Exploration on Effectiveness and Efficiency of Similar Sentence Matching

Gu, Yanhui; Yang, Zhenglu; Nakano, Miyuki; Kitsuregawa, Masaru

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Polibits

On-line version ISSN 1870-9044

Polibits n.47 México Jan./Jul. 2013

Exploration on Effectiveness and Efficiency of Similar Sentence Matching

Yanhui Gu, Zhenglu Yang, Miyuki Nakano, and Masaru Kitsuregawa

Institute of Industrial Science, University of Tokyo, Komaba 4-6-1, Meguro, Tokyo, 153-8505, Japan (e-mail: guyanhui@tkl.iis.u-tokyo.ac.jp, yangzl@tkl.iis.u-tokyo.ac.jp, miyuki@tkl.iis.u-tokyo.ac.jp, kitsure@tkl.iis.u-tokyo.ac.jp).

Manuscript received on December 15, 2012
Accepted for publication on January 11, 2013.

Abstract

Similar sentence matching is an essential issue for many applications, such as text summarization, image extraction, social media retrieval, question-answer model, and so on. A number of studies have investigated this issue in recent years. Most of such techniques focus on effectiveness issues but only a few focus on efficiency issues. In this paper, we address both effectiveness and efficiency in the sentence similarity matching. For a given sentence collection, we determine how to effectively and efficiently identify the top-k semantically similar sentences to a query. To achieve this goal, we first study several representative sentence similarity measurement strategies, based on which we deliberately choose the optimal ones through cross-validation and dynamically weight tuning. The experimental evaluation demonstrates the effectiveness of our strategy. Moreover, from the efficiency aspect, we introduce several optimization techniques to improve the performance of the similarity computation. The trade-off between the effectiveness and efficiency is further explored by conducting extensive experiments.

Key words: String matching, information retrieval, natural language processing.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] Y. Li, D. McLean, Z. Bandar, J. O'Shea, and K. A. Crockett, "Sentence similarity based on semantic nets and corpus statistics." IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 8, pp. 1138-1150, 2006. [ Links ]

[2] R. Mihalcea, C Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in Proceedings of the AAAI Conference on Artificial Intelligence, ser. AAAI'06, 2006, pp. 775-780. [ Links ]

[3] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis, "Text relatedness based on a word thesaurus." Journal of Artificial Intelligence Research, vol. 37, p. 1-39, 2010. [ Links ]

[4] W. W. Cohen, "Integration of heterogeneous databases without common domains using queries based on textual similarity," in Proceedings of the ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'98, 1998, pp. 201-212. [ Links ]

[5] G. Navarro, "A guided tour to approximate string matching," ACM Computing Surveys, vol. 33, no. 1, pp. 31-88, 2001. [ Links ]

[6] A. Islam and D. Inkpen, "Semantic text similarity using corpus-based word similarity and string similarity," ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 2, pp. 1-25, 2008. [ Links ]

[7] T. K. Landauer and S. T. Dumais "A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge," Psychological Review, vol. 104, pp. 211-240, 1997. [ Links ]

[8] E. Gabrilovich and S. Markovitch, "Computing semantic relatedness using wikipedia-based explicit semantic analysis," in Proceedings of the International Joint Conference on Artifical Intelligence, ser. IJCAI'07, 2007, pp. 1606-1611. [ Links ]

[9] Y. Gu, Z. Yang, M. Nakano, and M. Kitsuregawa, "Towards efficient similar sentences extraction," in Proceedings of Intelligent Data Engineering and Automated Learning, ser. IDEAL'12, 2012, pp. 270-277. [ Links ]

[10] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas, "Web-scale distributional similarity and entity set expansion." in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP'09, 2009, pp. 938-947. [ Links ]

[11] A. Goyal and H. Daume, III, "Approximate scalable bounded sketch for large data nlp," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP'11, 2011, pp. 250-261. [ Links ]

[12] Z. Yang and M. Kitsuregawa, "Efficient searching top-k semantic similar words," in Proceedings of the International Joint Conference on Artificial Intelligence, ser. IJCAI'11, 2011, pp. 2373-2378. [ Links ]

[13] D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," Communications of ACM, vol. 18, no. 6, pp. 341-343, 1975. [ Links ]

[14] A. Islam and D. Inkpen, "Second order co-occurrence pmi for determining the semantic similarity of words," in Proceedings of the International Conference on Language Resources and Evaluation, ser. LREC'06, 2006, pp. 1033-1038. [ Links ]

[15] P. Wiemer-Hastings, "Adding syntactic information to Isa," in Proceedings of the Annual Conference of the Cognitive Science Society, ser. COGSCI'00, 2000, pp. 989-993. [ Links ]

[16] C. Leacock and M. Chodorow, "Combining local context and wordnet similarity for word sense identification," in WordNet: An Electronic Lexical Database. In C. Fellbaum (Ed.), MIT Press, 1998, pp. 305-332. [ Links ]

[17] M. B. Blake, L. Cabral, B. König-Ries, U. Küster, and D. Martin, Semantic Web Services: Advancement through Evaluation. Springer, 2012. [ Links ]

[18] R. Fagin, A. Lotem, and M. Naor, "Optimal aggregation algorithms for middleware," in Proceedings of the ACM SIGMOD symposium on Principles of Database Systems, ser. PODS'01, 2001, pp. 102-113. [ Links ]

[19] V. Hatzivassiloglou, J. L. Klavans, and E. Eskin, "Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning," in Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, ser. EMNLP/VLC'99, 1999, pp. 203-212. [ Links ]

[20] V. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966. [ Links ]

[21] Z. Yang, J. Yu, and M. Kitsuregawa, "Fast algorithms for top-k approximate string matching," in Proceedings of the AAAI Conference on Artificial Intelligence, ser. AAAI'10, 2010, pp. 1467-1473. [ Links ]

[22] S. Sarawagi and A. Kirpal, "Efficient set joins on similarity predicates," in Proceedings of the ACM SIGMOD International Conference on Management of Data, ser. SIGMOD'04, 2004, pp. 743-754. [ Links ]