SciELO - Scientific Electronic Library Online

 
 issue51Classification of Group Potency Levels of Software Development Student TeamsApplying the Technology Acceptance Model to Evaluation of Recommender Systems author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand

Journal

Article

Indicators

Related links

  • Have no similar articlesSimilars in SciELO

Share


Polibits

On-line version ISSN 1870-9044

Polibits  n.51 México Jan./Jun. 2015

http://dx.doi.org/10.17562/PB-51-9 

Soft Cardinality in Semantic Text Processing: Experience of the SemEval International Competitions

 

Sergio Jimenez1, Fabio A. Gonzalez1, and Alexander Gelbukh2

 

1 Departamento de Ingeniería de Sistemas e Industrial of the Universidad Nacional de Colombia, Bogota, Colombia. (e-mail: fagonzalezo@unaledu.co, sergiojimenezvargas@gmail.com).

2 Centro de Investigación en Computación, Instituto Politécnico Nacional, México City, México. (e-mail: gelbukh@gelbukh.com).

 

Manuscript received on February 17, 2015.
Accepted for publication on May 27, 2015.
Published on June 15, 2015.

 

Abstract

Soft cardinality is a generalization of the classic set cardinality (i.e., the number of elements in a set), which exploits similarities between elements to provide a "soft" counting of the number of elements in a collection. This model is so general that can be used interchangeability as cardinality function in resemblance coefficients such as Jaccard's, Dice's, cosine and others. Beyond that, cardinality-based features can be extracted from pairs of objects being compared to learn adaptive similarity functions from training data. This approach can be used for comparing any object that can be represented as a set or bag. We and other international teams used soft cardinality to address a series of natural language processing (NLP) tasks in the recent SemEval (semantic evaluation) competitions from 2012 to 2014. The systems based on soft cardinality have always been among the best systems in all the tasks in which they participated. This paper describes our experience in that journey by presenting the generalities of the model and some practical techniques for using soft cardinality for NLP problems.

Key words: Similarity measure, soft computing, set cardinality, semantics, natural language processing.

 

DESCARGAR ARTÍCULO EN FORMATO PDF

 

ACKNOWLEDGMENT

The second author acknowledges the support of LACCIR R1212LAC006 under the project "Multimodal image retrieval to support medical case-based scientific literature search." The third author acknowledges the support of the Mexican Government via SNI, CONACYT, and the Instituto Politécnico Nacional, SIP-IPN grants 20152100 and 20152095.

 

REFERENCES

[1] S. Jimenez, F. Gonzalez, and A. Gelbukh, "Text Comparison Using Soft Cardinality," in String Processing and Information Retrieval, ser. LNCS, E. Chavez and S. Lonardi, Eds. Berlin, Heidelberg: Springer, 2010, vol. 6393, pp. 297-302.         [ Links ]

[2] S. P. Jena, S. K. Ghosh, and B. K. Tripathy, "On the theory of bags and lists," Information Sciences, vol. 132, no. 1-4, pp. 241-254, 2001.         [ Links ]

[3] P. Jaccard, "Etude comparative de la distribution florare dans une portion des {A}lpes et des {J}ura," Bulletin de la Société Vaudoise des Sciences Naturelles, pp. 547-579, 1901.         [ Links ]

[4] L. R. Dice, "Measures of the Amount of Ecologic Association Between Species," Ecology, vol. 26, no. 3, pp. 297-302, 1945.         [ Links ]

[5] A. Tversky, "Features of similarity," Psychological Review, vol. 84, no. 4, pp. 327-352, 1977.         [ Links ]

[6] Ochiai, Akira, "Zoogeographical studies on the soleoid fishes found Japan and its neighboring regions," Jap. Soc. Sci. Fish., vol. 22, no. 9, pp. 526-530, 1957.         [ Links ]

[7] G. Sidorov, A. Gelbukh, H. Gomez-Adorno, and D. Pinto, "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model," Computacion y Sistemas, vol. 18, no. 3, pp. 491-504, 2014.         [ Links ]

[8] S. Jimenez, C. Becerra, and A. Gelbukh, "SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity," in Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task. Atlanta, Georgia, USA: ACL, Jun. 2013, pp. 194-201.         [ Links ]

[9] B. D. Baets, H. D. Meyer, and H. Naessens, "A class of rational cardinality-based similarity measures," Journal ofComputational and Applied Mathematics, vol. 132, no. 1, pp. 51-69, Jul. 2001.         [ Links ]

[10] R. Poli, W. B. Langdon, N. F. McPhee, and J. R. Koza, Afield guide to genetic programming. Lulu. com, 2008.         [ Links ]

[11] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," The Journal of Machine Learning Research, vol. 3, pp. 1157-1182, 2003.         [ Links ]

[12] Jimenez, Sergio, Gonzalez, Fabio A., and Gelbukh, Alexander, "Cardinality-based lexical similarity in WordNet: Bridging the gap to neural embedding," to appear, 2015.         [ Links ]

[13] Dueñas, George, Jimenez, Sergio, and Julia, Baquero, "Automatic prediction of item difficulty for short-answer questions," in to appear, 2015.         [ Links ]

[14] Bouma, Gerlof, "Normalized (pointwise) mutual information in collocation extraction," in Proceedings of the Biennial GSCL Conference, 2009, pp. 31-40.         [ Links ]

[15] T. Pedersen, S. Patwardhan, and J. Michelizzi, "WordNet::Similarity: measuring the relatedness of concepts," in Proceedings HLT-NAACL-Demonstration Papers. Stroudsburg, PA, USA: ACL, 2004.         [ Links ]

[16] S. Jimenez, C. Becerra, and A. Gelbukh, "Soft Cardinality+ ML: Learning Adaptive Similarity Functions for Cross-lingual Textual Entailment," in First Joint Conference on Lexical and Computational Semantics (*SEM). Montreal, Canada: ACL, 2012, pp. 684-688.         [ Links ]

[17] A. E. Monge and C. Elkan, "The field matching problem: Algorithms and applications," in Proceeding ofthe 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, 1996, pp. 267-270.         [ Links ]

[18] S. Jimenez, C. Becerra, A. Gelbukh, and F. Gonzalez, "Generalized Mongue-Elkan Method for Approximate Text String Comparison," in Computational Linguistics and Intelligent Text Processing, ser. Lecture Notes in Computer Science, A. Gelbukh, Ed. Springer, Jan. 2009, no. 5449, pp. 559-570.         [ Links ]

[19] G. Salton, Introduction to modern information retrieval. McGraw-Hill, 1983.         [ Links ]

[20] S. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, "Okapi at TREC-3," in Proceedings of the Third Text REtrieval Conference (TREC 1994), Gaithersburg, USA, 1994, pp. 109-126.         [ Links ]

[21] Jimenez, Sergio, Gonzalez, Fabio A., and Gelbukh, Alexander, "Mathematical properties of Soft Cardinality: Enhancing Jaccard, Dice and cosine similarity measures with element-wise distance," to appear, 2015.         [ Links ]

[22] V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," Soviet Physics Doklady, vol. 10, no. 8, pp. 707-710, 1966.         [ Links ]

[23] W. E. Winkler, "The State of Record Linkage and Current Research Problems," Statistical Research Division, US Census Bureau, 1999.         [ Links ]

[24] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, "Duplicate Record Detection: A Survey," IEEE Trans. on Knowl. and Data Eng., vol. 19, no. 1, pp. 1-16, 2007.         [ Links ]

[25] M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli, "Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland: ACL, 2014, pp. 1-8.         [ Links ]

[26] B. T. McInnes, T. Pedersen, Y. Liu, G. B. Melton, and S. V. Pakhomov, "U-path: An undirected path-based measure of semantic similarity," in AMIA Annual Symposium Proceedings, vol. 2014. American Medical Informatics Association, 2014, p. 882.         [ Links ]

[27] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa, "A study on similarity and relatedness using distributional and WordNet-based approaches," in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, ser. NAACL'09. Stroudsburg, PA, USA: ACL, 2009, pp. 19-27.         [ Links ]

[28] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, and Dean, Jeff, "Distributed representations of words and phrases and their compositionality," in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 3111-3119.         [ Links ]

[29] Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D., "Glove: Global vectors for word representation," in Proceedings ofthe Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12, Doha, Qatar, 2014, pp. 1532-1543.         [ Links ]

[30] E. Gabrilovich and S. Markovitch, "Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis," in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI'07. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp. 1606-1611.         [ Links ]

[31] S. Banerjee and T. Pedersen, "An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet," in Computational Linguistics and Intelligent Text Processing, ser. Lecture Notes in Computer Science, A. Gelbukh, Ed. Springer, 2002, no. 2276, pp. 136-145.         [ Links ]

[32] C. Corley and R. Mihalcea, "Measuring the semantic similarity of texts," in Proceedings ofthe ACL Workshop on Empirical Modeling ofSemantic Equivalence and Entailment, ser. EMSEE'05. Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp. 13-18.         [ Links ]

[33] D. Croce, V. Storch, P. Annesi, and R. Basili, "Distributional Compositional Semantics and Text Similarity," in Proceedings of the IEEE Sixth International Conference on Semantic Computing (ICSC), SEP. 2012, pp. 242-249.         [ Links ]

[34] D. Croce, V. Storch, and R. Basili, "UNITOR-CORE TYPED: Combining Text Similarity and Semantic Filters through SV Regression," in Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings ofthe Main Conference and the Shared Task: SemanticTextual Similarity. Atlanta, Georgia, USA: ACL, 2013, pp. 59-65.         [ Links ]

[35] M.-C. De Marneffe, B. MacCartney, C. D. Manning, and others, "Generating typed dependency parses from phrase structure parses," in proceedings of LREC, vol. 6, 2006, pp. 449-454.         [ Links ]

[36] M. D. Lee, B. Pincombe, and M. Welsh, "An empirical evaluation of models of text document similarity," in In CogSci2005. Erlbaum, 2005, pp. 1254-1259.         [ Links ]

[37] E. Agirre, D. Cer, M. Diab, and G.-A. Aitor, "SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity," in First Joint Conference on Lexical and Computational Semantics (*SEM). Montreal, Canada: ACL, 2012, pp. 385-393.         [ Links ]

[38] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Aguirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, "SemEval-2014 Task 10: Multilingual semantic textual similarity," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland: ACL, 2014, pp. 81-91.         [ Links ]

[39] A. Lynum, P. Pakray, B. Gamback, and S. Jimenez, "NTNU: Measuring Semantic Similarity with Sublexical Feature Representations and Soft Cardinality," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland: ACL, 2014, pp. 448-453.         [ Links ]

[40] S. Jimenez, G. Duenas, J. Baquero, and A. Gelbukh, "UNAL-NLP: Combining soft cardinality features for semantic textual similarity, relatedness and entailment," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland: ACL, 2014, pp. 732-742.         [ Links ]

[41] D. Jurgens, M. T. Pilehvar, and R. Navigli, "SemEval-2014 Task 3: Cross-level semantic similarity," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland: ACL, 2014, pp. 17-26.         [ Links ]

[42] M. Negri, A. Marchetti, Y. Mehdad, L. Bentivogli, and D. Giampiccolo, "2012. Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization," in First Joint Conference on Lexical and Computational Semantics (*SEM). Montreal, Canada: ACL, 2012, pp. 399-407.         [ Links ]

[43] M. Negri, A. Marchetti, Y. Mehdad, and L. Bentivogli, "Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization," in Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: ACL, 2013, pp. 25-33.         [ Links ]

[44] S. Jimenez, C. Becerra, and A. Gelbukh, "SOFTCARDINALITY: Hierarchical Text Overlap for Student Response Analysis," in Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: ACL, 2013, pp. 280-284.         [ Links ]

[45] ----------, "SOFTCARDINALITY: Learning to Identify Directional Cross-Lingual Entailment from Cardinalities and SMT," in Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: ACL, Jun. 2013, pp. 34-38.         [ Links ]

[46] ----------, "Soft Cardinality: A Parameterized Similarity Function for Text Comparison," in First Joint Conference on Lexical and Computational Semantics (*SEM). Montreal, Canada: ACL, 2012, pp. 449-453.         [ Links ]

[47] M. O. Dzikovska, R. D. Nielsen, C. Brew, C. Leacock, D. Giampiccolo, L. Bentivogli, P. Clark, I. Dagan, and H. T. Dang, "SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge," in Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), in conjunction with the Second Joint Conference on Lexical and Computational Semantcis (*SEM 2013). Atlanta, Georgia, USA: ACL, 2013, pp. 263-274.         [ Links ]

[48] S. P. Leeman-Munk, E. N. Wiebe, and J. C. Lester, "Assessing elementary students' science competency with text analytics," in Proceedins ofthe Fourth International Conference on Learning Analytics And Knowledge (LAK 14). Indianapolis, Indiana, USA: ACM, 2014, pp. 143-147.         [ Links ]

Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License