1405-5546

S1405-55462014000300007

10.13053/CyS-18-3-2043

México

00 09 2014

18 3 491 504

Artículos regulares

Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model

Grigori Sidorov¹, Alexander Gelbukh¹, Helena Gómez-Adorno¹, and David Pinto^²

^¹ Centro de Investigación, en Computación, Instituto Politéctico Nacional, México D.F., México. sidorov@cic.ipn.mx, gelbukh@cic.ipn.mx, helena.adorno@gmail.com

^² Facultad de Ciencias de la Computación, Benemérita Universidad Autónoma de Puebla, Puebla, México. dpinto@cs.buap.mx.

]]> Article received on 25/07/2014.
Accepted on 12/09/2014.

Abstract

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

Keywords: Soft similarity, soft cosine measure, vector space model, similarity between features, Levenshtein distance, n-grams, syntactic n-grams.

DESCARGAR ARTÍCULO EN FORMATO PDF

References

]]>

1. Bejar, I., Chaffin, R., & Embretson, S. (1991). Cognitive and psychometric analysis of analogical problem solving. Recent research in psychology. Springer-Verlag. [ Links ]

2. Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische mathematik, 1(1), 269–271. [ Links ]

3. Gmez-Adorno, H., Sidorov, G., Pinto, D., & Gelbukh, A. (2014). Graph-based approach to the question answering task based on entrance exams. Cappellato, L., Ferro, N., Halvey, M., & Kraaij, W., editors, Notebook for PAN at CLEF 2014. CLEF 2014. CLEF2014 Working Notes, volume 1180 of CEUR Workshop Proceedings, CEUR-WS.org, pp. 1395–1403. [ Links ]

4. Jimenez, S., Gonzalez, F., & Gelbukh, A. (2010). Text comparison using soft cardinality. Chavez, E. & Lonardi, S., editors, String Processing and Information Retrieval, volume 6393 of Lecture Notes in Computer Science, Springer, pp. 297–302. [ Links ]

5. Jimenez Vargas, S. & Gelbukh, A. (2012). Baselines for natural language processing tasks based on soft cardinality spectra. International Journal of Applied and Computational Mathematics, 11(2), 180–199. [ Links ]

]]>

6. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710. [ Links ]

7. Li,B.&Han,L. (2013). Distance weighted cosine similarity measure for text classification. Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., & Yao, X., editors, IDEAL, volume 8206 of Lecture Notes in Computer Science, Springer, pp. 611–618. [ Links ]

8. Mikawa, K., Ishida, T., & Goto, M. (2011). A proposal of extended cosine measure for distance metric learning in text classification. Systems, Man, and Cybernetics (SMC), IEEE, pp. 1741–1746. [ Links ]

9. Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38, 39–41. [ Links ]

10. Peñas, A., Hovy, E. H., Forner, P., Rodrigo, Á., Sutcliffe, R. F. E., Forascu, C., & Sporleder, C. (2011). Overview of qa4mre at clef 2011: Question answering for machine reading evaluation. CLEF (Notebook Papers/Labs/Workshop), pp. 1–20. [ Links ]

]]>

11. Peñas, A., Hovy, E. H., Forner, P., Rodrigo, Á., Sutcliffe, R. F. E., Sporleder, C., Forascu, C., Benajiba, Y., & Osenova, P. (2012). Overview of qa4mre at clef 2012: Question answering for machine reading evaluation. CLEF (Online Working Notes/Labs/Workshop), pp. 1–24. [ Links ]

12. Peñas A., Miyao, Y., Forner, P., & Kando, N. (2013). Overview of qa4mre 2013 entrance exams task. CLEF (Online Working Notes/Labs/Workshop), pp. 1–6. [ Links ]

13. Pinto, D., Gómez-Adorno, H., Ayala, D. V., & Singh, V. K. (2014). A graph-based multi-level linguistic representation for document understanding. Pattern Recognition Letters, 41, 93–102. [ Links ]

14. Poria, S., Agarwal, B., Gelbukh, A., Hussain, A., & Howard, N. (2014). Dependency-based semantic parsing for concept-level text analysis. 15th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2014, Part I, number 8403 in Lecture Notes in Computer Science, Springer, pp. 113–127. [ Links ]

15. Poria, S., Gelbukh, A., Cambria, E., Hussain, A., & Huang, G.-B. (2015). EmoSenticSpace: A novel framework for affective common-sense reasoning. Knowledge-Based Systems. [ Links ]

]]>

16. Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., & Bandyopadhyay, S. (2013). Enhanced SenticNet with affective labels for concept-based opinion mining. IEEE Intelligent Systems, 28, 31–38. [ Links ]

17. Salton, G., editor (1988). Automatic text processing. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. [ Links ]

18. Sanchez-Perez, M., Sidorov, G., & Gelbukh, A. (2014). The winning approach to text alignment for text reuse detection at pan 2014. Cappellato, L., Ferro, N., Halvey, M., & Kraaij, W., editors, Notebook for PAN at CLEF 2014. CLEF 2014. CLEF2014 Working Notes, volume 1180 of CEUR Workshop Proceedings, CEUR-WS.org, pp. 1004–1011. [ Links ]

19. Sidorov, G. (2013). Syntactic dependency based n-grams in rule based automatic English as second language grammar correction. International Journal of Computational Linguistics and Applications, 4(2), 169–188. Methods and Applications of Artificial and Computational Intelligence. [ Links ]

20. Sidorov, G. (2014). Should syntactic n-grams contain names of syntactic relations? International Journal of Computational Linguistics and Applications, 5(1), 139–158. [ Links ]

]]>

21. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernndez, L. (2014). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853–860. Methods and Applications of Artificial and Computational Intelligence. [ Links ]

]]>

1991

1959 1 1 1

269-271

2014 1180

1395-1403

2010 6393

297-302

2012 11 2 2

180-199

1966 10 8 8

707-710

2013 8206

611-618

2011

1741-1746

1995 38

39-41

2011

1-20

2012

1-24

2013

1-6

2014 41

93-102

2014 8403

113-127

2015

2013 28

31-38

1988

2014 1180

1004-1011

2013 4 2 2

169-188

2014 5 1 1

139-158

2014 41 3 3

853-860