SciELO - Scientific Electronic Library Online

vol.18 issue3Using Multi-View Learning to Improve Detection of Investor Sentiments on TwitterSIMTEX: An Approach for Detecting and Measuring Textual Similarity based on Discourse and Semantics author indexsubject indexsearch form
Home Pagealphabetic serial listing  

Services on Demand




Related links

  • Have no similar articlesSimilars in SciELO


Computación y Sistemas

Print version ISSN 1405-5546


SIDOROV, Grigori; GELBUKH, Alexander; GOMEZ-ADORNO, Helena  and  PINTO, David. Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model. Comp. y Sist. [online]. 2014, vol.18, n.3, pp.491-504. ISSN 1405-5546.

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

Keywords : Soft similarity; soft cosine measure; vector space model; similarity between features; Levenshtein distance; n-grams; syntactic n-grams.

        · text in English     · English ( pdf )


Creative Commons License All the contents of this journal, except where otherwise noted, is licensed under a Creative Commons Attribution License