Cosine Similarity

#Set #Distance

As simple as the inner product of two vectors

$$ d_{cos} = \frac{\vec A}{\vert \vec A \vert} \cdot \frac{\vec B }{ \vert \vec B \vert} $$


To use cosine similarity, we have to vectorize the words first. There are many different methods to achieve this. For the purpose of illustrating cosine similarity, we use term frequency.

Term frequency is the occurrence of the words. We do not deal with duplications so duplicate words will have some effect on the similarity.

In principle, we could also use word set for a sentence to remove the effect of duplicate words. In most cases, if a word is repeating, it would indeed make the sentences different. If duplicating words are becoming a problem, we will consider using tf-idf.
Word Set: (( sentenceOneWords ))
Word Set: (( sentenceTwoWords ))
Union as Vector Element Labels: (( unionWords ))
Sentence One Vector: (( sentenceOneVector ))
Sentence Two Vector: (( sentenceTwoVector ))
Cosine Similarity: (( cosineSimilarity ))

Published: by ;

Lei Ma (2019). 'Cosine Similarity', Datumorphism, 05 April. Available at:

Current Ref:

  • cards/math/