Cosine Similarity
From the Wikipedia article on Cosine similarity:
In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval [-1,1].
For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. In some contexts, the component values of the vectors cannot be negative, in which case the cosine similarity is bounded in [0,1].
So for two vectors A and B each of length n, the cosine similarity \cos (\theta) is defined as:
\cos (\theta ) = \dfrac {A \cdot B} {\left\| A\right\| \left\| B\right\|} = \dfrac {\sum \limits_{i=1}^{n}{A_i B_i}} {\sqrt{\sum \limits_{i=1}^{n}{(A_i)^2}} \sqrt{\sum \limits_{i=1}^{n}{(B_i)^2}}}
For example if A=(1,2,3) and B=(4,5,6) then the cosine similarity is:
\cos (\theta ) = \dfrac {1 \cdot 4 + 2 \cdot 5 + 3 \cdot 6} {\sqrt{1^2 + 2^2 + 3^2} \sqrt{4^2 + 5^2 + 6^2}} = \dfrac {32} {\sqrt{14} \sqrt{77}} \approx 0.9746
Write a function called cosine_similarity
that takes two vectors as input and returns the cosine similarity between them.
Bonus challenge
Implement the following different versions of your function, and use %timeit
(e.g. in an iPython terminal) to compare their performance on vectors of length 10,000:
- a naive Python version that uses
for
loops to compute the dot product and the lengths of the vectors, which are implemented as simple lists. You can getsqrt
usingfrom math import sqrt
. - a version that uses
numpy
arrays and thenumpy.dot
function to compute the dot product and thenumpy.linalg.norm
function to compute the lengths of the vectors