Laurent Haan wrote: > On Apr 16, 10:45 am, Jussi Piitulainen <jpiit...@ling.helsinki.fi> > wrote: >> Laurent Haan writes: >>> I'm having problems modifying the formula for the cosine similarity >>> to take into accounts weights given to the components of the >> >> ... >> >>> I'll illustrate the problem using the euclidean distance: I have a >>> certain number of vectors and a query vector and I want to return >>> the vector that minimizes the euclidean distance to the query >>> vector. >> >>> Vector 1: [0.5, 0.5, 1] >>> Vector 2: [1, 0, 0.5] >> >>> Query Vector : [1, 0.5, 0.5] >> >>> Distance 1: 0.5 + 0 + 0.5 = 1 >>> Distance 2: 0 + 0.5 + 0 = 0.5 >> >>> Output: Vector 2 >>> [/code] >> >> That's not Euclidean distance. That's block distance. Euclidean >> distance is the square root of the sum of squared differences. >> >>> I want to give each component an importance/weight. I've chosen >>> values between [1, 10] since that allows me to immediatly modify the >>> euclidean distance formula to take into account the weight: >> >>> dist = sum(weight(i) * abs(x(i) - y(i))) >> >> ... >> >>> What I can't figure out is to how to express the exact same thing >>> with the cosine similarity. I tried modifying the formula in several >>> ways, but each try failed. >> >> I wonder why you want to do this. If cosine does not work for you and >> some other formula does, you could just use the other formula. >> >> However, here's a couple of thoughts, don't know how valuable. >> >> You were able to do your weighting with block distance because you had >> access to something like individual components of the total distance. >> Cosine is the dot product of normalized vectors. Normalize first: the >> component x_k of vector x becomes x_k/length(x), where length(x) is >> Euclidean, that is, square root of sum of squares. Then the cosine is >> the sum of componentwise products, which you could weight, just like >> block distance was the sum of componentwise differences. >> >> Alternatively, how about separate cosines for important and >> unimportant components, and then weighted average of those? > > > Thank you for your answer, it already brought me closer to the goal. > There is only one problem left that I need to solve to get the correct > result: > > In the block distance (thanks for the correction), it was logical to > me that increasing the difference between two components would > increase its importance, which means that the higher the importance, > the bigger the number I would multiply the difference with. > > This doesn't work with the cosine similarity. This is also my last > question, which probably is also the hardest: > > How should the components look like in the importance vector? Does a > bigger number automatically mean that this term has a higher > importance than another? At the moment, I construct a vector with > values between [1, 10] with 10 being the highest importance and I > normalize that vector. Then I multiply each component of that vector > to the componentwise products like you explained. Unfortunately, the > result is not convincing. I never achieve a perfect similarity of 1, > even if the two vectors are the same.
The normalization must also be weighted. For vectors u and v, with weight vector w, the weighted cosine is