Cosine similarity and word2vec

Hi, we are looking into using the concept embeddings MedCAT generates to understand the similarity between concepts. Our understanding is that MedCAT uses an initial vocabulary of tokens with embeddings based on word2vec to create embeddings of contexts around mentions of concepts, which in turn are used to learn the embeddings of the concepts themselves.

We also understand that MedCAT uses cosine similarity (so the dot product between normalised embedding vectors) during annotation to measure the similarity / distance between context embeddings and concepts. Could you let me know why cosine similarity was chosen for this purpose? It seemed to us that word2vec suggests using differences between vectors and averages for various purposes, based on which we would have expected that in their embedding Euclidean distance is expected to be a good measure of difference and similarity. I feel we are missing something here - maybe there is a step in MedCAT during training that translates word2vec embeddings into a directional system. But then we notice that the concept embedding vectors of MedCAT in the cdb are not unit vectors.

Any help as to what distance metric we should use between concepts and how we can combine their embeddings would be much appreciated!

1 Like