Cosine similarity and word2vec

csep · October 24, 2022, 12:13pm

Hi, we are looking into using the concept embeddings MedCAT generates to understand the similarity between concepts. Our understanding is that MedCAT uses an initial vocabulary of tokens with embeddings based on word2vec to create embeddings of contexts around mentions of concepts, which in turn are used to learn the embeddings of the concepts themselves.

We also understand that MedCAT uses cosine similarity (so the dot product between normalised embedding vectors) during annotation to measure the similarity / distance between context embeddings and concepts. Could you let me know why cosine similarity was chosen for this purpose? It seemed to us that word2vec suggests using differences between vectors and averages for various purposes, based on which we would have expected that in their embedding Euclidean distance is expected to be a good measure of difference and similarity. I feel we are missing something here - maybe there is a step in MedCAT during training that translates word2vec embeddings into a directional system. But then we notice that the concept embedding vectors of MedCAT in the cdb are not unit vectors.

Any help as to what distance metric we should use between concepts and how we can combine their embeddings would be much appreciated!

Topic		Replies	Views
Loopback between contexts and concepts during training MedCAT	0	209	October 24, 2022
Accessing MedCAT entities' concept embeddings MedCAT	10	352	January 3, 2024
Self-supervised MedCAT model MedCAT	4	292	June 12, 2023
MedCAT French model only matches exact terms - accuracy similarity always 1 MedCAT	7	53	June 8, 2025
New paper citing MedCAT MedCAT	4	245	October 21, 2022

Cosine similarity and word2vec

Related topics