Accessing MedCAT entities' concept embeddings

Hello. Is there a way to access the embeddings of the concepts that have been linked to the CDB?

Dear @zeljko , is there a way to access the context embeddings of the entities linked?

Hi @Hideaki , unfortunately, that is not possible, it would consume too much memory for big runs so the context embeddings are just calculated but not stored anywhere during the Linking phase. You can of course access embeddings of concepts from the CDB.

1 Like

Thank you, @zeljko. I’ve used the cat.cdb.cui2context_vectors on a few CUIs. For most of them, I am able to access the four vector types. However, I get a key error with CUI 840539006 and 91637004. Not sure what I’ve done wrong

Also, may I please check, where does the original embedding for concepts in the CDB come from?

Hi @Hideaki,

re the key error: not all CUIs have embeddings, depends did the CUI receive any training.

The original embeddings for concepts in the CDB come from the unsupervised training, have a look at the MedCAT paper it explains the training procedure and how the Vocab (word embeddings) are used to make concept embeddings.

1 Like

Thank you, @zeljko! We received a private model from a different Trust and so I used cat.multiprocessing to annotate then checked the embeddings of the concepts. I suppose those CUIs did not receive training from the original organisation.

@Hideaki just to double check if the concepts have recieved any training you can explore:

cat.cdb.cui2count_train['<CUI OF INTREREST>']

2 Likes

@Hideaki thanks for joining the call. apologies, for delay as we are in usa this week. We can go into the many-to-1 map and context vectors next week when we are back

@mart.ratas @anthony.shek

Thank you, @Jthteo. What I’m interested to find out are:

  1. Where the four types of vectors come from when we use cat.cdb.cui2context_vectors[CUI]. My understanding from Equation 10 of the 2021 paper is that there can be many context vectors (Vcntx) but only one concept vector (Vconcept). The Public and King’s 1.4 models have 4 types of of these vectors and the King’s 1.2 model has 6.

  2. Whether the vectors of cat.cdb.cui2context_vectors map to preferred names in a one-one fashion? I found that there are duplicates of the preferred names when I run the below code for the Public model. There seems to be 29674 trained CUIs but 29314 unique preferred names.


Hi @Hideaki,

Although we spoke about this. Just to have the answer in the public domain.

As you have correctly pointed out, currently no single concept vector that is stored. Instead there are multiple different vectors which represent different contexts sizes. A single concept vector can be calculated from these and is therefore never stored. The parameters for the different concept context vectors are specified in the medcat config file.

  1. Context Vector Sizes:
  • context_vector_sizes : This is a dictionary specifying the sizes of different context vectors that will be calculated and used for linking. Each key represents a context type (‘xlong’, ‘long’, ‘medium’, ‘short’), and the corresponding value is the size of the vector associated with that context type.
  • For example: ‘xlong’ context vectors have a context window size of 27, ‘long’ context vectors have a context window size of 18, and so on. Not to be confused with the size of the vectors.
  1. Context Vector Weights:
  • context_vector_weights: This is a dictionary specifying the weight of each vector in the similarity score. The weights are used when calculating the overall similarity score based on multiple context types. Each key corresponds to a context type, and the value is the weight assigned to that context type.
  • The weights should add up to 1. In this case, the ‘long’ and ‘medium’ context vectors contribute more to the overall similarity score, with weights of 0.4 each, while ‘xlong’ and ‘short’ contribute less with weights of 0.1 each.

If you want to combine multiple vectors into a single vector that represents them all, you can perform a weighted sum of the individual vectors, where each vector is multiplied by its corresponding weight.

# First initialise your cat object, then proceed with the following code:

# Retrieve the multiple different vectors from your cat object, here I will mention 4 explicitly. 
v_xlong = ...  # Replace with the actual vector for 'xlong'
v_long = ...   # Replace with the actual vector for 'long'
v_medium = ...  # Replace with the actual vector for 'medium'
v_short = ...   # Replace with the actual vector for 'short'

# Retrieve the weights
weights = cat.config.linking['context_vector_weights'][context_type]

# Combine vectors into a single vector
v_combined_unitvec = weights['xlong'] * unitvec(v_xlong) + weights['long'] * unitvec(v_long) + weights['medium'] * unitvec(v_medium) + weights['short'] * unitvec(v_short)

Here the v_combined is your single vector representation.

This is part of the process in how we then calculate the prediction/similarity of the target word/phrase to the concept vectors.

Calculation of similarity:

The similarity calculated in this context is a weighted sum of cosine similarities between vectors associated with a given Clinical Unique Identifier (CUI) and input vectors for different context types. Cosine similarity is a measure that ranges from -1 to 1, where:

  • 1 indicates perfect similarity.
  • -1 indicates perfect dissimilarity.
  • 0 indicates orthogonality (no similarity).

Given this range, the weighted sum of cosine similarities should also fall within the range of -1 to 1, and it shouldn’t exceed 1.