Hello. Is there a way to access the embeddings of the concepts that have been linked to the CDB?
Hi @Hideaki , unfortunately, that is not possible, it would consume too much memory for big runs so the context embeddings are just calculated but not stored anywhere during the Linking phase. You can of course access embeddings of concepts from the CDB.
Thank you, @zeljko. I’ve used the cat.cdb.cui2context_vectors on a few CUIs. For most of them, I am able to access the four vector types. However, I get a key error with CUI 840539006 and 91637004. Not sure what I’ve done wrong
Also, may I please check, where does the original embedding for concepts in the CDB come from?
Hi @Hideaki,
re the key error: not all CUIs have embeddings, depends did the CUI receive any training.
The original embeddings for concepts in the CDB come from the unsupervised training, have a look at the MedCAT paper it explains the training procedure and how the Vocab (word embeddings) are used to make concept embeddings.
Thank you, @zeljko! We received a private model from a different Trust and so I used cat.multiprocessing to annotate then checked the embeddings of the concepts. I suppose those CUIs did not receive training from the original organisation.
@Hideaki just to double check if the concepts have recieved any training you can explore:
cat.cdb.cui2count_train['<CUI OF INTREREST>']
@Hideaki thanks for joining the call. apologies, for delay as we are in usa this week. We can go into the many-to-1 map and context vectors next week when we are back
Thank you, @Jthteo. What I’m interested to find out are:
-
Where the four types of vectors come from when we use cat.cdb.cui2context_vectors[CUI]. My understanding from Equation 10 of the 2021 paper is that there can be many context vectors (Vcntx) but only one concept vector (Vconcept). The Public and King’s 1.4 models have 4 types of of these vectors and the King’s 1.2 model has 6.
-
Whether the vectors of cat.cdb.cui2context_vectors map to preferred names in a one-one fashion? I found that there are duplicates of the preferred names when I run the below code for the Public model. There seems to be 29674 trained CUIs but 29314 unique preferred names.