Out of interest, since the self-supervised MedCAT algorithm is a custom one, may I ask what the trained model is stored as? For example, in neural networks, the model would be the weights and biases, so I wonder what the self-supervised model represents? Hope you can shed light on this. Thank you.
The training procedures uses rules to check whether a text span matches from the mapping of CUI to vocabulary list. If there is no ambiguity, then the algorithm creates a vectorized representation of the context (average of vectors of words) around that concept. There are heuristics that determine context size and weights based on size of document I think.
Then, it applies an update function based on a learning rate to move the concept embedding closer to the context.
So in this case the “weights” are actually the concept embeddings and the training procedure updates these concept embeddings based on the context these concepts are found within.
This is how the system is able to disambiguate and link to a given concept given the context.
Thank you. So basically the model is the updated word embeddings of the concepts in the vocabulary that have received training. How is annotation (NER+L) done based on the updated embeddings?
Consider a toy case where you have the following concepts and aliases or names
concept 1: [“primary care physician”, “pcp”]
concept 2: [“pneumonia”, “pcp”]
During the unsupervised training procedure, If “pcp” comes up in the data there will be no training. However consider the following sentences:
case 1: “Patient should consult with primary care physician”
case: 2"Take antibx as prophylaxis due to concern for pneumonia"
The algorithm will make the concept vector closer to the vectorized contexts in the above examples. Then if enough of the context around the concepts are learned, then when we calculate the closest concept vector from context in case where the names are overlapping, we would have learned to disambiguate them.
Consider the new cases below:
case 1: “Patient should consult with PCP”
case 2: “Concern for PCP, run a cat scan”
Annotation is basically a nearest neighbor search of the context vector against the list of existing concept vectors and filtering based on a similarity threshold. You can read it in the source code:
Thank you. So basically, during annotation, the word embeddings are only used to annotate ambiguous entities? Unambiguous entities get immediately linked to the concepts in the CDB?