Self-supervised MedCAT model

Hideaki · April 14, 2023, 10:43am

Hello,

Out of interest, since the self-supervised MedCAT algorithm is a custom one, may I ask what the trained model is stored as? For example, in neural networks, the model would be the weights and biases, so I wonder what the self-supervised model represents? Hope you can shed light on this. Thank you.

jkgenser · June 6, 2023, 7:30pm

The training procedures uses rules to check whether a text span matches from the mapping of CUI to vocabulary list. If there is no ambiguity, then the algorithm creates a vectorized representation of the context (average of vectors of words) around that concept. There are heuristics that determine context size and weights based on size of document I think.

Then, it applies an update function based on a learning rate to move the concept embedding closer to the context.

So in this case the “weights” are actually the concept embeddings and the training procedure updates these concept embeddings based on the context these concepts are found within.

This is how the system is able to disambiguate and link to a given concept given the context.

Hideaki · June 12, 2023, 5:08pm

Thank you. So basically the model is the updated word embeddings of the concepts in the vocabulary that have received training. How is annotation (NER+L) done based on the updated embeddings?

jkgenser · June 12, 2023, 5:26pm

Consider a toy case where you have the following concepts and aliases or names

concept 1: [“primary care physician”, “pcp”]
concept 2: [“pneumonia”, “pcp”]

During the unsupervised training procedure, If “pcp” comes up in the data there will be no training. However consider the following sentences:

case 1: “Patient should consult with primary care physician”
case: 2"Take antibx as prophylaxis due to concern for pneumonia"

The algorithm will make the concept vector closer to the vectorized contexts in the above examples. Then if enough of the context around the concepts are learned, then when we calculate the closest concept vector from context in case where the names are overlapping, we would have learned to disambiguate them.

Consider the new cases below:

case 1: “Patient should consult with PCP”
case 2: “Concern for PCP, run a cat scan”

Annotation is basically a nearest neighbor search of the context vector against the list of existing concept vectors and filtering based on a similarity threshold. You can read it in the source code:

github.com

CogStack/MedCAT/blob/master/medcat/linking/vector_context_model.py#L135


      
                          s = np.dot(unitvec(vectors[context_type]), unitvec(cui_vectors[context_type]))
                          similarity += weight * s
          
          
                # DEBUG
                          logger.debug("Similarity for CUI: %s, Count: %s, Context Type: %.10s, Weight: %s.2f, Similarity: %s.3f, S*W: %s.3f",
                                         cui, self.cdb.cui2count_train[cui], context_type, weight, s, s*weight)
                  return similarity
              else:
                  return -1
          
          
def disambiguate(self, cuis: List, entity: Span, name: str, doc: Doc) -> Tuple:
              vectors = self.get_context_vectors(entity, doc)
              filters = self.config.linking['filters']
          
          
    # If it is trainer we want to filter concepts before disambiguation
              #do not want to explain why, but it is needed.
              if self.config.linking['filter_before_disamb']:
                  # DEBUG
                  logger.debug("Is trainer, subsetting CUIs")
                  logger.debug("CUIs before: %s", cuis)

Hideaki · June 12, 2023, 5:46pm

Thank you. So basically, during annotation, the word embeddings are only used to annotate ambiguous entities? Unambiguous entities get immediately linked to the concepts in the CDB?

Topic		Replies	Views
Cosine similarity and word2vec MedCAT	0	209	October 24, 2022
Accessing MedCAT entities' concept embeddings MedCAT	10	353	January 3, 2024
Loopback between contexts and concepts during training MedCAT	0	209	October 24, 2022
Meta annotation basics MedCAT	3	337	October 5, 2022
Medcat 1.7.0 trained on documents, or sentences (short documents) MedCAT	1	213	March 30, 2023

Self-supervised MedCAT model

Related topics