Adding new concepts to a trained model or re-training a MedCAT model

Is it possible to re-train a model or add concepts to the CDB of a saved model?

I am using a trained MedCAT model (MedCAT Version": "1.2.9.dev13) and would like to add more concepts relevant to my specialty to the CDB, without losing the previous training.

I have tried to add concepts using cdb.add_concept() after the model is loaded, but, although this adds the concepts to the CDB, it will not affect the model’s output. Otherwise, can I re-train a model that has been previously trained on a new data set to let the model pick up new concepts?

Thanks!

Welcome @elenaP !

You can absolutely add new concepts to the model or even create your own!
Keep in mind that the CUI and primary name must be unique.

Although cdb.add_concept() can be used. This function does not contain any preprocessing steps to the concept name. Instead try and add new concepts via the function: cat.add_and_train_concept().
You can read more about the function arguments here. It does make a call to add the cdb.add_concepts() func.

It is on our TODO list to clarify/rename these two functions. Its quite confusing imo.

But in short, try and work with the CAT object rather than directly with the CDB.

The CAT object consists of the CDB, Vocab and config files and stores them together.

Thank you for the reply. The add_and_train_concept() function takes a spacy_doc argument, “Spacy representation of the document that was manually annotated”. Could you give some more details on this document’s format please? Or a sample document might be helpful.

@anthony.shek What type of model is it actually which is being trained?
Is it a neural network? Can I read about the model type anywhere?

Not a neural Network.

You can read more about it here:

Hi @anthony.shek I have read that paper a few times and I am pretty sure its not mentioned in there what models are actually being used. Its only mentioned that its supervised and for metacat is used bi-directional LSTM models. Is it because this information is confidential that its not explained into more depth?

It would be very helpful if you can shed some light on this :slight_smile:

So you’re right. MetaCAT uses bi-LSTM but it can be done with BERT as well.

As for the main NER pipeline. You can consider it as a custom pipeline. It is basically a vocab-based look up with Context-based disambiguation to link words or phrases to the most appropriate concept.

pre-print open access available here: https://arxiv.org/pdf/2010.01165.pdf

algorithm in text form is here:
Screenshot 2023-01-30 at 16.05.15

MetaAnnotation i.e. contextualisation of an extracted concept is essentially a bi-LSTM

Hope that helps!

Aha ok, that’s interesting… So there are actually no statistical models involved in the main NER pipeline. No wonder i couldn’t find any then :slight_smile:
Thanks for confirming that

No worries - not sure what you mean by statistical models, but the context similarity calculation, comparison and usage of distributional semantic models are very much based on statistical techniques…