Adding new concepts to a trained model or re-training a MedCAT model

elenaP · December 23, 2022, 11:52am

Is it possible to re-train a model or add concepts to the CDB of a saved model?

I am using a trained MedCAT model (MedCAT Version": "1.2.9.dev13) and would like to add more concepts relevant to my specialty to the CDB, without losing the previous training.

I have tried to add concepts using cdb.add_concept() after the model is loaded, but, although this adds the concepts to the CDB, it will not affect the model’s output. Otherwise, can I re-train a model that has been previously trained on a new data set to let the model pick up new concepts?

Thanks!

anthony.shek · December 29, 2022, 2:58pm

Welcome @elenaP !

You can absolutely add new concepts to the model or even create your own!
Keep in mind that the CUI and primary name must be unique.

Although cdb.add_concept() can be used. This function does not contain any preprocessing steps to the concept name. Instead try and add new concepts via the function: cat.add_and_train_concept().
You can read more about the function arguments here. It does make a call to add the cdb.add_concepts() func.

It is on our TODO list to clarify/rename these two functions. Its quite confusing imo.

But in short, try and work with the CAT object rather than directly with the CDB.

The CAT object consists of the CDB, Vocab and config files and stores them together.

elenaP · January 17, 2023, 12:10pm

Thank you for the reply. The add_and_train_concept() function takes a spacy_doc argument, “Spacy representation of the document that was manually annotated”. Could you give some more details on this document’s format please? Or a sample document might be helpful.

bkakke · January 24, 2023, 4:52pm

@anthony.shek What type of model is it actually which is being trained?
Is it a neural network? Can I read about the model type anywhere?

anthony.shek · January 24, 2023, 6:48pm

Not a neural Network.

You can read more about it here:

bkakke · January 24, 2023, 9:23pm

Hi @anthony.shek I have read that paper a few times and I am pretty sure its not mentioned in there what models are actually being used. Its only mentioned that its supervised and for metacat is used bi-directional LSTM models. Is it because this information is confidential that its not explained into more depth?

It would be very helpful if you can shed some light on this

anthony.shek · January 30, 2023, 10:26am

So you’re right. MetaCAT uses bi-LSTM but it can be done with BERT as well.

As for the main NER pipeline. You can consider it as a custom pipeline. It is basically a vocab-based look up with Context-based disambiguation to link words or phrases to the most appropriate concept.

tomolopolis · January 30, 2023, 4:06pm

pre-print open access available here: https://arxiv.org/pdf/2010.01165.pdf

algorithm in text form is here:
Screenshot 2023-01-30 at 16.05.15

MetaAnnotation i.e. contextualisation of an extracted concept is essentially a bi-LSTM

Hope that helps!

bkakke · January 30, 2023, 6:58pm

Aha ok, that’s interesting… So there are actually no statistical models involved in the main NER pipeline. No wonder i couldn’t find any then
Thanks for confirming that

tomolopolis · January 30, 2023, 7:38pm

No worries - not sure what you mean by statistical models, but the context similarity calculation, comparison and usage of distributional semantic models are very much based on statistical techniques…

Topic		Replies	Views
Using different scispaCy models with MedCAT MedCAT medical-ontologies	6	299	June 9, 2023
MedCAT French model only matches exact terms - accuracy similarity always 1 MedCAT	7	63	June 8, 2025
How to improve recall and make medcat find correct word combinations?	15	315	January 20, 2023
MedCATtrainer changing cdb MedCAT	3	214	October 5, 2022
MedCAT for Heart Disease Concept NER and model fine-tuning MedCAT	1	309	April 19, 2022

Adding new concepts to a trained model or re-training a MedCAT model

Related topics