Is it possible to re-train a model or add concepts to the CDB of a saved model?
I am using a trained MedCAT model (MedCAT Version": "1.2.9.dev13) and would like to add more concepts relevant to my specialty to the CDB, without losing the previous training.
I have tried to add concepts using cdb.add_concept() after the model is loaded, but, although this adds the concepts to the CDB, it will not affect the model’s output. Otherwise, can I re-train a model that has been previously trained on a new data set to let the model pick up new concepts?
You can absolutely add new concepts to the model or even create your own!
Keep in mind that the CUI and primary name must be unique.
Although cdb.add_concept() can be used. This function does not contain any preprocessing steps to the concept name. Instead try and add new concepts via the function: cat.add_and_train_concept().
You can read more about the function arguments here. It does make a call to add the cdb.add_concepts() func.
It is on our TODO list to clarify/rename these two functions. Its quite confusing imo.
But in short, try and work with the CAT object rather than directly with the CDB.
The CAT object consists of the CDB, Vocab and config files and stores them together.
Thank you for the reply. The add_and_train_concept() function takes a spacy_doc argument, “Spacy representation of the document that was manually annotated”. Could you give some more details on this document’s format please? Or a sample document might be helpful.
Hi @anthony.shek I have read that paper a few times and I am pretty sure its not mentioned in there what models are actually being used. Its only mentioned that its supervised and for metacat is used bi-directional LSTM models. Is it because this information is confidential that its not explained into more depth?
It would be very helpful if you can shed some light on this
So you’re right. MetaCAT uses bi-LSTM but it can be done with BERT as well.
As for the main NER pipeline. You can consider it as a custom pipeline. It is basically a vocab-based look up with Context-based disambiguation to link words or phrases to the most appropriate concept.
Aha ok, that’s interesting… So there are actually no statistical models involved in the main NER pipeline. No wonder i couldn’t find any then
Thanks for confirming that
No worries - not sure what you mean by statistical models, but the context similarity calculation, comparison and usage of distributional semantic models are very much based on statistical techniques…