MedCAT entity recognition is dictionary based. Each concept in the CDB is mapped to one or more “names” (aliases is frequently also used in other works to describe these entries). If any of these aliases are present in the text, it will be an entity candidate. If the alias is only mapped to one CUI then it will be an immediate match otherwise, there will need to be a disambiguation step in order to determine which of multiple CUIs.
Most of the “machine learning” of MedCAT is actually around the disambiguation step and the standard NER is dictionary based with a spell-checker that runs first and some lemmatization depending on your configs.
You likely are using a model that does not have dictionary entries for your use case and you’ll need to either add more entries to the CDB whether it’s manually or via training.
The supervised training process will add new entries if they are not there.
Thanks for your response. I appreciate it. Thanks for explaining that. It was helpful. I will consider adding new entries via training/manually.
Based on my previous discussion with Anthony, from last year, I am trying to reproduce the MedCAT demo hosted at https://medcat.rosalind.kcl.ac.uk/. I have used NHS TRUD Snomed International files for mapping Snomed codes further to ICD-10/OPCS-4 codes. I was under the impression that this should enable me to produce the same results as the demo. However if you see the image attached below from the MedCAT demo for the same test_string, the entities recognized show better performance. For instance, “Epidural Injection” being an important entity.
In this context, could you please advise if I am missing any step to reproduce the above implementation? Maybe which base modelpack was used for the demo could be helpful.
Also, I am interested to know, if you can advise regarding the possibility of using NER from another spacy model, replacing the dictionary based MedCAT matching and use it along with the ontology linking capability of MedCAT.
I understand this is a complex work area and I just want to mention that any help and guidance from your end is highly appreciated.
The Demo is there for demonstrative purposes. You should in theory be able to download the demo artifacts, build the CAT from vocab and cdb, and run inference over your documents but there are many potential issues like miss-match of libraries, etc. that could go wrong.
You could verify that epidural~injection is included as a name in your CDB that you are using locally. If not, then you’ll have to add that name to your CDB. The NER is dictionary based so it’s only going to detect entities that are in the dictionary.
I have not personally tried adding another NER component but in theory it should be possible.
The NER component is a spacy pipeline component, it reads in a doc and returns a doc. In Medcat, the NER returns a doc with additional annotations.
In theory you could either subclass Pipe or remove the pipe and add a new pipe. Your new pipe would need to add similar annotations to the doc in the form of doc._.ents in order for downstream components (really the Linker) to work properly.
I would recommend enriching your vocab first before trying to add a new NER model. I have given what I would do to get started using a separate NER component but I haven’t tested it so this is just a rough sketch of how to get started.
That was really insightful. Thanks a lot for your response.
I will try the approaches you suggested. I appreciate your support. I hope we can keep in touch to keep exchanging ideas and approaches around this work. If possible, I request you to please share your email address for the same.