Using different scispaCy models with MedCAT

Prajwal_Khairnar · June 5, 2023, 7:57am

Hello all,

I am a Data Scientist working in the NHS UK. I’m working on NER-L using MedCAT.

Has anyone successfully used different scispacy models with any existing MedCAT modelpack? I tried installing the new model and replacing the model name like below in the cat config:

cat.config.general['spacy_model'] = "en_core_sci_md"

However, based on the performance, I suspect that this might not be the best way to do this. I request anyone who has experienced this to share their thoughts! Thanks in advance.

anthony.shek · June 7, 2023, 2:03pm

Hi @Prajwal_Khairnar,

Not sure why you are looking to use: “en_core_sci_md”
We migrated away from that version previously as that version is now unsupported.
There is very little difference to “en_core_web_md”

You will not see any major difference unless you train the model.

Prajwal_Khairnar · June 8, 2023, 7:45am

Hi Anthony,

Thanks a lot for your response. I appreciate your time.

Based on testing the various spacy models (without MedCAT, just a spacy NER pipeline) on sample data, I observed the differences like below for instance:

MedCAT modelpack using en_core_web_md:

NER spacy pipeline using en_core_sci_md:

I am trying to understand if I can achieve the performance for entity recognition like second case within MedCAT as my primary concern is the performance on entity recognition.

I’d appreciate any advice from your end. Thanks in advance.
Regards,
Prajwal

~WRD0000.jpg

jkgenser · June 8, 2023, 1:33pm

MedCAT entity recognition is dictionary based. Each concept in the CDB is mapped to one or more “names” (aliases is frequently also used in other works to describe these entries). If any of these aliases are present in the text, it will be an entity candidate. If the alias is only mapped to one CUI then it will be an immediate match otherwise, there will need to be a disambiguation step in order to determine which of multiple CUIs.

Most of the “machine learning” of MedCAT is actually around the disambiguation step and the standard NER is dictionary based with a spell-checker that runs first and some lemmatization depending on your configs.

You likely are using a model that does not have dictionary entries for your use case and you’ll need to either add more entries to the CDB whether it’s manually or via training.

The supervised training process will add new entries if they are not there.

Prajwal_Khairnar · June 8, 2023, 2:20pm

Hi Jerome,

Thanks for your response. I appreciate it. Thanks for explaining that. It was helpful. I will consider adding new entries via training/manually.

Based on my previous discussion with Anthony, from last year, I am trying to reproduce the MedCAT demo hosted at https://medcat.rosalind.kcl.ac.uk/. I have used NHS TRUD Snomed International files for mapping Snomed codes further to ICD-10/OPCS-4 codes. I was under the impression that this should enable me to produce the same results as the demo. However if you see the image attached below from the MedCAT demo for the same test_string, the entities recognized show better performance. For instance, “Epidural Injection” being an important entity.

In this context, could you please advise if I am missing any step to reproduce the above implementation? Maybe which base modelpack was used for the demo could be helpful.

Also, I am interested to know, if you can advise regarding the possibility of using NER from another spacy model, replacing the dictionary based MedCAT matching and use it along with the ontology linking capability of MedCAT.

I understand this is a complex work area and I just want to mention that any help and guidance from your end is highly appreciated.

Kind regards,
Prajwal

~WRD0000.jpg

jkgenser · June 8, 2023, 4:33pm

The Demo is there for demonstrative purposes. You should in theory be able to download the demo artifacts, build the CAT from vocab and cdb, and run inference over your documents but there are many potential issues like miss-match of libraries, etc. that could go wrong.

You could verify that epidural~injection is included as a name in your CDB that you are using locally. If not, then you’ll have to add that name to your CDB. The NER is dictionary based so it’s only going to detect entities that are in the dictionary.

I have not personally tried adding another NER component but in theory it should be possible.

The NER component is a spacy pipeline component, it reads in a doc and returns a doc. In Medcat, the NER returns a doc with additional annotations.

In theory you could either subclass Pipe or remove the pipe and add a new pipe. Your new pipe would need to add similar annotations to the doc in the form of doc._.ents in order for downstream components (really the Linker) to work properly.

I would recommend enriching your vocab first before trying to add a new NER model. I have given what I would do to get started using a separate NER component but I haven’t tested it so this is just a rough sketch of how to get started.

github.com

CogStack/MedCAT/blob/master/medcat/pipe.py#L85


      
          def add_token_normalizer(self, config: Config, name: Optional[str] = None, spell_checker: Optional[BasicSpellChecker] = None) -> None:
              token_normalizer = TokenNormalizer(config=config, spell_checker=spell_checker)
              component_name = spacy.util.get_object_name(token_normalizer)
              name = name if name is not None else component_name
              Language.component(name=component_name, func=token_normalizer)
              self._nlp.add_pipe(component_name, name=name, last=True)
          
          
    # Add custom fields needed for this usecase
              Token.set_extension('norm', default=None, force=True)
          
          
def add_ner(self, ner: NER, name: Optional[str] = None) -> None:
              """Add NER from CAT to the pipeline, will also add the necessary fields
              to the document and Span objects.
          
          
    Args:
                  ner(NER):
                      The NER instance
                  name(Optional[str], optional):
                      The pipeline name (Default value = None)
              """
              component_name = spacy.util.get_object_name(ner)

Prajwal_Khairnar · June 9, 2023, 7:27am

Hi Jerome,

That was really insightful. Thanks a lot for your response.

I will try the approaches you suggested. I appreciate your support. I hope we can keep in touch to keep exchanging ideas and approaches around this work. If possible, I request you to please share your email address for the same.

Thanks and regards,
Prajwal

~WRD0000.jpg

Topic		Replies	Views
Error in MedCATtrainer Project Setup: Missing "spacy_model" MedCAT	4	163	January 22, 2024
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	375	January 30, 2023
Usuage of MedCat MedCAT	7	237	May 16, 2024
Understanding medcat MedCAT	6	385	September 13, 2022
Reenabling pipeline components MedCAT	6	182	August 18, 2022

Using different scispaCy models with MedCAT

Related topics