Reenabling pipeline components

Disabling spaCy components means missing annotation data such as sentence boundaries and parsed data (i.e. head tree). What happens when we add back the sentencizer and parser spaCy components?

So far in my testing, nothing “bad” seems to happen. However, can others confirm this doesn’t lead to harmful side effects?

Hi @plandes there should be no harmful effects, unless you use NER components from spacy that somehow clash with MedCAT NER components. Things like sentencizer or parser will not affect anything.

No, these are the only two components, then merge non-medical NER from a different spaCy Language model instance.

However, I have noticed the sentence boundaries are at places different from a non-MedCAT spaCy language model. Any insight on that?

Possibly because we’ve modified the standard spacy tokenizer and added some other rules, but should be minimal. One thing to check is that the spacy model you are using is exactly the same as the spacy model in the medcat pipeline and that all the components that help to detect sentence boundaries are enabled.

This makes sense. Thank you.

One more question on this: can you provide an example of text where the spelling is fixed and an abbreviation expanded for the purposes of a unit test case with my own setup? Thanks in advance.

Try typing the following Intracerebral heorrhage and CKD in the demo app here. It will detect the first part even though it is misspelt and also CKD even though it is an abbreviation. The text will not be touched, but internally the model will ignore the spelling mistakes and also detect abbreviations.

2 Likes