MedCAT sentencisation and chunking

Hi all, we are using MedCAT to develop point of care NLP for clinical note-taking at UCLH. I had some queries about the sentencisation so that we can ensure it works for the format of clinical notes.

Clinicians often enter a list of diagnoses with linebreaks and no full stops, and we would want to make sure that concepts are not incorrectly picked up that span lines e.g.

Cancer of kidney
Stones - gallbladder

(may be detected incorrectly as [Cancer] [Kidney stones])

How does MedCAT currently handle sentence breaks and line breaks? My understanding is that linebreaks are considered no differently to other whitespace in the default model; we were thinking of converting them to sentence breaks in a preprocessor.

Also, the contextual information will be almost always confined to a single sentence, I wondered if MetaCAT is restricted to sentences or if it could be configured in this way?

Contextual information that spans sentences (at the paragraph level) is also useful, (e.g. whether a paragraph is about past medical history / medication list / differential diagnosis), and we wondered if anyone has done any prior work on this using MedCAT?

Thanks, Anoop

Looking at the, I don’t see it removing \n or \r characters. Are those newline characters are already retained after tokenising?

Interesting to see if it makes much impact on downstream models