MedCAT sentencisation and chunking

Hi all, we are using MedCAT to develop point of care NLP for clinical note-taking at UCLH. I had some queries about the sentencisation so that we can ensure it works for the format of clinical notes.

Clinicians often enter a list of diagnoses with linebreaks and no full stops, and we would want to make sure that concepts are not incorrectly picked up that span lines e.g.

PMH:
Cancer of kidney
Stones - gallbladder

(may be detected incorrectly as [Cancer] [Kidney stones])

How does MedCAT currently handle sentence breaks and line breaks? My understanding is that linebreaks are considered no differently to other whitespace in the default model; we were thinking of converting them to sentence breaks in a preprocessor.

Also, the contextual information will be almost always confined to a single sentence, I wondered if MetaCAT is restricted to sentences or if it could be configured in this way?

Contextual information that spans sentences (at the paragraph level) is also useful, (e.g. whether a paragraph is about past medical history / medication list / differential diagnosis), and we wondered if anyone has done any prior work on this using MedCAT?

Thanks, Anoop

Looking at the cleaner.py, I don’t see it removing \n or \r characters. Are those newline characters are already retained after tokenising?

Interesting to see if it makes much impact on downstream models