MedCAT sentencisation and chunking

anoopshah · December 21, 2022, 2:07pm

Hi all, we are using MedCAT to develop point of care NLP for clinical note-taking at UCLH. I had some queries about the sentencisation so that we can ensure it works for the format of clinical notes.

Clinicians often enter a list of diagnoses with linebreaks and no full stops, and we would want to make sure that concepts are not incorrectly picked up that span lines e.g.

PMH:
Cancer of kidney
Stones - gallbladder

(may be detected incorrectly as [Cancer] [Kidney stones])

How does MedCAT currently handle sentence breaks and line breaks? My understanding is that linebreaks are considered no differently to other whitespace in the default model; we were thinking of converting them to sentence breaks in a preprocessor.

Also, the contextual information will be almost always confined to a single sentence, I wondered if MetaCAT is restricted to sentences or if it could be configured in this way?

Contextual information that spans sentences (at the paragraph level) is also useful, (e.g. whether a paragraph is about past medical history / medication list / differential diagnosis), and we wondered if anyone has done any prior work on this using MedCAT?

Thanks, Anoop

Jthteo · December 23, 2022, 5:41pm

Looking at the cleaner.py, I don’t see it removing \n or \r characters. Are those newline characters are already retained after tokenising?

Interesting to see if it makes much impact on downstream models

Topic		Replies	Views
Preprocessed text in medcat trainer projects MedCAT	3	166	February 4, 2023
Medcat 1.7.0 trained on documents, or sentences (short documents) MedCAT	1	214	March 30, 2023
How to improve recall and make medcat find correct word combinations?	15	315	January 20, 2023
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	375	January 30, 2023
Using different scispaCy models with MedCAT MedCAT medical-ontologies	6	299	June 9, 2023

MedCAT sentencisation and chunking

Related topics