Preprocessed text in medcat trainer projects

sangeetabose · February 3, 2023, 7:14am

Should one preprocess the text before creating projects for the MedCAT trainer? Wondering whether preprocessing actually will hinder the ‘understandability’ of the context for the annotator. For e.g. If the text contains Stage III?, it would tell the annotator that the doctor is not sure of the stage III. But if the annotator receives preprocessed text after removing the ‘?’ the context is lost. Please share your experiences.

tomolopolis · February 3, 2023, 12:25pm

hi @sangeetabose - others can also chime in here - but generally its best to avoid pre-processing where possible. The model can only learn to extract, and disambiguate linked clinical terms or contextualise linked terms through the context of span.

One thing to be clear on however - is annotation guidelines. If you’re annotating even with only one annotator writing some comprehensive guidelines ensures you’ll gather consistent input data for model training / fine-tuning. Consistency is the most important here.

Hope that helps!

sangeetabose · February 3, 2023, 12:47pm

Thanks. that is useful. Did think the same. But the lesser preprocessing one does, the more ‘context of span’ one has. Hence we need to train for more contexts. Would that be right.

Jthteo · February 4, 2023, 8:51am

I agree that the preprocessing tends to remove valuable contextual features.

The only bit of pre-processing that needs to thought through is what to do about document formatting features (especially if in xhtml where there may be line breaks and tabular tags).

Topic		Replies	Views
MedCAT sentencisation and chunking MedCAT	1	207	December 23, 2022
How to improve recall and make medcat find correct word combinations?	15	315	January 20, 2023
MedCat meta annotation model poor functionality MedCAT	4	261	January 18, 2023
Annotate adjacent spans in MedCATTrainer MedCAT	1	67	March 11, 2024
Medcat 1.7.0 trained on documents, or sentences (short documents) MedCAT	1	214	March 30, 2023

Preprocessed text in medcat trainer projects

Related topics