Preprocessed text in medcat trainer projects

Should one preprocess the text before creating projects for the MedCAT trainer? Wondering whether preprocessing actually will hinder the ‘understandability’ of the context for the annotator. For e.g. If the text contains Stage III?, it would tell the annotator that the doctor is not sure of the stage III. But if the annotator receives preprocessed text after removing the ‘?’ the context is lost. Please share your experiences.

hi @sangeetabose - others can also chime in here - but generally its best to avoid pre-processing where possible. The model can only learn to extract, and disambiguate linked clinical terms or contextualise linked terms through the context of span.

One thing to be clear on however - is annotation guidelines. If you’re annotating even with only one annotator writing some comprehensive guidelines ensures you’ll gather consistent input data for model training / fine-tuning. Consistency is the most important here.

Hope that helps!

Thanks. that is useful. Did think the same. But the lesser preprocessing one does, the more ‘context of span’ one has. Hence we need to train for more contexts. Would that be right.

I agree that the preprocessing tends to remove valuable contextual features.

The only bit of pre-processing that needs to thought through is what to do about document formatting features (especially if in xhtml where there may be line breaks and tabular tags).