I’m in the process of evaluating the performance of our own MedCAT model and I’m trying to understand how I should split my dataset for evaluating the performance of my model after both unsupervised and supervised learning.
My main question is: should the clinical notes used in an annotated evaluation set never be used during unsupervised training?
My confusion comes from my understating of how MedCAT was evaluated. Based on the paper [1], it seems M3 at KCH (an unsupervised model) was trained on the entire KCH EPR dataset and then evaluated (F1, SD, IQR) using manually annotated notes from that same dataset as ground truth.
It also seems that M4 at KCH (M3 + supervised training) was evaluated using 10 fold-cross validation on the same annotated notes as M3. In this case I assume that for each fold M3 was trained (supervised) on 9/10 of the annotated notes and then evaluated on the rest 1/10. However, it wasn’t clear to me if that 1/10 of notes was also excluded from unsupervised training.
Am I correct to assume that for M3 and M4 MedCAT models saw all the clinical notes in the evaluation test during unsupervised training?
Is this overlap between annotated notes for evaluation and notes for unsupervised training of no or minimal concern because the influence of a single note on the context embeddings of a concept is very small given a big enough training corpus?
[1] “Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit”, under section 2.4.4 Clinical Use Case NER+L Experimental Setup and Table 5.
I will preface this with the fact that I wasn’t here when the paper was written and published. So the following will be somewhat of an assumption.
I would say that’s there was probably some overlap between the unsupervised and supervised training sets.
I would argue that the influence of a single (or a few out of the entire corpus) notes in unsupervised trianing will indeed be minimal compared to evaluating them in a supervised manner. Though (as far as I know) we’ve not tested this.
On the other hand, I think it’s important to note that the unsupervised process will not have necessarily been able to learn from all the concepts in the corpus. That’s because it’s - as its name suggests - unsupervised. If there are terms that are ambiguous the model can only disambiguate them if the various possible concepts have been trained on their unique names before (to a sufficient threshold of examples). So if the dataset didn’t contain enough unique names it may never have been able to train on the (potentially bigger number of) ambiguous names. So because of that the impact of these few notes may be lower still. But again (as far as I know), this has not been extensively tested.
But to answer the main question - as I’m sure you’re aware - if you wish to be absolutely certain there’s no overfitting for the evaluation set and/or if the evaluation set would be a large chunk of the unsupervised training data, you should avoid the overlap.
However, if the evaluation set would be a miniscule subset of the unsupervised dataset (the KCH corpus is at least hunderds of thousands - but potentially millions - of notes vs a few hundred annotated for evaluation within those) and/or if the concern isn’t that strong, you should be fine with the slight overlap.