I am using MedCat to identify a concept in text with the desired meta annotations of subject:patient and presence True.
Running the basic model results in many false positives.
On further inspection at least one major source of these is a generic sentence printed on blood test documents regarding the condition and a warning about reference ranges.
In the trainer I’ve marked these as true for presence of concept but the subject as other.
At roughly 10 marked, 30, and >50 marked in this manner I receive little if any at all adjustment to the fine tuned models predictions. Notably the subject meta annotation does not change from the initial prediction of patient to other.
Here is the exact text causing the problem:
“”"A transferrin saturation (>45% female, >50% male) with a raised
ferritin
(>200 ug/L in females, >300 ug/L in males) suggests iron overload
(EASL 2010
HFE Hemochromatosis)."""
The fine tuned model with >400 documents annotated and at least 50 explicit examples of this text snippet annotated returns:
11: {‘pretty_name’: ‘Hemochromatosis (disorder)’,
‘cui’: ‘399187006’,
‘type_ids’: [‘T-11’],
‘types’: [‘disorder’],
‘source_value’: ‘Hemochromatosis’,
‘detected_name’: ‘hemochromatosis’,
‘acc’: 0.8494843000173569,
‘context_similarity’: 0.8494843000173569,
‘start’: 170,
‘end’: 185,
‘icd10’: ,
‘ontologies’: [‘SNOMED’],
‘snomed’: ,
‘id’: 11,
‘meta_anns’: {‘Presence’: {‘value’: ‘True’,
‘confidence’: 1.0,
‘name’: ‘Presence’},
‘Subject’: {‘value’: ‘Patient’, ‘confidence’: 1.0, ‘name’: ‘Subject’},
‘Time’: {‘value’: ‘Recent’,
‘confidence’: 0.9991687536239624,
‘name’: ‘Time’}}},
And the untrained model:
11: {‘pretty_name’: ‘Hemochromatosis (disorder)’,
‘cui’: ‘399187006’,
‘type_ids’: [‘T-11’],
‘types’: [‘disorder’],
‘source_value’: ‘Hemochromatosis’,
‘detected_name’: ‘hemochromatosis’,
‘acc’: 1.0,
‘context_similarity’: 1.0,
‘start’: 170,
‘end’: 185,
‘icd10’: ,
‘ontologies’: [‘SNOMED’],
‘snomed’: ,
‘id’: 11,
‘meta_anns’: {‘Presence’: {‘value’: ‘True’,
‘confidence’: 1.0,
‘name’: ‘Presence’},
‘Subject’: {‘value’: ‘Patient’, ‘confidence’: 1.0, ‘name’: ‘Subject’},
‘Time’: {‘value’: ‘Recent’,
‘confidence’: 0.9991687536239624,
‘name’: ‘Time’}}}},
‘tokens’: }
Any advice?
I think that I should not mark them as incorrect in the trainer as they highlight the precise word “Hemochromatosis” which I am otherwise looking for.
I suppose I could have also marked them with the hypothetical annotation too.
Is this a matter of not annotating enough examples of this precise snippet?
Thanks