MedCat meta annotation model poor functionality

I am using MedCat to identify a concept in text with the desired meta annotations of subject:patient and presence True.
Running the basic model results in many false positives.
On further inspection at least one major source of these is a generic sentence printed on blood test documents regarding the condition and a warning about reference ranges.

In the trainer I’ve marked these as true for presence of concept but the subject as other.
At roughly 10 marked, 30, and >50 marked in this manner I receive little if any at all adjustment to the fine tuned models predictions. Notably the subject meta annotation does not change from the initial prediction of patient to other.

Here is the exact text causing the problem:

“”"A transferrin saturation (>45% female, >50% male) with a raised


(>200 ug/L in females, >300 ug/L in males) suggests iron overload

(EASL 2010

HFE Hemochromatosis)."""

The fine tuned model with >400 documents annotated and at least 50 explicit examples of this text snippet annotated returns:
11: {‘pretty_name’: ‘Hemochromatosis (disorder)’,
‘cui’: ‘399187006’,
‘type_ids’: [‘T-11’],
‘types’: [‘disorder’],
‘source_value’: ‘Hemochromatosis’,
‘detected_name’: ‘hemochromatosis’,
‘acc’: 0.8494843000173569,
‘context_similarity’: 0.8494843000173569,
‘start’: 170,
‘end’: 185,
‘icd10’: ,
‘ontologies’: [‘SNOMED’],
‘snomed’: ,
‘id’: 11,
‘meta_anns’: {‘Presence’: {‘value’: ‘True’,
‘confidence’: 1.0,
‘name’: ‘Presence’},
‘Subject’: {‘value’: ‘Patient’, ‘confidence’: 1.0, ‘name’: ‘Subject’},
‘Time’: {‘value’: ‘Recent’,
‘confidence’: 0.9991687536239624,
‘name’: ‘Time’}}},

And the untrained model:
11: {‘pretty_name’: ‘Hemochromatosis (disorder)’,
‘cui’: ‘399187006’,
‘type_ids’: [‘T-11’],
‘types’: [‘disorder’],
‘source_value’: ‘Hemochromatosis’,
‘detected_name’: ‘hemochromatosis’,
‘acc’: 1.0,
‘context_similarity’: 1.0,
‘start’: 170,
‘end’: 185,
‘icd10’: ,
‘ontologies’: [‘SNOMED’],
‘snomed’: ,
‘id’: 11,
‘meta_anns’: {‘Presence’: {‘value’: ‘True’,
‘confidence’: 1.0,
‘name’: ‘Presence’},
‘Subject’: {‘value’: ‘Patient’, ‘confidence’: 1.0, ‘name’: ‘Subject’},
‘Time’: {‘value’: ‘Recent’,
‘confidence’: 0.9991687536239624,
‘name’: ‘Time’}}}},
‘tokens’: }

Any advice?

I think that I should not mark them as incorrect in the trainer as they highlight the precise word “Hemochromatosis” which I am otherwise looking for.
I suppose I could have also marked them with the hypothetical annotation too.

Is this a matter of not annotating enough examples of this precise snippet?


How have you trained the meta-annotation model? can you paste in the function and parameters that you’ve used thanks!

Thanks for the reply,

I have setup a project with the standard meta annotations, downloaded the annotations, loaded the base kch model,

I haven’t adjusted any parameters anywhere and I don’t know what you’re referring to by function for the meta annotations.

**update I have slightly managed to reduce the models accuracy by labelling these errors (+30 more examples of it) as incorrect. The meta annotation predictions are still showing subject: patient and presence true when each example is labelled other and hypothetical.

Right so the training for concepts and Meta-annotations are currently separate.

The training that you have shown here is for the NER+L only after MCTtrainer labelling. This explains why your meta-annotation performance has not changed.

The Meta_annotation models will sit on top and this is why you initiate them separately in a cat object.
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab, meta_cats= [list of meta_cat models]) you may have missed this if you loaded it directly through a modelpack.

The training for meta-annotations can be found here. (Still a work in progress when I get around to it) If you get further feel free to make a PR :smiley:

I guess it probably makes sense to have a general function to train them both at the same time though…
Will have a look into this in the main MedCAT repo.

Partially solved with :

import json
from datetime import date

from import CAT

#from medcat.tokenizers.meta_cat_tokenizers import TokenizerWrapperBase
from medcat.tokenizers.meta_cat_tokenizers import TokenizerWrapperBPE

from medcat.vocab import Vocab

from medcat.cdb import CDB
from tokenizers import ByteLevelBPETokenizer
import pandas as pd
from medcat.meta_cat import MetaCAT 
import numpy as np 

DATA_DIR = "/data/AS/Samora/base_kch_model/medcat_model_pack_422d1d38fc58f158/"

embeddings_path = DATA_DIR + 'embeddings.npy'

meta_annotation_list = ['Subject/Experiencer', 'Presence', 'Time']

vocab_path = DATA_DIR + "vocab.dat"
cdb_path = DATA_DIR + "cdb.dat"

embeddings = np.load(open(embeddings_path, 'rb'))

for i in range(0, len(meta_annotation_list)):
    #remap to avoid pathing issues with / #in future no odd characters only underscores!
    folder_name_remap_dict = {'Subject/Experiencer':'Subject',
    current_meta_annotation = meta_annotation_list[i]
    DATA_DIR_meta = f"{DATA_DIR}meta_{folder_name_remap_dict.get(current_meta_annotation)}/"
    #read vocab and merges from target meta annotation directory #unnecessary duplicates? 
    tokenizer = ByteLevelBPETokenizer(vocab=DATA_DIR_meta + "bbpe-vocab.json", merges=DATA_DIR_meta + "bbpe-merges.txt")

    #read specific config
    config_path = DATA_DIR_meta + 'config.json'
    with open(config_path) as json_file:
        config_file = json.load(json_file)
    #instantiate with default config, internally generated    
    mc = MetaCAT(tokenizer=TokenizerWrapperBPE(tokenizer), embeddings=embeddings)
    #set category name to current meta target
    mc.config.general['category_name'] = current_meta_annotation
    version_id_string = f"medcat_model_pack_{cat.config.version['id']}"
    #Write into new modelpack overwriting previous untrained but exported meta
    output_meta_path = os.path.join(model_dir, output_modelpack) + "/" + version_id_string + "/" + 'meta_'+folder_name_remap_dict.get(current_meta_annotation)
    #train #  
    mc.train(mctrainer_export_path, save_dir_path = output_meta_path) #possibly redundant