Meta annotation basics

bkakke · September 15, 2022, 4:31pm

Hello, as I understand i can create meta annotation models to, for example, indicate negation as the mc_status model does. I have seen the notebook “Part 4.2 - Supervised Training and Meta-annotations.ipynb” which essentially does the following:

tokenizer = ByteLevelBPETokenizer(vocab_file=DATA_DIR + "medmen-vocab.json", merges_file=DATA_DIR + "medmen-merges.txt")
embeddings = np.load(open(DATA_DIR + "embeddings.npy", 'rb'))
mc = MetaCAT(tokenizer=tokenizer, embeddings=embeddings, pad_id=len(embeddings) -1, save_dir='mc_status', device='cpu')
mc.train(DATA_DIR + "MedCAT_Export.json", 'Status', nepochs=20)
mc.save(full_save=True)

which seems to create the mc_status model based on provided tokenizer and embeddings. This is great, but I don’t understand how this works really works so I hope i can pose these questions:

Can I somehow make my own meta annotations with other models/other logic? if so, how?
what data is being passed to the meta annotation model during processing? Is it each identified token together with x tokens from left and right side of the token?
Why is a tokenizer needed to be provided to metacat? I thought text had already been tokenized at the point in processing where meta annotations are handled.
Where in this status_mc model is it’s output defined? Forexample, where is it defined that output can be: “{‘Status’: ‘Confirmed’}” ?

I hope these questions are not to simple/off

Thanks in advance

tomolopolis · September 16, 2022, 2:54pm

Hi @bkakke - welcome to the CogStack discourse community!

Answers to your questions:

Yes - we have two alternative implementations so far, one using bi-lstm and another via a Transformer, i.e. BERT. you can extend the API, re-use the base classes / configs etc. for your own implementations here. Each model implementation that actually does the heavy lifting is here
yes exactly - the concept that has been identified and extracted by the NER+L MedCAT pipe, and the surrounding context.
Each model could theoretically have its own tokenizer, but in practise you can use the same BBPE (or BERTTokenzier i.e. WordPiece) tokenizer across the different tasks for which you’ll train different MetaCAT models. We use a BBPE or WordPiece tokenizer here, as these are more effective tokenization methods during this classification scenario, with BBPE / WordPiece the vocab is not driven by the clinical terminology but is built directly from the corpus, so sub-word tokens can be used and word vectors learnt, alleviating the OOV problem seen by non-subword methods, i.e. Word2Vec. Sub-word tokenization for NER+L i.e. the MedCAT problem, doesn’t make sense as we ultimately are aiming to link full tokens to somewhere in the configured terminology, we don’t care about learning a sub-word latent space that allows to perform some abstract downstream task such as classification, inference etc.
MetaCAT models are configured via the medcat.config_meta_cat.ConfigMetaCAT class. This defines what labels are being predicted, and what the label actually maps to in human readable form. Collecting these annotations (i.e. labelled data) is via the MedCATtrainer interface.

Hope that helps.

Thanks,
Tom

Michale_Angelo · October 5, 2022, 11:48am

Wait, you can configure the Meta-annotation task to be custom with different custom Meta Tasks?!

Where can I learn about all the features of MedCAT? Its a lot more capable than I realised…

tomolopolis · October 5, 2022, 1:25pm

A good place to start is the paper: pre-print or journal pub,

Then there’s tutorials for the API / features of the toolkit: here

Broader tutorials that cover the internals of building a new MedCAT model from scratch is here

There’s also the documentation available here for: MedCAT and MedCATtrainer

If you have any specific questions feel free to create a new post right here on discourse!

Topic		Replies	Views
Alternative Models for MetaCAT MedCAT	3	189	November 25, 2023
MedCat meta annotation model poor functionality MedCAT	4	261	January 18, 2023
Public models for meta annotations MedCAT	5	284	April 16, 2025
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	373	January 30, 2023
MetaCAT Model Evaluation MedCAT	2	216	January 18, 2023

Meta annotation basics

Related topics