Meta annotation basics

Hello, as I understand i can create meta annotation models to, for example, indicate negation as the mc_status model does. I have seen the notebook “Part 4.2 - Supervised Training and Meta-annotations.ipynb” which essentially does the following:

tokenizer = ByteLevelBPETokenizer(vocab_file=DATA_DIR + "medmen-vocab.json", merges_file=DATA_DIR + "medmen-merges.txt")
embeddings = np.load(open(DATA_DIR + "embeddings.npy", 'rb'))
mc = MetaCAT(tokenizer=tokenizer, embeddings=embeddings, pad_id=len(embeddings) -1, save_dir='mc_status', device='cpu')
mc.train(DATA_DIR + "MedCAT_Export.json", 'Status', nepochs=20)

which seems to create the mc_status model based on provided tokenizer and embeddings. This is great, but I don’t understand how this works really works so I hope i can pose these questions:

  1. Can I somehow make my own meta annotations with other models/other logic? if so, how?
  2. what data is being passed to the meta annotation model during processing? Is it each identified token together with x tokens from left and right side of the token?
  3. Why is a tokenizer needed to be provided to metacat? I thought text had already been tokenized at the point in processing where meta annotations are handled.
  4. Where in this status_mc model is it’s output defined? Forexample, where is it defined that output can be: “{‘Status’: ‘Confirmed’}” ?

I hope these questions are not to simple/off

Thanks in advance

Hi @bkakke - welcome to the CogStack discourse community!

Answers to your questions:

  1. Yes - we have two alternative implementations so far, one using bi-lstm and another via a Transformer, i.e. BERT. you can extend the API, re-use the base classes / configs etc. for your own implementations here. Each model implementation that actually does the heavy lifting is here
  2. yes exactly - the concept that has been identified and extracted by the NER+L MedCAT pipe, and the surrounding context.
  3. Each model could theoretically have its own tokenizer, but in practise you can use the same BBPE (or BERTTokenzier i.e. WordPiece) tokenizer across the different tasks for which you’ll train different MetaCAT models. We use a BBPE or WordPiece tokenizer here, as these are more effective tokenization methods during this classification scenario, with BBPE / WordPiece the vocab is not driven by the clinical terminology but is built directly from the corpus, so sub-word tokens can be used and word vectors learnt, alleviating the OOV problem seen by non-subword methods, i.e. Word2Vec. Sub-word tokenization for NER+L i.e. the MedCAT problem, doesn’t make sense as we ultimately are aiming to link full tokens to somewhere in the configured terminology, we don’t care about learning a sub-word latent space that allows to perform some abstract downstream task such as classification, inference etc.
  4. MetaCAT models are configured via the medcat.config_meta_cat.ConfigMetaCAT class. This defines what labels are being predicted, and what the label actually maps to in human readable form. Collecting these annotations (i.e. labelled data) is via the MedCATtrainer interface.

Hope that helps.


1 Like

Wait, you can configure the Meta-annotation task to be custom with different custom Meta Tasks?! :exploding_head:

Where can I learn about all the features of MedCAT? Its a lot more capable than I realised…

A good place to start is the paper: pre-print or journal pub,

Then there’s tutorials for the API / features of the toolkit: here

Broader tutorials that cover the internals of building a new MedCAT model from scratch is here

There’s also the documentation available here for: MedCAT and MedCATtrainer

If you have any specific questions feel free to create a new post right here on discourse!

1 Like