New Synonyms and train count not updating + general query

Summary: version 2.5.3 of medcat. Added a synonym for the CUI for acute kidney injury in medcattrainer. Trained model, saved new version. aki added was added as synonym but it has a train count of 0 (even though it was in the annotation export twice).

Secondly, a query - is there a general approach recommended with respect to training, particularly unsupervised training? I have a set of 14.5k historic pre-op documents that we hope to extract diseases/disorders from including context; is it recommended that I run unsupervised training on that set of documents as a kind of first step? In tandem with that we have a list of acronyms that we use, we assess those against the CDB, if they’re missing we will inject them before beginning annotations.

What do you mean when you say “trained model”? Did you download the trainer export off trainer and train the model somewhere else? Or did you specify “Train model on submit” in the project config, then annotate, then save the subsequent model off trainer, and then download the model?

Unsupervised training is generally intended to be used before the supervised training / fine-tuning. It’s general purpose is to identify automatically identifiable name/concept pairs (these are ones where a) the name only refers to one concept, or b) the specific concept is marked as primary for the specific name) and learn what contexts these appear in.

So whether or not you’ll find it useful to run the model over your data in an unsupervised manner depends on what the model already knows and what additional information it could obtain from the dataset. If you’ve got a model that doesn’t have existing training for a bunch of concepts you know (or expect to) exist in the dataset, then this may be worthwile. But if the model already has a good training data for the concepts at hand, this is unlikely to be super helpful.

The thing to note here is that the self-supervised training step isn’t magic. It’s limited to only training on names mapping to 1 CUI or in cases where the name is a primary name for a specific CUI.

In general, when you’re trying to align the model to your specific use cases / datasets, we’ve found supervised training to be more useful. You’ve got a lot more control on what the model is learning in this situation. In the self-supervised case the model can (and often does) learn things that aren’t all that useful. For an illustrative example, if the name jaguar only maps to once CUI (14398006 | Panthera onca) then all mentions of Jaguars (the cars) will be trained as this concept. While this specific example might not be super concerning, the same could be true for a number of other names / concepts in in the CDB. In fact, my quick look suggests that the vast majority of the approximately 3M names (3 019 360 / 3 080 845 in the model I’ve got) in the CDB only map to 1 concept (though notably a lot of these are drug dosage including ones or other obscure ones that you may not ever encounter).

Very appreciative of the rapid reponse! I am testing out MedCAT as an option for our hospital; so it’s not quite in production yet. I basically had a clinical colleague annotate some pre-op assessment documents. When she completed that, we generated the export of the annotations. I used that export to train a base model I have on my Windows PC; I trained via Python, and saved is as a new model version. I can now compare CUI for AKI in base model versus trained.

I note the CUI state before training. Train via Python. Save new model.

Note: save model with train is failing as per previous incident raised to you (not by me).

Re: unsupervised training: I accept what you’re saying… I recognise certain patterns may get affirmed that are not beneficial to us. I will not pursue it for the time being, and thank you for the advice.

Did the count_train for the CUI in question go up?

I suspect if neither went up then something went wrong during the training process and logged (so that the training can continue). But you may not have seen the warning. Perhaps adding some debug logging could help, e.g:

from medcat.trainer import logger as train_logger
import logging
train_logger.addHandler(logging.StreamHandler())
train_logger.setLevel(logging.DEBUG)

I am revisiting this issue, and will respond to you with more detail.

Debug logging enabled as suggested. Output attached. The log shows all 347 annotations were processed across 4 documents with no warnings or errors. No additional DEBUG level output was produced — on reviewing the trainer source, the supervised training path in 2.5.3 contains no logger.debug calls beyond _train_meta_cat. Please let us know if you need anything further to diagnose the name-level count issue."
I am putting in a portion of the output log:
Running without a test set, or train==test
Annotation right-sided chest pain (285386001) [400:422]
Annotation Past medical history (392521001) [53:73]
Annotation epigastric hernia (289260002) [26:43]


Annotation RV (276756009) [1417:1419]
Annotation AKI (14669001) [793:796]
Annotation AKI (14669001) [690:693]
Annotation diarrhoeal (128333008) [712:722]


Annotation MI (57054005) [1348:1350]
Annotation alcohol excess (15167005) [231:245]

I’ve taken a quick look and I think for some reason there’s an issue where the per name train count isn’t being updated in some cases. I’ll look into it in a bit more detail and let you know if/when I’ve got a fix implemented.

PS:
You should still be able to track per CUI train counts for the time being.

I’ve identified that the issue is already fixed in MedCAT v2.7.0.

But I’ve added a few tests for the future on this anyway.

You can see a few more details on what fixed it in the PR: