Medcat trained models issues

Hi everyone,
I’ve recently started using MedCAT. Following tutorials everything was working perfectly when using medmentions model. However I ran into issues when trying out big and small UMLS models and SNOMED model:

  1. Big model doesn’t seem to work at all for entity recognition. For example, the following sentence yields empty results
    “We have studied the patient on the subject of diabetes”
  2. SNOMED model doesn’t load at all, with the following error
ValidationError                           Traceback (most recent call last)
Cell In [3], line 1
----> 1 cat = CAT.load_model_pack(model_pack_path)
      2 cat_orig = CAT.load_model_pack(model_pack_path)
...
ValidationError: 1 validation error for Config
linking -> filters -> cuis
  value is not a valid set (type=type_error.set)

Hi! Thank you for the interest and the questions.

  1. The UMLS big model doesn’t seem to be trained on the diabetes concept. While I don’t know why that would be (it is mentioned quite a lot in the MIMIC-III training data), given it’s not received training, the result is not unexpected since the model doesn’t know the context in which to expect it.
    • You can check how much training a CUI has received by checking cat.cdb.cui2count_train[cui], though note that you may want to check if the CUI is in the dict first since it won’t be if it’s received no training
      • If you don’t know the CUI for a concept/name, you can find it from cat.cdb.name2cuis[name] - though do bare in mind that many names are ambiguous and may refer to multiple concepts
    • You can also use cat.cdb.name2count_train in a similar manner for the name
    • Here’s some (20) of the most-trained concepts for this model, along with their corresponding names and train counts:
      • C0392360 (rationale, indication, with~indication, indication~of~contextual~qualifier, indications, reasons, reason, justification) 1202508
        C4288581 (notable, noted ) 880647
        C2826258 (subject~continuance, cont ) 810844
        C0205397 (seen ) 669220
        C4745084 (medical~condition ) 500584
        C0587081 (laboratory~report, lab~findings, interpretation~laboratory~test, laboratory~findings, interpretation~laboratory~tests, laboratory~test~observations, laboratory~test~interpretation, laboratory~test~finding, interpretation~of~laboratory~tests, test~result, laboratory~test~observation, labs, interpretation~of~laboratory~test, laboratory~finding, lab~finding, lab~result, laboratory~test~findings, laboratory~test~result) 468121
        C0043084 (weanings, wean, weaning, ablactation, weaned) 398961
        C0184666 (admitting, admission, admits, hospital~admissions, admissions, hospital~admission, admit~to~hospital, admissions~hospital, admit, admission~to~hospital, admission~hospital, admitted~hospital, hospitalization~admission, admitted) 388589
        C1514756 (receiving, receive, received ) 386376
        C1533810 (placement, placed, placement~action, place) 379965
        C5553941 (aper, specimen~appearance~assessment, specimen~appearance, appear, appearance) 359490
        C1292718 (is~a, is~a~attribute ) 357654
        C2986914 (nonclinical~study~title, stitle ) 356354
        C0746591 (mitral ) 341402
        C0220825 (evaluation~procedure, evaluations, assessment, evaluate, evaluated, investigation, effectiveness~assessment, evaluation, efficacy~assessment) 334406
        C2081612 (explanation~of~plan~:~medication, medication~:, plan~:~medication~treatment, plan~:~medication) 324319
        C0699992 (lasix ) 317946
        C4698386 (intubated ) 307567
        C1707455 (compare, comparison, compared ) 304762
        C2317096 (spo2~saturation~of~peripheral~oxygen, peripheral~oxygen~saturation, spo2, saturation~of~peripheral~oxygen) 292541

    • As such, the following (nonsensical) sentence does correctly work for NER:
      • Hospital admissions have been going up due to lab funding going down
      • {'entities': {0: {'pretty_name': 'Hospital Environment', 'cui': 'C0019994', 'type_ids': ['T073', 'T093'], 'types': ['', ''], 'source_value': 'Hospital', 'detected_name': 'hospital', 'acc': 0.99, 'context_similarity': 0.99, 'start': 0, 'end': 8, 'icd10': [], 'ontologies': ['NCI', 'MEDLINEPLUS', 'SNOMEDCT_US', 'RCD', 'CHV', 'PSY', 'LCH', 'LNC', 'NCI_FDA', 'CSP', 'MTH', 'HL7V3.0', 'MSH', 'LCH_NW', 'NCI_CDISC', 'AOD', 'SNMI'], 'snomed': [], 'id': 0, 'meta_anns': {}}}, 'tokens': []}
  2. This is a known issue (e.g MedCAT model for SNOMED-CT).
    • Older models initialised something as a set where a dict was expected
      • And newer versions catch this discrepency
    • The current fix (for medcat 1.8.0+) is to run:
      • python -m medcat.utils.versioning fix-config <model_pack_path> <new_model_pack_path>
    • We have patched this in the current development branch but have yet to release it (probably soon in 1.10.0).

Thanks a lot for the detailed answer!
I still have a few questions

  1. I am a bit confused about the possibilities of fine-tuning trained models. In your tutorials you mention fine-tuning as in supervised training. Is it a viable approach to run unsupervised training again but on a dataset which contains the concepts needed for my use case?
  2. The models which are available for download went only through unsupervised step or through the whole pipeline you suggest in your paper (I mean, unsupervised → additional annotation, repeat)
  3. For the small UMLS model the following command returns an empty dictionary. I am bit confused, since I thought it was also trained on a subset of UMLS. Is it expected behaviour?
cat.cdb.name2count_train

I’ll try to answer to the best of my ability.

  1. Unsupervised training on documents that have more of the concepts you are interested in will definitely be beneficial. However, you are unlikely to get to the same level of performance without any supervised training. We have tools available that can help you create annotations for supervised training. I’d recommend looking at MedCATtrainer.
  2. As far as I know, these models have not gone through any supervised training. Since supervised training is generally done with hospital data, the resulting models generally cannot be shared publicly.
  3. It looks like name2count_train is indeed empty for this model. If you wish to to look for a name, you’d need to first check its CUI (cdb.name2cuis) and then check the training count (cdb.cui2count_train). For some reason this model seems to have not populated this dict.

Thanks again for the detailed answer, everything is clear now!
I have the last question, just to be sure. The model trained on Medmentions (used in tutorial) is also trained only unsupervised? Or the labels were used and training was supervised?

Your intuition is correct. The MedMentions model was indeed trained in a supervised manner based on the MedMentions dataset. That is on top of the self-supervised training it received (MIMIC, I think - but don’t quote me on that).
And from what I hear, it worked quite well. But would probably still benefit from supervised training if it were to be used in specific situations.

Just to clarify, the statement above about public models not receiving any supervised training was meant to refer to training on datasets that are not publicly available. Though as far as I know, the public models referred to in the readme received no supervised training (not even MedMentions).