Medcat trained models issues

andrii.k · January 5, 2024, 12:24pm

Hi everyone,
I’ve recently started using MedCAT. Following tutorials everything was working perfectly when using medmentions model. However I ran into issues when trying out big and small UMLS models and SNOMED model:

Big model doesn’t seem to work at all for entity recognition. For example, the following sentence yields empty results
“We have studied the patient on the subject of diabetes”
SNOMED model doesn’t load at all, with the following error

ValidationError                           Traceback (most recent call last)
Cell In [3], line 1
----> 1 cat = CAT.load_model_pack(model_pack_path)
      2 cat_orig = CAT.load_model_pack(model_pack_path)
...
ValidationError: 1 validation error for Config
linking -> filters -> cuis
  value is not a valid set (type=type_error.set)

mart.ratas · January 8, 2024, 10:01am

Hi! Thank you for the interest and the questions.

The UMLS big model doesn’t seem to be trained on the diabetes concept. While I don’t know why that would be (it is mentioned quite a lot in the MIMIC-III training data), given it’s not received training, the result is not unexpected since the model doesn’t know the context in which to expect it.
- You can check how much training a CUI has received by checking cat.cdb.cui2count_train[cui], though note that you may want to check if the CUI is in the dict first since it won’t be if it’s received no training
  - If you don’t know the CUI for a concept/name, you can find it from cat.cdb.name2cuis[name] - though do bare in mind that many names are ambiguous and may refer to multiple concepts
- You can also use cat.cdb.name2count_train in a similar manner for the name
- Here’s some (20) of the most-trained concepts for this model, along with their corresponding names and train counts:
  - C0392360 (rationale, indication, with~indication, indication~of~contextual~qualifier, indications, reasons, reason, justification) 1202508
    C4288581 (notable, noted ) 880647
    C2826258 (subject~continuance, cont ) 810844
    C0205397 (seen ) 669220
    C4745084 (medical~condition ) 500584
    C0587081 (laboratory~report, lab~findings, interpretation~laboratory~test, laboratory~findings, interpretation~laboratory~tests, laboratory~test~observations, laboratory~test~interpretation, laboratory~test~finding, interpretation~of~laboratory~tests, test~result, laboratory~test~observation, labs, interpretation~of~laboratory~test, laboratory~finding, lab~finding, lab~result, laboratory~test~findings, laboratory~test~result) 468121
    C0043084 (weanings, wean, weaning, ablactation, weaned) 398961
    C0184666 (admitting, admission, admits, hospital~admissions, admissions, hospital~admission, admit~to~hospital, admissions~hospital, admit, admission~to~hospital, admission~hospital, admitted~hospital, hospitalization~admission, admitted) 388589
    C1514756 (receiving, receive, received ) 386376
    C1533810 (placement, placed, placement~action, place) 379965
    C5553941 (aper, specimen~appearance~assessment, specimen~appearance, appear, appearance) 359490
    C1292718 (is~a, is~a~attribute ) 357654
    C2986914 (nonclinical~study~title, stitle ) 356354
    C0746591 (mitral ) 341402
    C0220825 (evaluation~procedure, evaluations, assessment, evaluate, evaluated, investigation, effectiveness~assessment, evaluation, efficacy~assessment) 334406
    C2081612 (explanation~of~plan~:~medication, medication~:, plan~:~medication~treatment, plan~:~medication) 324319
    C0699992 (lasix ) 317946
    C4698386 (intubated ) 307567
    C1707455 (compare, comparison, compared ) 304762
    C2317096 (spo2~saturation~of~peripheral~oxygen, peripheral~oxygen~saturation, spo2, saturation~of~peripheral~oxygen) 292541
- As such, the following (nonsensical) sentence does correctly work for NER:
  - Hospital admissions have been going up due to lab funding going down
  - {'entities': {0: {'pretty_name': 'Hospital Environment', 'cui': 'C0019994', 'type_ids': ['T073', 'T093'], 'types': ['', ''], 'source_value': 'Hospital', 'detected_name': 'hospital', 'acc': 0.99, 'context_similarity': 0.99, 'start': 0, 'end': 8, 'icd10': [], 'ontologies': ['NCI', 'MEDLINEPLUS', 'SNOMEDCT_US', 'RCD', 'CHV', 'PSY', 'LCH', 'LNC', 'NCI_FDA', 'CSP', 'MTH', 'HL7V3.0', 'MSH', 'LCH_NW', 'NCI_CDISC', 'AOD', 'SNMI'], 'snomed': [], 'id': 0, 'meta_anns': {}}}, 'tokens': []}
This is a known issue (e.g MedCAT model for SNOMED-CT).
- Older models initialised something as a set where a dict was expected
  - And newer versions catch this discrepency
- The current fix (for medcat 1.8.0+) is to run:
  - python -m medcat.utils.versioning fix-config <model_pack_path> <new_model_pack_path>
- We have patched this in the current development branch but have yet to release it (probably soon in 1.10.0).

andrii.k · January 9, 2024, 8:28pm

Thanks a lot for the detailed answer!
I still have a few questions

I am a bit confused about the possibilities of fine-tuning trained models. In your tutorials you mention fine-tuning as in supervised training. Is it a viable approach to run unsupervised training again but on a dataset which contains the concepts needed for my use case?
The models which are available for download went only through unsupervised step or through the whole pipeline you suggest in your paper (I mean, unsupervised → additional annotation, repeat)
For the small UMLS model the following command returns an empty dictionary. I am bit confused, since I thought it was also trained on a subset of UMLS. Is it expected behaviour?

cat.cdb.name2count_train

mart.ratas · January 10, 2024, 10:18am

I’ll try to answer to the best of my ability.

Unsupervised training on documents that have more of the concepts you are interested in will definitely be beneficial. However, you are unlikely to get to the same level of performance without any supervised training. We have tools available that can help you create annotations for supervised training. I’d recommend looking at MedCATtrainer.
As far as I know, these models have not gone through any supervised training. Since supervised training is generally done with hospital data, the resulting models generally cannot be shared publicly.
It looks like name2count_train is indeed empty for this model. If you wish to to look for a name, you’d need to first check its CUI (cdb.name2cuis) and then check the training count (cdb.cui2count_train). For some reason this model seems to have not populated this dict.

andrii.k · January 14, 2024, 10:41am

Thanks again for the detailed answer, everything is clear now!
I have the last question, just to be sure. The model trained on Medmentions (used in tutorial) is also trained only unsupervised? Or the labels were used and training was supervised?

mart.ratas · January 16, 2024, 5:01pm

Your intuition is correct. The MedMentions model was indeed trained in a supervised manner based on the MedMentions dataset. That is on top of the self-supervised training it received (MIMIC, I think - but don’t quote me on that).
And from what I hear, it worked quite well. But would probably still benefit from supervised training if it were to be used in specific situations.

Just to clarify, the statement above about public models not receiving any supervised training was meant to refer to training on datasets that are not publicly available. Though as far as I know, the public models referred to in the readme received no supervised training (not even MedMentions).

Topic		Replies	Views
MedCAT French model only matches exact terms - accuracy similarity always 1 MedCAT	7	62	June 8, 2025
How to improve recall and make medcat find correct word combinations?	15	315	January 20, 2023
Medcat 1.7.0 trained on documents, or sentences (short documents) MedCAT	1	213	March 30, 2023
Issue with medcat umls full model MedCAT	20	308	May 16, 2024
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	373	January 30, 2023

Medcat trained models issues

Related topics