Issue with medcat umls full model

Im facing problem with using the full model i got from the UMLS.
from medcat.cat import CAT
cat = CAT.load_model_pack(‘model.zip’)
text = “Kidney disfunction”
entities = cat.get_entities(text)
print(entities)
Model loading part is getting stuck and after some time the terminal shows “killed”.

medmen_wstatus_2021_oct.zip model is working perfectly.

You’re most likely running out of memory (RAM).
The full UMLS model is an extremely large one.

When I last loaded up the model, it took a whopping 31GB of memory.

EDIT: I just looked at it, and it seems it was “only” 20.8GB for the CAT instance model pack itself. This is the memray flamegraph from January of 2023:
https://mart-r.github.io/UMLSMemory/memray-flamegraph-umls.html

Ohh okay, i have only got a memory of 8 gb

Increasing my ram worked. I was trying to use the full model to get the icd 10 codes but after running the full model the icd 10 codes array is empty for every entities. What i was expecting was similar to the demo page where for each entities icd 10 codes are also provided. Any extra step to do for this?

Unfortunately the ICD10 mappings weren’t created and included in the full UMLS model.

If you want these mappings, you can try and find them yourself from the raw UMLS download.
The UMLS preprocessing module might be of help:

Any way to map the entities that I get from the model to its corresponding ICD 10 codes?
I cant understand the above code of preprocessing. I am actually new to python. It will be very helpful if I can get a step by step procedure.

If you download the official UMLS release files and then use the class in the module I’ve provided, you should be able to do the following:

from medcat.utils.preprocess_umls import UMLS
path_to_mrconso = '/path/to/MRCONSO.RRF'  # CHANGE THIS TO REFER TO MRCONSO.RRF
path_to_mrsty = 'path/to/MRSTY.RRF'  # CHANGE THIS TO REFER TO MRSTY.RRF
umls = UMLS(path_to_mrconso, path_to_mrsty)
# get the mappings
umls2icd10 = umls.map_umls2icd10()
# now you should have a pandas DataFrame that should have the UMLS concept IDs (`CUI` column) as well as the ICD10 IDs (`CODE` column).
from medcat.cat import CAT
from medcat.utils.preprocess_umls import UMLS

cat = CAT.load_model_pack(‘Models/medmen_wstatus_2021_oct.zip’)
text = “lower back pain, upper abdominal pain, headache”
entities = cat.get_entities(text, only_cui=False, addl_info=[‘cui2icd10’, ‘cui2ontologies’, ‘cui2snomed’])

path_to_mrconso = '/path/to/MRCONSO.RRF'  # CHANGE THIS TO REFER TO MRCONSO.RRF
path_to_mrsty = 'path/to/MRSTY.RRF'. # CHANGE THIS TO REFER TO MRSTY.RRF
umls = UMLS(path_to_mrconso, path_to_mrsty)
# get the mappings
umls2icd10 = umls.map_umls2icd10(cui) #pass cui of each entities to this?

this would work?

You don’t need the MedCAT model for this at all.

And you’d need to change the two lines for MRCONSO.RRF and MRSTY.RRF to point to the files on your disk that you’ve downloaded. (plus remove the . after the string).

Im not getting a big picture here. My requirement is to get icd 10 codes for a given clinical text by using the umls full model. The model is giving the entities but, icd 10 codes is empty there, how can get the icd 10 codes? How to structure my code(like where to use the above code that you showed with the code for getting the entities from the model)? What does it mean when you say you dont need the model part here. Isn’t the model is what annotates the text and gives the entities? Is the code you showed part of training the model? If that is the case do i need to train a model for my requirement to get satisfied?

As I said before, the full UMLS model does not have the ICD10 codes embedded in it.
There is no way to extract something from the model that it does not have saved with it.

So in order to get the mappings from UMLS to ICD10, you would need to use the raw UMLS files that are used during preprocessing. In the way that I described above.

So it sounds like what you would need to do is:

  1. Get the UMLS → ICD10 mappings (this part does not require a model)
    • Download UMLS (I think 2022AA was used for the full model)
    • Use preprocessing to extract the relevant UMLS → ICD10 mappings
    • Create a direct dict mapping from the CUI column to the CODE column in the pandas.DataFrame
    • Save the mappings dict to disk
  2. Add the UMLS → ICD10 mappings to the model
    • Load up the model pack
    • Set mappings at cat.cdb.addl_info['cui2icd10']
    • Save model
  3. Use the newly saved model
    • It now has the UMLS → ICD10 mappings

Thank you for you response. Now it seems more clear.

from medcat.cat import CAT
from medcat.utils.preprocess_umls import UMLS

cat = CAT.load_model_pack("Models/medmen_wstatus_2021_oct")

path_to_mrconso = 'Models/MRCONSO.RRF'
path_to_mrsty = 'Models/MRSTY.RRF'
umls = UMLS(path_to_mrconso, path_to_mrsty)
umls2icd10 = umls.map_umls2icd10()
cat.cdb.addl_info['cui2icd10']= umls2icd10
save_folder = 'Models'
cat.create_model_pack(save_folder)

From my understanding i tried out the above code and it saved a new model in the specified folder. Still im not getting the icd 10 codes with this newly created model. Do you find anything wrong in the above code? I tried this 2019 umls full release files.

The code sets the pandas.DataFrame in the addl_info. But the library expects a dict that maps the Snomed CUI to the ICD10 CUIs.

I explicitly mentioned this above as well:

How to do that? Could you please help.

 df = umls.map_umls2icd10()
umls2icd10 = dict(zip(dataframe['CUI'], dataframe['CODE']))
cat.cdb.addl_info['cui2icd10']= umls2icd10

Does this work

As far as I can tell, that should work.

It worked. But It seems there are only 13k concepts in 2019 MRCONSO with icd 10 codes.

First of all, the full UMLS model was created with the 2022AA release, as I mentioned above. So the 2019 version may have a significant amount of differences.

As for why there would the “only 13k concepts”.
Is that based on the actual MRCONSO.RRF file?
If that’s the case, then there’s nothing I can do about it.
However, if you’re referring to the number of concepts that map to ICD10 in UMLS, the issue might be in the way you’ve created your mapping. What you’ve done allows any UMLS CUI to map to only one ICD10 code. There may be some CUIs that should map to more than one ICD10 code.

I think that is probably correct.

I am working with the 2023AA release right now and there are also around 13k UMLS CUIs associated with ICD10 codes:

And there are indeed some CUIs that map to multiple ICD10 codes, as dropping duplicate CUIs after reducing the dataframe to only the columns with unique rows of CUI and CODE leads to 11552 rows.
grafik

Alternatively, you can see how many concepts with ICD codes are in the current UMLS version in total in the statistics overview:
https://www.nlm.nih.gov/research/umls/sourcereleasedocs/current/ICD10/stats.html

So the current UMLS version has 11560 concepts with ICD codes (at the bottom at " Source Overlap").

Okay. So it doesn’t make a difference if I choose 2023 umls files instead of 2019 files.