Trouble creating ICD10 codes mappings for NER

Hi noob here :wink:

I am trying to do some entity recognition, but I am confused as how to actually report them in the results.
For instance:

In [1]: from medcat.cat import CAT
In [2]: cat = CAT.load_model_pack('./medmen_wstatus_2021_oct.zip')
In [3]: cat.get_entities("epilepsy", only_cui=False)
Out[3]: 
{'entities': {0: {'pretty_name': 'Epilepsy',
   'cui': 'C0014544',
   'type_ids': ['T047'],
   'types': ['Disease or Syndrome'],
   'source_value': 'epilepsy',
   'detected_name': 'epilepsy',
   'acc': 0.6693785786848553,
   'context_similarity': 0.6693785786848553,
   'start': 0,
   'end': 8,
   'icd10': [],
   'ontologies': [],
   'snomed': [],
   'id': 0,
   'meta_anns': {'Status': {'value': 'Affirmed',
     'confidence': 0.7051243185997009,
     'name': 'Status'}}}},
 'tokens': []}

and the icd10 field is empty.
Then I followed the https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/specialised/Preprocessing_SNOMED_CT.html and
I am able to get to the point where we doc

In [91] cdb.addl_info['cui2icd10'] = sctid2icd10
In [92] cdb.save("SNOMED_cdb.dat")

But I am not able to create a model_pack from this cdb.

Could you please help me out?

Thanks

Andrea

I am not quire sure what’s stopping you from creating a model pack.

But for reference, you create a model pack using the CAT instance, not the CDB instance. The model pack includes the CDB and the Vocab. When saved on disk, it’ll also save the spacy model used along with the additional NER models or MetaCAT models if applicable.

Here’s how you do it, in a nutshell:

# relevant imports
cat = CAT.load_model_pack('./medmen_wstatus_2021_oct.zip')
cdb = cat.cdb # reference to the CDB involved in the model pack
# relevant changes to the CDB - in your case, adding the cui2icd10 mappings
save_folder = '' # path to where you want to save the new model pack
cat.create_model_pack(save_folder)

Note that the CAT instance keeps track of the CDB instance so changes you’ve made to the CDB will be saved on disk as part of the model pack.

mhmm … I guess this is the part where I am confused for medmen and snomed.

Could you please help me out filling in the ??? in the codes below ? Thanks!

# --- code 1
from medcat.cat import CAT
cat = CAT.load_model_pack('./medmen_wstatus_2021_oct.zip')
# cat.cdb.addl_info['cui2icd10'] = ???
cat.get_entities("epilepsy", only_cui=False)
ents = cat.get_entities("epilepsy", only_cui=False)
assert ents['entities'][0]['icd10'] == ['G40.909']
cat.create_model_pack('medmen_with_icd10')
# --- code 2
from medcat.cdb import CDB
from medcat.cat import CAT
cdb = CDB.load('SNOMED_cdb.dat')
def get_direct_refset_mapping(in_dict: dict) -> dict:
    ret_dict = dict()
    for k, vals in in_dict.items():
        ret_dict[k] = [v['code'] for v in svals]
    return ret_dict
from medcat.utils.preprocess_snomed import Snomed
snomed = Snomed('SnomedCT_InternationalRF2_PRODUCTION_20240301T120000Z')
icd_dict = snomed.map_snomed2icd10()
sctid2icd10 = get_direct_refset_mapping(icd_dict)
cdb.addl_info['cui2icd10'] = get_direct_refset_mapping(sctid2icd10)

# cat = CAT(cdb=cdb, config=cdb.config, vocab=???)

I see you’re first (in code 1) loading the MedMentions model. That’s a UMLS based model.

You’re then (in code 2) seemingly loading a SNOMED based CDB. And then using our preprocessing tools to get and add the SNOMED to ICD10 mappings from the 2024 SNOMED International release.
As far as I know, this should work just fine.

Now, if you want to relate the UMLS terms in the MedMentions model to ICD10 you would need to use the UMLS version that this model was created with (different versions you may have unforeseen conflicts) and look for the corresponding ICD10 terms for each of the concepts where possible. Though this may not always be straight forward since there can be one-to-many or many-to-one mappings.
Off the top of my head, I think this will involve looking into MRCONSO.RRF (but potentially some others as well) and looking at lines where the source is ICD10. In that case the source ID should be the ICD10 ID and should allow you to link it to the UMLS concept.