My use case requires about 3,000 concepts and even the smallest off the shelf UMLS model has many thousands of concepts.
I started using this function in order to remove any concept I don’t need from the small model, however it takes a very long time… Like multiple seconds in order to remove 10 items. In the time it took me to write this post I removed about 800.
This is actually untractable if I want to use the 4mm set and then only keep say 20,000 UMLS concepts.
Does anyone have a recommendation or strategy for surgically creating a smaller subset CDB?
One strategy I have in mind is manually creating a new CDB object based on taking the relevant internal objects from an existing CDB and but only keep keys I care about.
When I tried running filter_by_cuis on the umls small, I got this exception:
Exception: This CDB does not support subsetting - most likely because it is a `small/medium` version of a CDB
It looks to me that the only reason models without a filled in cui2snames are not allowed is because the filter_by_cui method expects the dict to be filled for all cuis.
So the simples solution I can offer is to populate cui2snames from cui2names. In general, cui2snames would be expected to have more names for each cui than cui2names, but all the names in cui2names should still fit in cui2snames (at least that was the case with the full UMLS model).
When populating based on cui2names, I’d want to make sure to map each cui to a new set so that the two dicts can subsequently be modified independnetly.
You can refer to the PR I just opened:
PS:
I tried this with the small UMLS model (umls_sm_pt2ch_533bab5115c6c2d6 - it’s the updated small version you can get through the link on the medcat github repo link with a parent to children mapping added)
filter_cui_list = ['C5393760', 'C3176945', 'C4680675', 'C0054729', 'C5364763', 'C4601133'] # random list of CUIs
cat.cdb.populate_cui2snames()
cat.cdb.filter_by_cui(filter_cui_list)