Advice on MedCAT for a small set of concepts

jkgenser · June 6, 2023, 9:16pm

My use case requires about 3,000 concepts and even the smallest off the shelf UMLS model has many thousands of concepts.

I started using this function in order to remove any concept I don’t need from the small model, however it takes a very long time… Like multiple seconds in order to remove 10 items. In the time it took me to write this post I removed about 800.

This is actually untractable if I want to use the 4mm set and then only keep say 20,000 UMLS concepts.

Does anyone have a recommendation or strategy for surgically creating a smaller subset CDB?

One strategy I have in mind is manually creating a new CDB object based on taking the relevant internal objects from an existing CDB and but only keep keys I care about.

When I tried running filter_by_cuis on the umls small, I got this exception:

Exception: This CDB does not support subsetting - most likely because it is a `small/medium` version of a CDB

mart.ratas · June 21, 2023, 9:02am

It looks to me that the only reason models without a filled in cui2snames are not allowed is because the filter_by_cui method expects the dict to be filled for all cuis.

So the simples solution I can offer is to populate cui2snames from cui2names. In general, cui2snames would be expected to have more names for each cui than cui2names, but all the names in cui2names should still fit in cui2snames (at least that was the case with the full UMLS model).

When populating based on cui2names, I’d want to make sure to map each cui to a new set so that the two dicts can subsequently be modified independnetly.
You can refer to the PR I just opened:

github.com/CogStack/MedCAT

CU-86785yhfk Add method to populate cui2snames with data from cui2names

CogStack:master ← mart-r:subsetting

opened 08:53AM - 21 Jun 23 UTC

mart-r

+21 -0

This is an attempt to allow subsetting of smaller models as well. In referenc…e to: https://discourse.cogstack.org/t/advice-on-medcat-for-a-small-set-of-concepts/216 Best I can tell, the only reason the `CDB.filter_by_cui` method fails upon an empty `cui2snames` is because it excpect all CUIs to have snames filled in as well and the method would thus throw an exception otherwise. Though if someone knows of other reasons, I'd be happy to hear them.

PS:
I tried this with the small UMLS model (umls_sm_pt2ch_533bab5115c6c2d6 - it’s the updated small version you can get through the link on the medcat github repo link with a parent to children mapping added)

filter_cui_list = ['C5393760', 'C3176945', 'C4680675', 'C0054729', 'C5364763', 'C4601133'] # random list of CUIs
cat.cdb.populate_cui2snames()
cat.cdb.filter_by_cui(filter_cui_list)

And that was successful.

jkgenser · June 26, 2023, 1:22pm

Amazing thank you. I’ll give this a try!

Topic		Replies	Views
Removing a CDB Concept MedCAT	13	280	June 6, 2023
Removing names from a CDB concept MedCAT	6	203	August 15, 2022
Medcat trained models issues MedCAT	5	294	January 16, 2024
Impact of filters on MedCAT annotations	1	169	June 30, 2023
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	360	January 30, 2023

Advice on MedCAT for a small set of concepts

Related topics