I’m interested in tagging health text with symptoms, diseases, body parts … and drugs. I’d like to do my analysis with SNOMED-CT and RxNorm.
Am I better off subsetting the pre-trained UMLS dataset somehow, or building an RxNorm + SNOMED-CT CDB from scratch and training that up? (even as I type it, the second option sounds like a bad idea)
As @Jthteo mentioned taking the pre-trained UMLS model and filtering it just for SNOMED+RxNorm concepts will be the easier of the two approaches.
This filter can be applied by adding the relevant subset of SNOMED+RxNorm CUIs from UMLS into the set within the appropriate configuration setting as follows:
cdb.confing.linking['filters'] = {'cuis': set()}
Unfortunately, we currently don’t have a freely available RxNorm+SNOMED CDB that we can share. I’m afraid if you want to take this option you will need to do this yourself.
In the Specialised MedCAT tutorials it walks one through how to do create a SNOMED MedCAT CDB. You will just need to format RxNORM into the same format, save it as a csv file, and then combine the two preprocessed CSVs using the command below: