I am been trying to build a NER+linking pipeline. I want to get the SNOMED CT code of the entities recognized and then map the code to ICD 10. I looked up MedCat documentation and tutorials but there weren’t helpful. I just want to know the steps needed to achieve my goal
Please refer to the CAT.get_entities
method:
https://medcat.readthedocs.io/en/latest/autoapi/medcat/cat/index.html#medcat.cat.CAT.get_entities
If used with the default addl_info
argument, the entities you receive will have an icd10
key and value pair. And if the model has ICD10 mappings embedded and the CUI found maps to an ICD10 term, the value will be a list of the ICD10 terms the CUI maps to.
An example result from CAT.get_entities
for a SNOMED model:
{
"entities": {
"7": {
"pretty_name": "Syncope",
"cui": "271594007",
"type_ids": [
"67667581"
],
"types": [
""
],
"source_value": "Syncope",
"detected_name": "syncope",
"acc": 0.4381021320061105,
"context_similarity": 0.4381021320061105,
"start": 114,
"end": 121,
"icd10": [
"R55"
],
"ontologies": [
"20220803_SNOMED_UK_CLINICAL_EXT"
],
"snomed": [],
"id": 7,
"meta_anns": {}
}
}
}
So you’d need to look at the "icd10"
of each entity.
I.e
entities = cat.get_entities(text)
ents = entities['entities']
for ent in ents.values():
ent_cui = ent['cui']
ent_icd10 = ent['icd10']
# do what yo uwish with it
print(ent_cui, '->', ent_icd10)
Thank you for the help. Can you please tell me what is the maximum text length, medcat can perform NER and linking on?
The maximum is defined in the config at config.preprocessing.max_document_length
:
https://medcat.readthedocs.io/en/latest/autoapi/medcat/config/index.html#medcat.config.Preprocessing.max_document_length
By default, this is 1M characters.
This is the upper limit for the spacy model we depend on under the hood . So you should be able to lower this number, but as far as I know, if you increase it and feed MedCAT a larger document, it will fail due to an exception raised by spacy
.
Thank you, can you also please tell me how can i improve the accuracy of medcat NER+linking.
It depends on the model you’re using as well as what data you’re using it on.
The models we’ve got publicly available aren’t going to provide the best performance. Especially for very specialised data.
The way we generally train a model is as follows:
- Train on some public data (i.e MIMIC III or MIMIC IV)
- Train on some relevant hospital data (potentially on multiple sites)
- Obtain supervised training datasets (by annotating with MedCATtrainer)
- Use the supervised training datasets to perform supervised training
Now, the publicly available models have generally only gone through the first of those steps.
So if you want the NER+L process to work well for your specific use case, you would need to do some additional training / fine tuning.
If your data is considerably different from what the model was originally trained on, unsupervised training on some of it may already give you increased performance.
However, to improve the performance further, you would need to do some (good quality!) supervised training as well.
One way to figure out whether or what you should focus the additional training / fine tuning on, would be to look at the CUIs you’re interested in within the model’s CDB. If the cui
does not exist in cdb.cui2count_train
or the value retained is small for many of the CUIs you’re interested in, this is a good indication that more training is needed on the topics you’re interested in.
I gave following input text “leukemia” the response was empty as medcat did not recognize it, i checked online it is present on snomed ct, should i update my concept database? Can i use ICD 10 as concept database with snomed ct model? Also can you tell me how can i fine tune my model on publicly available datasets such as NCBI disease? These datasets are labelled.
The library isn’t necessarily designed to recognise a single word (it’d be far easier to do a dictionary lookup for this). It’s designed to recognised entities in free text.
If you give it a single word, it may very well be ambiguous. MedCAT uses the context around a word to disambiguate words. But if you don’t give it any context, then this cannot be done.
You cannot ‘update’ your concept database from snomed to ICD 10. The model was trained on Snomed concepts. Changing the database after training would do nothing but break the model. It would be like learning to drive in a car, but then be presented an airplane during the driving test.
With that said, certain models do have ICD10 mappings in them. So if your model has ICD10 mappings, you should see that every entity that maps to one or more ICD10 terms has an "icd10"
key (at least by default) which refers to the corresponding ICD10 term(s).
As for training, please refer to the tutorials: