I am a data analyst at Royal Brompton hospital currently with about 15000 CT thorax reports. I would like to use MedCAT to analyse these reports. I am looking to see if MedCAT can identify positive/negative/suspect about the following terms.
• Bronchiectasis – subterms would be mild/moderate/severe, central, varicose, cylindrical and if possible indication of localisation (i.e. upper, lower, middle lobes, bilateral, panlobar, localised etc)
• Mycetoma; aspergilloma
• Fungal ball
• Nodule – subterms would be cavitatory or cavitating
• Pleural thickening
• Mucus plugging or tree-in-bud change
• Ground glass
I have also attach a screenshot of where I am currently with MedCAT. I know I can copy and paste each CT report in manually, but I was wondering if there was any way to process all 15000 reports at once, and give me +ve/-ve/suspect values for the aforementioned terms?
2022 May 10th
In short, the MedCAT user interface is one of the initial steps to create a labelled dataset. This labelled dataset is used to both train and validate the performance of the model on a sample of your dataset. (in your case 129 documents out of 15000).
The actual training and annotation process will occur outside of the User interface. Once your sample of document is annotated. In the “Project annotate entities” tab select all relevant annotation projects to your use case and select “export”. Now you have a labelled MedCATtrainer dataset.
Since you are using an open source model you can directly interchange the the models, MedCATtrainer export, and dataset within scripts found within the MedCAT tutorials. Just be sure to add your relevant concept filter.
Tutorial 4.2 demonstrates how to train a model from a export from the MedCATtrainer (the MedCAT UI), and how to save your newly trained model.
Automated annotation of your dataset
Tutorial 4.3 Then takes the trained model, created from Tutorial 4.2, and describes how to annotate & Meta-Annotate all documents within your dataset. This will create a JSON format file with all annotations across your 15k documents.
Processing the annotations
An example of how a project may choose to process annotations is demonstrated in Tutorial 5.
Note: I am aware of a pre-trained model at GSTT which will perform better compared to the available open-source model since it is trained on real-world NHS data. This should be readily transferable since GSTT and RBH are within the same NHS trust. Please DM me if it is of interest.
Yes, this is possible but this first step, the project we set up for you on medcattrainer is to tune the model to recognise the synonyms appropriately. The running on the rest of the 15k follows after the tuning is done and validated.
Thanks for pointing me towards the tutorial. I have been going through it, and the tutorial was helpful. Currently, I have used the model trained on MedMentions and it is doing alright for the terms that are in MedMentions (such as cavity). However, it is unable to recognize terms in SNOWMED-CT or UMLS which are not in MedMentions. I was wondering if I can ask for the pretrained GSTT model with terms from SNOWMED-CT and UMLS to try it upon the reports to check its performance?
Thanks for the advice about the tuning and validation. I have talked to Anand about it and we have plans to do the tuning and validation. I understand that doing the tuning and validation will be best for long term because it will tune the model specifically for our use case. However, currently, I would like test if the pretrained GSTT model is sufficiently good for our purposes. While the results might not the best, if they are satisfactory enough I will be able to work with the non-perfect satisfactory results on the next steps of the analysis.
By marking the annotations in Medcattrainer, you evaluate its suitability at the same time.
I recommend against having too many instances of medcattrainers and models as one would rapidly lose track of the provenance of a model. I recommend using the model in Gstt and not creating new instances.