Reproduce MedCAT Experiments from Publication

Hello MedCAT team,

I would like to run experiments with MedCAT on the same datasets used in the MedCAT publication:
“Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit” (2021)

At the bottom of the publication it is stated:
“All code for running the experiments, the toolkit and integration with wider CogStack deployments are available here: MedCAT: GitHub - CogStack/MedCAT: Medical Concept Annotation Tool (…)”

But I did not find the code required to run the experiments in the MedCAT Github.
Did I perhaps oversee it somewhere?

I understand that it might not be possible to share the datasets freely (altough medmentions is in the repo), but it would help a lot to have access to the Code used for the experiments and then datasets can then be added manually by me where they are needed.

I know that the original MedCAT models for the experiments in the publication are not publicly available (MedCAT model used in validation · Issue #379 · CogStack/MedCAT · GitHub) but I hope the code can be shared.

Kind regards
Kim Tang

Hi Kim Tang,

Thank you for your question.

I’ll preface this with the fact that I wasn’t a part of the team when the paper was prepared and published. So I won’t be able to give you the full details. But I’ll try and point you in the correct direction.

As I’m sure you’ve seen, MedCAT is an ongoing project. As such, the state the repository is in now is vastly different from the state it was in when the results in the paper were produced.
As such, to see the state of the codebase at the time of the results, you’d need to go quite a way back on the repo. As per the paper, the revisions were made in March 2021. So I would recommend looking for a commit some time before that.
So perhaps something like this:

From the date of the revision.
Though most of the work done and results gathered was likely before this. The publication seems to have originally been sent out in October 2020. So you might have better chances with a commit from somewhere around that time.

With that said, I am fairly certain the exact code used to train and validate isn’t going to be on there. The training process will have involved patient data from hospitals. This training will have taken place on the hospital network on this data, potentially with some credentialing. So it’s not something that could have been committed to the repository.

I believe what the “All code for running the experiments, the toolkit and integration with wider CogStack deployments are available here: MedCAT: GitHub - CogStack/MedCAT: Medical Concept Annotation Tool (…)” part meant is that the code for the training, saving, and using the model and/or toolkit is available there. Not necessarily a “run this file to replicate” kind of procedure. As I mentioned above, this would not really be feasible.

1 Like

Hello Mart Ratas,

thanks for your explanation also regarding the idea to look up the commits!

I will have a look and see what I can find.

Hello again @mart.ratas,

hope its okay to continue here rather than opening a new question.

I am currently running looking into the ShARe/CLEF 2014 Task 2 dataset, which was also used in the MedCAT Paper.
The dataset comprises 300 clinical reports and for each report there is a file containing UMLS annotations, which were used to validate the MedCAT results.

But I don’t understand how exactly MedCAT was validated with the data (could not find any details in the paper nor in the GitHub issues or here on Cogstack when searching for “CLEF” etc.).

Right now I identified three problems in the dataset when using MedCAT and am unsure, how these were handled.

1. Overlapping concept spans

From my understanding of the MedCAT procedure, MedCAT returns as a result no overlapping spans / UMLS annotations, but in the dataset there are numerous overlapping annotated spans:

Gold standard annotations for 14888-014879-DISCHARGE_SUMMARY:

Double check for correctly parsed data from the xml files confirms overlap:

2. CUI-less spans

Several annotations mark spans in the text, but do not link them to UMLS CUIs but rather indicate, that those are concepts without CUI by calling them “CUI-less”:


3. Annotations with interupted spans

Several annotations contain not one but multiple spans with start and end markers, which make up the mention for the concept:

But MedCAT can not match those instances, since the sliding window would “break” with these interuptions.

Do you know how these cases were handled in the validation or do you know if some information about that is available somewhere?

I would like to develop and compare a mapping approach with MedCAT, but a sound comparison is difficult without knowing these details.

Were CUI-less annotations excluded and other annotations kept, or were all annotations of these three cases excluded and MedCAT only validated on the remaining subset of annotations, it could theoretically find?

Thanks a lot for reading and helping!
Please also let me know if I perhaps misunderstand something.

Kind regards,