Reproduce MedCAT Experiments from Publication

Hello MedCAT team,

I would like to run experiments with MedCAT on the same datasets used in the MedCAT publication:
“Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit” (2021)

At the bottom of the publication it is stated:
“All code for running the experiments, the toolkit and integration with wider CogStack deployments are available here: MedCAT: GitHub - CogStack/MedCAT: Medical Concept Annotation Tool (…)”

But I did not find the code required to run the experiments in the MedCAT Github.
Did I perhaps oversee it somewhere?

I understand that it might not be possible to share the datasets freely (altough medmentions is in the repo), but it would help a lot to have access to the Code used for the experiments and then datasets can then be added manually by me where they are needed.

I know that the original MedCAT models for the experiments in the publication are not publicly available (MedCAT model used in validation · Issue #379 · CogStack/MedCAT · GitHub) but I hope the code can be shared.

Kind regards
Kim Tang

Hi Kim Tang,

Thank you for your question.

I’ll preface this with the fact that I wasn’t a part of the team when the paper was prepared and published. So I won’t be able to give you the full details. But I’ll try and point you in the correct direction.

As I’m sure you’ve seen, MedCAT is an ongoing project. As such, the state the repository is in now is vastly different from the state it was in when the results in the paper were produced.
As such, to see the state of the codebase at the time of the results, you’d need to go quite a way back on the repo. As per the paper, the revisions were made in March 2021. So I would recommend looking for a commit some time before that.
So perhaps something like this:

From the date of the revision.
Though most of the work done and results gathered was likely before this. The publication seems to have originally been sent out in October 2020. So you might have better chances with a commit from somewhere around that time.

With that said, I am fairly certain the exact code used to train and validate isn’t going to be on there. The training process will have involved patient data from hospitals. This training will have taken place on the hospital network on this data, potentially with some credentialing. So it’s not something that could have been committed to the repository.

I believe what the “All code for running the experiments, the toolkit and integration with wider CogStack deployments are available here: MedCAT: GitHub - CogStack/MedCAT: Medical Concept Annotation Tool (…)” part meant is that the code for the training, saving, and using the model and/or toolkit is available there. Not necessarily a “run this file to replicate” kind of procedure. As I mentioned above, this would not really be feasible.

1 Like

Hello Mart Ratas,

thanks for your explanation also regarding the idea to look up the commits!

I will have a look and see what I can find.