Re using Vocabulary files

jxz · May 6, 2025, 4:07pm

I got access to your “SNOMED INT enriched with UMLS and trained unsupervised on MIMIC-III” model.

Just to be sure, I want to confirm that it is alright to use its vocabulary to build a new model alongside our existing SNOMED-Canada-based CDB. We created this CDB in-house.
Could you also confirm the corpus that was used to create this vocab, please (UMLS Metathesaurus+WIkipedia)?
I’m curious, do you think creating a new vocabulary using Wikipedia and MIMIC-III together would lead to a noticeable boost in performance?

Thank you

mart.ratas · May 7, 2025, 9:08am

That should be fine. In fact, I’d be surprised if there hadn’t already been people that have reused this before.
That should indeed be the corpuse the Vocab is based on.
It’s hard to tell. You can always try it out if you’re curious. The Vocab is used for context embeddings for concepts. And as such, if you create one that better captures the emebeddings of the relevant words (or one that simply has embeddings for more words) it could very well lead to better performance. And I’m sure it’s possible to do. Though I don’t know whether or not it’s easy.

Topic		Replies	Views
Creating Vocab from the Input TextString MedCAT	5	238	May 26, 2023
New paper citing MedCAT MedCAT	4	249	October 21, 2022
MedCAT French model only matches exact terms - accuracy similarity always 1 MedCAT	7	103	June 8, 2025
How to improve recall and make medcat find correct word combinations?	15	338	January 20, 2023
Accessing MedCAT entities' concept embeddings MedCAT	10	383	January 3, 2024