Creating Vocab from the Input TextString

Sen · May 22, 2023, 11:59pm

I have created SNOMET_CDB.DAT as mentioned in

https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/specialised/Preprocessing_SNOMED_CT.html

I am curious on how best to create a vocab. Can somebody point to the code that creates vocab from a input text file.
The following is not helpful for me.

patrickj · May 25, 2023, 8:55am

Hi Sen

Did you have any luck with this? I have the same question.

The following article:

https://www.researchgate.net/publication/351357469_Multi-domain_Clinical_Natural_Language_Processing_with_MedCAT_the_Medical_Concept_Annotation_Toolkit

says that “We have compiled our own VCB by scraping Wikipedia and enriching it with words from UMLS. Only the Wikipedia VCB is made public, but the full VCB can be built with scripts provided in the MedCAT repository (GitHub - CogStack/MedCAT: Medical Concept Annotation Tool).”

However, I can’t find the scripts.

Sen · May 25, 2023, 1:17pm

Thanks @patrickj . Now i know i am not alone @anthony.shek we would be much helped with any guidance.

anthony.shek · May 25, 2023, 5:45pm

Hey @Sen @patrickj

Unfortunately we do not share the exact way we do it. However you can use the following code below to create your own. Just replace the comments with your own code. Good Luck!

from medcat.vocab import Vocab
import os

vocab = Vocab()

# the vocab.txt file need to be in the tab sep format: <token>\t<word_count>\t<vector_embedding_separated_by_spaces>
# Vector embedding can be created from Word2Vec, you can also use transformer embeddings calculated from transformer libraries and packages such as BERT.
# embeddings of 300 dimensions is standard

vocab.add_words('vocab_data.txt', replace=True)
vocab.make_unigram_table()
vocab.save("vocab.dat")

patrickj · May 25, 2023, 7:21pm

Thanks for the reply Anthony

Sen · May 26, 2023, 3:35am

Thanks for the hand @anthony.shek. I am able to get the vocab built up.
@patrickj are your able to get at it? Let me know if you want a hand.

Topic		Replies	Views
What's the best way to trial MedCAT MedCAT	3	254	April 19, 2022
Re using Vocabulary files	1	17	May 7, 2025
Understanding medcat MedCAT	6	366	September 13, 2022
Anyone tried using MedCAT on data which is not in english? MedCAT	2	267	April 3, 2022
MedCAT for Heart Disease Concept NER and model fine-tuning MedCAT	1	308	April 19, 2022

Creating Vocab from the Input TextString

Related topics