Creating Vocab from the Input TextString

I have created SNOMET_CDB.DAT as mentioned in

https://htmlpreview.github.io/?https://github.com/CogStack/MedCATtutorials/blob/main/notebooks/specialised/Preprocessing_SNOMED_CT.html

I am curious on how best to create a vocab. Can somebody point to the code that creates vocab from a input text file.
The following is not helpful for me.

Hi Sen

Did you have any luck with this? I have the same question.

The following article:

https://www.researchgate.net/publication/351357469_Multi-domain_Clinical_Natural_Language_Processing_with_MedCAT_the_Medical_Concept_Annotation_Toolkit

says that “We have compiled our own VCB by scraping Wikipedia and enriching it with words from UMLS. Only the Wikipedia VCB is made public, but the full VCB can be built with scripts provided in the MedCAT repository (GitHub - CogStack/MedCAT: Medical Concept Annotation Tool).”

However, I can’t find the scripts.

Thanks @patrickj . Now i know i am not alone :slight_smile: @anthony.shek we would be much helped with any guidance.

Hey @Sen @patrickj

Unfortunately we do not share the exact way we do it. However you can use the following code below to create your own. Just replace the comments with your own code. Good Luck! :smiley:

from medcat.vocab import Vocab
import os

vocab = Vocab()

# the vocab.txt file need to be in the tab sep format: <token>\t<word_count>\t<vector_embedding_separated_by_spaces>
# Vector embedding can be created from Word2Vec, you can also use transformer embeddings calculated from transformer libraries and packages such as BERT.
# embeddings of 300 dimensions is standard

vocab.add_words('vocab_data.txt', replace=True)
vocab.make_unigram_table()
vocab.save("vocab.dat")

Thanks for the reply Anthony

Thanks for the hand @anthony.shek. I am able to get the vocab built up.
@patrickj are your able to get at it? Let me know if you want a hand.