Using non word2vec vectors (floret or word-piece)

jkgenser · June 6, 2023, 8:44pm

I’m looking at using floret vectors from spacy to build my vocabulary. One of the things is that it dynamically creates OOV vectors using existing subwords.

Is this tractable with MedCAT? I’ve looked at the MedCAT code to some degree, at vector_context_model.py and context_based_linker. From what I can tell get_context_tokens uses spacy in order to get each token in the context. However, it subsequently calls vocab.vec(word) which does a straight dictionary lookup in the form of vocab[word].

I think I would need to subclass the Vocab class so that instead of a dictionary lookup, we did something like: nlp.vocab["my_word"].vector since if it’s a floret vector then spacy is not doing a simple dictionary lookup but looking up the floret vector based on hashembed strategy that spacy uses.

Additionally if I’m looking at subword tokenization like BPE or wordpiece, I noticed that one of the blog posts suggests using BERT (aka word-piece). However, wouldn’t we run into the same issue where the get_context_tokens routine uses spacy tokenizer which would not align with the wordpiece tokens.

Let me know if there’s something obvious I’m missing as a way to create an alternate vocab.

Topic		Replies	Views
Understanding medcat MedCAT	6	374	September 13, 2022
Creating Vocab from the Input TextString MedCAT	5	214	May 26, 2023
Cosine similarity and word2vec MedCAT	0	209	October 24, 2022
Reenabling pipeline components MedCAT	6	182	August 18, 2022
Anyone tried using MedCAT on data which is not in english? MedCAT	2	269	April 3, 2022

Using non word2vec vectors (floret or word-piece)

Related topics