I’m looking at using floret vectors from spacy to build my vocabulary. One of the things is that it dynamically creates OOV vectors using existing subwords.
Is this tractable with MedCAT? I’ve looked at the MedCAT code to some degree, at vector_context_model.py
and context_based_linker
. From what I can tell get_context_tokens
uses spacy in order to get each token in the context. However, it subsequently calls vocab.vec(word)
which does a straight dictionary lookup in the form of vocab[word]
.
I think I would need to subclass the Vocab class so that instead of a dictionary lookup, we did something like: nlp.vocab["my_word"].vector
since if it’s a floret vector then spacy is not doing a simple dictionary lookup but looking up the floret vector based on hashembed strategy that spacy uses.
Additionally if I’m looking at subword tokenization like BPE or wordpiece, I noticed that one of the blog posts suggests using BERT (aka word-piece). However, wouldn’t we run into the same issue where the get_context_tokens
routine uses spacy tokenizer which would not align with the wordpiece tokens.
Let me know if there’s something obvious I’m missing as a way to create an alternate vocab.