I’m looking at using floret vectors from spacy to build my vocabulary. One of the things is that it dynamically creates OOV vectors using existing subwords.
Is this tractable with MedCAT? I’ve looked at the MedCAT code to some degree, at
context_based_linker. From what I can tell
get_context_tokens uses spacy in order to get each token in the context. However, it subsequently calls
vocab.vec(word) which does a straight dictionary lookup in the form of
I think I would need to subclass the Vocab class so that instead of a dictionary lookup, we did something like:
nlp.vocab["my_word"].vector since if it’s a floret vector then spacy is not doing a simple dictionary lookup but looking up the floret vector based on hashembed strategy that spacy uses.
Additionally if I’m looking at subword tokenization like BPE or wordpiece, I noticed that one of the blog posts suggests using BERT (aka word-piece). However, wouldn’t we run into the same issue where the
get_context_tokens routine uses spacy tokenizer which would not align with the wordpiece tokens.
Let me know if there’s something obvious I’m missing as a way to create an alternate vocab.