We’re currently in the process of setting up a project in MedCATtrainer. We’ve uploaded the vocab.dat and cdb.dat from the UMLS small model. However, when attempting to open the project for annotation, we encounter the following error:
Our understanding is that the project should use the ‘en_core_web_md’ model, as in the config file (cat.general.spacy_model = 'en_core_web_md'). The ‘en_core_web_md’ model is downloaded during the container build.
We would appreciate any help in resolving this issue. Thank you
The earlier medcat model packs shipped with a spacy model named simply spacy_model within the model pack. The same is specified within the config.
When the medcat library loads the model pack, it modifies the value of config.general.spacy_model to point at the to the unpacked model pack folder (currently line 375 of the medcat.CAT module).
However, due to the way MedCATtrainer is built, it only loads the CDB. Not the entire model pack. Thus, this change to the name/path of the spacy model is not done (and wouldn’t be applicable since the spacy model wouldn’t be available at that location).
python update_spacy_model_in_old_cdb.py <cdb_path> to make the fix and overwrite
python update_spacy_model_in_old_cdb.py <cdb_path> <new_cdb_path> to write to a new file
What this script does is rename the spacy model from spacy_model to en_core_web_md (the small UMLS model used the 3.1.0 version of this, but newer ones should also work) and then saves it back to disk. This will allow MCT to successfully load and use the CDB.