Hello, I am facing this problem with uploading the current UMLS full model to MedCATrainer. I have been trying to upload the model pack to the admin Django app, but to no avail and have encountered the errors below.
These are the 2 errors I have gotten:
Kindly seeking any help that I could get!
Hi,
This is most likely a resource issue.
If you’re trying to use the full UMLS model, then that’s a massive model. When loaded it can take an excess of 21GB of memory (from what I recall). And that’s not including other overheard.
So if you’ve got less memory than that available, that’ll probably be the issue.
However, I’m open to other things being the culprit.
As such, you can do the following:
Open the model manually outside the container:
- Use python 3.11 (that’s what trainer uses)
- Install
medcat~=1.12
(if you’re wanting to use trainer v2.17
) or medcat~=1.14.0
(if you’re using the latest trainer v2.18
)
- Load the model manually
- I.e
from medcat.cat import CAT;cat = CAT.load_model_pack("<model_pack_path>")
Let me know if the above causes a similar (or other) issues.
But as a note, I did just recently (on Monday) verify that the model does load in the latest medcat version.
Thank you very much for your reply! I can see why it’s a memory issue, as my laptop only has 8GB of Ram.
May I check how I can use the model manually outside the container, but for the MedCATTrainer as well? I have tried using the UMLS model previously in MedCAT and it works just fine, but as I’m using the MedCATTrainer, is there a way for me to still use the same model there?
Thank you
I am not entirely sure what you’re saying.
May I check how I can use the model manually outside the container, but for the MedCATTrainer as well?
I don’t really understand the question. You can just use the model file separately in multiple places, just like one would a word document.
But I’m guessing I’ve somehow missed what you’re asking.
I have tried using the UMLS model previously in MedCAT and it works just fine, but as I’m using the MedCATTrainer, is there a way for me to still use the same model there?
Which UMLS model are you using? Are you using the “full” UMLS model downloaded through the link in the README for MedCAT (this model is around 1.6GB when zipped); or are you using the MedMentions model used in the tutorials (this model is around 536MB when zipped)?
The former is the large one that is requires 20+GB of RAM. The latter is a small version that should load fine in most situations.
Apologies for the misunderstanding. I don’t have very much experience in coding, hence might not be able to properly explain the errors/issues I am facing.
I am intending to use the full UMLS model in MedCATTrainer to allow me to actively train up the model to be able to specifically look out for epilepsy related terms, rather than the full database of terms.
Previously I tried using the same UMLS model (the full one, that’s 1.6GB when zipped and >5GB when unzipped) in MedCAT through jupyter notebook, and was able to assess the model but it required an hour+ of loading.
I understand that at this moment, the problem is likely a resource issue, considering that my MacBook only has 8GB of RAM. Is there another way to circumvent this issue, besides using a higher powered computer?
There are a few things you can do to slim down the model. But you still need to do so in an environment in which you can actually load the model. So if you were able to do so in a notebook somewhere, then that could work.
Though do note that it likely took an hour because the system ran out of RAM and used the SSD as additional memory (which is a lot slower). Loading the model would not normally take that long. On my M1 based MacBook Pro with 32GB of RAM the full UMLS model loads in 142s (2m22s).
And if this was on your own machine, then the difference between doing it in the notebook and doing it in MedCATtrainer is the fact that the former is running directly on your system and the latter is running in a docker container. A docker container won’t allow using all available system resources by design. It’s limited to a certain amount of resources and will not use more than that.
Now, if you’re only interested in epilepsy terms, you can filter the CDB using the terms you’re interested by using CDB.filter_by_cui (i.e cat.cdb.filter_by_cui(list_of_cuis)
) and subsequently save the model again (CAT.create_model_pack). If you’re only keeping a subset of the concepts in the CDB, it will become a lot smaller.
There are a few other things that could make a small difference. For instance in the latest beta release (v1.15.0b
), there’s a method to lower the dimensionality of the vectors within the vocab (and the concept vectors). See vocab_utils.convert_vocab_vector_size and the PR for more details (though in my limited testing, this didn’t affect RAM usage, it could).