Hello everyone, so I have been playing around with MedCAT for a while and everything so far runs pretty well. My problem now is when the dataset used for unsupervised training are huge, the training can take a very long time, too long for my liking. Is MedCAT already have a method to implement GPU into training device to accelerate the entire unsupervised training? Or maybe a custom method to accelerate? As far as I know there is the multibatch method but that one is for the supervised training.
Hi.
Unfortunately there is currently no built in way to speed up the training procedure. This goes for both self-supervised as well as supervised training (though it generally doesn’t matter for supervised training).
The multiprocessing methods we do have are for inference - for using the model.
With that said, since medcat 1.10 we’ve had the option to merge CDBs.
So you should be able to split your training set and run the training on these parts in parallel and then merge the CDBs after you’re done.
https://medcat.readthedocs.io/en/latest/autoapi/medcat/utils/cdb_utils/index.html#medcat.utils.cdb_utils.merge_cdb
Bare in mind, there has been limited testing to this CDB merging methods, but this hasn’t been fully validated in all use cases.
But can I personally accelerate the process? Like using JIT compiler from numba or something akin to that?
Maybe. I don’t know. We haven’t really tried.
But if you can split your dataset, the route I described above, should work quite well, though with some extra steps.
The thing is, in most of our use cases, training is a one time (or at the very least infrequently repeated) step. So if it takes quite a lot of time, it’s not that big of an issue.
The other thing to consider is that if you’re concern about the time it takes to train your model, you could very well be entering the domain of diminishing returns.
I would recommend benchmarking your model at various stages of training on some of your relevant data to make sure that subsequent training is actually (meaningfully) beneficial for the model. Self-supervised training can only get you so far, after all. So if you don’t see improvements from further training, you might need to do supervised training to improve model performance.
I see. So far my technique is using checkpoint so I can just “pause” the training and continue another time. Thank you for the advice