Is it possible to accelerate the unsupervised training, and how?

lw_serenic · July 29, 2024, 10:04am

Hello everyone, so I have been playing around with MedCAT for a while and everything so far runs pretty well. My problem now is when the dataset used for unsupervised training are huge, the training can take a very long time, too long for my liking. Is MedCAT already have a method to implement GPU into training device to accelerate the entire unsupervised training? Or maybe a custom method to accelerate? As far as I know there is the multibatch method but that one is for the supervised training.

mart.ratas · July 29, 2024, 1:05pm

Hi.

Unfortunately there is currently no built in way to speed up the training procedure. This goes for both self-supervised as well as supervised training (though it generally doesn’t matter for supervised training).
The multiprocessing methods we do have are for inference - for using the model.

With that said, since medcat 1.10 we’ve had the option to merge CDBs.
So you should be able to split your training set and run the training on these parts in parallel and then merge the CDBs after you’re done.
https://medcat.readthedocs.io/en/latest/autoapi/medcat/utils/cdb_utils/index.html#medcat.utils.cdb_utils.merge_cdb

Bare in mind, there has been limited testing to this CDB merging methods, but this hasn’t been fully validated in all use cases.

lw_serenic · July 30, 2024, 8:31am

But can I personally accelerate the process? Like using JIT compiler from numba or something akin to that?

mart.ratas · July 30, 2024, 11:38am

Maybe. I don’t know. We haven’t really tried.
But if you can split your dataset, the route I described above, should work quite well, though with some extra steps.

The thing is, in most of our use cases, training is a one time (or at the very least infrequently repeated) step. So if it takes quite a lot of time, it’s not that big of an issue.

The other thing to consider is that if you’re concern about the time it takes to train your model, you could very well be entering the domain of diminishing returns.
I would recommend benchmarking your model at various stages of training on some of your relevant data to make sure that subsequent training is actually (meaningfully) beneficial for the model. Self-supervised training can only get you so far, after all. So if you don’t see improvements from further training, you might need to do supervised training to improve model performance.

lw_serenic · July 30, 2024, 11:54am

I see. So far my technique is using checkpoint so I can just “pause” the training and continue another time. Thank you for the advice

Topic		Replies	Views
MedCATtrainer changing cdb MedCAT	3	212	October 5, 2022
Medcat Trainer configuration MedCAT	11	314	July 22, 2022
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	361	January 30, 2023
MedCAT Large CDB Upload Failure MedCAT	3	153	March 23, 2023
MedCAT for Heart Disease Concept NER and model fine-tuning MedCAT	1	308	April 19, 2022

Is it possible to accelerate the unsupervised training, and how?

Related topics