I thought I’d share some observations on configuring nifi workflows involving the medcat service. I’m attempting to optimise the performance for the scenario of ingesting a large database.
The issue that prompted the investigation was failure of a low proportion of flowfiles sent to the batch interface. Failure was indicated by:
2023-03-14 01:48:46,334 ERROR [Timer-Driven Process Thread-36] o.a.nifi.processors.standard.InvokeHTTP InvokeHTTP[id=cb72d2e0-d5c0-36c1-19b6-13a542a56e60] Request Processing failed: StandardFlowFileRecord[uuid=eff900e7-81ce-4312-abe2-218cb78d3ca1,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1678758435916-5, container=default, section=5], offset=11133732, length=125544],offset=0,name=eff900e7-81ce-4312-abe2-218cb78d3ca1,size=125544]
org.apache.nifi.processor.exception.ProcessException: IOException thrown from InvokeHTTP[id=cb72d2e0-d5c0-36c1-19b6-13a542a56e60]: java.net.SocketException: Broken pipe (Write failed)
Sending the problem flowfiles to medcatservice via curl worked as expected.
I think I’ve tracked the problem down to combinations of medcat service thread and worker configuration and nifi threads for the InvokeHTTP processor. I think the errors occurred when the number of threads required by medcat exceeded the setting for flask.
The combination that I find works without failures is:
nifi processor: 16 concurrent tasks
APP_TORCH_THREADS=1 # no GPU on the server
Obviously the 16 is server dependent. The key seems to be to force single threaded operation per batch in this scenario.
The broken pipe can also happen due to incorrect response messages(check the service for errors, also check that the messages you are sending do not contain empty strings) and/or too many requests at one time. I’ve noticed this behaviour with InvokeHttp in general, irrespective of the service.
So far we annotated large datasets with only 1-2 workers but large batches, keeping Medcat on a high Nproc, didnt encounter major issues apart from the eariler torch problem(highlighted in your pull request) with the latest version. There may be some further issues related to how the therad hierarchy is handled. Will investigate.
I didn’t find any service errors. I resent several of the failed flowfiles via curl without any issues.
I suspect that 1-2 workers + high nproc is a reasonable alternative and would expect it to be reliable as long as 2*nproc is greater than or equal to the thread count set in the flask config.
I’ve been using batches of 50 docs and more workers because it is easy to be certain the thread count isn’t above the flask setting. I don’t have any measures to suggest that it is a more efficient approach, just that it was an easy way to get an idea of why the errors were happening.
Can you share what kind of throughput we are talking about? I have a throughput of no more than 2 or 3 Mb per 5 minutes. I think that is slow for 24 cpu’s at almost 100% in a VMware Player VM with 50 GB RAM, Ubuntu 22.04 and Docker. The host machine has Intel Xeon W-2175 CPU 2.50Hkz with 14 cores 28 logical CPU with 64 GB RAM.
I was playing around with the same settings as you mentioned but no combination of APP_NPROC, WORKERS , THREADS and nifi concurrent tasks gives me more than 3 Mb per 5 minutes. For testing I rebuilt the Medcat docker container with various settings in aplications/medcat/config/env_app. In Nifi, i stopped surrounding processors so that Medcat was the only processor running while emptying the queue.
My text files are medical intensive care records, on avg 1,5 Kb (but varying considerably). Medcat outputs approximately 10 x the amount of data that it reads, which is of course totally dependent on the contents of the individual records. Flowfiles rows I tried are from 50 to 1000.
I’m debugging some other problems at the moment, but will get some figures to you asap.
I’ll get back and confirm in more detail next week, but I recall the following problems:
Torch grabs heaps of threads in an inefficient way if you have a host without a GPU (which I did). There’s a modification in recent versions allowing the TORCH_THREADS to be set to 1. I certainly found that performance degraded a lot if TORCH was left unrestricted
I found that setting threads to a high number lead to high CPU loads for only a very small proportion of the time
On my server, 24 workers in medcat service and 24 threads on the nifi processor were giving high sustained CPU load, but I can’t remember the data throughput. I was focusing more on keeping the CPU load at a high constant level.
Thanks, I already tried the options you mentioned after reading your previous posts. TORCH_THREADS=1 had no effect. Getting the 24 CPU to 100% was not difficult, it is just that the performance did not increase beyond 3 Mb/5min. Really confusing. I also observed activity bursts which I seemed to be able to prolong by increasing the rows per flowfile. However, I was trying things out so I have no systemic overview of settings vs results, and perhaps my conclusions where wrong.
TORCH_THREADS requires a recent medcat service - soundsl like you were building your own so it should work. As background, I was finding that lots of threads would get allocated briefly to gunicorn processes, which was inefficient in my testing as TORCH appeared to have priority in grabbing the threads over the main processing. The side effect was that I would frequently see requests to medcat service fail because (I think) the host couldn’t grant the requested threads.
Anyhow, my current test setup is
medcat service is configured for 30 concurrent tasks, standard run duration and 0 second run, schedule. 0 run duration. I think 30 is probably a bit too high for this server, even though it has the CPUs. top reports that about 1/4 of the gunicorn processes are at below 90% load. If I drop to 24 then a much higher proportion sit at 100. I still see the occasional one hitting a higher than 100% load, but so far I’m not getting failures.
This is the medcatservice bulk interface
For the test I’ve terminated the response rather than send to an output queue.
Data is coming from opensearch. Document batch size is 200, and the elasticsearch processor is on a 0.8 second run schedule. This appears to be enough to ensure that medcat service always has data in the input queue.
These documents are from a clinical system, highly variable in length. A high proportion are quite short (progress notes). As a guide, the current queue for medcat service is showing 25 batches (200 each) at 5.6M.
The 5 minute average is showing Read/Write 88MB/534MB.
I played around more with the settings you suggested but 10 MB/5 min throughput (60MB/5 min out) is the max I have seen (and using for now). I see your messages are about the same size (1,12 kb) as mine (1.5 kb)
I don’t understand why it is so different. One other thing to mention, that I found useful that you may already be doing, was to save some flowfiles and submit them direct to medcat service using curl. This lets you watch loads without nifi. i.e. you can bounce the medcat service up and down with different settings.
One other thing - how big is the dutch variant model and how much ram on the host? Is there enough ram to comfortably hold the necessary model copies for your configuration? That is certainly a problem we had early on - I had to reduce the worker count as a result. Only after upgrading the RAM to > 512G were we able to down the path of using lots of workers.
Having watched the output of top carefully, I just don’t think the threading inside medcat is used for a high enough proportion of the time to give really high throughput. In my case the load on a process would only go to multiples of 100% (indicating high thread utilisation) for a short bursts - this was something I could check with the curl approach.
I was on short holiday last week so sorry for the delayed response. Thank for the additional details! I will try the no-nifi procedure with curl as well but have to develop a script to use the bulk interface. Perhaps that will provide some insight.
The dutch model is here, filesize for cdb and vocab files are app 400Mb and 800Mb. I do not know how much medcat needs per instance but the VM has 50GB RAM. Should be enough for at least 16 workers, I think.
I was looking at top as well and wondering about threads and who uses them. I have seen warnings in Nifi not beeing able to contact medcat (timeout) but these warnings dissappear after a few minutes. Only with too many workers (more than 2) does throughput deteriorate and errors keep occuring.