Tika optimization and structured documents

Hi,
I have two questions concerning the Tika component of cogstack:

  • Is there a bulk interface for handling many small documents, and are there sample workflows illustrating how to use it?
  • Is there any facility for processing structured document formats? For example, some of the documents generated by our EMR are forms of xml, which include tags that may be useful for recognising document structures like tables, headings and possibly templates used to construct the document. Any Tika experts who are able to comment on this?

Thanks

Hello,

Although there is a bulk api request for the tika web service we don’t have any bulk method for small documents specifically, you will need to probably have a separate tika instance that is configured for small docs (and also with OCR disabled if this should be the case). No sample workflows as of yet. I’ve added the workflow to the to-do list.

There is no specific configuration that allows further customisation for processing other file types by default, there’s a only a general configuration for tesseract-ocr and one for pdf separately which only covers the OCR bits. The only way one might be able to do this is to write his/her own content handler programatically, I will add this to the requested feature list but I’m not sure yet how generic it can be made.

Thanks

Just some workarounds that should be used:

  • you can process multiple documents at the same time with NiFi with ease, all that is needed to do is to set the number of concurrent tasks that the service script can request.
  • for small documents you need to setup another container that has the env vars OMP_DYNAMIC=FALSE and OMP_NESTED=FALSE, as well as OMP_THREAD_LIMIT=1, you can then use the /bulk_process method and see if it is performing well, if it isnt then I’d suggest relying on the above method of sending task requests concurrently.