thanks for providing access to MedCat and also the Demo online!
I read the paper on MedCat with the note, that token order can be ignored with up to two tokens.
But I could not try out that feature in the online Demo nor through downloading the model trained on SNOMED-CT and running it locally.
Is this behavior enabled by default, or can I activate it somehow?
Just to find out, I used a simple model to see if the revers word order actually works.
I set the value to True, but it didn’t seem to work. At least not for the model I was using (which is a UMLS model trained on MIMIC-III).
(The right hand side is with the correct order, left hand side is with reversed order).
In fact, as we can see from the in-code documentation, this is exactly the type of things it’s supposed to find. But it doesn’t seem to be able to.
I’ve double checked that try_reverse_word_order is in fact True.
Just in case, I also looked at the code just to make sure the value is actually used. And it looks like it is:
With that said, the concept you tried and posted on Github probably was never meant to be supported - it has 4 tokens whereas (as you quoted) the feature was supposed to work up to 2 tokens.
Is it possible to increase the number of tokens that can have a changed order from the current 2 tokens to an arbitrary number (knowing, that it will probably have a large impact on the computation time) ?
It would be helpful if that was a configurable parameter to adjust for specific use cases.
Yes, it would technically be possible to do that. But there is currently no functionality that does that.
What we would recommend instead is to add the specific irregularly ordered names to the CDB.
E.g something like this worked in my simple example:
# expecting `cat` to be a pre-loaded model pack / CAT instance
from medcat.cdb_maker import prepare_name
# the CUI to add the reverse order name to
heart_disease_CUI = 'C0018799'
# the name to add
heart_disease_name_reverse = "disease heart"
# the names dict
names = dict()
# the dict will be filled in the below method
prepare_name(heart_disease_name_reverse, cat.pipe.spacy_nlp, names, cat.config)
# adding the name(s)
cat.cdb.add_names(heart_disease_CUI, names=names)
You can obviously iterate over the different names / orders of tokens you wish to add.
This shouldn’t need additional training. At least it didn’t in my simple example:
Unfortunately, while the suggested solution would work for several concepts with mixed token orders, it does not work well for my use case with freely written clinical reports, in which any SNOMED-CT concept could have a slightly mixed token order (in theory I would have to add token order variants for every existing SNOMED-CT concept).
But I understand that this is also a more complex conceptual problem, since allowing mixed up token orders for the concepts could also increase the number of mismatches overall. Thanks for your help though, I will experiment a bit further with it.