Question regarding MedCat and token order "tolerance" mentioned in paper

Good afternoon CogStack team,

thanks for providing access to MedCat and also the Demo online!
I read the paper on MedCat with the note, that token order can be ignored with up to two tokens.
But I could not try out that feature in the online Demo nor through downloading the model trained on SNOMED-CT and running it locally.

Is this behavior enabled by default, or can I activate it somehow?

I also created an issue on Github for that: Concept not found if token order is slightly changed contrary to mentioned note in paper · Issue #344 · CogStack/MedCAT · GitHub and tried to contact via mail but it seems to be not reachable, so I figured it’s better to bring this up over here.

Kind regards,
Kim Tang

Hey @KimTang!

I think you can find the configuration for what you are looking for here:

The source code is below for you to inspect:

I haven’t used personally used this feature myself. Let us know me know how this works for you.

Just to find out, I used a simple model to see if the revers word order actually works.
I set the value to True, but it didn’t seem to work. At least not for the model I was using (which is a UMLS model trained on MIMIC-III).

(The right hand side is with the correct order, left hand side is with reversed order).
In fact, as we can see from the in-code documentation, this is exactly the type of things it’s supposed to find. But it doesn’t seem to be able to.
I’ve double checked that try_reverse_word_order is in fact True.

Just in case, I also looked at the code just to make sure the value is actually used. And it looks like it is:

With that said, the concept you tried and posted on Github probably was never meant to be supported - it has 4 tokens whereas (as you quoted) the feature was supposed to work up to 2 tokens.

Thank you so much for the detailed answer!

Is it possible to increase the number of tokens that can have a changed order from the current 2 tokens to an arbitrary number (knowing, that it will probably have a large impact on the computation time) ?

It would be helpful if that was a configurable parameter to adjust for specific use cases.

Hi KimTang,

Yes, it would technically be possible to do that. But there is currently no functionality that does that.

What we would recommend instead is to add the specific irregularly ordered names to the CDB.
E.g something like this worked in my simple example:

    # expecting `cat` to be a pre-loaded model pack / CAT instance
    from medcat.cdb_maker import prepare_name
    # the CUI to add the reverse order name to
    heart_disease_CUI = 'C0018799'
    # the name to add
    heart_disease_name_reverse = "disease heart"
    # the names dict
    names = dict()
    # the dict will be filled in the below method
    prepare_name(heart_disease_name_reverse, cat.pipe.spacy_nlp, names, cat.config)
    # adding the name(s)
    cat.cdb.add_names(heart_disease_CUI, names=names)

You can obviously iterate over the different names / orders of tokens you wish to add.

This shouldn’t need additional training. At least it didn’t in my simple example:

Hi Mart, thanks for clarifying.

Unfortunately, while the suggested solution would work for several concepts with mixed token orders, it does not work well for my use case with freely written clinical reports, in which any SNOMED-CT concept could have a slightly mixed token order (in theory I would have to add token order variants for every existing SNOMED-CT concept).

But I understand that this is also a more complex conceptual problem, since allowing mixed up token orders for the concepts could also increase the number of mismatches overall. Thanks for your help though, I will experiment a bit further with it.