Question regarding MedCat and token order "tolerance" mentioned in paper

KimTang · September 19, 2023, 3:01pm

Good afternoon CogStack team,

thanks for providing access to MedCat and also the Demo online!
I read the paper on MedCat with the note, that token order can be ignored with up to two tokens.
But I could not try out that feature in the online Demo nor through downloading the model trained on SNOMED-CT and running it locally.

Is this behavior enabled by default, or can I activate it somehow?

I also created an issue on Github for that: Concept not found if token order is slightly changed contrary to mentioned note in paper · Issue #344 · CogStack/MedCAT · GitHub and tried to contact via mail but it seems to be not reachable, so I figured it’s better to bring this up over here.

Kind regards,
Kim Tang

anthony.shek · September 25, 2023, 1:56pm

Hey @KimTang!

I think you can find the configuration for what you are looking for here:
cat.cdb.config.Ner()

The source code is below for you to inspect:

I haven’t used personally used this feature myself. Let us know me know how this works for you.

mart.ratas · September 26, 2023, 12:57pm

Just to find out, I used a simple model to see if the revers word order actually works.
I set the value to True, but it didn’t seem to work. At least not for the model I was using (which is a UMLS model trained on MIMIC-III).

(The right hand side is with the correct order, left hand side is with reversed order).
In fact, as we can see from the in-code documentation, this is exactly the type of things it’s supposed to find. But it doesn’t seem to be able to.
I’ve double checked that try_reverse_word_order is in fact True.

Just in case, I also looked at the code just to make sure the value is actually used. And it looks like it is:

github.com

CogStack/MedCAT/blob/master/medcat/ner/vocab_based_ner.py

import logging
from spacy.tokens import Doc
from medcat.ner.vocab_based_annotator import maybe_annotate_name
from medcat.pipeline.pipe_runner import PipeRunner
from medcat.cdb import CDB
from medcat.config import Config


logger = logging.getLogger(__name__)


class NER(PipeRunner):

    # Custom pipeline component name
    name = 'cat_ner'

    # Override
    def __init__(self, cdb: CDB, config: Config) -> None:
        self.config = config
        self.cdb = cdb

This file has been truncated. show original

With that said, the concept you tried and posted on Github probably was never meant to be supported - it has 4 tokens whereas (as you quoted) the feature was supposed to work up to 2 tokens.

KimTang · October 9, 2023, 7:34am

Thank you so much for the detailed answer!

Is it possible to increase the number of tokens that can have a changed order from the current 2 tokens to an arbitrary number (knowing, that it will probably have a large impact on the computation time) ?

It would be helpful if that was a configurable parameter to adjust for specific use cases.

mart.ratas · October 9, 2023, 11:03am

Hi KimTang,

Yes, it would technically be possible to do that. But there is currently no functionality that does that.

What we would recommend instead is to add the specific irregularly ordered names to the CDB.
E.g something like this worked in my simple example:

    # expecting `cat` to be a pre-loaded model pack / CAT instance
    from medcat.cdb_maker import prepare_name
    # the CUI to add the reverse order name to
    heart_disease_CUI = 'C0018799'
    # the name to add
    heart_disease_name_reverse = "disease heart"
    # the names dict
    names = dict()
    # the dict will be filled in the below method
    prepare_name(heart_disease_name_reverse, cat.pipe.spacy_nlp, names, cat.config)
    # adding the name(s)
    cat.cdb.add_names(heart_disease_CUI, names=names)

You can obviously iterate over the different names / orders of tokens you wish to add.

This shouldn’t need additional training. At least it didn’t in my simple example:

KimTang · October 9, 2023, 1:35pm

Hi Mart, thanks for clarifying.

Unfortunately, while the suggested solution would work for several concepts with mixed token orders, it does not work well for my use case with freely written clinical reports, in which any SNOMED-CT concept could have a slightly mixed token order (in theory I would have to add token order variants for every existing SNOMED-CT concept).

But I understand that this is also a more complex conceptual problem, since allowing mixed up token orders for the concepts could also increase the number of mismatches overall. Thanks for your help though, I will experiment a bit further with it.

Topic		Replies	Views
New issue on github MedCAT	2	158	August 29, 2023
Anyone tried using MedCAT on data which is not in english? MedCAT	2	261	April 3, 2022
Is there some guides or examples to help implement MedCAT in other language other than English? MedCAT	3	227	May 16, 2023
Meta annotation basics MedCAT	3	318	October 5, 2022
Published paper using MedCAT for negation detection in Dutch medical text MedCAT	3	191	January 18, 2023

Question regarding MedCat and token order "tolerance" mentioned in paper

Related topics