Understanding medcat

bkakke · September 12, 2022, 5:35pm

Hi, I am trying to understand how i build my own medcat pipelines. I think i understand pretty well how I setup vocabulary and ctb, however there still seems to be missing all the rest - all the spacy models, the config and the meta annotations. Isnt there a tutorial that shows how these are set up and what to be aware of when setting it up?
Thanks a lot

anthony.shek · September 13, 2022, 1:00pm

HI there! Thanks for your question.

Right now we only have these tutorials available.

We are looking to update them. But if you have any specific questions let us know here! Thanks

bkakke · September 13, 2022, 2:24pm

Hi, thanks for the answer. So I have a question regarding the model packs. When i download modelpack it has the current content

.
└── mc_modelpack_snomed_int_16_mar_2022_25be3857ba34bdd5
    ├── cdb.dat
    ├── meta_Status
    │   ├── bbpe-merges.txt
    │   ├── bbpe-vocab.json
    │   ├── config.json
    │   └── model.dat
    ├── model_card.json
    ├── spacy_model
    │   ├── LICENSE
    │   ├── LICENSES_SOURCES
    │   ├── README.md
    │   ├── accuracy.json
    │   ├── attribute_ruler
    │   │   └── patterns
    │   ├── config.cfg
    │   ├── lemmatizer
    │   │   └── lookups
    │   │       └── lookups.bin
    │   ├── meta.json
    │   ├── ner
    │   │   ├── cfg
    │   │   ├── model
    │   │   └── moves
    │   ├── parser
    │   │   ├── cfg
    │   │   ├── model
    │   │   └── moves
    │   ├── senter
    │   │   ├── cfg
    │   │   └── model
    │   ├── tagger
    │   │   ├── cfg
    │   │   └── model
    │   ├── tok2vec
    │   │   ├── cfg
    │   │   └── model
    │   ├── tokenizer
    │   └── vocab
    │       ├── key2row
    │       ├── lookups.bin
    │       ├── strings.json
    │       └── vectors
    └── vocab.dat

My question is, how do i create anything in the directory called: spacy_model ?
Is that just a matter of following these tutorials? https://spacy.io/usage/training

Also, should i manually generate model_card.json?

anthony.shek · September 13, 2022, 2:37pm

Right so this tutorial is what you should be looking at: Part_3_1_Building_a_Concept_Database_and_Vocabulary

To quickly summarise; there are 3 components to a medcat model which are all contained in one place called the “model pack”:

CDB
Config (This is actually within the cdb)
Vocab

When you initialise the default config a spacy model can be set as follows:
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
maker = CDBMaker(config)

If you are using ipython notebooks: you can see the rest of the default config parameters here:
??cdb.config
Alternatively: help(cdb.config)

The main configurations which people would like to change are held in either:
cdb.config.general
cdb.config.linking
cdb.config.ner

To the create a modelpack, initialise a CAT object:
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

Then save it and everything you need should be in there:
cat.create_model_pack(DATA_DIR + "<my_first_medcat_modelpack_name>")

anthony.shek · September 13, 2022, 9:37pm

To answer the last part of the question. The model_card.json is auto generated. When you load a model pack is are some optional parameters which you can specify.

To check it out after you have created a modelpack:

Load the modelpack:
cat = CAT.load_model_pack('<path to downloaded modelpack zip file>')
Then check the output of:
cat.get_model_card(as_dict=True)

bkakke · September 13, 2022, 9:43pm

Aha, this is really helpful! Thank you so much
Do you know if any of the ner, tok2vec, tokenizer, tagger, senter, parser or lemmatizer should somehow be optimized to better parse clinical notes from doctors, or if using the default stacy models gives good results when used directly?
thanks

anthony.shek · September 13, 2022, 11:04pm

So in medcat config file which full details can be found here.

You can see under cat.config.general['spacy_disabled_components']. Several of the above components of SpaCy components have been disabled.

spacy_disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens', 'lemmatizer']

As for the tokenisers and tagger checkout the tokenizers.py file and the taggers.py file

The default configs that we provide are designed for clinical records but i’m sure there is always room for improvement. If you think something can be improved let us know!

Topic		Replies	Views
Error in MedCATtrainer Project Setup: Missing "spacy_model" MedCAT	4	163	January 22, 2024
Using different scispaCy models with MedCAT MedCAT medical-ontologies	6	299	June 9, 2023
Install Dutch spaCy model for MedCATtrainer MedCAT	3	144	January 18, 2024
MedCAT model for SNOMED-CT MedCAT medical-ontologies	2	430	June 20, 2023
Adding new concepts to a trained model or re-training a MedCAT model MedCAT	9	375	January 30, 2023

Understanding medcat

Related topics