Understanding medcat

Hi, I am trying to understand how i build my own medcat pipelines. I think i understand pretty well how I setup vocabulary and ctb, however there still seems to be missing all the rest - all the spacy models, the config and the meta annotations. Isnt there a tutorial that shows how these are set up and what to be aware of when setting it up?
Thanks a lot :slight_smile:

HI there! Thanks for your question.

Right now we only have these tutorials available.

We are looking to update them. But if you have any specific questions let us know here! Thanks


Hi, thanks for the answer. So I have a question regarding the model packs. When i download modelpack it has the current content

└── mc_modelpack_snomed_int_16_mar_2022_25be3857ba34bdd5
    β”œβ”€β”€ cdb.dat
    β”œβ”€β”€ meta_Status
    β”‚   β”œβ”€β”€ bbpe-merges.txt
    β”‚   β”œβ”€β”€ bbpe-vocab.json
    β”‚   β”œβ”€β”€ config.json
    β”‚   └── model.dat
    β”œβ”€β”€ model_card.json
    β”œβ”€β”€ spacy_model
    β”‚   β”œβ”€β”€ LICENSE
    β”‚   β”œβ”€β”€ LICENSES_SOURCES
    β”‚   β”œβ”€β”€ README.md
    β”‚   β”œβ”€β”€ accuracy.json
    β”‚   β”œβ”€β”€ attribute_ruler
    β”‚   β”‚   └── patterns
    β”‚   β”œβ”€β”€ config.cfg
    β”‚   β”œβ”€β”€ lemmatizer
    β”‚   β”‚   └── lookups
    β”‚   β”‚       └── lookups.bin
    β”‚   β”œβ”€β”€ meta.json
    β”‚   β”œβ”€β”€ ner
    β”‚   β”‚   β”œβ”€β”€ cfg
    β”‚   β”‚   β”œβ”€β”€ model
    β”‚   β”‚   └── moves
    β”‚   β”œβ”€β”€ parser
    β”‚   β”‚   β”œβ”€β”€ cfg
    β”‚   β”‚   β”œβ”€β”€ model
    β”‚   β”‚   └── moves
    β”‚   β”œβ”€β”€ senter
    β”‚   β”‚   β”œβ”€β”€ cfg
    β”‚   β”‚   └── model
    β”‚   β”œβ”€β”€ tagger
    β”‚   β”‚   β”œβ”€β”€ cfg
    β”‚   β”‚   └── model
    β”‚   β”œβ”€β”€ tok2vec
    β”‚   β”‚   β”œβ”€β”€ cfg
    β”‚   β”‚   └── model
    β”‚   β”œβ”€β”€ tokenizer
    β”‚   └── vocab
    β”‚       β”œβ”€β”€ key2row
    β”‚       β”œβ”€β”€ lookups.bin
    β”‚       β”œβ”€β”€ strings.json
    β”‚       └── vectors
    └── vocab.dat

My question is, how do i create anything in the directory called: spacy_model ?
Is that just a matter of following these tutorials? https://spacy.io/usage/training

Also, should i manually generate model_card.json?

Right so this tutorial is what you should be looking at: Part_3_1_Building_a_Concept_Database_and_Vocabulary

To quickly summarise; there are 3 components to a medcat model which are all contained in one place called the β€œmodel pack”:

  1. CDB
  2. Config (This is actually within the cdb)
  3. Vocab

When you initialise the default config a spacy model can be set as follows:
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
maker = CDBMaker(config)

If you are using ipython notebooks: you can see the rest of the default config parameters here:
Alternatively: help(cdb.config)

The main configurations which people would like to change are held in either:

To the create a modelpack, initialise a CAT object:
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)

Then save it and everything you need should be in there:
cat.create_model_pack(DATA_DIR + "<my_first_medcat_modelpack_name>")

To answer the last part of the question. The model_card.json is auto generated. When you load a model pack is are some optional parameters which you can specify.

To check it out after you have created a modelpack:

Load the modelpack:
cat = CAT.load_model_pack('<path to downloaded modelpack zip file>')
Then check the output of:

Aha, this is really helpful! Thank you so much
Do you know if any of the ner, tok2vec, tokenizer, tagger, senter, parser or lemmatizer should somehow be optimized to better parse clinical notes from doctors, or if using the default stacy models gives good results when used directly?

So in medcat config file which full details can be found here.

You can see under cat.config.general['spacy_disabled_components']. Several of the above components of SpaCy components have been disabled.

spacy_disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens', 'lemmatizer']

As for the tokenisers and tagger checkout the tokenizers.py file and the taggers.py file

The default configs that we provide are designed for clinical records but i’m sure there is always room for improvement. If you think something can be improved let us know!

1 Like