Hi, I am trying to understand how i build my own medcat pipelines. I think i understand pretty well how I setup vocabulary and ctb, however there still seems to be missing all the rest - all the spacy models, the config and the meta annotations. Isnt there a tutorial that shows how these are set up and what to be aware of when setting it up?
Thanks a lot
HI there! Thanks for your question.
Right now we only have these tutorials available.
We are looking to update them. But if you have any specific questions let us know here! Thanks
Hi, thanks for the answer. So I have a question regarding the model packs. When i download modelpack it has the current content
.
βββ mc_modelpack_snomed_int_16_mar_2022_25be3857ba34bdd5
βββ cdb.dat
βββ meta_Status
β βββ bbpe-merges.txt
β βββ bbpe-vocab.json
β βββ config.json
β βββ model.dat
βββ model_card.json
βββ spacy_model
β βββ LICENSE
β βββ LICENSES_SOURCES
β βββ README.md
β βββ accuracy.json
β βββ attribute_ruler
β β βββ patterns
β βββ config.cfg
β βββ lemmatizer
β β βββ lookups
β β βββ lookups.bin
β βββ meta.json
β βββ ner
β β βββ cfg
β β βββ model
β β βββ moves
β βββ parser
β β βββ cfg
β β βββ model
β β βββ moves
β βββ senter
β β βββ cfg
β β βββ model
β βββ tagger
β β βββ cfg
β β βββ model
β βββ tok2vec
β β βββ cfg
β β βββ model
β βββ tokenizer
β βββ vocab
β βββ key2row
β βββ lookups.bin
β βββ strings.json
β βββ vectors
βββ vocab.dat
My question is, how do i create anything in the directory called: spacy_model
?
Is that just a matter of following these tutorials? https://spacy.io/usage/training
Also, should i manually generate model_card.json
?
Right so this tutorial is what you should be looking at: Part_3_1_Building_a_Concept_Database_and_Vocabulary
To quickly summarise; there are 3 components to a medcat model which are all contained in one place called the βmodel packβ:
- CDB
- Config (This is actually within the cdb)
- Vocab
When you initialise the default config a spacy model can be set as follows:
config = Config()
config.general['spacy_model'] = 'en_core_web_md'
maker = CDBMaker(config)
If you are using ipython notebooks: you can see the rest of the default config parameters here:
??cdb.config
Alternatively: help(cdb.config)
The main configurations which people would like to change are held in either:
cdb.config.general
cdb.config.linking
cdb.config.ner
To the create a modelpack, initialise a CAT object:
cat = CAT(cdb=cdb, config=cdb.config, vocab=vocab)
Then save it and everything you need should be in there:
cat.create_model_pack(DATA_DIR + "<my_first_medcat_modelpack_name>")
To answer the last part of the question. The model_card.json
is auto generated. When you load a model pack is are some optional parameters which you can specify.
To check it out after you have created a modelpack:
Load the modelpack:
cat = CAT.load_model_pack('<path to downloaded modelpack zip file>')
Then check the output of:
cat.get_model_card(as_dict=True)
Aha, this is really helpful! Thank you so much
Do you know if any of the ner, tok2vec, tokenizer, tagger, senter, parser or lemmatizer should somehow be optimized to better parse clinical notes from doctors, or if using the default stacy models gives good results when used directly?
thanks
So in medcat config file which full details can be found here.
You can see under cat.config.general['spacy_disabled_components']
. Several of the above components of SpaCy components have been disabled.
spacy_disabled_components: list = ['ner', 'parser', 'vectors', 'textcat', 'entity_linker', 'sentencizer', 'entity_ruler', 'merge_noun_chunks', 'merge_entities', 'merge_subtokens', 'lemmatizer']
As for the tokenisers and tagger checkout the tokenizers.py file and the taggers.py file
The default configs that we provide are designed for clinical records but iβm sure there is always room for improvement. If you think something can be improved let us know!