MedCAT French model only matches exact terms - accuracy similarity always 1

Hello,

I recently discovered MedCAT with great interest, and I would like to use it for a research project involving a French psychiatric EHR. I managed to create a model that extracts concepts with reasonably good relevance, by following both the tutorials available here and the work that has been done in Dutch. However, when I use this model (trained on various types of data), all identified concepts have an accuracy and a context_similarity of 1.

It seems like my model is only performing exact string matching against the CDB, which undermines the real strength of MedCAT. For reference, here are the parameters I experimented with, although the results remain unchanged:

  • Vocab file: I tried both French FastText and Word2Vec embeddings
  • CDB file: I built a dictionary using the French terms from UMLS as well as the French version of SNOMED-CT. The dictionary contains only French terms (~350k terms for 190k concepts)
  • Model training: I used 2,000 freely available medical documents from the Frasimed corpus, as well as my own clinical dataset (~21k documents). No difference was observed before or after training
  • Documents to annotate: I tested the model on real clinical notes, web-based texts, and manually written examples

Do you have any idea what might be causing this behavior in my model? Here is the configuration options with the training stats :

Config(version=VersionInfo(history=, meta_cats=, cdb_info={‘Number of concepts’: 187229, ‘Number of names’: 275420, ‘Number of concepts that received training’: 0, ‘Number of seen training examples in total’: 0, ‘Average training examples per concept’: 0.0}, performance={‘ner’: {}, ‘meta’: {}}, description=‘No description’, id=‘be1bbfb146671ccb’, last_modified=‘29 April 2025’, location=None, ontology=None, medcat_version=‘1.15.0’), cdb_maker=CDBMaker(name_versions=[‘LOWER’, ‘CLEAN’], multi_separator=‘|’, remove_parenthesis=5, min_letters_required=2), annotation_output=AnnotationOutput(doc_extended_info=False, context_left=-1, context_right=-1, lowercase_context=True, include_text_in_output=False), general=General(spacy_disabled_components=[‘ner’, ‘parser’, ‘vectors’, ‘textcat’, ‘entity_linker’, ‘sentencizer’, ‘entity_ruler’, ‘merge_noun_chunks’, ‘merge_entities’, ‘merge_subtokens’], checkpoint=CheckPoint(output_dir=‘checkpoints’, steps=None, max_to_keep=1), usage_monitor=UsageMonitor(enabled=False, batch_size=100, file_prefix=‘usage_’, log_folder=‘.’), log_level=20, log_format=‘%(levelname)s:%(name)s: %(message)s’, log_path=‘./medcat.log’, spacy_model=‘fr_core_news_lg’, separator=‘~’, spell_check=True, diacritics=True, spell_check_deep=False, spell_check_len_limit=7, show_nested_entities=False, full_unlink=True, workers=7, make_pretty_labels=‘long’, map_cui_to_group=False, simple_hash=False), preprocessing=Preprocessing(words_to_skip={‘nos’}, keep_punct={‘.’, ‘:’}, do_not_normalize={‘VBN’, ‘VBG’, ‘VBP’, ‘VBD’, ‘JJR’, ‘JJS’}, skip_stopwords=False, min_len_normalize=5, stopwords=None, max_document_length=1000000), ner=Ner(min_name_len=2, max_skip_tokens=2, check_upper_case_names=False, upper_case_limit_len=3, try_reverse_word_order=False), linking=Linking(optim={‘type’: ‘linear’, ‘base_lr’: 1, ‘min_lr’: 5e-05}, context_vector_sizes={‘xlong’: 27, ‘long’: 18, ‘medium’: 9, ‘short’: 3}, context_vector_weights={‘xlong’: 0.1, ‘long’: 0.4, ‘medium’: 0.4, ‘short’: 0.1}, filters=LinkingFilters(cuis=set(), cuis_exclude=set()), train=False, random_replacement_unsupervised=0.8, disamb_length_limit=5, filter_before_disamb=False, train_count_threshold=10, always_calculate_similarity=False, calculate_dynamic_threshold=False, similarity_threshold_type=‘static’, similarity_threshold=0.3, negative_probability=0.5, negative_ignore_punct_and_num=True, prefer_primary_name=0.35, prefer_frequent_concepts=0.35, subsample_after=30000, devalue_linked_concepts=False, context_ignore_center_tokens=False, disamb=True), word_skipper=re.compile(‘^(nos)$’), punct_checker=re.compile(‘[^a-z0-9]+’), hash=‘79af0f569b59f28f’)

Thank you in advance, and congratulations on this incredible work that will help advance research on free-text clinical data.

Hi. Thanks for the interest in MedCAT.

We haven’t worked with a French model ourselves. But there should be no reason MedCAT shouldn’t be able to work with the French language in principle given a CDB built on the French concepts.

With that said, the config you’ve provided clearly states:

Number of concepts that received training’: 0, ‘Number of seen training examples in total’: 0,

Now, this could simply be because you’ve not saved the model after it’s been trained (the cdb_info is only set at save time).
But just to make sure, you can always manually call cat.cdb.make_stats() to find out whether or how many of your concepts have seen training examples.

In general, here’s how I would imagine the workflow for creating a French language model of MedCAT:

  • Download and preprocess the French version of an ontology (E.g Snomed-CT)
    • If an English version is used instead, many terms are likely to not match
  • Create a CDB based on the preprocessed French language ontology
  • Create Vocab that’s got word embeddings for words relevant in the French language
    • The Vocab we provide with our model comes with words in the English language
    • As such, many of these will probably not be in the French language texts
    • And even if they are, they’re unlikely to have the exact same meaning
      • At least in terms of relative meaning between different words
  • Create a relevant config
    • You need to use a French spacy model for it to correctly tokenize the language
    • You may want to consider allowing diacritics for spell checking and word splitting
      • This will allow letters 'àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ' on top of the regular Latin alpabet
  • Create a model pack (CAT) based on the above
  • Perform self-supervised training on some relevant data in French
    • The closer the data is to the final use case, the better, of course
  • Save model on file for later use

Now, it does seem like you have actually done all of the above. But I wanted to write it down just to make sure we’re on the same page.

So if all of the above should be done, to the best of your knowledge, these are the following trouble shooting steps I would take:

  • Look at whether/how many concepts have seen training (see cdb.make_stats() above)
    • If there’s been no training on anything, even after the 23k documents, we’d need to look at why that is
    • If there’s training, you might find that it’s not on as many concepts as you’d have hoped

By the way, 23 000 documents (depending on the size of the documents) may not actually be enough for the model to learn everything. Though it should certainly learn some things.

If you’re convinced that the training process has run, but there’s few (or no) concepts that have
received training, perhaps you could add some logging to the training situation. Some things in there may be failing quietly without notifying you. E.g:

Now you should be able to tell where the model fails. Whether it never picks up the terms at all, or whether it fails to link them to the appropriate concept. Or whether something is failing and the failure is only visible in the logs.

Let me know how you get along.

Hello,

Thank you for your quick and detailed reply. I realize I forgot to include the details of the training I performed. Here it is: I trained the model on approximately 27k documents (2k + 21k + 2k + 300), all focused on psychiatry, with varied origins and lengths.

INFO:medcat.cdb:{
  "Number of concepts": 62818,
  "Number of names": 121285,
  "Number of concepts that received training": 7811,
  "Number of seen training examples in total": 802138,
  "Average training examples per concept": 102.6933811291768
}

Here is the code I used to compile the .txt files into a DataFrame, and then to train the model:

data_dirs = [
    '/Users/myaccount/Documents/txt/',
]
documents = []
for data_dir in data_dirs:
    for filename in os.listdir(data_dir):
        if filename.endswith(".txt"):
            with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f:
                text = f.read().replace('\n', ' ')
                documents.append({'text': text})

# Create a dataframe
df = pd.DataFrame(documents)
print(df.head(1))

# Train
print(f"There are {len(df['text'])} documents to train...")
cat.train(df.text, progress_print=100)
cat.cdb.print_stats()

I did follow the workflow you mentioned — and I would be very happy to share my files with the community once everything is functional. I also followed your advice to analyze the training process, and it appears to be running correctly. Here is a sample of the training logs for a few documents — perhaps you’ll notice something I missed? Despite this, the detected concepts still consistently return an accuracy and context_similarity of 1.

Training log
Maybe annotating name: hypothyroïdie
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: hypothyroïdie
NER detected an entity.
	Detected name: hypothyroïdie
	Link candidates: ['C0020676']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: hypothyroïdie
	Link candidates: ['C0020676']

Maybe annotating name: traitement
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: traitement
NER detected an entity.
	Detected name: traitement
	Link candidates: ['C2350609']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: traitement
	Link candidates: ['C2350609']

Maybe annotating name: traitement~par
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: traitement~par
NER detected an entity.
	Detected name: traitement~par
	Link candidates: ['C0678054', 'C1112342']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: traitement~par
	Link candidates: ['C0678054', 'C1112342']

Maybe annotating name: crampe
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: crampe
NER detected an entity.
	Detected name: crampe
	Link candidates: ['C0026821', 'C4324339']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: crampe
	Link candidates: ['C0026821', 'C4324339']

Maybe annotating name: fourmillements
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: fourmillements
NER detected an entity.
	Detected name: fourmillements
	Link candidates: ['C0016579']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: fourmillements
	Link candidates: ['C0016579']

Maybe annotating name: examen~physique
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: examen~physique
NER detected an entity.
	Detected name: examen~physique
	Link candidates: ['C0031809']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: examen~physique
	Link candidates: ['C0031809']
Maybe annotating name: absence
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: absence
NER detected an entity.
	Detected name: absence
	Link candidates: ['C0235956', 'C4316903']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: absence
	Link candidates: ['C0235956', 'C4316903']

[...]

Updating CUI: C0020676 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0020676 with negative=False
Updating CUI: C2350609 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609 with negative=False
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C0016579 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579 with negative=False
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0031809 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809 with negative=False
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0751115 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0751115 with negative=False
Updating CUI: C0027853 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0027853 with negative=False
Updating CUI: C0234146 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0234146 with negative=False
Updating CUI: C0151888 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888 with negative=False
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C1260928 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928 with negative=False
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C0853374 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374 with negative=False
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0235000 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000 with negative=False
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0240991 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991 with negative=False
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0200631 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0200631 with negative=False
Updating CUI: C0005778 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778 with negative=False
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0221423 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0221423 with negative=False
Updating CUI: C0024198 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0024198 with negative=False
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0553794 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0553794 with negative=False
Updating CUI: C0019348 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348 with negative=False
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0008049 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0008049 with negative=False
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0856955 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955 with negative=False
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0152025 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025 with negative=False
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0271681 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681 with negative=False
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C5208163 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5208163 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0194884 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0194884 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0441633 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0441633 with negative=False
Updating CUI: C0024485 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0024485 with negative=False
Updating CUI: C1518156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1518156 with negative=False
Updating CUI: C5849587 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587 with negative=False
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C0014038 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0014038 with negative=False
Updating CUI: C0338430 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0338430 with negative=False
Updating CUI: C0854581 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0854581 with negative=False
Updating CUI: C0021044 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0021044 with negative=False
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0041618 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618 with negative=False
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0014038 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0014038 with negative=False
Updating CUI: C0014038, with 0 negative words

[...]

INFO:medcat.cdb:{
  "Number of concepts": 62818,
  "Number of names": 121285,
  "Number of concepts that received training": 7811,
  "Number of seen training examples in total": 802217,
  "Average training examples per concept": 102.70349507105364
}

Do you think this could be due to a problem in the training phase, or that my training data (in terms of quantity or quality) might be insufficient?

On a related note, I’d also like to ask about contextualization — specifically regarding the detection of present/absent status for each identified concept. In the MedCAT tutorial for this step, it looks like a JSON file generated using medcattrainer is required. Can you confirm whether using this feature in MedCAT necessarily requires manually training a model via medcattrainer? Or is there another way to implement it?

Thanks again for your responsiveness. I’m really looking forward to fully using MedCAT in French and would be happy to share our results with the community.

It looks like the training worked. It trained something. It trained 7811 concepts, with a total of 802138 examples. And (on average) 100+ examples per concept. Though this is likely to be distributed quite unevently - with most of the concepts only having seen a few examples and a few having seen many. A sort of 80-20 rule (though could also be 90-10) where 20% of the concepts may have received 80% of the training.
But the question is whether the concepts you need were or weren’t trained. And whether the training received was relevant / useful. If you’ve got a subset of concepts in mind that you know will be needed for the use case, I would look at the training count for these. I.e

cuis = ["73211009", "84757009"]  # list CUIs of interest
for cui in cuis:
    print(cui, ":", cat.cdb.cui2count_train.get(cui, 0))

If the training count for the concepts you were interested in is low (or 0), there could be 2 reasons:

  • The dataset(s) used for training didn’t contain the concept(s)
  • The dataset(s) did contain the relevant concepts, but MedCAT was unable to disambiguate them
    • When one name can refer to multiple concepts, MedCAT can do disambiguation
    • However, it can only do that if the concepts have have sufficient training (above config.linking.train_count_threshold, in your case looks to be 10)
      • If the concepts don’t have sufficient training, a similarity of -1 is recorded
    • In order for a concept to get training, it would need to have names that are unique to that concept
      • So if there’s a name that refers to 2 concepts, but these 2 concepts only have that one name, then this will never be able to be trained in a self-supervised manner
    • So if this is the case (which is somewhat likely), you have a few options
      • Enrich your CDB with more names before training
        • Hopefully after enrichment all concepts have a unique name
        • We’ve used UMLS enrichment before, so that may work
        • You can see this for some reference
      • Train on the same data with multiple epochs
        • It’s possible that there are unique names for these concepts and they did receive training
        • But the threshold wasn’t reached so disambiguation can’t be done
        • Running multiple epochs can allow reaching the threshold
        • And subsequently even train on the ambiguous examples
        • However, multiple epochs can also run the risk of overfitting
      • Do some supervised training
        • In supervised training we don’t have issues with the disambiguation
        • Because the annotator will have told us the ground truth
        • Though this is - of course - quite a bit more manual work

We call these parts MetaAnnotations. They’re handled by MetaCAT instances. There’s a few tutorials on these as well. The MetaCAT models do not (generally) have a self-supervised training method. So they are likely to need supervised training indeed.
With that said, the supervised training doesn’t necessarily need to be done in MedCATtrainer. Any data that’s in the expected format will work.