MedCAT French model only matches exact terms - accuracy similarity always 1

Hello,

I recently discovered MedCAT with great interest, and I would like to use it for a research project involving a French psychiatric EHR. I managed to create a model that extracts concepts with reasonably good relevance, by following both the tutorials available here and the work that has been done in Dutch. However, when I use this model (trained on various types of data), all identified concepts have an accuracy and a context_similarity of 1.

It seems like my model is only performing exact string matching against the CDB, which undermines the real strength of MedCAT. For reference, here are the parameters I experimented with, although the results remain unchanged:

  • Vocab file: I tried both French FastText and Word2Vec embeddings
  • CDB file: I built a dictionary using the French terms from UMLS as well as the French version of SNOMED-CT. The dictionary contains only French terms (~350k terms for 190k concepts)
  • Model training: I used 2,000 freely available medical documents from the Frasimed corpus, as well as my own clinical dataset (~21k documents). No difference was observed before or after training
  • Documents to annotate: I tested the model on real clinical notes, web-based texts, and manually written examples

Do you have any idea what might be causing this behavior in my model? Here is the configuration options with the training stats :

Config(version=VersionInfo(history=, meta_cats=, cdb_info={‘Number of concepts’: 187229, ‘Number of names’: 275420, ‘Number of concepts that received training’: 0, ‘Number of seen training examples in total’: 0, ‘Average training examples per concept’: 0.0}, performance={‘ner’: {}, ‘meta’: {}}, description=‘No description’, id=‘be1bbfb146671ccb’, last_modified=‘29 April 2025’, location=None, ontology=None, medcat_version=‘1.15.0’), cdb_maker=CDBMaker(name_versions=[‘LOWER’, ‘CLEAN’], multi_separator=‘|’, remove_parenthesis=5, min_letters_required=2), annotation_output=AnnotationOutput(doc_extended_info=False, context_left=-1, context_right=-1, lowercase_context=True, include_text_in_output=False), general=General(spacy_disabled_components=[‘ner’, ‘parser’, ‘vectors’, ‘textcat’, ‘entity_linker’, ‘sentencizer’, ‘entity_ruler’, ‘merge_noun_chunks’, ‘merge_entities’, ‘merge_subtokens’], checkpoint=CheckPoint(output_dir=‘checkpoints’, steps=None, max_to_keep=1), usage_monitor=UsageMonitor(enabled=False, batch_size=100, file_prefix=‘usage_’, log_folder=‘.’), log_level=20, log_format=‘%(levelname)s:%(name)s: %(message)s’, log_path=‘./medcat.log’, spacy_model=‘fr_core_news_lg’, separator=‘~’, spell_check=True, diacritics=True, spell_check_deep=False, spell_check_len_limit=7, show_nested_entities=False, full_unlink=True, workers=7, make_pretty_labels=‘long’, map_cui_to_group=False, simple_hash=False), preprocessing=Preprocessing(words_to_skip={‘nos’}, keep_punct={‘.’, ‘:’}, do_not_normalize={‘VBN’, ‘VBG’, ‘VBP’, ‘VBD’, ‘JJR’, ‘JJS’}, skip_stopwords=False, min_len_normalize=5, stopwords=None, max_document_length=1000000), ner=Ner(min_name_len=2, max_skip_tokens=2, check_upper_case_names=False, upper_case_limit_len=3, try_reverse_word_order=False), linking=Linking(optim={‘type’: ‘linear’, ‘base_lr’: 1, ‘min_lr’: 5e-05}, context_vector_sizes={‘xlong’: 27, ‘long’: 18, ‘medium’: 9, ‘short’: 3}, context_vector_weights={‘xlong’: 0.1, ‘long’: 0.4, ‘medium’: 0.4, ‘short’: 0.1}, filters=LinkingFilters(cuis=set(), cuis_exclude=set()), train=False, random_replacement_unsupervised=0.8, disamb_length_limit=5, filter_before_disamb=False, train_count_threshold=10, always_calculate_similarity=False, calculate_dynamic_threshold=False, similarity_threshold_type=‘static’, similarity_threshold=0.3, negative_probability=0.5, negative_ignore_punct_and_num=True, prefer_primary_name=0.35, prefer_frequent_concepts=0.35, subsample_after=30000, devalue_linked_concepts=False, context_ignore_center_tokens=False, disamb=True), word_skipper=re.compile(‘^(nos)$’), punct_checker=re.compile(‘[^a-z0-9]+’), hash=‘79af0f569b59f28f’)

Thank you in advance, and congratulations on this incredible work that will help advance research on free-text clinical data.

Hi. Thanks for the interest in MedCAT.

We haven’t worked with a French model ourselves. But there should be no reason MedCAT shouldn’t be able to work with the French language in principle given a CDB built on the French concepts.

With that said, the config you’ve provided clearly states:

Number of concepts that received training’: 0, ‘Number of seen training examples in total’: 0,

Now, this could simply be because you’ve not saved the model after it’s been trained (the cdb_info is only set at save time).
But just to make sure, you can always manually call cat.cdb.make_stats() to find out whether or how many of your concepts have seen training examples.

In general, here’s how I would imagine the workflow for creating a French language model of MedCAT:

  • Download and preprocess the French version of an ontology (E.g Snomed-CT)
    • If an English version is used instead, many terms are likely to not match
  • Create a CDB based on the preprocessed French language ontology
  • Create Vocab that’s got word embeddings for words relevant in the French language
    • The Vocab we provide with our model comes with words in the English language
    • As such, many of these will probably not be in the French language texts
    • And even if they are, they’re unlikely to have the exact same meaning
      • At least in terms of relative meaning between different words
  • Create a relevant config
    • You need to use a French spacy model for it to correctly tokenize the language
    • You may want to consider allowing diacritics for spell checking and word splitting
      • This will allow letters 'àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ' on top of the regular Latin alpabet
  • Create a model pack (CAT) based on the above
  • Perform self-supervised training on some relevant data in French
    • The closer the data is to the final use case, the better, of course
  • Save model on file for later use

Now, it does seem like you have actually done all of the above. But I wanted to write it down just to make sure we’re on the same page.

So if all of the above should be done, to the best of your knowledge, these are the following trouble shooting steps I would take:

  • Look at whether/how many concepts have seen training (see cdb.make_stats() above)
    • If there’s been no training on anything, even after the 23k documents, we’d need to look at why that is
    • If there’s training, you might find that it’s not on as many concepts as you’d have hoped

By the way, 23 000 documents (depending on the size of the documents) may not actually be enough for the model to learn everything. Though it should certainly learn some things.

If you’re convinced that the training process has run, but there’s few (or no) concepts that have
received training, perhaps you could add some logging to the training situation. Some things in there may be failing quietly without notifying you. E.g:

Now you should be able to tell where the model fails. Whether it never picks up the terms at all, or whether it fails to link them to the appropriate concept. Or whether something is failing and the failure is only visible in the logs.

Let me know how you get along.

Hello,

Thank you for your quick and detailed reply. I realize I forgot to include the details of the training I performed. Here it is: I trained the model on approximately 27k documents (2k + 21k + 2k + 300), all focused on psychiatry, with varied origins and lengths.

INFO:medcat.cdb:{
  "Number of concepts": 62818,
  "Number of names": 121285,
  "Number of concepts that received training": 7811,
  "Number of seen training examples in total": 802138,
  "Average training examples per concept": 102.6933811291768
}

Here is the code I used to compile the .txt files into a DataFrame, and then to train the model:

data_dirs = [
    '/Users/myaccount/Documents/txt/',
]
documents = []
for data_dir in data_dirs:
    for filename in os.listdir(data_dir):
        if filename.endswith(".txt"):
            with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f:
                text = f.read().replace('\n', ' ')
                documents.append({'text': text})

# Create a dataframe
df = pd.DataFrame(documents)
print(df.head(1))

# Train
print(f"There are {len(df['text'])} documents to train...")
cat.train(df.text, progress_print=100)
cat.cdb.print_stats()

I did follow the workflow you mentioned — and I would be very happy to share my files with the community once everything is functional. I also followed your advice to analyze the training process, and it appears to be running correctly. Here is a sample of the training logs for a few documents — perhaps you’ll notice something I missed? Despite this, the detected concepts still consistently return an accuracy and context_similarity of 1.

Training log
Maybe annotating name: hypothyroïdie
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: hypothyroïdie
NER detected an entity.
	Detected name: hypothyroïdie
	Link candidates: ['C0020676']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: hypothyroïdie
	Link candidates: ['C0020676']

Maybe annotating name: traitement
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: traitement
NER detected an entity.
	Detected name: traitement
	Link candidates: ['C2350609']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: traitement
	Link candidates: ['C2350609']

Maybe annotating name: traitement~par
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: traitement~par
NER detected an entity.
	Detected name: traitement~par
	Link candidates: ['C0678054', 'C1112342']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: traitement~par
	Link candidates: ['C0678054', 'C1112342']

Maybe annotating name: crampe
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: crampe
NER detected an entity.
	Detected name: crampe
	Link candidates: ['C0026821', 'C4324339']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: crampe
	Link candidates: ['C0026821', 'C4324339']

Maybe annotating name: fourmillements
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: fourmillements
NER detected an entity.
	Detected name: fourmillements
	Link candidates: ['C0016579']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: fourmillements
	Link candidates: ['C0016579']

Maybe annotating name: examen~physique
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: examen~physique
NER detected an entity.
	Detected name: examen~physique
	Link candidates: ['C0031809']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: examen~physique
	Link candidates: ['C0031809']
Maybe annotating name: absence
DEBUG:medcat.ner.vocab_based_annotator:Maybe annotating name: absence
NER detected an entity.
	Detected name: absence
	Link candidates: ['C0235956', 'C4316903']

DEBUG:medcat.ner.vocab_based_annotator:NER detected an entity.
	Detected name: absence
	Link candidates: ['C0235956', 'C4316903']

[...]

Updating CUI: C0020676 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0020676 with negative=False
Updating CUI: C2350609 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609 with negative=False
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C2350609, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C2350609, with 0 negative words
Updating CUI: C0016579 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579 with negative=False
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0016579, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0016579, with 0 negative words
Updating CUI: C0031809 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809 with negative=False
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0031809, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0031809, with 0 negative words
Updating CUI: C0751115 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0751115 with negative=False
Updating CUI: C0027853 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0027853 with negative=False
Updating CUI: C0234146 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0234146 with negative=False
Updating CUI: C0151888 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888 with negative=False
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C0151888, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0151888, with 0 negative words
Updating CUI: C1260928 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928 with negative=False
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C1260928, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1260928, with 0 negative words
Updating CUI: C0853374 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374 with negative=False
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0853374, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0853374, with 0 negative words
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0235000 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000 with negative=False
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0235000, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0235000, with 0 negative words
Updating CUI: C0240991 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991 with negative=False
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0240991, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0240991, with 0 negative words
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0200631 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0200631 with negative=False
Updating CUI: C0005778 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778 with negative=False
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0005778, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0005778, with 0 negative words
Updating CUI: C0221423 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0221423 with negative=False
Updating CUI: C0024198 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0024198 with negative=False
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0553794 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0553794 with negative=False
Updating CUI: C0019348 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348 with negative=False
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0019348, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0019348, with 0 negative words
Updating CUI: C0008049 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0008049 with negative=False
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0856955 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955 with negative=False
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0856955, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856955, with 0 negative words
Updating CUI: C0152025 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025 with negative=False
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0152025, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0152025, with 0 negative words
Updating CUI: C0271681 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681 with negative=False
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0271681, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0271681, with 0 negative words
Updating CUI: C0541939 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939 with negative=False
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0541939, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0541939, with 0 negative words
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C5208163 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5208163 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0040405 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405 with negative=False
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0040405, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0040405, with 0 negative words
Updating CUI: C0194884 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0194884 with negative=False
Updating CUI: C0497156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156 with negative=False
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0497156, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0497156, with 0 negative words
Updating CUI: C0441633 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0441633 with negative=False
Updating CUI: C0024485 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0024485 with negative=False
Updating CUI: C1518156 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C1518156 with negative=False
Updating CUI: C5849587 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587 with negative=False
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C5849587, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C5849587, with 0 negative words
Updating CUI: C0014038 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0014038 with negative=False
Updating CUI: C0338430 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0338430 with negative=False
Updating CUI: C0854581 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0854581 with negative=False
Updating CUI: C0021044 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0021044 with negative=False
Updating CUI: C0856592 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592 with negative=False
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856592, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856592, with 0 negative words
Updating CUI: C0856593 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593 with negative=False
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0856593, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0856593, with 0 negative words
Updating CUI: C0041618 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618 with negative=False
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0041618, with 0 negative words
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0041618, with 0 negative words
Updating CUI: C0014038 with negative=False
DEBUG:medcat.linking.vector_context_model:Updating CUI: C0014038 with negative=False
Updating CUI: C0014038, with 0 negative words

[...]

INFO:medcat.cdb:{
  "Number of concepts": 62818,
  "Number of names": 121285,
  "Number of concepts that received training": 7811,
  "Number of seen training examples in total": 802217,
  "Average training examples per concept": 102.70349507105364
}

Do you think this could be due to a problem in the training phase, or that my training data (in terms of quantity or quality) might be insufficient?

On a related note, I’d also like to ask about contextualization — specifically regarding the detection of present/absent status for each identified concept. In the MedCAT tutorial for this step, it looks like a JSON file generated using medcattrainer is required. Can you confirm whether using this feature in MedCAT necessarily requires manually training a model via medcattrainer? Or is there another way to implement it?

Thanks again for your responsiveness. I’m really looking forward to fully using MedCAT in French and would be happy to share our results with the community.

It looks like the training worked. It trained something. It trained 7811 concepts, with a total of 802138 examples. And (on average) 100+ examples per concept. Though this is likely to be distributed quite unevently - with most of the concepts only having seen a few examples and a few having seen many. A sort of 80-20 rule (though could also be 90-10) where 20% of the concepts may have received 80% of the training.
But the question is whether the concepts you need were or weren’t trained. And whether the training received was relevant / useful. If you’ve got a subset of concepts in mind that you know will be needed for the use case, I would look at the training count for these. I.e

cuis = ["73211009", "84757009"]  # list CUIs of interest
for cui in cuis:
    print(cui, ":", cat.cdb.cui2count_train.get(cui, 0))

If the training count for the concepts you were interested in is low (or 0), there could be 2 reasons:

  • The dataset(s) used for training didn’t contain the concept(s)
  • The dataset(s) did contain the relevant concepts, but MedCAT was unable to disambiguate them
    • When one name can refer to multiple concepts, MedCAT can do disambiguation
    • However, it can only do that if the concepts have have sufficient training (above config.linking.train_count_threshold, in your case looks to be 10)
      • If the concepts don’t have sufficient training, a similarity of -1 is recorded
    • In order for a concept to get training, it would need to have names that are unique to that concept
      • So if there’s a name that refers to 2 concepts, but these 2 concepts only have that one name, then this will never be able to be trained in a self-supervised manner
    • So if this is the case (which is somewhat likely), you have a few options
      • Enrich your CDB with more names before training
        • Hopefully after enrichment all concepts have a unique name
        • We’ve used UMLS enrichment before, so that may work
        • You can see this for some reference
      • Train on the same data with multiple epochs
        • It’s possible that there are unique names for these concepts and they did receive training
        • But the threshold wasn’t reached so disambiguation can’t be done
        • Running multiple epochs can allow reaching the threshold
        • And subsequently even train on the ambiguous examples
        • However, multiple epochs can also run the risk of overfitting
      • Do some supervised training
        • In supervised training we don’t have issues with the disambiguation
        • Because the annotator will have told us the ground truth
        • Though this is - of course - quite a bit more manual work

We call these parts MetaAnnotations. They’re handled by MetaCAT instances. There’s a few tutorials on these as well. The MetaCAT models do not (generally) have a self-supervised training method. So they are likely to need supervised training indeed.
With that said, the supervised training doesn’t necessarily need to be done in MedCATtrainer. Any data that’s in the expected format will work.

Hello,

Thanks again for your very detailed and quick reply. I followed all your recommendations, but unfortunately without success. I mainly worked on enriching my CDB, doubling the number of terms and ensuring that each name is unique (previously, multiple CUIs could share the same name). I also expanded my training dataset by downloading over 10k medical articles from Wikipedia — but this didn’t make any difference.

The result is that my model is able to extract very relevant concepts, which will be useful for my ongoing work, but the accuracy and context_similarity values remain stuck at 1, regardless of the text or concept.

As a comparison, I downloaded the English “UMLS small” model available on your portal, reset the training, and ran it on 1,000 reports from the MIMIC corpus. Most of the concepts identified by this model have an accuracy value different from 1.

I’m running out of ideas for identifying what might be wrong with my unsupervised model. Would it be possible to send you the relevant files from my model so you could help me investigate the source of the issue?

Thanks again for your time and support.

Thomas

Hi!

Sorry for the delayed response - I was on leave last week.

Let me make sure I understand this correctly. Does this mean each concept only has unique names? Because if that’s the case, then you will always see an accuracy of 1 - there is no need to disambiguate if there’s nothing ambiguous. But it would also mean ambiguous names would never be recognised since they’re not a part of the CDB.
Hopefully what you meant was that each concept does in fact have at least one unique name. But most of them will also have ambiguous names.

Given that you said you doubled the number of terms and didn’t mention removing any, I’ll assume it’s the latter.

A few questions here:

  • Was the model able to extract all or just some of the concepts?
  • Were the names that were extracted ambiguous? I.e is there multiple cuis in cdb.name2cuis / cdb.name2cuis2status?

I’ve got a feeling that the model was only able to extract unique names, and not those that weren’t unique. Though I may be wrong.

If you like, I can take a look. You’d need to send me a link to the model, as well as a list of the concept IDs you’re interested in. If you’re able, you can also send me a text or 2 to try. But I can have an LLM generate something as well (won’t be excellent, but better than nothing). You should be able to use the message functionality on my profile here, I believe.

Hello,

Apologies for the lack of clarity regarding the CDB. I actually compiled two databases: the UMLS, which includes 7 French-language sources, and the French version of SNOMED-CT. Since the identifiers (CUIs) were not the same across these sources, I merged concepts that had at least one name in common under a single CUI. This means each name maps to a single CUI, but each CUI may have several names, with at least one unique name. I also randomly assigned one name as “P” and the others as “A” in the status column.

That said, the error I’m encountering was already present when a single name/term could map to multiple CUIs, regardless of how many names were associated with each CUI.

I also noticed that spaCy seems to generate multiple variations for a given name (e.g., plural forms, gendered forms), which could cause a name to appear across multiple CUIs. To work around this, I imported the concepts using the English spaCy model, then switched to the French model for training and extraction — without any noticeable difference.

I would say that all concepts are being recognized correctly — though a few I expected to find are missing from the CDB (such as certain psychiatric semiology terms). Each name in the CDB is unique, so the functions cdb.name2cuis and cdb.name2cuis2status return only a single CUI per name.

However, some names are very similar — differing only by plural forms or slightly modified expressions — which I consider a form of ambiguity. Here’s an example with code for the word “anxiety” in French:

term = "angoisse"
matched = []

for cui, names in cdb.cui2names.items():
    for name in names:
        if term.lower() in name.lower():
            statut = cdb.name2cuis2status.get(name, {}).get(cui, "Inconnu")
            matched.append((cui, name, statut))

for cui, name, statut in matched:
    print(f"CUI : {cui} | Names : {name} | Statut : {statut}")

And the result:

CUI : 198288003 | Names : angoisse | Statut : A
CUI : 304896009 | Names : angoisse~de~castration | Statut : A
CUI : C0003477 | Names : angoisse~de~la~séparation | Statut : P
CUI : C0003477 | Names : trouble~de~l~angoisse~de~séparation | Statut : A
CUI : C0003477 | Names : angoisse~de~séparation | Statut : A
CUI : C0003477 | Names : trouble~de~l~angoisse~de~la~séparation | Statut : A
CUI : C0005604 | Names : angoisse~de~la~naissance | Statut : A
CUI : 25501002 | Names : trouble~d~angoisse~sociale | Statut : P
CUI : 25501002 | Names : trouble~d~angoisse~social | Statut : P
CUI : 225630000 | Names : angoisse~liée~aux~soin~dentaire | Statut : A
CUI : 225630000 | Names : angoisse~liée~aux~soins~dentaires | Statut : A
CUI : C0235101 | Names : angoisses | Statut : A
CUI : C0235106 | Names : angoisse~de~la~mort | Statut : A
CUI : C0235106 | Names : angoisse~devant~la~mort | Statut : P
CUI : C0235111 | Names : angoisse~complexe | Statut : A
CUI : C0262377 | Names : angoisse~situationnel | Statut : P
CUI : C0262377 | Names : angoisse~situationnelle | Statut : P
CUI : 231504006 | Names : dépression~liée~a~l~angoisse | Statut : P
CUI : 279622009 | Names : angoisse~de~performance | Statut : A
CUI : 279622009 | Names : angoisse~de~la~performance | Statut : A
CUI : 300895004 | Names : crise~d~angoisse | Statut : A
CUI : C0857238 | Names : patient~introspectif~et~susceptible~a~l~angoisse | Statut : P
CUI : C0859031 | Names : visage~angoisser | Statut : P
CUI : C0860603 | Names : symptôme~d~angoisse~sai | Statut : P
CUI : C0860603 | Names : symptômes~d~angoisse~sai | Statut : P
CUI : C1279420 | Names : angoisse~névrose~d | Statut : A
CUI : C1279420 | Names : névrose~d~angoisse | Statut : P
CUI : C1279420 | Names : névroses~d~angoisse | Statut : P
CUI : C1279420 | Names : nevrose~d~angoisse | Statut : A
CUI : 11806006 | Names : trouble~d~angoisse~de~séparation~de~l~enfance | Statut : P
CUI : 35429005 | Names : angoisse~d~anticipation | Statut : P
CUI : 81350009 | Names : angoisse~flottante | Statut : P
CUI : 85061001 | Names : trouble~d~angoisse~de~séparation~de~l~enfance~d~apparition~précoce | Statut : P

198288003 and C0235101 are very close, second is the plural of the first.

Thank you in advance. I’ve sent all the necessary files through the KCL contact page, since messaging appears to be disabled on the forum.

Best regards,

Thomas

I will take a little bit of a closer look.

But at first glance, the issue seems to be the extra concepts from UMLS. You don’t generally want concepts from both UMLS and Snomed. Especially given that there’s some overlap (NOTE: in certain fine-grained situations it may be useful, but not in general).

What you want to do instead is (roughly):

  • Start with a SnomedCT release of your choice and preprocess for CDB creation
  • Use a UMLS release (MRCONSO.RRF should suffice) and filter for your language
  • Find all the UMLS CUIs that correspond to the SnomedCT codes (SCUI column in UMLS) - i.e the relevant CUIs
  • Create a mapping from the UMLS CUIs to Snomed-CT codes based on above
  • Filter the whole data frame based on the relevant CUIs
  • Add a SnomedCT code to the dataframe
  • Merge with existing Snomed data frame

The idea here is that you only add names to the CDB, not new concepts. Because otherwise you end up having concepts that refer to the same thing, and the model is unlikely to be able to differentiate between them.
For reference, you can take a look at this:

(look for the " Optional: Enrich with UMLS terms." section).