Adding custom concepts / synonyms to cat

I am working in an area where a lot of slang and jargon are used. I am trying to add these terms to an existing MedCAT model using add_and_train_concepts() without providing a spacy_doc, since I don’t have annotated training documents for these new terms.

As expected, the concepts are registered in the CDB, but get_entities doesn’t detect them in the text - likely because they have no context vectors, causing the linker’s similarity score to be 0.

My questions:

  1. Is there a recommended way to add new surface forms/synonyms to an existing model without annotated training data?
  2. Is setting context_similarity_threshold = 0 and min_count = 0 the intended approach, or does it risk degrading performance on existing trained concepts?
  3. Is there another way to inherit context vectors from a related CUI (e.g., copying vectors from “metaphetamine” to “ice”)?

I realise there’s a thread asking a similar question, but I couldn’t get the answer I wanted.

Thanks in advance.

I will assume you’re using the latest MedCAT (2.5+). If you’re using <2 let me know and I can give you some further guidance.

There can be a distinct difference between adding new concepts and enriching with new synonyms.

With that said, whichever you use, the most “natural” way to add a new names and concepts would be to use the CDBMaker class. The reason I say this is because you can pass a dataframe or even a file path to a method of this class. And then this will handle all the nitty gritty details of token-splitting, adding names, subnames, and all other relevant information.

These are the columns this can use:
‘cui’, ‘name’, ‘ontologies’, ‘name_status’,‘type_ids’, ‘description’
NOTE: Only cui and name are required. name_status should be A (automatic), P (primary/prefered name for this concept), N(always disambiguate); type_ids are IDs from cdb.type_ids.keys() - they’re based on some high level Snomed concepts, but they’re not strictly required. ontologies is a |-separate list of source ontologies (again, optional).

Here’s how you’d go about it:

from medcat.cat import CAT
from medcat.model_creation.cdb_maker import CDBMaker

cat = CAT.load_model_pack(““)  # load your model pack
cdb = cat.cdb

# it’s important to pass in your CDB, otherwise a new one will be created
# But doing it this way will add on top of the existing CDB
cdb_maker = CDBMaker(cdb.config, cdb)

# option 1 - you have added your names / concepts to a CSV
cdb_maker.prepare_csvs([file_path])# file_path is the path to the CSV

# option 2 - you create a pandas DataFrame
df = pd.DataFrame(data=[], columns=[])# make sure to include column headers
cdb_maker.prepare_csvs([df])

# now save the model pack if you need to - changes were done in the existing CDB

The reason I said there might be differences between adding new synonyms and adding new concepts has to do with existing knowledge of the data. If the concept you’re adding a synonym to already has been trained and has context vectors, then the new synonym will work just as well as the other ones. However, if you’re adding a completely new name or a name to a concept with no training, you do not have this luxury.
In addition, the model’s ability to recognise a newly added synonym / name depends on the name and how many concepts use this name as well as whether / how many of these have received training. If you’re adding a completely new name (one that doesn’t previously exist in cdb.name2info ) then the model should be able to identify the name right away because there is no ambiguity.
However, if you’re adding a name that does exist in cdb.name2info and therefore already exists and maps to other concepts as well, things get quite a bit more complicated. Because the model needs to disambiguate this name when it sees it. And when doing that, all CUIs that don’t have prior training will be consider the least similar to the context at hand (-1). The model can only compare similarity between concepts that it has training data for (and which have enough of it as per train_count_threshold). This does mean that if there’s no training data for a specific concept then none of its ambiguous names will ever be picked up. And that’s because the model will have no comparison to make - it doesn’t know what context this concept appears in.

So if you’re adding ambiguous names you need to have some training on the concepts on top of just adding the names. This training can be unsupervised (just raw text), but unsuperivsed training can only train primary names for a concept or unambiguous names. So more likely than not you’ll have to do some supervised training.

As for setting context_similarity_threshold = 0 all that will do is allow for worse matches to be shown to the user. But since the untrained concepts (in ambiguous names) have a similarity of -1 they will still not be output. Setting this to -1 might output something, but it will be rather arbitrary in cases of multiple CUIs with no training. And you’ll get a lot of nonsense. So I really don’t recommend this.

Setting context_similarity_threshold = 0 is equivalent to setting it to 1. That’s because if there are no vectors to compare to (which is the case with no training) then the similarity is still not going to be found. So you still need some trianing. But the problem with setting this very low is that you’ll then consider a very small number of training examples golden samples. And that may work in some cases, but there’s no guarantee that they will.

Copying context vectors could get you something. But it’s not really something the model is designed to account for. For instance, you’d probably want to adjust the count train for the concept as well as (potentially) for the names. And this only really works if the 2 concepts don’t share names (since that would lead to somewhat arbitrary disambiguation), yet if they don’t they’re probably not related enough for this to be useful. As such, I wouldn’t recommend it unless you’re really confident in knowing what the implications are.

Hi @mart.ratas, thank you so much for your prompt response and detailed response.

I am currently using MedCAT version 1.16.0, so if there are any version-specific considerations, I’d appreciate any additional guidance.

Seems like supervised training is the best way to move forward, particularly for ambiguous names like slang terms we’re working with. However, given the long list of synonyms, creating annotated training examples for each term would be quite time-consuming. A few follow-up questions:

A few follow-up questions:

  1. Is there a recommended minimum number of training examples per name/synonym?
  2. What is the recommended workflow? Should the supervised training be done before or after the first NER+L?

Thanks again for your help!

First of all, I wouldn’t recommend using medcat==1.16.0. If you need to use v1, use the latest medcat==1.16.8.

As for supervised training - MedCAT trains on a per concept basis. So if you’re adding multiple synonyms to one concept you only really need to train that one concept rather than all of the specific names. Though most likely you would want to have a diverse training dataset that includes multiple different names otherwise these may not be picked up later down the line. I.e if you only train on formal synonyms it might not be able to pick up informal ones.

As for the specific question:

  1. There is no general answer I can give here that will always work well. Because it depends heavily on what the problem is exactly. The rule of thumb has been around 100-150 annotations per concept. But in some scenarios you may be able to get away with fewer and in others this may not be enough. For instance, if you’re trying to make the model good at disambiguating between 2 similar concepts that share a name, you may need more training.
  2. I’m not sure I understand this question fully. Normally your first step is to identify whether the current performance is up to the standard you require. The simplest way is to do a visual sanity check by running inference over the data and checking whether what you were interested in was actually linked by the model, but this is - somewhat obviously - quite subjective. The more objective measure would be to gather an annotated dataset and make a decision on the resulting metrics. If that’s not what you asked, please do clarify.