I will assume you’re using the latest MedCAT (2.5+). If you’re using <2 let me know and I can give you some further guidance.
There can be a distinct difference between adding new concepts and enriching with new synonyms.
With that said, whichever you use, the most “natural” way to add a new names and concepts would be to use the CDBMaker class. The reason I say this is because you can pass a dataframe or even a file path to a method of this class. And then this will handle all the nitty gritty details of token-splitting, adding names, subnames, and all other relevant information.
These are the columns this can use:
‘cui’, ‘name’, ‘ontologies’, ‘name_status’,‘type_ids’, ‘description’
NOTE: Only cui and name are required. name_status should be A (automatic), P (primary/prefered name for this concept), N(always disambiguate); type_ids are IDs from cdb.type_ids.keys() - they’re based on some high level Snomed concepts, but they’re not strictly required. ontologies is a |-separate list of source ontologies (again, optional).
Here’s how you’d go about it:
from medcat.cat import CAT
from medcat.model_creation.cdb_maker import CDBMaker
cat = CAT.load_model_pack(““) # load your model pack
cdb = cat.cdb
# it’s important to pass in your CDB, otherwise a new one will be created
# But doing it this way will add on top of the existing CDB
cdb_maker = CDBMaker(cdb.config, cdb)
# option 1 - you have added your names / concepts to a CSV
cdb_maker.prepare_csvs([file_path])# file_path is the path to the CSV
# option 2 - you create a pandas DataFrame
df = pd.DataFrame(data=[], columns=[])# make sure to include column headers
cdb_maker.prepare_csvs([df])
# now save the model pack if you need to - changes were done in the existing CDB
The reason I said there might be differences between adding new synonyms and adding new concepts has to do with existing knowledge of the data. If the concept you’re adding a synonym to already has been trained and has context vectors, then the new synonym will work just as well as the other ones. However, if you’re adding a completely new name or a name to a concept with no training, you do not have this luxury.
In addition, the model’s ability to recognise a newly added synonym / name depends on the name and how many concepts use this name as well as whether / how many of these have received training. If you’re adding a completely new name (one that doesn’t previously exist in cdb.name2info ) then the model should be able to identify the name right away because there is no ambiguity.
However, if you’re adding a name that does exist in cdb.name2info and therefore already exists and maps to other concepts as well, things get quite a bit more complicated. Because the model needs to disambiguate this name when it sees it. And when doing that, all CUIs that don’t have prior training will be consider the least similar to the context at hand (-1). The model can only compare similarity between concepts that it has training data for (and which have enough of it as per train_count_threshold). This does mean that if there’s no training data for a specific concept then none of its ambiguous names will ever be picked up. And that’s because the model will have no comparison to make - it doesn’t know what context this concept appears in.
So if you’re adding ambiguous names you need to have some training on the concepts on top of just adding the names. This training can be unsupervised (just raw text), but unsuperivsed training can only train primary names for a concept or unambiguous names. So more likely than not you’ll have to do some supervised training.
As for setting context_similarity_threshold = 0 all that will do is allow for worse matches to be shown to the user. But since the untrained concepts (in ambiguous names) have a similarity of -1 they will still not be output. Setting this to -1 might output something, but it will be rather arbitrary in cases of multiple CUIs with no training. And you’ll get a lot of nonsense. So I really don’t recommend this.
Setting context_similarity_threshold = 0 is equivalent to setting it to 1. That’s because if there are no vectors to compare to (which is the case with no training) then the similarity is still not going to be found. So you still need some trianing. But the problem with setting this very low is that you’ll then consider a very small number of training examples golden samples. And that may work in some cases, but there’s no guarantee that they will.
Copying context vectors could get you something. But it’s not really something the model is designed to account for. For instance, you’d probably want to adjust the count train for the concept as well as (potentially) for the names. And this only really works if the 2 concepts don’t share names (since that would lead to somewhat arbitrary disambiguation), yet if they don’t they’re probably not related enough for this to be useful. As such, I wouldn’t recommend it unless you’re really confident in knowing what the implications are.