Skip to content

feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training#467

Open
mart-r wants to merge 15 commits into
mainfrom
feat/medcat/CU-869d9n2rg-avoid-running-pipe-twice-in-sup-train
Open

feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training#467
mart-r wants to merge 15 commits into
mainfrom
feat/medcat/CU-869d9n2rg-avoid-running-pipe-twice-in-sup-train

Conversation

@mart-r
Copy link
Copy Markdown
Collaborator

@mart-r mart-r commented May 13, 2026

This PR separates the callers used for supervised and self-supervised training.

The self-supervised training will run the entire pipe.
And the supervised training only runs the tokenizer.

Along the way also fixed up a few minor things

  • The trainable component counting based warning (i.e when nothing to train)
  • The entities sent to be trained supervised now have a detected name
    • Without it the name counts wouldn't be updated
Original(ish) description If all goes well this just works out of the box.

EDIT:
Looks like it's not working since the linker unsupervised training needs the NER step to have run because otherwise it doesn't know what entities to train on. Need to figure out a solution for that.

@mart-r mart-r closed this May 13, 2026
@mart-r mart-r changed the title fead(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training May 13, 2026
@mart-r mart-r reopened this May 13, 2026
Copy link
Copy Markdown
Contributor

@adam-sutton-1992 adam-sutton-1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should be a big speedup.

One minor question below.

Comment on lines +422 to +427
ent = self._pipeline.entity_from_tokens_in_doc(tkns, doc)
# avoid setting CUI so as to not show our hand
raw_name = ann['value'].lower().replace(
" ", self.cdb.config.general.separator)
ent.detected_name = raw_name
ents.append(ent)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the correct CUI isn't already set when performing training?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - it doesn't matter. Even in the previous case this would have already gone through the pipe. Might as well add that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants