feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training by mart-r · Pull Request #467 · CogStack/cogstack-nlp

mart-r · 2026-05-13T13:54:29Z

This PR separates the callers used for supervised and self-supervised training.

The self-supervised training will run the entire pipe.
And the supervised training only runs the tokenizer.

Along the way also fixed up a few minor things

The trainable component counting based warning (i.e when nothing to train)
The entities sent to be trained supervised now have a detected name
- Without it the name counts wouldn't be updated

Original(ish) description

If all goes well this just works out of the box.

EDIT:
Looks like it's not working since the linker unsupervised training needs the NER step to have run because otherwise it doesn't know what entities to train on. Need to figure out a solution for that.

…creation

adam-sutton-1992

LGTM. Should be a big speedup.

One minor question below.

adam-sutton-1992 · 2026-05-14T15:02:06Z

+                ent = self._pipeline.entity_from_tokens_in_doc(tkns, doc)
+                # avoid setting CUI so as to not show our hand
+                raw_name = ann['value'].lower().replace(
+                    " ", self.cdb.config.general.separator)
+                ent.detected_name = raw_name
+                ents.append(ent)


As in the correct CUI isn't already set when performing training?

You're right - it doesn't matter. Even in the previous case this would have already gone through the pipe. Might as well add that as well.

…sed training

CU-869d9n2rg: Avoid running pipe twice for supervised training

4e3feb9

mart-r closed this May 13, 2026

mart-r changed the title ~~fead(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training~~ feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training May 13, 2026

mart-r reopened this May 13, 2026

github-actions Bot added 6 commits May 13, 2026 15:04

CU-869d9n2rg: Fix typo

eca18ea

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer

08267ab

CU-869d9n2rg: Use separate callers for tokenizer and pipe in trainer …

01c7898

…creation

CU-869d9n2rg: Fix training time instantiation

872a98a

CU-869d9n2rg: Set entity name when preparing for supervised training

69406bf

CU-869d9n2rg: Fix issue with trainable component counting

e83d8b4

adam-sutton-1992 approved these changes May 14, 2026

View reviewed changes

github-actions Bot added 8 commits May 14, 2026 16:08

CU-869d9n2rg: Add cui to entity for supervised training

c67fb32

CU-869d9n2rg: Using properly processed name when prepping for supervi…

68ee430

…sed training

CU-869d9n2rg: Simplify trainer class - use pipe for different callers

ac94a02

CU-869d9n2rg: Simplify trainer class in init

ad07225

CU-869d9n2rg: Fix trainer utils tests

2f30634

CU-869d9n2rg: Update trainer tests with new trainer object

7c4a071

CU-869d9n2rg: Update trainer utils tests with new trainer object

2afc942

CU-869d9n2rg: Remove unused import

86a155e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training#467

feat(medcat): CU-869d9n2rg Avoid running pipe twice for supervised training#467
mart-r wants to merge 15 commits into
mainfrom
feat/medcat/CU-869d9n2rg-avoid-running-pipe-twice-in-sup-train

mart-r commented May 13, 2026 •

edited

Loading

Uh oh!

adam-sutton-1992 left a comment

Uh oh!

adam-sutton-1992 May 14, 2026

Uh oh!

mart-r May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mart-r commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 left a comment

Choose a reason for hiding this comment

Uh oh!

adam-sutton-1992 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented May 13, 2026 •

edited

Loading