Skip to content

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41

Draft
xuefei-wang wants to merge 228 commits into
vanvalenlab:masterfrom
xuefei-wang:master
Draft

v0.1.0: monorepo merge — unified training + inference, canonical model, archive-free inference, vendored baselines#41
xuefei-wang wants to merge 228 commits into
vanvalenlab:masterfrom
xuefei-wang:master

Conversation

@xuefei-wang

@xuefei-wang xuefei-wang commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

Merges the separate training repository (deepcelltypes-cell-type-assignment-pytorch)
into this repo and replaces the legacy CellTypeCLIPModel inference path with the
current canonical model. This is the v0.1.0 release cut.

Before this PR, vanvalenlab/deepcell-types was inference-only — it shipped
CellTypeCLIPModel, the dct_kit/ helpers, and a top-level __init__ that
exported just predict. After it, a single package covers training and
inference: inference stays a plain pip install deepcell-types, the full
training pipeline lives behind a [train] extra, and the four paper comparison
baselines are vendored behind per-baseline extras.

⚠️ Breaking changes — see below.

Canonical model

model.py is rewritten around CellTypeAnnotator; CellTypeCLIPModel /
CellTypeDataEncoder are removed. Canonical training defaults (scripts/train.py,
click-based CLI): --resnet_channels 48, --domain_weight 0.1,
--best_metric macro_f1.

  • Mean-intensity injection — per-cell mean marker intensity is scattered
    into a marker-position vector and injected as a CLS residual. The output
    projection is zero-init, so warm-starting from a checkpoint preserves
    predictions at step 0.
  • DANN domain adaptation via a gradient-reversal head, on by default
    (--domain_weight 0.1; 0 disables it).
  • Adapter-style fine-tuning: --freeze_backbone trains only the
    mean-intensity branches on top of an existing checkpoint; --unfreeze_ct_head
    additionally co-adapts the CT head / CLS token / final norm without unfreezing
    the transformer backbone.
  • Padding-channel positions are explicitly zeroed (masked_fill) through the
    channel encoder, fusion, and mean-intensity paths so masked tokens contribute
    exactly zero rather than leaking bias/spatial_feat into the transformer.
  • Self-describing checkpoints: scripts/train.py bundles ct2idx, n_heads,
    and compat_marker0_zero into the checkpoint, and inference asserts the
    vocabulary ordering matches (a permuted vocabulary previously passed the
    count-only check and silently mislabeled cells).

Canonical-only inference

  • Archive-free by default: the marker / cell-type registry ships as a small
    packaged vocab.json snapshot, so pip install deepcell-types +
    download_model() is enough to run predict() — the multi-GB TissueNet zarr
    archive is no longer required (pass zarr_path= / set
    DEEPCELL_TYPES_ZARR_PATH only if you need it). Verified identical
    predictions with vs. without the archive on the paper checkpoint.
  • Post-hoc abstention on by default (ct_abstention_k=0.2), bucketed
    per-FOV everywhere (CLI, Python API, library): cells below an IQR fence on
    the FOV confidence distribution are relabeled to the "Unknown" sentinel
    (skipped when k is disabled or the FOV has <4 cells).
  • Custom preprocessing hook: predict(..., preprocess=...) overrides the
    per-FOV normalization without retraining, backed by a bounded op library
    (apply_config, make_preprocessor, DEFAULT_CONFIG) and a
    composition-guided adaptation loop (skills/preproc-adapt/).
  • The bright-spot clip percentile (DCTConfig.PERCENTILE_THRESHOLD) is now
    99.9, matching the recipe the training archive was built with (was 99.0,
    a carryover from the original packaging).
  • predict(return_probabilities=True) returns a PredictionResult dataclass
    with the full per-cell softmax matrix, cell indices, and the pre-abstention
    argmax labels (cell_types_raw).
  • _torch_load_weights loads with weights_only=True and emits a loud warning
    if it has to fall back to unsafe pickle on an older torch; a missing
    checkpoint raises a clear FileNotFoundError pointing at download_model().

New public API

  • predict, DCTConfig, PredictionResult, preprocess_fov, apply_config,
    make_preprocessor, and DEFAULT_CONFIG are importable from deepcell_types
    directly. preprocess_fov(raw, mask, native_mpp, channel_names) → PreprocessedFov is the standalone preprocessing entry point.

Monorepo: training pipeline

  • deepcell_types.training ships from this repo behind pip install "deepcell-types[train]": config.py, dataset.py, archive.py,
    annotations.py, baseline_features.py, gold_metadata.py, losses.py,
    metrics.py, patch.py, utils.py, abstention.py.
  • Scripts under scripts/: train.py, pretrain.py, predict.py,
    generate_openai_embeddings.py, generate_splits.py, split_val_for_test.py,
    plus the release-archive gate (validate_archive_contract.py,
    check_release_archive.sh).
  • Canonical split manifests committed under splits/
    (fov_split{,_valsubset,_test}.json + README), so the published
    train/val/test partition is reproducible from the repo.
  • Experiment logging is plain Python logging — no Weights & Biases dependency
    anywhere (--enable_wandb is gone; confusion matrices save locally as PNGs).
  • zarr>=3.1 pulls the Python floor up to 3.11 for the train extra.

Baselines

  • Four paper comparison baselines vendored under deepcell_types/baselines/
    (cellsighter, maps, nimbus, xgb), invoked through the unified runner
    python -m deepcell_types.baselines <name>, each with a self-contained
    install extra (baseline-cellsighter, baseline-maps, baseline-nimbus,
    baseline-xgboost).
  • Each baseline ships a README documenting every deviation from its upstream
    source; third-party licenses are tracked in deepcell_types/baselines/NOTICE.
  • extract_features_from_zarr(missing_value=...) lets each baseline choose its
    absent-marker sentinel: MAPS / CellSighter keep 0.0; XGBoost can pass
    np.nan so absent markers route through XGBoost's learned missing direction
    instead of being conflated with "present, intensity 0.0". The feature matrix
    records a present_markers mask and the cache stays missing-value-agnostic.

Breaking changes

  • CellTypeCLIPModel removed. No shim — use from deepcell_types import predict, DCTConfig.
  • All predict() arguments after mpp are keyword-only, preventing
    accidental transposition of the adjacent string arguments. device= is the
    preferred spelling (device_num= remains a deprecated alias).
  • predict(num_workers=...) default is now 0 (was 24) — 24 workers
    OOM'd machines with <64 GB RAM.
  • Abstention on by default changes returned labels vs. the unfiltered argmax
    of prior releases; pass ct_abstention_k=0 to recover raw argmax.
  • Clip percentile 99.0 → 99.9 shifts ~5% of predicted labels; on a
    held-out test-split sample it reproduces the canonical predictions slightly
    better (92.5% vs 91.9% argmax agreement).

Packaging / infra

  • Package data now ships vocab.json, channel_mapping.yaml, and
    training/config/*.yaml (incl. combined_celltypes.yaml), which were
    previously outside the package tree and absent after pip install.
  • tifffile declared in the [train] extra.
  • CI workflow added (.github/workflows/ci.yml); inference vs. [train] test
    boundary enforced.
  • LICENSE text matches the OSI Apache 2.0 text exactly (LIC: Revert licence text to exactly match OSI Apache 2 #42); NOTICE
    aligned to the vanvalenlab convention.

Tests

35 test modules under tests/ (plus tests/baselines/) covering canonical
inference, abstention CLI, checkpoint round-trip, dataset/split/sampler
behavior, preprocessing + the preprocess hook, losses, hierarchical eval,
archive-contract validation, baseline feature splits, and vendored-baseline
equivalence against upstream.

See CHANGELOG.md
for the full 0.1.0 entry and migration notes.

xuefei-wang and others added 30 commits April 22, 2026 23:54
Two surviving issues from the cross-repo audit (deepcelltypes-cell-type-
assignment-pytorch reviews/2026-05-10-0850/deepcell-types/SYNTHESIS.md)
that PR #1 ("feat: support canonical annotator inference") did not address.
The other 3 findings (channel KeyError fallback, marker-embedding always-
normalize, marker_embeddings allocation shape) are already fixed on this
branch.

predict.py:
- `_torch_load_weights` previously caught `TypeError` from a too-old torch
  and silently fell back to unsafe pickle deserialization. Now emits a
  loud warning when the fallback fires, recommending an upgrade. Untrusted
  checkpoints can execute arbitrary code at unsafe `torch.load` time, so
  this fallback should be the rare exception, not silent.

model.py:
- Legacy `CellTypeDataEncoder.forward` (used for the older CLIP checkpoints
  via the `_is_canonical_checkpoint() == False` route) had:

      aug_mask = nn.functional.pad(mask.long(), (1, 0), mode="reflect")

  which prepends a copy of the channel-0 mask bit into the CLS slot. This
  is correct only when channel 0 is always real (not padding). Replace
  with explicit `torch.cat([torch.zeros(B, 1, dtype=bool), mask], dim=1)`
  to make CLS-always-visible the structural invariant. The canonical
  `annotator_model.py` already uses this pattern (line 409-410); this
  brings legacy parity.

Smoke test: `CellTypeDataEncoder(...)` constructs and forwards without
error. No regression risk for canonical-checkpoint loads (those go through
`annotator_model.py`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In preparation for merging the training pipeline (currently in a separate
repo) into this package, collapse to a single supported architecture.
The legacy `CellTypeCLIPModel` path and its DCTConfig "legacy" profile
were carrying ~1.8k lines of config blobs and dual-mode branching that
would otherwise have to be ported into the training side as well.

Removes:
- `model.py` (CellTypeCLIPModel) and `loss.py` (CLIP/contrastive losses)
- `dct_kit/utils.py` (all four helpers had no remaining callers)
- 8 dead config blobs in `dct_kit/config/` — both deepseek-r1 and
  text-embedding-3-large JSON dumps, plus the legacy and (already-dead)
  `canonical_*.yaml` mirrors and the `tissue_celltype_mapping_merged`
  YAML

Simplifies:
- `predict.py`: drop `_is_canonical_checkpoint` routing, the legacy
  model/dataloader branches, and `_load_legacy_embeddings`
- `dct_kit/config.py::DCTConfig`: remove the `profile=` kwarg, the
  legacy package-bundled init path, and the embedding-loader methods
  (`get_channel_embedding`, `get_celltype_embedding`)
- `dataset.py::PatchDataset`: drop the `output_mode` parameter and the
  legacy `_combine_masks` / `_pad_images` / `_calcualte_marker_positivity`
  helpers — every batch is now canonical
- `tests/test_canonical_inference.py`: drop the two legacy-arm tests;
  the remaining 6 unit tests still pass
- `docs/index.md`: trim the legacy `master_channels.yaml` reference
  from the Limitations section

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the legacy CLIP model.py removed in the previous commit, the
canonical CellTypeAnnotator can reclaim the obvious filename. Updates
the two import sites (predict.py, tests/test_canonical_inference.py)
to match. This also lines the import path up with the training repo
(deepcelltypes-cell-type-assignment-pytorch), which has been using
`deepcelltypes.model.CellTypeAnnotator` all along — easing the
upcoming training-pipeline merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sets up the structure for absorbing the training pipeline currently
maintained in the deepcelltypes-cell-type-assignment-pytorch repo, while
preserving the lean inference-only install that today's users rely on.

- New empty package ``deepcell_types.training`` with an explanatory
  docstring; will be populated in subsequent phases (losses, dataset,
  annotations, ...).
- pyproject extras:
  - ``train`` — wandb / zarr (pinned >=3.1, <4 per the alpha
    metadata-cache bug) / torchvision / torchinfo / torchmetrics /
    pandas / scikit-learn / click / matplotlib
  - ``baselines`` — xgboost / optuna
  - ``analysis`` — plotly / seaborn / openpyxl / kaleido (pinned to
    skip the broken 0.2.1.post1)
  - ``all`` — fan-in convenience target
- Mirrored the [tool.pytest.ini_options] block from the training repo.

CI guard: tests/test_inference_deps.py imports the inference entry
points in a fresh subprocess and asserts that none of
{wandb, zarr, sklearn, pandas, torchvision, torchinfo, torchmetrics,
matplotlib} ends up in sys.modules. Future leaks from the training
side into the inference path will fail this test loudly. Subprocess
isolation prevents pytest's own imports from poisoning the check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copies three self-contained modules from the training repo
(deepcelltypes-cell-type-assignment-pytorch) into
``deepcell_types/training/``:

- ``losses.py``: FocalLoss (referenced from upstream pytorch-multi-class-
  focal-loss) and the dormant HierarchicalLoss (coarse-grained CT loss
  driven by a YAML fine→coarse mapping). HierarchicalLoss is kept
  ``weight=0`` in the canonical recipe but is part of the released
  training surface area for follow-on experiments.
- ``annotations.py``: zarr-archive annotation extraction with KDTree
  centroid matching and the duplicate-label collapse / conflict-drop
  semantics the training pipeline depends on. Lazy-imports scipy and
  numcodecs so it stays cheap to import.
- ``gold_metadata.py``: Pan-M Gold-Standard subset → (tissue, modality)
  canonicalization, including the non-direct mappings (decidua →
  uterus, Vectra/Opal → cycif) used at evaluation time.

All three have zero cross-imports into the training-side ``config.py``
or ``utils.py``, so they land cleanly without waiting on Phase 6's
config reconciliation. The remaining training surfaces with config
dependencies — FullImageDataset, FOVGroupedSampler, augmentations,
create_dataloader, and the training portion of utils.py — are deferred
to Phase 6.

The CI guard (tests/test_inference_deps.py) still passes: importing
``deepcell_types.predict`` does not transitively reach
``deepcell_types.training``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates the canonical raw-FOV → archive preprocessing recipe from the
training repo (deepcelltypes-cell-type-assignment-pytorch:preprocessing.py)
and promotes it to the top-level public API.

The function is the single source of truth for transforming an ingested
raw FOV (``(C, H, W)`` intensity at a native MPP) into the format the
model consumes:

  1. resample to ``TissueNetConfig.STANDARD_MPP_RESOLUTION`` (0.5 µm/px)
  2. per-channel p99.9 clip (over nonzero pixels, matching the recovered
     production recipe from
     ``hubmap-to-zarr@origin/deepcell-types:preprocess_for_training.py``)
  3. per-channel min-max normalize to [0, 1]
  4. cast mask to uint32 and compute centroids in resampled coordinates

Lives at the top level (``deepcell_types/preprocessing.py``), not under
``training/``, because public inference users need it too — running
``predict()`` against an arbitrary FOV requires this exact preprocessing
upstream. Re-exports the function from ``deepcell_types.__init__`` so
``from deepcell_types import preprocess_fov`` works.

Only numpy + skimage dependencies (both already in the base install) —
the inference-deps guard still passes.

The snapshot test from the training repo
(``tests/test_preprocessing.py::test_snapshot_against_production``)
will follow in Phase 9 when ``tests/`` is migrated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copies the training-side configuration module from B
(deepcelltypes-cell-type-assignment-pytorch:config.py) into
``deepcell_types/training/config.py`` verbatim. B is the up-to-date
source per the canonical-only-monorepo merge directive.

The migrated surface (1343 lines) includes:

- ``TissueNetConfig`` — heavy training-side config that opens the zarr
  archive directly (vs the inference-side ``DCTConfig`` which reads
  root attrs via JSON). Exposes ct2idx, marker2idx, domain2idx,
  tissue2idx, dataset_keys, tumor_datasets, tissue_celltype_mapping,
  domain_mapping, celltype_mapping, marker_positivity, plus the lazy
  per-dataset MP DataFrame loader.
- ``LazyMarkerPositivityDict`` — dict-shaped lazy loader for
  per-dataset marker-positivity DataFrames; avoids walking ~1.9k
  datasets at init when only ~285 carry MP data.
- ``archive_metadata_fingerprint`` / ``cached_archive_metadata_fingerprint``
  / ``archive_array_fingerprint`` — stable hashes used to invalidate
  cell-data and baseline-feature caches after in-place archive repairs.
- ``_discover_fov_keys`` — detects v7 (flat) vs v8 (5-level
  ``modality/tissue/cohort/sample/fov``) layouts and returns
  slash-joined leaf FOV keys that zarr and the filesystem both
  resolve.
- ``_patch_zarr_v3_alpha_metadata`` — workaround for zarr 3.0.0a*
  metadata-cache bugs; the ``[train]`` extra pins zarr>=3.1 to avoid
  needing this in fresh installs, but the patch stays for dev envs
  still on the alpha.
- ``extract_patch`` / ``extract_patch_from_zarr`` /
  ``compute_distance_transform`` — patch-extraction utilities consumed
  by training/dataset.py (next commit).

Zero ``deepcelltypes.*`` cross-imports — the file is fully self
contained at the package level, so the copy lands without rewiring.
Inference-deps guard still passes: ``deepcell_types.predict`` does
not transitively reach ``deepcell_types.training.config``, so zarr
and pandas stay out of the base install.

The DCTConfig (inference) vs TissueNetConfig (training) behavioral
audit follows in a subsequent commit — the merge directive says B
wins where they diverge, and a couple of spots need aligning (the
domain2idx derivation, the ct2idx defensive casting). For now the
two coexist and the inference path is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copies B's dataset.py (1706 lines) to ``deepcell_types/training/dataset.py``.
Drop-in migration — B's relative imports (``from .config import ...``,
``from .annotations import ...``) resolve cleanly inside the new
``training/`` package since config.py and annotations.py already live
there from the previous commits.

Brings over:

- ``FullImageDataset`` — the canonical zarr-backed training Dataset.
  Returns ``(C_max, 1, H, W)`` raw*self_mask, ``(3, H, W)`` spatial
  context (self_mask, neighbor_mask, distance_transform), and the "?"
  marker-positivity validity mask required by the marker positivity
  loss.
- ``AugmentedDataset`` + ``DropOutChannels`` — train-time augmentations
  (horizontal/vertical flips, random channel dropout).
- ``FOVGroupedSampler`` — keeps samples from the same FOV together
  within a batch to amortize zarr open cost / preserve neighborhood
  context.
- ``create_fov_splits`` / ``save_fov_splits`` / ``load_fov_splits`` —
  stratified-by-modality FOV partitioning with sole-source detection
  so a single-FOV cell type stays in one split.
- ``compute_sample_weights`` — class-balanced sampling weights.
- ``create_dataloader`` — top-level factory the training scripts call.

Inline ``_Compose`` / ``_RandomHorizontalFlip`` / ``_RandomVerticalFlip``
are intentional re-implementations of the torchvision transforms; B
uses them to avoid a hard torchvision import at module-load time (the
[train] extra pulls torchvision in, but the pattern matches the rest
of the package's lazy-deps discipline).

Inference-deps guard still green: importing deepcell_types.predict
does not transitively reach deepcell_types.training.dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copies B's utils.py (1462 lines) to ``deepcell_types/training/utils.py``
verbatim and rewrites three lazy package imports from
``from deepcelltypes.X`` to ``from deepcell_types.training.X``.

Surface migrated:

- ``BatchData`` — dataclass collecting per-batch inputs (sample,
  spatial_context, ch_idx, padding_mask, marker_pos_mask, ct_label,
  domain_label, dataset_name, fov_name, cell_index, ...).
- ``LossesAndMetrics`` — per-epoch loss and metric accumulator.
- ``MPMetricsTracker`` — marker-positivity per-marker counters and
  threshold sweeps.
- ``PredLogger`` — atomic-write CSV predictions logger (5-field:
  labels, probs, cell_index, dataset_name, fov_name). Name collides
  with ``deepcell_types.predict.PredLogger`` but they live in
  different namespaces — A's is a 2-field inference result buffer
  with a different interface. Leaving both: B is the
  training-authoritative version, A's stays as the inference API
  contract.
- ``get_tissue_ct_exclude`` — per-sample tissue/dataset-aware ct
  exclusion list builder for training-time masking. Different
  function from A's ``_excluded_celltype_indices`` (which is the
  per-tissue public-API affordance for ``predict(tissue_exclude=...)``);
  both retained.
- Seed / dataloader hygiene: ``seed_everything``, ``worker_init_fn``,
  ``make_generator``.
- Label compaction: ``build_label_remap``, ``adjust_conf_mat_hierarchy``.
- Wandb logging: ``log_epoch_metrics``, ``log_confusion_matrix``
  (wandb is a lazy import inside the functions; module load works
  without it).
- Feature extraction: ``extract_features_from_zarr``,
  ``_extract_all_dataset_features``, ``compute_baseline_metrics``,
  ``save_baseline_predictions``.
- Atomic file utilities and a cache-metadata fingerprint helper used
  by the cell-data caching layer.

Inference-deps guard still green: importing
``deepcell_types.predict`` does not transitively reach
``deepcell_types.training.utils`` (pandas, the heaviest dep here,
stays out of the base install).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copies B's abstention.py into ``deepcell_types/training/abstention.py``.

Placement decision: training-side, not top-level public API. The
``apply_abstention`` function takes a ``pandas.DataFrame`` as its
input, so promoting the module to ``deepcell_types/abstention.py``
would either force pandas into the inference base install (breaks
Phase 3's [train] dep split) or require a pandas-to-numpy refactor
of the public surface. Neither is justified now — the module's
existing callers in B all live in scripts and notebooks that will
be moved under training-side surfaces in Phase 9.

If the public release wants abstention as a first-class inference
feature later, two options: (a) refactor apply_abstention to take
arrays + group keys instead of a DataFrame and move it to
deepcell_types/abstention.py; (b) accept pandas in the inference
deps and move both the file and the [train] guard. Defer to Phase 10.

Updates one internal docstring reference from
``deepcelltypes.utils.adjust_conf_mat_hierarchy`` to the new
``deepcell_types.training.utils.adjust_conf_mat_hierarchy`` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Intended to land in the previous commit but the Edit was rejected
because the file hadn't been Read in-session yet. Updates the
``hierarchical_correct`` docstring's cross-reference from
``deepcelltypes.utils.adjust_conf_mat_hierarchy`` to the migrated
``deepcell_types.training.utils.adjust_conf_mat_hierarchy`` path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final bulk migration from
deepcelltypes-cell-type-assignment-pytorch (B) into the canonical-only
monorepo:

- ``scripts/`` (16 entry points, 528KB): train.py, pretrain.py,
  predict.py (script form, distinct from the library
  ``deepcell_types.predict``), benchmark_gold_standard.py,
  run_gold_standard_nimbus.py, generate_openai_embeddings{,_v8}.py,
  generate_manifest_index.py, generate_splits.py, ingest_gold_to_zarr.py,
  split_val_for_test.py, refine_mp_labels_with_intensity_v2.py,
  validate_archive_contract.py, _combine_v3_and_C.py,
  download_gold_standard.sh.
- ``tests/`` (22 new test modules merged with A's existing
  ``test_canonical_inference.py`` and ``test_inference_deps.py``).
  No filename collisions; the A tests stayed in place.
- ``config/combined_celltypes.yaml`` — small (~couple KB) cell-type
  group taxonomy used by TissueNetConfig.combined_celltype_mapping.
  Skipped the 30MB ``marker_embeddings-deepseek-r1-70b.json`` (training
  artifact, not part of the public release surface — users regenerate
  via scripts/generate_openai_embeddings.py).

Import rewrites applied via sed across the migrated files AND across
``deepcell_types/training/`` itself (caught two lazy ``from
deepcelltypes.utils import ...`` imports inside training/dataset.py
that the verbatim copy preserved):

  deepcelltypes.model         -> deepcell_types.model           (top-level)
  deepcelltypes.preprocessing -> deepcell_types.preprocessing   (top-level)
  deepcelltypes.abstention    -> deepcell_types.training.abstention
  deepcelltypes.annotations   -> deepcell_types.training.annotations
  deepcelltypes.config        -> deepcell_types.training.config
  deepcelltypes.dataset       -> deepcell_types.training.dataset
  deepcelltypes.losses        -> deepcell_types.training.losses
  deepcelltypes.utils         -> deepcell_types.training.utils
  deepcelltypes.gold_metadata -> deepcell_types.training.gold_metadata
  from deepcelltypes import   -> from deepcell_types.training import

Path fix in training/config.py: ``CONFIG_DIR`` now resolves three
parents up (deepcell_types/training/config.py ->
deepcell_types/training/ -> deepcell_types/ -> repo root -> config/),
one ``.parent`` deeper than B's original two-segment ``deepcelltypes/``
layout.

Test results: 239/245 passing. The 6 failures are all env/data
dependent, not migration bugs:

- 4 × test_v2.py::TestLossesAndMetricsCompute — needs torchmetrics
  (in the [train] extra, not in the base install). Pass when [train]
  is installed.
- 1 × test_preprocessing.py::test_snapshot_against_production — needs
  the production zarr archive at PRODUCTION_ARCHIVE AND zarr>=3.1
  (the [train] pin); dev env has zarr 3.0.0a5 and no archive.
- 1 × test_refine_mp_labels_v2.py::test_stage7_synthetic_gold_validation
  — imports from ``analysis/`` which was explicitly deferred from
  this migration (research cruft triage is a separate exercise).

Deferred from this migration: ``output/`` (62GB), ``models/`` (48GB),
``features/`` (8.9GB), ``baselines/`` (7.3GB), ``data/`` (4.9GB),
``wandb_tmp/``, ``embeddings/``, ``figures/``, ``splits/``, ``logs/``,
``analysis/`` (~400KB), ``experiments/`` (~400KB). The big ones are
training artifacts that should never be in git regardless; the small
ones (analysis/, experiments/) are research code to triage separately
before deciding whether they belong in the public release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the four baseline comparison repos into A as submodules,
matching B's layout under ``baselines/``. Each sub-repo is owned at
github.com/xuefei-wang/deepcelltypes-<name>.git and is now tracked
on ``main``.

Pre-flight: in each baseline repo, ``paper-faithfulness-alignment``
was 2-4 commits ahead of ``main`` with 0 behind. Fast-forwarded
``main`` to ``paper-faithfulness-alignment`` and pushed for each
repo before adding the submodule here, so A's pin lands on the same
commit that B's pin pointed to:

  cellsighter  79c79aa..e8c078d  (paper -> main FF)
  maps         c50c0eb..5b59f46
  nimbus       f3f65e9..9bfe11d
  xgboost      e4db5ed..b227380

A's submodule branch tracking points at ``main`` for all four;
``paper-faithfulness-alignment`` remains as a historical reference
in each sub-repo but is no longer the active branch.

The baseline source code is small (~200KB total tracked); B's local
``baselines/maps`` working tree had ~7GB of model artifacts that are
not git-tracked and stay in B. The fresh clone in A contains only
the tracked source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the two BLOCKER findings from the deep-review:

1. ``tifffile`` is a top-level import in ``scripts/ingest_gold_to_zarr.py``
   (and lazy at 4 other scripts) but was absent from ``pyproject.toml``.
   Any clean ``pip install deepcell-types[train]`` could not run the
   ingest pipeline. Added to the ``[train]`` extra.

2. ``training/config.py::CONFIG_DIR`` resolved three ``.parent``s up
   to a repo-root ``config/`` directory that does not exist after
   ``pip install`` (it would land at ``site-packages/config/``). The
   YAML file ``combined_celltypes.yaml`` therefore was unreachable
   from any non-editable install, and ``combined_celltype_mapping``
   silently returned ``{}`` — group-level cell-type logic invisibly
   broke for installed users.

   Fix: move ``config/combined_celltypes.yaml`` into the package at
   ``deepcell_types/training/config/combined_celltypes.yaml``, shorten
   ``CONFIG_DIR`` to ``Path(__file__).parent / "config"``, and extend
   ``[tool.setuptools.package-data]`` to include
   ``training/config/*.yaml`` so the wheel actually ships the file.
   Verified: ``yaml.safe_load(CONFIG_DIR / "combined_celltypes.yaml")``
   loads 48 entries from the new location.

Test suite after fix: 115 passed, 1 skipped, 1 fail. The single
failure is ``test_snapshot_against_production`` which needs zarr>=3.1
(``[train]`` extra pins it) plus the production archive available
at ``$PRODUCTION_ARCHIVE_PATH`` — pre-existing env-dependent skip,
not introduced by this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Themes addressed in one batch (see reviews/2026-05-10-2345/SYNTHESIS.md):

- J (errors H1): mp_macro_precision/recall used np.mean over arrays
  containing np.nan for vacuous markers — poisoning wandb dashboards
  with NaN every epoch. Switch to np.nanmean with an all-NaN guard,
  matching the existing macro_f1 treatment.

- A (API/simplification H3): rename predict.PredLogger to
  _InferenceResultBuffer (private) to remove the collision with the
  richer training-side PredLogger; same-name, incompatible-signature
  classes were a future-bug magnet.

- B (API/perf H2): num_workers default 24 → 0 in predict(). The doc
  string already warned "only safe with >64 GB RAM"; 24 workers each
  hold a full FOV in-memory and re-run preprocessing.

- C (multiple): drop stale deepcelltypes-kit fallback paths in
  get_channel_embedding / get_celltype_embedding (path didn't exist
  post-merge → silent {} return); rewrite the training/config.py
  module docstring; fix docs/site/API-key.md broken
  "from utils import download_training_data" import.

- D (API H1, M4): TissueNetConfig default zarr_path is now None with
  DEEPCELL_TYPES_ZARR_PATH env-var fallback (was hard-coded /data2/...
  NFS path). Fix create_model docstring to name DCTConfig.

- I (errors H2/H3, M1, M5): narrow three broad `except Exception`
  blocks in dataset.py (_load_tissuenet_archive: cache build,
  modality attr, tissue attr) to (KeyError, AttributeError, TypeError,
  ValueError, OSError, json.JSONDecodeError, GroupNotFoundError).
  Add a >1% drop-rate guard so schema regressions can no longer
  silently lose hundreds of datasets. Narrow zarr-v3-alpha shim
  except to ImportError. Catch UnicodeDecodeError in
  _read_dataset_metadata.

Also removed two vestigial "see MEMORY.md" cross-references in
LossesAndMetrics warning text (MEMORY.md never existed in this repo).

Tests: 243 passed, 1 skipped, 1 pre-existing env-dependent failure
(test_stage7_synthetic_gold_validation needs analysis/ on path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From the deep review's tests.md HIGH findings + the API/tests M1
cross-config agreement gap:

- _build_model n_markers mismatch → ValueError
- _build_model n_celltypes mismatch → ValueError
- _excluded_celltype_indices on unknown tissue → ValueError
- _excluded_celltype_indices positive case: returned exclusion rows
  contain every non-allowed index for the tissue
- _excluded_celltype_indices(tissue=None) passthrough
- PatchDataset with channel_names matching nothing → ValueError
  ("No input channels matched")
- DCTConfig and TissueNetConfig built from the same archive must
  agree on MAX_NUM_CHANNELS, CROP_SIZE, STANDARD_MPP_RESOLUTION,
  marker2idx, ct2idx (importorskip("zarr") gates the test on the
  training extra).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From reviews/2026-05-10-2345/simplification.md H1+H2 and complexity.md H2:

- Delete _zarr_group_filesystem_path and _read_v3_1d_array from
  training/utils.py. Both were verbatim copies of annotations.py's
  group_filesystem_path / read_v3_1d_array with zero callers across
  the repo (verified by grep). The annotations.py versions are the
  canonical ones imported by training/dataset.py.

- Delete the three pass-through static shim methods on FullImageDataset
  (_group_filesystem_path, _read_v3_1d_array, _centroid_to_cell_idx_fast).
  None were called anywhere — adding zero value, only obscuring that
  the real helpers live in annotations.py. Note: _build_centroid_tree
  is kept (also flagged but not in the HIGH list).

- Backport the zstd-level-aware codec read from dct_kit/config.py into
  annotations.py:read_v3_1d_array. The old training-side copy hardcoded
  Zstd(level=0) while the inference side correctly reads level from
  the codec config. With archives written at a non-zero compression
  level the training-side read would silently produce garbage. Both
  paths now share the level-aware contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es (Theme F)

config.py and utils.py had grown to 1.3k and 1.5k LOC, mixing archive
fingerprinting, patch extraction, metric trackers, baseline IO, and the
core TissueNetConfig/RNG/log helpers in one place each. Carve four
focused modules out (verbatim, no logic changes):

- training/archive.py: zarr v3 alpha metadata patch, archive metadata
  / array fingerprinting, FOV-key discovery, and the per-process caches.
- training/patch.py: per-cell patch extraction
  (compute_distance_transform, extract_patch_from_zarr, extract_patch).
- training/metrics.py: confusion-matrix hierarchy adjustment,
  MP per-marker reduction, MPMetricsTracker, LossesAndMetrics,
  build_label_remap.
- training/baseline_features.py: baseline classifier feature extraction
  pipeline (_conf_mat_summary, compute_baseline_metrics,
  save_baseline_predictions, _extract_all_dataset_features,
  extract_features_from_zarr, _get_cell_data_from_ds).

Re-exports at the bottom of config.py and utils.py keep all
tests/scripts working unchanged (230 passed, 1 skipped, matching the
pre-split baseline). dataset.py is updated to import directly from
the new homes for cached_archive_metadata_fingerprint and extract_patch.

Two non-mechanical touches required to keep monkey-patch-based tests
green:
- baseline_features.extract_features_from_zarr looks up
  _discover_fov_keys and _extract_all_dataset_features via the
  config / utils modules at call time, so tests that monkeypatch
  those symbols on the legacy modules still take effect after the
  split. _FINGERPRINT_CACHE / _FOV_KEYS_CACHE dicts are re-exported
  from config.py for the same reason (test_dataset_cache mutates them).
- metrics.LossesAndMetrics.compute defers import of _conf_mat_summary
  to method-call time to avoid a metrics <-> baseline_features import
  cycle (baseline_features needs adjust_conf_mat_hierarchy at module
  load).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
From reviews/2026-05-10-2345/docs.md HIGH findings:

- README: add a "Training" section describing the [train] extra and the
  four main entry points under scripts/. Move "Download the model"
  after "Installation" (was non-executable in reading order).

- docs/index.md: add a "Training" section explaining that training-only
  code lives under deepcell_types.training, gated behind the [train]
  extra, with pointers to scripts/{train,predict,pretrain,
  benchmark_gold_standard,ingest_gold_to_zarr}.py. Fix the long-standing
  "sorce" typo.

- docs/site/tutorial.md: bump the example archive placeholder from
  tissuenet-v8.zarr → tissuenet-v9.zarr to match DCTConfig's probe
  order (v9 is the canonical contemporary archive).

The docs.md HIGH for the broken `from utils import download_training_data`
import in docs/site/API-key.md was fixed in 88b95f9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five MEDIUM/HIGH findings from reviews/2026-05-10-2345 in one batch:

- complexity H1: TissueNetConfig.get_marker_positivity() and
  marker_positivity_labels[] now share a single LazyMarkerPositivityDict.
  Previously the plain-dict cache populated by get_marker_positivity()
  was discarded the first time marker_positivity_labels was accessed
  (the property replaced the field), causing wasted I/O and divergent
  caches. _marker_positivity_cache is now Optional[LazyMP...] and
  lazily constructed on first access; get_marker_positivity routes
  through marker_positivity_labels for a single source of truth.

- numerical M1: MarkerEmbeddingLayer.forward zeros output for
  padding positions (ch_idx == -1). Without this, F.normalize(proj(0))
  yielded a unit-norm direction equal to F.normalize(proj.bias) — a
  non-trivial embedding flowing into the transformer for tokens that
  should be invisible.

- numerical M2: CellTypeAnnotator.forward zeros spatial features
  for padding positions BEFORE the fusion concat. Otherwise padding
  tokens enter self.fusion with [0, spatial_feat] and emerge as
  W_spatial @ spatial_feat + bias.

- API M1: rename predict(tissue_exclude=...) → predict(tissue_filter=...).
  The old name was inverted — "tissue_exclude='colon'" actually meant
  "filter TO colon-associated cell types". The deprecated alias stays
  (keyword-only) and emits DeprecationWarning; passing both raises
  TypeError.

- API M3: predict(return_probabilities=True) returns a
  PredictionResult dataclass with cell_types, probabilities (full per-
  cell softmax matrix), and cell_indices. Default behaviour
  unchanged (returns list[str]). PredictionResult and DCTConfig are
  now hoisted to top-level so `from deepcell_types import
  PredictionResult, DCTConfig` works.

Tests: 233 passed, 1 skipped. Added 3 new tests covering
return_probabilities, tissue_exclude DeprecationWarning, and the
both-args TypeError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tests M3: add a regression anchor in test_train_loop_smoke.py that
  asserts scripts/train.py still contains the AMP scheduler-gate
  predicate. The 2-line _run_gated_step helper is faithful to the
  production behavior but a silent drift would otherwise let the
  emulator tests pass while real training desynchronizes OneCycleLR.

- tests M2: same idea for test_zero_channel_masking.py. The unit-test
  helper is a verbatim copy of __getitem__'s masking block; a refactor
  could let the copy drift. New test asserts
  training/dataset.py still contains _zero_channel_cache and
  fov_zero_mask.

- docs M4: add CHANGELOG.md documenting the 0.0.1 → 0.1.0 release
  (canonical-only refactor, training subpackage, breaking removal of
  CellTypeCLIPModel, deprecated tissue_exclude alias, num_workers=0
  default, TissueNetConfig env-var default). Bump version in
  pyproject.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
complexity H8: replace FullImageDataset.indices' positional 8-tuple
with a CellIndexRecord NamedTuple. Named fields make grep / refactor
safe (no more record[6] / record[5] magic numbers across 10+ call
sites). NamedTuple IS a tuple, so positional access still works for
backward compat with serialized caches that stored raw 8-tuples.
Production call sites in dataset.py now use .ct_label_standard,
.dataset_name, .fov_name, .ds_idx, .domain accessors. Mock-index
constructors in tests/{test_v2,test_samplers,test_stratified_splits,
test_dataset_splits}.py updated to build CellIndexRecord instances.

complexity H7: introduce DataLoaderConfig dataclass + matching
create_dataloader_from_config(zarr_dir, dct_config, cfg) wrapper.
Lets new callers pass a single discoverable object instead of 20+
keyword arguments. The legacy keyword signature of create_dataloader
is preserved verbatim so train.py / predict.py / tests don't need
any change. Field defaults mirror create_dataloader's defaults
exactly — DataLoaderConfig() is equivalent to no-override.

Tests: 235 passed, 1 skipped (analysis-only env failure unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse training pipeline into deepcell-types (canonical-only)
…ne submodule rebase

Three independent bugs surfaced when running training against the current
master HEAD from a fresh workspace install:

1. tissue_idx kwarg mismatch (scripts/train.py:121, scripts/predict.py:208 + 334)
   scripts pass `tissue_idx=batch_data.tissue_idx` to
   `CellTypeAnnotator.forward(...)`, but the model's forward signature is
   `(sample, spatial_context, ch_idx, padding_mask, ct_exclude=None,
   return_attn_weights=False, domain_idx=None)` — no `tissue_idx`. The
   tissue-FiLM MP head experiment was rolled back (see memory
   `v10_mp_expansion_tissue_negative.md`) and the model dropped the
   parameter, but the scripts kept passing it. Result: every training /
   prediction run dies at the first forward pass with
   `TypeError: ...got an unexpected keyword argument 'tissue_idx'`.
   Fix: drop the kwarg at all three call sites. `batch_data.tissue_idx`
   is still populated by the dataloader and remains available to anyone
   who needs it downstream — the model just doesn't consume it.

2. Circular import between training/utils.py and training/baseline_features.py
   utils.py re-exports four symbols from baseline_features.py at module
   level for backward compat. baseline_features.py also imports private
   helpers (`_atomic_np_savez` etc.) from utils.py. When utils.py is
   imported first (training path) the cycle resolves fine, but when
   baseline_features.py is imported first (baseline path — e.g.
   `import xgb.run`), the partially-initialized utils.py reaches back to
   `baseline_features._extract_all_dataset_features` before that name is
   defined, and ImportError fires.
   Fix: convert the re-exports to a module-level `__getattr__` so the
   lookup is deferred until actual access, by which point both modules
   have finished initializing. Existing callers
   (`from deepcell_types.training.utils import save_baseline_predictions`,
   verified in tests/test_v2.py) keep working.

3. Submodule rebase (baselines/{maps,cellsighter,xgboost,nimbus})
   Each baseline's pyproject.toml listed `deepcelltypes @ git+...
   deepcelltypes-cell-type-assignment-pytorch.git` as a dep; that URL
   now resolves to the renamed research workspace (no longer a Python
   package) and `uv pip install` fails with a metadata-name mismatch.
   Each baseline also imported from `deepcelltypes.{config,utils,dataset}`
   — the pre-refactor flat layout. Companion commits on each submodule's
   `fix/post-refactor-imports` branch replace the dep URL with a plain
   `deepcell-types` and rebase imports onto
   `deepcell_types.training.{config,utils,dataset,metrics,baseline_features}`.
   This parent commit bumps the submodule pointers to those branch tips.

End-to-end verification: with the three fixes, a fresh workspace `uv sync`
+ smoke training (`scripts/train.py` with the v10 split + svd_512_v6
embeddings) gets through model build, GPU allocation, and reaches batch 0
of epoch 0. The xgboost baseline imports cleanly after
`uv pip install -e baselines/xgboost`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…port

fix(train,predict,utils): tissue_idx kwarg + circular import + baseline submodule rebase
uv.lock is regenerated on every branch switch when pyproject.toml
shapes differ (master vs training), so keeping it tracked produces
constant churn. /reviews/ holds local /deep-review outputs.
Untouched since Oct 2024 and broken since the kit was inlined in
0a8108e: it COPYs a non-existent top-level requirements.txt and
pip-installs the deleted deepcelltypes-kit/ directory. No CI, docs,
or scripts reference it.
xuefei-wang and others added 30 commits June 17, 2026 13:51
…mmit -> CSV)

Pins each reported test number to its checkpoint (sha256), the train/eval code
commit (d13fd54 for the MLP-head run, b598710 for the resMLP run; both ancestors
of PR #41), and the prediction CSV (sha256). Notes the self-pinning gap (configs
don't record git_commit) addressed separately on PR #41.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CKPT_CONFIG now stores the code commit the run executed under (git rev-parse
HEAD of the checkout owning train.py — the pinned worktree's HEAD when run via
a pin), so any checkpoint/result traces to an exact code snapshot. 'unknown'
when not a git repo. Closes the traceability gap noted in the recipe-ablation
manifest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ty gap note

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e reported set

MAPS and CellSighter chose their best checkpoint by evaluating on the same set
they then reported (MAPS: data["X_val"] each epoch -> best-by-val-loss; CellSighter:
test_loader each epoch -> best-by-macro-accuracy). Selecting on the reported set is
leakage. XGBoost was already correct (FOV-grouped inner-val for early stopping).

Both baselines now mirror XGBoost: select on a FOV-grouped inner-validation set
carved from the TRAIN FOVs (10%), and report once on the untouched test set.

- dataloader.create_dataloader: additive, default-off `inner_val_ratio`/`inner_val_seed`.
  When >0, carves a FOV-grouped inner-val from train_indices, trains only on
  inner-train, and returns the inner-val loader via metadata["inner_val_loader"].
  Default 0.0 leaves the main-model path unchanged.
- maps/run.py: GroupShuffleSplit(test_size=0.1) on train FOVs; normalization stats,
  sampler, and per-epoch val-loss selection all from inner-train/inner-val.
- cellsighter/run.py: inner_val_ratio=0.1; selection on metadata["inner_val_loader"].
- READMEs: record the deviation from upstream selection protocol.
- test_maps_cellsighter_equivalence: drop the run.py byte-equivalence pin (logic now
  intentionally deviates from upstream) and replace with a behavioral inner-val check.

Consequence: changes published MAPS/CellSighter numbers (now train on ~90% of train
cells); requires a full re-run on the v10 archive to regenerate the headline table.
Does not touch the abstention asymmetry (scoped out). CellSighter still selects by
macro-accuracy (separate finding, left unchanged).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CellSighter selected its best checkpoint by macro-accuracy while the main model
(scripts/train.py -> val_macro_f1) and the headline comparison use macro-F1. When
accuracy and F1 diverge (systematic in imbalanced multi-class settings) the
returned checkpoint was not the macro-F1-optimal one, depressing the reported
CellSighter macro-F1. Switch selection (on the held-out inner-val) to macro-F1;
update the saved checkpoint key and logging. Reported test metrics unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…isjointness

Extract the inner-validation carve from create_dataloader into a
module-level _carve_inner_val_fovs helper (behavior-preserving; the
default no-op path returns train_indices unchanged) so the
leakage-critical FOV-grouping is unit-testable. Add regression tests
asserting whole-FOV grouping, train/inner-val disjointness, a clean
index partition, the no-op path, and the >=1-inner-train-FOV cap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t_strict knobs

Thread optional per-loader knobs through extract_patch, FullImageDataset, and
create_dataloader, each defaulting to current behavior so DCT/MAPS inputs stay
byte-identical:
- mask_intensities (default True): when False, return the full crop including
  neighbor intensities instead of raw*self_mask (single-cell input).
- crop_size/output_size (default dct_config): per-loader patch-size override.
- train_transform (default H/V flips): custom train-time spatial augmentation.
- split_strict (default True): downgrade split fingerprint mismatch to a warning
  when all split FOVs are present in the current archive.

Enables the faithful CellSighter baseline without altering the shared
single-cell path. Full suite: 313 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reimplement the CellSighter baseline to follow Amitay et al. (Nat Commun 2023)
training recipe inside our cross-tissue harness:
- ImageNet ResNet50 stem (7x7/s2 + maxpool) for 60x60 crops; --cifar_stem keeps
  the 32x32 CIFAR stem for ablation.
- Unmasked neighbor intensities (mask_intensities=False); --mask_self ablation
  restores the single-cell input.
- Geometric augmentation module (rotation, vectorized per-channel shift, mask
  dilation, flips@0.75). Poisson resampling omitted: preprocessed/raw is [0,1]
  min-max normalized, not photon counts, so Poisson would corrupt the signal.
- New flags: --crop_size (default 60), --seed (ensemble diversity),
  --test_split_file (final eval on a held-out split), --allow_split_mismatch.

Re-pin the cellsighter freeze snapshots as a drift guard (no longer an
upstream-identical port) and re-freeze its CLI option set. maps stays frozen.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n docs

b598710 (resMLP-head default) and ef1229f (checkpoint git_commit self-pinning)
are independent xuefei/master commits, not part of PR #41. Only d13fd54 is in
PR #41's lineage. Correct the three attribution claims in TRACEABILITY.md and
REPORT.md; numerical claims unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an explicit cross-file FOV disjointness check between --split_file
('train') and --test_split_file ('val'): load_fov_splits only checks
overlap within a single file, so a mismatched pair could silently leak
training FOVs into the reported number. Also warn loudly when
--test_split_file is omitted, since the final eval then reuses the
checkpoint-selection val loader (selection-on-the-eval-set, not a
held-out number). Re-pin the cellsighter run.py drift-guard SHA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + full-inv-freq)

The CellSighter baseline silently inherited DCT's sqrt-inverse-frequency
WeightedRandomSampler (1000-count floor) via create_dataloader's default,
not the original CellSighter's equal-proportion balancing (research-workspace
issue #96). Make the baseline genuinely faithful on the class-balancing axis.

Faithfully reproduce KerenLab/CellSighter's recipe:
- subsample_indices_per_class: caps the TRAIN pool to <=size_data cells/class
  (subsample_const_size; paper size_data=1000), deterministic per seed, val/test
  untouched.
- compute_sample_weights_equal: full-inverse-frequency weights weight=total/count
  (define_sampler with sample_batch=true).

Wire a `class_balance` {equal|sqrt|none} + `size_data` knob through
create_dataloader; the CellSighter baseline now defaults to the faithful
equal-proportion scheme (--class_balance equal --size_data 1000), with sqrt and
none as ablations. --no_weighted_sampler kept as a deprecated alias for
--class_balance none. Legacy use_weighted_sampler still honored when
class_balance is None (main DCT model and other callers unaffected).

Docs: README documents the now-faithful default + remaining hierarchy_match
deviation. Tests: 6 new unit tests for the weight law + size_data cap; updated
the cellsighter option-freeze set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A three-round faithfulness audit of the baselines against their original
papers/codebases found the baseline READMEs accurately documented architecture
and training mechanics but omitted several data-pipeline deviations. This adds
the verified, source-cited disclosures:

- CellSighter: neighbor-intensity self-mask (training/patch.py:176) vs upstream's
  raw neighbor intensities; [0,1] min-max normalization vs upstream raw counts;
  flips-only augmentation vs upstream's seven; 32x32 context vs 60px; sqrt- vs
  full-inverse-frequency sampler. Most are shared DeepCell Types preprocessing
  (fairness-neutral across models) but still deviate from how upstream
  CellSighter was trained. Self-mask impact was empirically tested
  (feat/faithful-cellsighter): the ranking did not change.
- Nimbus: prediction resize uses INTER_LINEAR vs upstream 0.0.5 INTER_NEAREST;
  mpp-based rescale vs magnification-ratio. Verified against the installed
  nimbus-inference==0.0.5 wheel; core primitives (sigmoid, prepare_binary_mask,
  cross-FOV normalization) confirmed faithful.
- XGBoost: no cellSize feature and no class balancing (conservative vs the
  neural baselines); tuning budget not matched (Optuna only for XGBoost).

Documentation only; no code or behavior changes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Remove pre-existing F401 (pandas, typing.Dict/Any, torch.nn.functional)
and F541 (placeholder-less f-string) in run.py, surfaced by ruff --fix
during the baseline integration. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	README.md
#	tests/test_preprocess_hook.py
# Conflicts:
#	README.md
…sole balancer

The DCT backbone training now uses the WeightedRandomSampler
(compute_sample_weights in dataset.py) as the SOLE rare-class balancer.
The redundant per-class FocalLoss alpha weighting is removed entirely
(cleaner than an interlock), making double-weighting structurally
impossible. The focal term (gamma, via --focal_gamma) is kept unchanged.

Concretely:
- Delete the compute_class_weights() helper in scripts/train.py (its only
  caller; MAPS keeps its own separate compute_class_weights in
  baselines/maps/run.py, untouched).
- Delete the call site plus the plumbing that fed it only (the dataset-layout
  isinstance checks + train_dataset_ref/train_indices unwrapping, and the now
  -unused AugmentedDataset/FullImageDataset imports). label_remap is retained;
  it is used elsewhere.
- FocalLoss alpha is now hard-coded to None for backbone training.
- Remove the --no_class_weights Click flag, its main() parameter, and its
  "no_class_weights" key from the checkpoint config dict. Removing the config
  key is safe: extra/missing config keys do not break checkpoint loading.
- Update the LossesAndMetrics double-weighting warning in metrics.py to drop
  the stale --no_class_weights reference.
- CHANGELOG: note the schema change and the default change.

This makes the no-flags default reproduce the released-checkpoint recipe,
which was trained with --no_class_weights. It CHANGES scripts/train.py's
no-flag default versus v0.1.0 development builds (which applied class weights
by default); the released checkpoint and the stage-2 head retrain
(scripts/retrain_head.py, plain CrossEntropyLoss) are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add model version 2026-06-15 (deepcell-types_2026-06-15_resmlp.pt, md5
704616a1...) and set it as _latest, so download_model() / predict.py default
to the two-stage residual-MLP head-retrain model (80.27 hier macro-F1 on the
held-out 129-FOV test split, vs tuned XGBoost 79.03 and the prior Frozen-CLS
74.20). The resMLP head is auto-detected by predict._build_model via
ct_head.inp.0.weight, so no caller change is needed. The prior Frozen-CLS
release (2026-05-17) is retained in the registry for reproducibility.

NOTE: the asset deepcell-types_2026-06-15_resmlp.pt must be uploaded to
users.deepcell.org/models/ before this pin resolves for end users.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…add headline number

The pr-31 archive-free-README merge took its README wholesale (--theirs),
which reverted master's richer Training section (retrain_head.py stage-2 recipe,
evaluate_on_test.sh, the leakage-free-test-split headline sentence) because
pr-31 branched before that landed. Restore it as a union (pr-31's archive-free
inference sections + master's Training section) and state the headline number:
two-stage resMLP 80.27 hierarchical macro-F1 vs tuned XGBoost 79.03.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Track A: infra fixes + drop-class-weighting recipe (consolidates #43,#34,#38,#39,#31,#30,#41)
Promote two-stage resMLP to the headline released model
Recipe-ablation review docs with corrected commit attributions (was #44)
# Conflicts:
#	deepcell_types/training/dataloader.py
Track B: faithful CellSighter + leakage-free baseline selection (consolidates #35,#33,#42,#37)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants