Skip to content

refactor: remove verified dead code#791

Open
nabinchha wants to merge 2 commits into
mainfrom
codex/fix-789-dead-code-cleanup
Open

refactor: remove verified dead code#791
nabinchha wants to merge 2 commits into
mainfrom
codex/fix-789-dead-code-cleanup

Conversation

@nabinchha

@nabinchha nabinchha commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

📋 Summary

Remove production code that a triple review confirmed is unreachable or used only by tests. The review covers static references, runtime registration and callback paths, repository history, extension surfaces, and open-PR dependencies; capacity planning and completion diagnostics used by #745 and #743 are explicitly retained.

🔗 Related Issue

Closes #789

🔄 Changes

✨ Added

  • Move reusable admission-event test fixtures into the intentional data_designer.engine.testing namespace.

🔧 Changed

🗑️ Removed

  • Remove unused DataFrame/config I/O helpers, CLI wrappers, builder helpers, parser utilities, error types, and private methods.
  • Remove sync-engine leftovers from async concurrency and skip-metadata tracking.
  • Remove tests that exclusively exercised deleted internals while retaining tests of live behavior.

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the API-removal boundaries established by the triple review:

  • io_helpers.py — removes two unexported, undocumented helpers with no production callers.
  • async_concurrency.py — removes the obsolete executor while retaining the live shared event-loop lifecycle.
  • skip_tracker.py — removes the orphaned sync-engine round-trip restoration subgraph; no open PR references it.

🧪 Testing

  • make test passes — not run directly; the complete package test suites were run via pytest.
  • Full package suites: 3,735 passed, 1 skipped.
  • Focused capacity/completion/scheduler suite: 133 passed.
  • Ruff lint, Ruff format, pre-commit hooks, and git diff --check pass.
  • Unit tests updated.
  • E2E tests: N/A — no end-to-end behavior changed.

✅ Checklist

  • Follows commit message conventions.
  • Commits are signed off (DCO).
  • Architecture docs: N/A — this removes internal dead code without changing supported architecture.

Description updated with AI

Remove unreachable helpers and production code exercised only by tests after a triple review of static references, runtime registration paths, and extension surfaces. Keep reusable observability fixtures in the explicit engine.testing namespace.

Closes #789

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
@nabinchha nabinchha requested a review from a team as a code owner July 1, 2026 20:52
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Fern preview: https://nvidia-preview-pr-791.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR removes verified dead code across three packages (data-designer-config, data-designer-engine, data-designer) and relocates two test/diagnostic helpers (InMemoryAdmissionEventSink, CorrelatedRuntimeView) from the production observability module into the intentional data_designer.engine.testing namespace.

  • Removals: Unused DataFrame/config I/O helpers, AsyncConcurrentExecutor (full class), TaskResult, RealizePydanticTypes, SkipMetadataRestoreContext/round-trip helpers, _validate_cell_output, DatasetBatchManagementError, UnknownModelAliasError, and several other private methods/constants that had no remaining callers.
  • Test updates: Tests exercising deleted internals are dropped; tests that used production-only seams (set_processor_runner, verify_entry_point, etc.) are updated to either call live APIs or use local helpers that manipulate internal state directly.

Confidence Score: 5/5

Safe to merge — all removed symbols are confirmed unreachable in production paths, every import site across the codebase has been updated, and the 3,724-test suite passes.

Static cross-references for every removed symbol were verified to be clean: no remaining callers of AsyncConcurrentExecutor, _validate_cell_output, TaskResult, RealizePydanticTypes, SkipMetadataRestoreContext, UnknownModelAliasError, or any of the removed private methods. The two observability helpers are cleanly re-exported from their new testing namespace and all import sites in tests have been updated consistently.

No files require special attention. PydanticTypeBlock in types.py is now without a producer since RealizePydanticTypes was removed, but it may be intentionally retained as an extension point for external plugins.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/async_concurrency.py Removes entire AsyncConcurrentExecutor class and its supporting types (Success, Failure, TaskResult); retains only the process-wide ensure_async_engine_loop() singleton used by the live async engine.
packages/data-designer-engine/src/data_designer/engine/observability.py Removes InMemoryAdmissionEventSink and CorrelatedRuntimeView from the production module; these are cleanly re-exported from the new testing namespace.
packages/data-designer-engine/src/data_designer/engine/testing/observability.py New file — homes InMemoryAdmissionEventSink and CorrelatedRuntimeView in the intentional testing namespace; implementation is identical to what was removed from observability.py.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/skip_tracker.py Removes SkipMetadataRestoreContext, SKIP_METADATA_RESTORE_ID_COLUMN_PREFIX, prepare_records_for_skip_metadata_round_trip, restore_skip_metadata, and _choose_restore_id_column — all confirmed dead by grep.
packages/data-designer-engine/src/data_designer/engine/models/parsers/postprocessors.py Removes RealizePydanticTypes postprocessor class; PydanticTypeBlock type definition remains in types.py but is now without a producer.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Removes set_processor_runner() and single_column_config_by_name cached_property; tests are updated to use a local _replace_processors helper that accesses _processor_runner directly.
packages/data-designer-config/src/data_designer/config/utils/io_helpers.py Removes ensure_config_dir_exists() and smart_load_dataframe() helpers plus the now-unused PurePosixPath import; both functions had no production callers.
packages/data-designer/src/data_designer/cli/repositories/persona_repository.py Removes get_dataset_name() and get_dataset_prefix() convenience methods; the test that depended on get_dataset_prefix() is updated to use an inline constant.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph removed["Removed from Production"]
        ACE["AsyncConcurrentExecutor\n(async_concurrency.py)"]
        TR["TaskResult dataclass\n(task_model.py)"]
        RPT["RealizePydanticTypes\n(postprocessors.py)"]
        SMR["SkipMetadata round-trip helpers\n(skip_tracker.py)"]
        IH["ensure_config_dir_exists /\nsmart_load_dataframe\n(io_helpers.py)"]
        UMA["UnknownModelAliasError\n(errors.py)"]
        VCO["_validate_cell_output\n(custom.py)"]
    end

    subgraph moved["Moved to Testing Namespace"]
        IMAS_old["observability.InMemoryAdmissionEventSink"]
        CRV_old["observability.CorrelatedRuntimeView"]
    end

    subgraph testing["data_designer.engine.testing"]
        IMAS_new["testing.InMemoryAdmissionEventSink"]
        CRV_new["testing.CorrelatedRuntimeView"]
    end

    subgraph kept["Retained"]
        EAEL["ensure_async_engine_loop()\n(async_concurrency.py)"]
    end

    IMAS_old --> IMAS_new
    CRV_old --> CRV_new
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph removed["Removed from Production"]
        ACE["AsyncConcurrentExecutor\n(async_concurrency.py)"]
        TR["TaskResult dataclass\n(task_model.py)"]
        RPT["RealizePydanticTypes\n(postprocessors.py)"]
        SMR["SkipMetadata round-trip helpers\n(skip_tracker.py)"]
        IH["ensure_config_dir_exists /\nsmart_load_dataframe\n(io_helpers.py)"]
        UMA["UnknownModelAliasError\n(errors.py)"]
        VCO["_validate_cell_output\n(custom.py)"]
    end

    subgraph moved["Moved to Testing Namespace"]
        IMAS_old["observability.InMemoryAdmissionEventSink"]
        CRV_old["observability.CorrelatedRuntimeView"]
    end

    subgraph testing["data_designer.engine.testing"]
        IMAS_new["testing.InMemoryAdmissionEventSink"]
        CRV_new["testing.CorrelatedRuntimeView"]
    end

    subgraph kept["Retained"]
        EAEL["ensure_async_engine_loop()\n(async_concurrency.py)"]
    end

    IMAS_old --> IMAS_new
    CRV_old --> CRV_new
Loading

Reviews (2): Last reviewed commit: "fix: preserve active capacity diagnostic..." | Re-trigger Greptile

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Code Review: PR #791refactor: remove verified dead code

Author: nabinchha · Base: main · +108 / −2,158 across 47 files · Closes #789

Summary

This PR removes production code that a "triple review" concluded is unreachable
or exercised only by tests. It spans all three packages (config / engine /
interface-CLI) and touches two high-connectivity abstractions
(AsyncTaskScheduler, ExecutionGraph). Notable moving parts:

  • Deletes 3 modulesengine/capacity.py, plus the test-only
    test_capacity.py and test_async_concurrency.py.
  • Slims async_scheduler.py — removes the entire capacity_plan() reporting
    path and all _observed_max_* accounting state / _record_observed_task_state().
  • Slims async_concurrency.py — deletes AsyncConcurrentExecutor and its
    result dataclasses, keeping only the shared ensure_async_engine_loop()
    lifecycle.
  • Relocates test seamsInMemoryAdmissionEventSink and
    CorrelatedRuntimeView move from the shipped engine/observability.py into the
    intentional engine/testing/ namespace (good: removes test-only types from the
    production surface).
  • Removes assorted private helpers / unused error types / CLI helpers and the
    RealizePydanticTypes postprocessor.

The change is well-scoped for a cleanup: deletions are matched by their test
deletions, and the PR body documents the risky boundaries. My independent static
verification confirms no production callers remain for the removed symbols
(smart_load_dataframe, capacity_plan/AsyncCapacityPlan, AsyncConcurrentExecutor,
set_processor_runner, is_all_complete, get_ready_tasks, seed_frontier,
compute_cell_dependencies, to_mermaid, extract_message_from_exception_string,
_validate_tool_configs_against_providers, UnknownModelAliasError,
ensure_config_dir_exists, update_cells, DatasetBatchManagementError,
single_column_config_by_name, etc.). Removed private methods
(_record_observed_task_state, _enqueue_ready_task, _increment_waiter,
_decrement_waiter, _validate_cell_output, _ensure_images_directory) have no
residual call sites.

Findings

1. Orphaned PydanticTypeBlock left behind — MEDIUM

The PR removes RealizePydanticTypes (the only producer of PydanticTypeBlock)
from postprocessors.py, and drops PydanticTypeBlock from every import and the
parser tests. But the class itself survives in
engine/models/parsers/types.py:82:

class PydanticTypeBlock(BaseModel):
    serialized: str
    obj: BaseModel

After this PR nothing constructs it and nothing imports it (grep finds only the
definition). This is exactly the category of dead code the PR set out to remove —
it looks like an incomplete deletion. Either remove PydanticTypeBlock too, or if
it is intentionally retained as a public/plugin-facing block type, say so in the PR
description so a future cleanup doesn't flag it again. Recommend removing it for
consistency with the PR's stated goal.

2. capacity.py / AsyncCapacityPlan are a documented deliverable in plans/645/ — MEDIUM (verify intent)

AsyncCapacityPlan is not just an internal symbol — it is specified as a
required contract throughout the active plans/645/ async-scheduling epic
(last touched 2026-05-20, feat: implement async scheduling admission control #661).
For example:

  • plans/645/contracts.md: lists AsyncCapacityPlan as a "diagnostic/reporting
    DTO, emitted to explain a run" owned by data-designer-engine.
  • plans/645/benchmark-plan.md / capacity-model.md: "Benchmarks and traces
    must include AsyncCapacityPlan plus per-layer observed maxima."
  • plans/645/architecture.md / module-ownership.md: list it as the
    user/operator-facing capacity explanation.

So this is arguably not dead code but an unwired, planned feature — the
reporting surface for an epic that is still in flight. Removing it may be correct
(the epic may have dropped this deliverable), but the PR neither references that
decision nor updates the plans. Two concerns:

  • Backward-compat / blast radius: the structural analysis ranks
    AsyncTaskScheduler as the feat: support IndexRange and PartitionBlock seed selection strategy #8 most-connected node (132 deps). capacity_plan()
    is a public method on it. If any out-of-tree benchmark/diagnostic tooling calls
    it (the plans imply benchmarks do), this is a silent break. Worth a sentence in
    the PR confirming no benchmark harness consumes it.
  • Doc drift: plans/645/* still mandate AsyncCapacityPlan. Please either
    update those plans (mark the deliverable dropped) or confirm they're superseded,
    so the plan and the code don't contradict each other. This is the one place the
    "triple review covered repository history" claim is hard to take on faith.

3. Behavioral shift in the parser pipeline — LOW (correct, but worth flagging)

Removing RealizePydanticTypes changes observable parse output: JSON blocks that
previously became PydanticTypeBlock (validated Pydantic instances) now stay as
StructuredDataBlock (raw dicts). The updated test_parser.py asserts exactly
this (result.parsed[2].obj == {"baz": 3} instead of Foo(baz=3)). Since nothing
in production constructed the parser with RealizePydanticTypes, this is dead in
practice — but it is a semantic change to a public-ish parser, not a pure no-op
deletion. Fine to proceed; just calling it out as more than "delete unused code."

4. Test-quality regression risk from the scheduler-accounting deletions — LOW

test_async_scheduler.py loses test_scheduler_dispatch_does_not_scan_ready_frontier,
which was a genuine regression guard (it asserted the scheduler applies frontier
deltas rather than O(n)-scanning via get_ready_tasks). Its removal is justified
because get_ready_tasks itself is deleted, but the underlying performance
invariant ("dispatch must not full-scan") is now untested. Consider whether a
lighter-weight assertion on the delta-driven path should remain. Not blocking.

Style / conventions

  • New file engine/testing/observability.py has the SPDX header,
    from __future__ import annotations, and modern typing — consistent with the
    repo. __all__ in engine/testing/__init__.py is updated correctly (though note
    it mixes string literals "CorrelatedRuntimeView" with the existing
    LineFanoutDirectorySeedReader.__name__ idiom — pre-existing, not introduced here).
  • io_helpers.py: dropping PurePosixPath from the import is correct;
    urlparse and _maybe_rewrite_url remain in use, so no over-pruning there.
  • Import-direction: the structural analysis flags 39 "config → engine" violations,
    but these are pre-existing graph artifacts (the flagged .set() is
    observability/logging), not introduced by this PR. No new reverse imports are
    added — the changes only delete imports.
  • The set_processor_runner production method is replaced by a local
    _replace_processors test helper poking builder._processor_runner directly.
    That's a reasonable trade (removes a production method that existed only for
    tests) but does reach into a private attribute; acceptable for a test helper.

Recommendations before merge

  1. Remove the now-orphaned PydanticTypeBlock from parsers/types.py (Finding 1),
    or document why it stays.
  2. Reconcile the AsyncCapacityPlan removal with plans/645/* — update the plans
    or add a PR note confirming the deliverable was dropped and no benchmark tooling
    depends on capacity_plan() (Finding 2).
  3. Confirm the "full package suites: 3,724 passed" run was against this exact diff
    (I could not execute tests in this environment — the package is not importable in
    the review shell — so test verification rests on the author's stated results).

Structural Impact

Structural Impact (graphify, 2.2s)

Risk: HIGH (2 core abstraction(s) modified; 39 import direction violation(s))

  • 47 Python files, 637 AST entities, 15/82 clusters

  • 3 Python file(s) deleted (impact not fully analyzable)

Import Direction Violations (39)

Legal direction: interface -> engine -> config

  • ._is_missing_value() (config) --calls--> .set() (engine)
  • column_types() (config) --calls--> .set() (engine)
  • generate_analysis_report() (config) --calls--> .set() (engine)
  • _validate_skip_scope() (config) --calls--> .set() (engine)
  • required_columns() (config) --calls--> .set() (engine)
  • +34 more

Core Abstractions Modified

High-Connectivity Changes

  • AsyncTaskScheduler (132 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
  • .set() (112 deps) in packages/data-designer-engine/src/data_designer/engine/observability.py
  • ExecutionGraph (105 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/execution_graph.py
  • CompletionTracker (91 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/completion.py
  • Task (89 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_model.py
  • ProviderError (83 deps) in packages/data-designer-engine/src/data_designer/engine/models/clients/errors.py
  • AdaptiveRequestAdmissionController (81 deps) in packages/data-designer-engine/src/data_designer/engine/models/request_admission/controller.py
  • RowGroupBufferManager (78 deps) in packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py
  • +355 more

Cross-Package Dependencies

  • mcp_command() (interface) --calls--> .run() (engine)
  • models_command() (interface) --calls--> .run() (engine)
  • providers_command() (interface) --calls--> .run() (engine)
  • tools_command() (interface) --calls--> .run() (engine)
  • .run_create() (interface) --calls--> create() (engine)
  • .__init__() (interface) --calls--> .set() (engine)
  • +561 more

Reviewer note on structural risk: The HIGH rating is driven by blast radius on
AsyncTaskScheduler/ExecutionGraph and by 3 whole-file deletions. In practice
the edits to those god-nodes are subtractive (removing an unused public method and
private accounting state), so the runtime blast radius is smaller than the raw
connectivity suggests — provided no external consumer calls capacity_plan()
(see Finding 2). The 39 import-direction violations are pre-existing and not
introduced by this PR.

Verdict

Approve with minor changes requested. The core deletions are correct and
independently verified to have no in-tree production callers, and the test-seam
relocation into engine/testing/ is a genuine improvement to the production
surface. Two loose ends should be tightened before merge: the orphaned
PydanticTypeBlock (Finding 1) and the AsyncCapacityPlan vs. plans/645/
contradiction (Finding 2) — the latter mostly needs confirmation that this is a
deliberately dropped deliverable and not the accidental removal of an in-flight
feature's reporting surface. Neither is a correctness bug in shipping code.

Note: automated tests were not executed during this review (the package is not
importable in the CI review shell); functional verification relies on the author's
reported suite results.

Restore the capacity-plan module, scheduler observation state, completion APIs, and their tests because open PRs #745 and #743 actively extend or assert these artifacts.

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove triple-verified dead and test-only internal code

1 participant