Skip to content

1278 recursive databricks and spark checks and custom queries#1311

Draft
rob-h-w wants to merge 4 commits into
datacontract:mainfrom
rob-h-w:1278-recursive-databricks-and-spark-checks
Draft

1278 recursive databricks and spark checks and custom queries#1311
rob-h-w wants to merge 4 commits into
datacontract:mainfrom
rob-h-w:1278-recursive-databricks-and-spark-checks

Conversation

@rob-h-w

@rob-h-w rob-h-w commented Jun 16, 2026

Copy link
Copy Markdown

Supports recursive databricks & spark checks and custom queries, replaces the ibis implementation that requires write permissions on connection until a new release of ibis stops doing that.

  • Tests pass (uv run pytest)
  • Code formatted (uv run ruff check --fix && uv run ruff format)
  • [-] README.md updated (if relevant)
  • CHANGELOG.md entry added

I don't see a README.md update that's in scope.

rob-h-w added 4 commits June 10, 2026 18:16
To the Contribution>Troubleshooting section. Because I ran into that issue after checking out.
This is the beginning of an attempt to support ODCS' nested constraint and
quality definition capabilities in `test` subcommands, where the underlying
technology supports it. See [the relevant
issue](datacontract#1278).

Add recursive traversal of nested struct and array-of-struct fields when
generating ibis quality checks, scoped to verified backends only.

Check generation (create_checks.py):
- Add _iter_property_paths() to recursively yield (model, field_path, prop,
  is_nested) tuples for nested struct fields and array item models
- Struct recursion enabled for dataframe and databricks; array recursion for
  dataframe only
- Nested SQL quality checks emit MetricType.UNSUPPORTED with a warning preset
  on all other backends
- Use get_server_type() instead of server.type so imported DCS contracts with
  type="custom" are resolved correctly

Check execution (ibis_check_execute.py):
- Add _resolve_expr() / _resolve_nested_expr() for dotted-path ibis expressions
- Add _resolve_dtype() / _field_present() for nested schema introspection
- Update _run_present() to reuse the already-resolved model schema rather than
  re-fetching the table, fixing a case-sensitivity failure on Oracle
- Update _run_type(), _run_freshness(), _run_duplicate(), _missing_expr(),
  _valid_expr(), _invalid_expr(), _samples_for() to accept resolved expressions
  instead of bare column names

Spark temp view materialisation (kafka.py, connect.py):
- Add add_spark_nested_views() to create {model}__{field} Spark temp views for
  nested struct fields and exploded array-of-struct items
- Call add_spark_nested_views_for_contract() in the dataframe and
  Databricks-via-Spark connection paths before creating the ibis pyspark backend

Tests:
- tests/fixtures/dataframe/datacontract_nested.yaml: nested struct + array fixture
- tests/test_create_checks_nested.py: unit tests for recursive generation and
  backend gating
- tests/test_ibis_check_execute.py: regression tests for Oracle-style presence
  check without extra table lookup
- tests/test_test_dataframe.py: Spark integration pass/fail for nested struct,
  nested SQL quality, and array-item checks
- tests/test_test_databricks.py: unit test confirming nested struct SQL enabled
  and array recursion suppressed for Databricks

Created with Claude Sonnet 4.6.
…atacontract#1278)

Implement support for recursive nested struct and array checks on Databricks,
with zero-permission requirements (SELECT-only). This enables data contract
validation on read-only SQL warehouses without requiring CREATE VOLUME or
CREATE TABLE permissions.

Changes:

- New module `databricks_nested_models.py`: CTE-based virtual model generation
  for array item checks. Uses `LATERAL VIEW OUTER explode_outer()` to expose
  nested array elements as queryable tables without creating real volumes.

- Modified `_connect_databricks()` in `connect.py`: Introduced `_NoVolumeBackend`
  subclass that overrides `_post_connect()` with a no-op, bypassing ibis' default
  `CREATE VOLUME IF NOT EXISTS` call. Connection succeeds on read-only warehouses.

- Updated `connect_ibis()` Databricks branch: Builds and attaches virtual model
  CTE queries to the backend connection for downstream table resolution.

- Enabled array recursion for Databricks: Added "databricks" to
  `_SUPPORTED_NESTED_ARRAY_SERVER_TYPES` in `create_checks.py`, matching
  feature parity with Dataframe backend.

- Enhanced `_resolve_table()` in `ibis_check_execute.py`: Falls back to virtual
  model CTE queries before attempting list_tables(), allowing nested array
  models (e.g., `orders__items`) to resolve via pre-built WITH clauses.

- Test updates: Rewrote Databricks auth tests to patch the correct backend method,
  added `test_no_create_volume_on_connect` to verify volume creation is skipped,
  flipped nested array expectations to enable checks on array items.

- New test file `test_connect_databricks_virtual_models.py`: Unit tests for CTE
  query generation and schema filtering logic.

Result: 52 real-world data contract checks now pass against Databricks without
any CREATE/WRITE operations. Recursive struct checks (dotted paths) and recursive
array item checks (CTE virtual models) both fully supported.
Per the PR template.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant