Skip to content

Add namespace-scoped collections with IVF vector indexing for agent database#240

Open
JingYuChe wants to merge 82 commits into
oceanbase:developfrom
JingYuChe:develop
Open

Add namespace-scoped collections with IVF vector indexing for agent database#240
JingYuChe wants to merge 82 commits into
oceanbase:developfrom
JingYuChe:develop

Conversation

@JingYuChe

@JingYuChe JingYuChe commented Jun 22, 2026

Copy link
Copy Markdown

Summary

Solution Description

Summary by CodeRabbit

  • New Features
    • Added namespace-enabled collections with namespace-scoped CRUD, querying, and prewarm, including strict per-namespace isolation and use_namespace routing controls.
    • Introduced IVF dense vector indexing via IVFConfiguration (IVFIndexType/IVFIndexLib) and extended schema/config to support IVF.
  • Bug Fixes
    • Improved remote connection resilience during catalog initialization and added friendlier namespace kernel error messages.
  • Documentation
    • Expanded query() include to support distances and broadened where_document operator docs (including $regex).
  • Tests
    • Significantly expanded namespace + IVF + filtering/hybrid-search operator coverage, including lifecycle and concurrency scenarios.

JingYuChe and others added 30 commits April 28, 2026 21:11
namespace调用oceanbase的内核DBMS_LOGIC_TABLE包,内核自行管理元数据表
…must

OceanBase rejects a bool with only must_not when it is nested under must.
Combine $not_contains with metadata filters via outer filter + must_not instead.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/pyseekdb/client/client_base.py (2)

5490-5500: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle metadata-only namespace upserts on new IDs.

_namespace_add() still rejects payloads that only have metadata, so any upsert() for a brand-new id with metadata-only data will fail here even though the update path already supports that shape.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 5490 - 5500, The
_namespace_add() method currently rejects payloads that contain only metadata
without embeddings or documents, causing metadata-only upserts on new IDs to
fail. Update the _namespace_add() method to support metadata-only payloads by
allowing cases where embeddings and documents are None but metadata is present,
similar to how the update path already handles this scenario. Ensure the method
can process the add_metas, add_docs, and add_embs parameters correctly when some
of these are None values.

2394-2408: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the namespace recovery path here.

has_collection() treats a catalog-only namespace collection as existing, so this early return hands back an incomplete handle instead of letting create_collection(..., use_namespace=True) rebuild the missing physical tables.

♻️ Suggested fix
-        if self.has_collection(name):
-            return self.get_collection(name, embedding_function=embedding_function)
+        if self.has_collection(name):
+            if use_namespace and self._is_incomplete_ns_collection(name):
+                return self.create_collection(
+                    name=name,
+                    schema=schema,
+                    configuration=configuration,
+                    embedding_function=embedding_function,
+                    use_namespace=use_namespace,
+                    **kwargs,
+                )
+            return self.get_collection(name, embedding_function=embedding_function)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 2394 - 2408, The
get_or_create_collection method has an early return that checks if a collection
exists and returns it immediately, but this bypasses the namespace recovery
mechanism. When use_namespace is True, the method should allow create_collection
to run and rebuild any missing physical tables in the namespace, rather than
returning an incomplete collection handle from the early has_collection check.
Either condition the early return to skip when use_namespace is True, or remove
the early return entirely so that the create_collection call with use_namespace
parameter can properly rebuild catalog-only collections that are missing their
physical tables.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration_tests/test_namespace_lifecycle_concurrency.py`:
- Around line 133-136: The test currently only verifies that drop_thread is
still alive, which proves the injected slow drop is paused but not that
has_thread is blocked on the lifecycle lock. Add an assertion to check that
has_thread is alive before calling drop_can_finish.set() to definitively prove
that checker.has_namespace() is blocked waiting for the lifecycle lock to be
released.

In `@tests/integration_tests/test_namespace_upsert_concurrency.py`:
- Around line 48-71: The thread synchronization calls barrier.wait() and
thread.join() lack timeouts, which can cause the test to hang indefinitely if
any thread stalls or fails to reach the barrier. Add a timeout parameter to the
barrier.wait() call inside the _upsert_once function to prevent indefinite
waiting at the synchronization point, and also add a timeout parameter to each
thread.join() call in the loop that waits for all threads to complete. This
ensures the test will fail cleanly with a timeout error instead of hanging the
entire test suite.

---

Outside diff comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 5490-5500: The _namespace_add() method currently rejects payloads
that contain only metadata without embeddings or documents, causing
metadata-only upserts on new IDs to fail. Update the _namespace_add() method to
support metadata-only payloads by allowing cases where embeddings and documents
are None but metadata is present, similar to how the update path already handles
this scenario. Ensure the method can process the add_metas, add_docs, and
add_embs parameters correctly when some of these are None values.
- Around line 2394-2408: The get_or_create_collection method has an early return
that checks if a collection exists and returns it immediately, but this bypasses
the namespace recovery mechanism. When use_namespace is True, the method should
allow create_collection to run and rebuild any missing physical tables in the
namespace, rather than returning an incomplete collection handle from the early
has_collection check. Either condition the early return to skip when
use_namespace is True, or remove the early return entirely so that the
create_collection call with use_namespace parameter can properly rebuild
catalog-only collections that are missing their physical tables.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c6c39f62-b1e5-4942-a1c8-731369aab9a0

📥 Commits

Reviewing files that changed from the base of the PR and between 626caca and 0be3947.

📒 Files selected for processing (9)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/collection.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/unit_tests/test_get_or_create_namespace_concurrency.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • tests/unit_tests/test_namespace_upsert_reconcile.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/pyseekdb/client/collection.py
  • tests/unit_tests/test_namespace.py

Comment thread tests/integration_tests/test_namespace_lifecycle_concurrency.py Outdated
Comment thread tests/integration_tests/test_namespace_upsert_concurrency.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 2400-2403: The conflict fallback mechanism that calls
get_collection() does not re-verify the namespace state before returning, which
can result in returning incompletely initialized collections or collections in
the wrong mode under concurrent namespace creation. Before returning a
conflicted collection in the code around the has_collection(name) check and
_is_incomplete_ns_collection(name) validation, you must re-validate the
namespace state to ensure the returned collection from get_collection() is
properly initialized with ready physical tables and in the correct namespace
mode. Apply this same fix to both occurrences mentioned at lines 2400-2403 and
2422-2424.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cc496da7-150a-4d63-8ec6-e3db8a55efc0

📥 Commits

Reviewing files that changed from the base of the PR and between 0be3947 and 8b50723.

📒 Files selected for processing (8)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/meta_info.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/validators.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_namespace.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/validators.py
  • src/pyseekdb/client/meta_info.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/collection.py
  • tests/unit_tests/test_namespace.py

Comment thread src/pyseekdb/client/client_base.py
[pyseekdb] 多个client并发调用get_or_create_collection接口,在sdk_collection存在多个同名collection

https://project.alipay.com/project/W24001004732/P26001026476/workitemDetail?openWorkItemId=2026062300116920158&status=status

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 2365-2367: The `_catalog_database()` method returns the database
identifier without any validation or proper escaping, creating a SQL injection
risk if the database name contains special characters like backticks. Add
validation to the `_catalog_database()` method to enforce allowed characters
(using a similar pattern to `_validate_collection_name()` which restricts to
alphanumeric and underscore characters), or alternatively implement proper
identifier escaping in `_qtable()` to escape backticks within the database name
(convert backtick to double-backtick for proper MySQL/OceanBase syntax). This
validation should apply to all SQL queries using `_qtable()` that include the
database identifier.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 72f1fff2-1e04-464c-8323-8c0367faf763

📥 Commits

Reviewing files that changed from the base of the PR and between 8b50723 and 444d4ba.

📒 Files selected for processing (5)
  • src/pyseekdb/client/client_base.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_get_or_create_collection_multiprocess.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
💤 Files with no reviewable changes (1)
  • tests/integration_tests/test_logic_table_monitoring_p1.py

Comment thread src/pyseekdb/client/client_base.py
@JingYuChe JingYuChe changed the title pyseekdb support agent database Add namespace-scoped collections with IVF vector indexing for agent database Jun 24, 2026
@JingYuChe

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/integration_tests/test_namespace_lifecycle_concurrency.py (1)

98-139: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Synchronize before releasing the blocked drop.

The polling loop can run after Line 133 releases DROP_NAMESPACE, so this may not exercise has_namespace() during the blocked drop. Wait for one successful has_namespace() call while the drop is still paused, then release the drop and assert the final state.

Suggested patch
     errors: list[Exception] = []
     has_results: list[bool] = []
+    has_completed_while_drop_blocked = threading.Event()
@@
                 assert drop_started.wait(timeout=30), "drop did not start"
                 for _ in range(10):
                     has_results.append(checker.has_namespace(ns_name))
+                    has_completed_while_drop_blocked.set()
                     time.sleep(0.05)
@@
-        drop_can_finish.set()
+        observed_during_blocked_drop = has_completed_while_drop_blocked.wait(timeout=30)
+        drop_can_finish.set()
         drop_thread.join(timeout=120)
         has_thread.join(timeout=120)
 
         assert errors == []
-        assert has_results
+        assert observed_during_blocked_drop, "has_namespace did not complete while drop was blocked"
+        assert has_results and has_results[0] is True
         assert checker.has_namespace(ns_name) is False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/test_namespace_lifecycle_concurrency.py` around lines
98 - 139, The concurrency test in test_namespace_lifecycle_concurrency should
ensure has_namespace() runs while the DROP_NAMESPACE operation is still blocked,
not only after it is released. Update the _has_worker and drop synchronization
so the polling loop waits for at least one successful
checker.has_namespace(ns_name) call before drop_can_finish is set, then release
the blocked delete_namespace path and verify the final false state afterward.
Use the existing symbols drop_started, drop_can_finish, _drop_worker,
_has_worker, and checker.has_namespace to keep the test focused on the intended
overlap.
src/pyseekdb/client/client_base.py (2)

5252-5277: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fail when duplicate-row reconciliation exhausts its retry budget.

After 120 attempts this loop just falls through, so _namespace_upsert() can return successfully while duplicate business IDs still exist. That turns a retry failure into silent data corruption.

Suggested fix
             for attempt in range(120):
                 duplicate_count = self._count_namespace_records_by_id(
                     table_name, ns_id, ltable_id, record_id
                 )
                 if duplicate_count <= 1:
                     break
                 ...
                 if attempt < 119:
                     time.sleep(0.05 * min(attempt + 1, 10))
+            else:
+                raise ValueError(
+                    f"Failed to reconcile duplicate namespace rows for record_id={record_id!r}"
+                )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 5252 - 5277, The
duplicate-row reconciliation loop in `_namespace_upsert()` currently stops after
120 attempts without surfacing failure, which can let duplicate business IDs
remain silently. Update this retry block to explicitly detect when the budget is
exhausted and raise an error or otherwise fail the upsert instead of falling
through; keep the fix near the `_count_namespace_records_by_id`,
`_delete_namespace_records_by_id`, and `_namespace_add` calls so the retry
exhaustion is handled in the same path.

212-225: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Persist the actual namespace index layout, not a hard-coded one.

_build_default_ltable_schema() always records FULLTEXT and IVF entries, but _create_namespace_physical_tables() only creates those indexes when the corresponding config is present. That leaves logic_schema_table claiming optional indexes exist even when the physical table was created without them.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 212 - 225, The default
logical-table schema in _build_default_ltable_schema() is hard-coding FULLTEXT
and IVF indexes even when _create_namespace_physical_tables() does not create
them. Update the schema-building flow so the recorded index_info reflects the
actual indexes created for the namespace, using the same config-driven
conditions as _create_namespace_physical_tables() and keeping only the indexes
that were actually provisioned.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 1191-1194: The unique-index migration in the catalog setup is
currently hidden by contextlib.suppress(Exception), so failures on existing
sdk_* tables can leave the catalog unconstrained. In client_base.py, update the
CREATE UNIQUE INDEX calls in the catalog migration path to stop swallowing all
exceptions and let failures surface (or handle them explicitly with
logging/propagation), using the same approach for both the collection and
namespace index creation blocks referenced by the catalog setup methods.

In `@src/pyseekdb/client/validators.py`:
- Around line 51-54: Reject boolean n_results explicitly in _validate_n_results
before the existing integer validation, since bool is a subclass of int and
True/False can slip through as valid counts. Update the guard in
src/pyseekdb/client/validators.py so the check in _validate_n_results treats
bool as invalid, while still allowing only real integers within the existing
max_results constraint.

---

Outside diff comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 5252-5277: The duplicate-row reconciliation loop in
`_namespace_upsert()` currently stops after 120 attempts without surfacing
failure, which can let duplicate business IDs remain silently. Update this retry
block to explicitly detect when the budget is exhausted and raise an error or
otherwise fail the upsert instead of falling through; keep the fix near the
`_count_namespace_records_by_id`, `_delete_namespace_records_by_id`, and
`_namespace_add` calls so the retry exhaustion is handled in the same path.
- Around line 212-225: The default logical-table schema in
_build_default_ltable_schema() is hard-coding FULLTEXT and IVF indexes even when
_create_namespace_physical_tables() does not create them. Update the
schema-building flow so the recorded index_info reflects the actual indexes
created for the namespace, using the same config-driven conditions as
_create_namespace_physical_tables() and keeping only the indexes that were
actually provisioned.

In `@tests/integration_tests/test_namespace_lifecycle_concurrency.py`:
- Around line 98-139: The concurrency test in
test_namespace_lifecycle_concurrency should ensure has_namespace() runs while
the DROP_NAMESPACE operation is still blocked, not only after it is released.
Update the _has_worker and drop synchronization so the polling loop waits for at
least one successful checker.has_namespace(ns_name) call before drop_can_finish
is set, then release the blocked delete_namespace path and verify the final
false state afterward. Use the existing symbols drop_started, drop_can_finish,
_drop_worker, _has_worker, and checker.has_namespace to keep the test focused on
the intended overlap.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2563a8a3-7220-4637-ade1-285ec6909c6e

📥 Commits

Reviewing files that changed from the base of the PR and between 8b50723 and c688423.

📒 Files selected for processing (21)
  • src/pyseekdb/client/__init__.py
  • src/pyseekdb/client/admin_client.py
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/client_seekdb_embedded.py
  • src/pyseekdb/client/client_seekdb_server.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/validators.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_get_or_create_collection_multiprocess.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_drop_validation.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • tests/integration_tests/test_namespace_lifecycle.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • tests/unit_tests/test_namespace_upsert_reconcile.py
💤 Files with no reviewable changes (1)
  • tests/integration_tests/test_logic_table_monitoring_p1.py
🚧 Files skipped from review as they are similar to previous changes (12)
  • src/pyseekdb/client/init.py
  • src/pyseekdb/client/client_seekdb_embedded.py
  • src/pyseekdb/client/admin_client.py
  • src/pyseekdb/client/namespace.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/unit_tests/test_namespace.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • src/pyseekdb/client/collection.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_namespace_lifecycle.py
  • tests/integration_tests/test_namespace_drop_validation.py

Comment thread src/pyseekdb/client/client_base.py Outdated
Comment thread src/pyseekdb/client/validators.py
@JingYuChe

Copy link
Copy Markdown
Author

@coderabbitai resume

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Reviews resumed.

@JingYuChe

Copy link
Copy Markdown
Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
tests/unit_tests/test_document_query_builder.py (2)

50-56: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Also assert that $not_contains survives wrapping.

document_expr_as_knn_filter() is supposed to preserve the original must_not clause and add the permissive exists filter. This test only checks the added filter, so an implementation that drops must_not would still pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_document_query_builder.py` around lines 50 - 56, The
test for document KNN wrapping only verifies the added exists filter and can
miss a regression where the original $not_contains / must_not clause is dropped.
Update test_pure_must_not_adds_exists_filter in document_expr_as_knn_filter /
build_document_hybrid_expression to also assert that the wrapped bool query
still contains the original must_not from $not_contains while keeping the
permissive exists filter.

29-40: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Strengthen the nested $and/$or assertion.

This test only proves that some bool.must wrapper exists. It would still pass if the nested $or branch or the "c" term were dropped during translation, which is the main behavior this case is meant to protect. Please assert the concrete nested structure here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_document_query_builder.py` around lines 29 - 40, The
nested `$and`/`$or` test in `test_and_or_nested` is too weak because it only
checks for a generic `bool.must` wrapper and could miss dropped branches during
translation. Update the assertions to verify the concrete structure returned by
`build_document_hybrid_expression`, including the nested `$or` branch and the
`"c"` term, so the test specifically protects the full nested boolean
translation.
src/pyseekdb/client/client_base.py (1)

149-154: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Parenthesize the mixed and/or for clarity. The final clause "1061" in message and "duplicate" in message binds tighter than the surrounding or, which is the intended grouping but is easy to misread.

♻️ Suggested tweak
     if (
         "already exists" in message
         or "duplicate key name" in message
         or "code=1061" in message
-        or "1061" in message and "duplicate" in message
+        or ("1061" in message and "duplicate" in message)
     ):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/client_base.py` around lines 149 - 154, The conditional
in client_base.py mixes or and and in a way that is correct but hard to read;
update the message-checking logic in the relevant exception handler to
explicitly parenthesize the final "1061" in message and "duplicate" in message
clause alongside the other or conditions. Keep the behavior the same, but make
the grouping obvious by clarifying the boolean expression used in that
duplicate-detection check.

Source: Linters/SAST tools

tests/integration_tests/namespace_hybrid_search_helpers.py (1)

137-139: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Use strict=True in zip for the ground-truth distance. Since l2_squared computes the expected-KNN baseline, a query/corpus dimension mismatch would silently truncate and yield a wrong expected ordering, masking a real regression instead of failing loudly.

♻️ Suggested tweak
-    return sum((x - y) ** 2 for x, y in zip(a, b))
+    return sum((x - y) ** 2 for x, y in zip(a, b, strict=True))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration_tests/namespace_hybrid_search_helpers.py` around lines 137
- 139, The l2_squared helper used for the expected KNN baseline should fail on
vector length mismatches instead of silently truncating. Update the zip call
inside l2_squared in namespace_hybrid_search_helpers.py to use strict=True so
any query/corpus dimension mismatch raises immediately, and keep the behavior
localized to this ground-truth distance helper.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 254-275: The unused merge_into_knn_filter helper should be removed
because knn filter merging is already handled elsewhere and client_base does not
use it. Delete the merge_into_knn_filter function from document_query_builder
and remove its import from client_base, then verify _build_knn_expression and
_inject_filter_into_knn continue to cover all merge behavior.
- Around line 14-22: The query-string escaping in _query_string_contains is
using SQL-style escaping instead of search-syntax escaping, so
Lucene/Elasticsearch reserved characters can be misread. Update
_query_string_contains to use a query_string-aware escaper for the query
payload, and apply the same escaping approach anywhere the document query
builder combines terms with $and/$or so those joins preserve search syntax
correctly.

In `@src/pyseekdb/client/kernel_errors.py`:
- Line 64: The boolean expression in the kernel error check mixes or and and
without parentheses, which Ruff flags in the kernel_errors logic. Update the
condition in the relevant text-matching branch to group the and subexpression
explicitly, keeping the existing behavior while making the intent clear and
lint-safe. Use the surrounding kernel_errors.py check that references "being
dropped" and "namespace" to locate the expression.

In `@tests/integration_tests/test_namespace_query_where_document.py`:
- Around line 98-119: The document filter tests can pass vacuously when the
query returns no rows, so add a non-empty result assertion before checking for
violations. Update test_where_document_not_contains, test_where_document_and,
test_where_document_or, test_where_document_regex, and the two nested tests to
assert that docs is not empty (like the baseline test) after ns.query, then keep
the existing exclusion checks so regressions that filter everything out are
caught.

---

Nitpick comments:
In `@src/pyseekdb/client/client_base.py`:
- Around line 149-154: The conditional in client_base.py mixes or and and in a
way that is correct but hard to read; update the message-checking logic in the
relevant exception handler to explicitly parenthesize the final "1061" in
message and "duplicate" in message clause alongside the other or conditions.
Keep the behavior the same, but make the grouping obvious by clarifying the
boolean expression used in that duplicate-detection check.

In `@tests/integration_tests/namespace_hybrid_search_helpers.py`:
- Around line 137-139: The l2_squared helper used for the expected KNN baseline
should fail on vector length mismatches instead of silently truncating. Update
the zip call inside l2_squared in namespace_hybrid_search_helpers.py to use
strict=True so any query/corpus dimension mismatch raises immediately, and keep
the behavior localized to this ground-truth distance helper.

In `@tests/unit_tests/test_document_query_builder.py`:
- Around line 50-56: The test for document KNN wrapping only verifies the added
exists filter and can miss a regression where the original $not_contains /
must_not clause is dropped. Update test_pure_must_not_adds_exists_filter in
document_expr_as_knn_filter / build_document_hybrid_expression to also assert
that the wrapped bool query still contains the original must_not from
$not_contains while keeping the permissive exists filter.
- Around line 29-40: The nested `$and`/`$or` test in `test_and_or_nested` is too
weak because it only checks for a generic `bool.must` wrapper and could miss
dropped branches during translation. Update the assertions to verify the
concrete structure returned by `build_document_hybrid_expression`, including the
nested `$or` branch and the `"c"` term, so the test specifically protects the
full nested boolean translation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d4fbc95-a5a6-43f4-aae7-2525641835cf

📥 Commits

Reviewing files that changed from the base of the PR and between c688423 and 908186a.

📒 Files selected for processing (50)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/client_seekdb_server.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/document_query_builder.py
  • src/pyseekdb/client/filters.py
  • src/pyseekdb/client/kernel_errors.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/schema.py
  • src/pyseekdb/client/validators.py
  • tests/integration_tests/namespace_dml_helpers.py
  • tests/integration_tests/namespace_fts_helpers.py
  • tests/integration_tests/namespace_hybrid_search_helpers.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_get_or_create_collection_multiprocess.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_dml.py
  • tests/integration_tests/test_namespace_dml_multi_coll.py
  • tests/integration_tests/test_namespace_dml_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_dml_multi_ns.py
  • tests/integration_tests/test_namespace_drop_validation.py
  • tests/integration_tests/test_namespace_get_delete_filters.py
  • tests/integration_tests/test_namespace_hybrid_search_combined.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_search_index.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_vector.py
  • tests/integration_tests/test_namespace_lifecycle.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_optional_indexes.py
  • tests/integration_tests/test_namespace_prewarm.py
  • tests/integration_tests/test_namespace_prewarm_lob.py
  • tests/integration_tests/test_namespace_query.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/integration_tests/test_namespace_reconnect.py
  • tests/integration_tests/test_namespace_session_vars.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/unit_tests/test_build_document_query.py
  • tests/unit_tests/test_build_knn_filter.py
  • tests/unit_tests/test_build_query_expression.py
  • tests/unit_tests/test_document_query_builder.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_get_or_create_namespace_concurrency.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_kernel_errors.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • tests/unit_tests/test_namespace_upsert_reconcile.py
🚧 Files skipped from review as they are similar to previous changes (36)
  • tests/unit_tests/test_build_knn_filter.py
  • tests/integration_tests/test_namespace_hybrid_search_search_index.py
  • tests/integration_tests/test_namespace_hybrid_search_combined.py
  • tests/integration_tests/test_namespace_lifecycle_concurrency.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/unit_tests/test_namespace_lifecycle_lock.py
  • src/pyseekdb/client/client_seekdb_server.py
  • tests/integration_tests/test_namespace_dml_multi_coll_multi_ns.py
  • tests/unit_tests/test_build_document_query.py
  • tests/unit_tests/test_get_or_create_namespace_concurrency.py
  • tests/integration_tests/test_logic_table_monitoring_p1.py
  • tests/integration_tests/test_namespace_session_vars.py
  • tests/integration_tests/test_namespace_dml_multi_ns.py
  • tests/integration_tests/test_get_or_create_collection_concurrency.py
  • tests/unit_tests/test_build_query_expression.py
  • src/pyseekdb/client/schema.py
  • tests/integration_tests/test_namespace_hybrid_search_vector.py
  • tests/integration_tests/test_namespace_dml_multi_coll.py
  • tests/integration_tests/test_namespace_reconnect.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_upsert_concurrency.py
  • tests/integration_tests/test_namespace_query.py
  • tests/integration_tests/test_namespace_prewarm.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/namespace.py
  • tests/integration_tests/test_namespace_get_delete_filters.py
  • tests/integration_tests/test_namespace_hybrid_search_fulltext.py
  • tests/integration_tests/test_namespace_prewarm_lob.py
  • tests/unit_tests/test_get_or_create_collection_concurrency.py
  • tests/integration_tests/test_namespace_hybrid_search_multi_coll_multi_ns.py
  • tests/integration_tests/test_namespace_hybrid_search_triple_branch_multi_coll_multi_ns.py
  • tests/integration_tests/namespace_dml_helpers.py
  • tests/integration_tests/test_namespace_drop_validation.py
  • tests/unit_tests/test_namespace.py
  • tests/integration_tests/namespace_fts_helpers.py

Comment thread src/pyseekdb/client/document_query_builder.py
Comment thread src/pyseekdb/client/document_query_builder.py Outdated
Comment thread src/pyseekdb/client/kernel_errors.py Outdated
Comment thread tests/integration_tests/test_namespace_query_where_document.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pyseekdb/client/document_query_builder.py (1)

149-159: 🎯 Functional Correctness | 🔴 Critical

The fast path for $and/$or combining $contains clauses alters boolean semantics for multi-word terms.

The optimization at lines 149-159 (and 173-183) joins all $contains values with spaces and forces a default_operator.

  • Non-optimized path: Generates separate query_string clauses for each condition. In Lucene/Elasticsearch, a raw string like "foo bar" without an explicit operator defaults to foo OR bar. Thus {"$and": [...]} results in (foo OR bar) AND baz.
  • Optimized path: Joins to "foo bar baz" and sets default_operator: "and". Lucene parses this as foo AND bar AND baz.

This changes the semantics for any multi-word term. The optimization should only apply if every $contains value is a single atomic term (no spaces or operators), or it should be removed to fall back to the correct bool composition.

Example discrepancy

Input: {"$and": [{"$contains": "foo bar"}, {"$contains": "baz"}]}

  • Expected (current fallback): (foo OR bar) AND baz
  • Current Fast Path: foo AND bar AND baz
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/pyseekdb/client/document_query_builder.py` around lines 149 - 159, The
`$and`/$`or` fast-path in `document_query_builder` changes semantics for
multi-word `$contains` values by collapsing them into one `query_string` and
forcing `default_operator`; update the optimization in the `$and` and `$or`
branches to only trigger for single-token atomic terms, or remove it and use the
existing bool composition fallback. Keep the logic aligned with the surrounding
query-building path in `document_query_builder` so multi-word phrases still
behave the same as the non-optimized `query_string` clauses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 14-17: The backslash handling in _escape_query_string_term is
duplicated because _QUERY_STRING_RESERVED_RE already matches backslashes, so
remove the manual text.replace("\\", "\\\\") and rely on the regex substitution
alone. Update the logic in _escape_query_string_term so every reserved
character, including backslash, is escaped exactly once and consistently with
the other Lucene query_string characters.

---

Outside diff comments:
In `@src/pyseekdb/client/document_query_builder.py`:
- Around line 149-159: The `$and`/$`or` fast-path in `document_query_builder`
changes semantics for multi-word `$contains` values by collapsing them into one
`query_string` and forcing `default_operator`; update the optimization in the
`$and` and `$or` branches to only trigger for single-token atomic terms, or
remove it and use the existing bool composition fallback. Keep the logic aligned
with the surrounding query-building path in `document_query_builder` so
multi-word phrases still behave the same as the non-optimized `query_string`
clauses.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78a0309a-8d7d-49fa-8794-0debfa57241b

📥 Commits

Reviewing files that changed from the base of the PR and between 908186a and 1ac0899.

📒 Files selected for processing (10)
  • docs/guide/dql.md
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/document_query_builder.py
  • src/pyseekdb/client/kernel_errors.py
  • tests/integration_tests/namespace_hybrid_search_helpers.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/unit_tests/test_document_query_builder.py
  • tests/unit_tests/test_namespace.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/pyseekdb/client/kernel_errors.py
  • tests/integration_tests/test_namespace_query_where_document.py
  • tests/unit_tests/test_document_query_builder.py
  • src/pyseekdb/client/configuration.py
  • tests/unit_tests/test_namespace.py

Comment thread src/pyseekdb/client/document_query_builder.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit_tests/test_namespace_no_index_embedding_dim.py (1)

11-47: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add coverage for None embeddings.

_validate_namespace_explicit_embedding_dimensions() now has a dedicated if vec is None: continue branch, but these tests only cover full-vector match/mismatch paths. A small regression test here would lock in the intended mixed-payload behavior for both indexed and no-index namespaces.

Suggested test
 class TestNoIndexExplicitEmbeddingDimension:
+    def test_ignores_none_embeddings(self):
+        _validate_namespace_no_index_explicit_embeddings(
+            [None, [0.0] * 384],
+            expected_dimension=384,
+        )
+
     def test_accepts_matching_dimension(self):
         _validate_namespace_no_index_explicit_embeddings(
             [[0.0] * 384, [1.0] * 384],
             expected_dimension=384,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit_tests/test_namespace_no_index_embedding_dim.py` around lines 11 -
47, Add regression coverage for None embeddings in the existing namespace
dimension tests. Extend TestNoIndexExplicitEmbeddingDimension and
TestIvfExplicitEmbeddingDimension to call
_validate_namespace_no_index_explicit_embeddings and
_validate_namespace_explicit_embedding_dimensions with a payload that mixes None
and valid vectors, and assert that None entries are skipped while the remaining
embeddings still validate correctly. Use the existing helper names and
expected_dimension/has_vector_index parameters so the new test locks in the vec
is None branch behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit_tests/test_namespace_no_index_embedding_dim.py`:
- Around line 11-47: Add regression coverage for None embeddings in the existing
namespace dimension tests. Extend TestNoIndexExplicitEmbeddingDimension and
TestIvfExplicitEmbeddingDimension to call
_validate_namespace_no_index_explicit_embeddings and
_validate_namespace_explicit_embedding_dimensions with a payload that mixes None
and valid vectors, and assert that None entries are skipped while the remaining
embeddings still validate correctly. Use the existing helper names and
expected_dimension/has_vector_index parameters so the new test locks in the vec
is None branch behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 941c6c59-88ef-4cbc-b37c-81fdf3d6b2e8

📥 Commits

Reviewing files that changed from the base of the PR and between 1ac0899 and b39abbd.

📒 Files selected for processing (9)
  • src/pyseekdb/client/client_base.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/namespace.py
  • src/pyseekdb/client/validators.py
  • tests/integration_tests/test_namespace_constraints.py
  • tests/integration_tests/test_namespace_optional_indexes.py
  • tests/unit_tests/test_namespace.py
  • tests/unit_tests/test_namespace_no_index_embedding_dim.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/integration_tests/test_namespace_optional_indexes.py
  • src/pyseekdb/client/collection.py
  • src/pyseekdb/client/configuration.py
  • src/pyseekdb/client/namespace.py
  • tests/unit_tests/test_namespace.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants