Move AsyncMinHashLSH from experimental to datasketch.aio#314
Conversation
- Create new `datasketch/aio/` submodule with AsyncMinHashLSH, AsyncMinHashLSHInsertionSession, and AsyncMinHashLSHDeleteSession - Add optional AsyncMinHashLSH export to main datasketch package - Deprecate experimental module with warnings pointing to new location - Update tests to use new import path - Rename `experimental_aio` dependency group to `aio` (keep alias) - Remove experimental from coverage omit New import paths: - `from datasketch.aio import AsyncMinHashLSH` - `from datasketch import AsyncMinHashLSH` Old imports still work but emit DeprecationWarning. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Sort __all__ lists alphabetically (RUF022) - Add noqa: E402 for intentional imports after deprecation warnings - Add blank line after docstring Example section (D413) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes: - Add missing await on self.execute() in AsyncRedisBuffer - Fix getmany() to correctly use Redis pipeline (don't await individual commands) - Add getmany() method to AsyncMongoListStorage for get_subset_counts support - Fix typo: "AsyncMinHash" -> "AsyncMinHashLSH" in error message - Remove dead code: await self.keys (storage objects aren't awaitable) Performance: - Change tuple buffers to lists in AsyncMongoBuffer for O(1) append Documentation: - Fix module docstring example to use prepickle=True for string keys - Simplify insertion_session and delete_session docstring examples Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This test exercises the getmany() method that was fixed in a previous commit, ensuring proper coverage for the async implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow triggers from workflow_run (which runs after main Test completes and doesn't show as PR checks) to push/pull_request triggers. This makes integration test results visible directly in PR status checks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change workflow trigger from workflow_run to push/pull_request triggers for consistency with MongoDB and Redis test workflows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Initial plan * Update documentation and fix RedisStorage import issue Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> * Add clarifying comments for optional Redis dependency handling Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ekzhu <320302+ekzhu@users.noreply.github.com>
…ecations (#313) - Delete orphaned datasketch/experimental/aio/storage.py. Nothing imports from it (the shims all forward to datasketch.aio.lsh), and the stale file still contained the pre-fix bugs that datasketch/aio/storage.py silently fixed: un-awaited AsyncRedisBuffer.execute(), tuple-based buffer stacks, missing AsyncMongoListStorage.getmany, and a broken pipeline getmany impl. - Drop the dead try/except ImportError fallback around `from datasketch.aio import AsyncMinHashLSH` in datasketch/__init__.py. datasketch.aio.storage already degrades gracefully when motor/redis are absent, so the fallback was unreachable and only served to rebind the public symbol to None for type checkers. - Fix AsyncMinHashLSH.get_subset_counts under prepickle=True: query keys now get pickled to match the stored representation, so the lookup actually finds the inserted keys. Adds a regression test. - Refactor the experimental deprecation shims to emit exactly one DeprecationWarning per deprecated import path (down from three). Uses PEP 562 __getattr__ with globals() caching so the warning fires on first attribute access and subsequent accesses skip the hook. Static analyzers see the __all__ names via TYPE_CHECKING-guarded imports. - Minor: add asyncio.run(main()) to the datasketch.aio docstring example so it's copy-paste runnable, and add a KEEP IN SYNC comment on the duplicated experimental_aio extras group in pyproject.toml. Co-authored-by: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request transitions the asynchronous MinHash LSH implementation from an experimental status to a stable datasketch.aio module, including deprecation proxies for the old paths. The changes also introduce a getmany method for storage backends and update the project configuration to include an aio extra. Feedback focuses on resolving a potential type error regarding the storage basename, restoring explicit storage initialization to prevent runtime issues, and optimizing MongoDB's getmany implementation with asyncio.gather.
Addresses the gemini-code-assist review comment on PR #314: the previous sequential loop issued N round trips in series. asyncio.gather runs the underlying find() queries concurrently. https://claude.ai/code
|
@dipeshbabu could you help review this PR? |
|
There’s one blocking issue here in the Redis async storage path.
I think this should be fixed before merge, either by routing |
Hardcoding `LRANGE` in `AsyncRedisListStorage.getmany` broke the inherited method on `AsyncRedisSetStorage`: the subclass overrides `_get_items` to use `SMEMBERS`, but the pipeline in `getmany` bypassed that override and would fail with `WRONGTYPE` against set keys. Route `getmany` through `_get_items` so the subclass override applies, following the existing `itemcounts`/`_get_len` pattern in the same file. Add a regression test that calls `getmany` on an unordered (set-backed) hashtable storage, which exercises the previously-broken path under Redis. Update the three integration-test workflows (mongo, redis, cassandra) to install the new `aio` extra instead of the deprecated `experimental_aio` alias, aligning CI install coverage with the new public API surface. https://claude.ai/code
dipeshbabu
left a comment
There was a problem hiding this comment.
The Redis `getmany()` issue I flagged is fixed in `feead1d`; routing through `_get_items()` restores the
`AsyncRedisSetStorage`/`SMEMBERS` behavior, and the added regression test covers that path. The integration workflows are also now
installing the new `aio` extra, and the latest CI run is green.
I found one remaining blocker before promoting this out of `experimental`: `prepickle=True` is not applied consistently in the async
public key APIs. `has_key()` and `_remove()` still use the caller's raw key, while `_insert()` stores the pickled key. This breaks
`has_key()` and `remove()` for non-bytes keys with `prepickle=True`.
Please fix that and add regression coverage in `test/aio/test_lsh.py`, ideally under `TestAsyncMinHashLSHWithPrepickle` starting at
line 583. The tests should cover `has_key()`, `remove()`, and ideally `delete_session.remove()` with a string key.`AsyncMinHashLSH.has_key()` forwarded the caller's raw key to storage,
and `_remove()` reused the raw key for the existence check, hashtable
lookup, and the delete itself. Under `prepickle=True` the storage holds
pickled bytes, so:
await lsh.insert("a", m)
await lsh.has_key("a") # returned False
await lsh.remove("a") # raised ValueError
Mirror the sync `MinHashLSH._remove` pattern: pickle the key inside
`_remove` first and call the storage primitives (`self.keys.has_key`,
`self.keys.get`, `self.keys.remove`, `hashtable.remove_val`) directly so
they all see the stored representation.
`has_key()` now pickles internally for the public API. `_insert()` was
calling `has_key()` after it had already pickled the key, which
accidentally worked but would have started double-pickling once
`has_key` learned to pickle on its own; switch `_insert`'s duplicate
check to `self.keys.has_key()` directly to keep that path single-pickle.
Add regression tests under `TestAsyncMinHashLSHWithPrepickle` covering
`has_key`, `remove`, and `delete_session.remove` with string keys.
https://claude.ai/code
Add four tests covering behaviors that are part of the public async contract but were only tested implicitly (or not at all): - test_insert_duplicate_raises: re-inserting an existing bytes key must raise ValueError - test_insert_check_duplication_false: check_duplication=False bypass must allow re-insertion without error - test_is_empty: is_empty() must flip False after insert and back to True after removing all keys - test_insert_duplicate_prepickle: duplicate detection must also work for pickled string keys and the bypass must still apply Design gaps (no async `merge` or `add_to_query_buffer`) are tracked separately - they are missing APIs rather than missing tests and are out of scope for this PR. https://claude.ai/code
Pipeline command methods in redis.asyncio are synchronous (they queue and return self). The previous approach routed through async _get_items which awaited the pipeline return value — while Pipeline.__await__ makes this technically valid, it adds unnecessary async machinery to what should be a synchronous queue operation. Revert AsyncRedisListStorage.getmany to the original direct pipe.lrange() pattern, and add an explicit getmany override in AsyncRedisSetStorage that uses pipe.smembers(). This is the "override in subclass" approach from the reviewer's suggestion, and keeps pipeline command queuing purely synchronous. https://claude.ai/code
The sync MinHashLSH._query_b unpickles candidates before returning them when prepickle is enabled. The async version was missing this step, returning raw pickled bytes instead of original keys. https://claude.ai/code
Summary
Promotes
AsyncMinHashLSHout of thedatasketch.experimentalnamespace into a first-classdatasketch.aiosubmodule, deprecates the old import paths, and runs the storage-backend integration tests directly on PRs. This is a re-submission of #301 from a new branch, with the description rewritten to accurately reflect the changes.What does this PR do?
New
datasketch.aiosubmoduledatasketch/aio/__init__.py,datasketch/aio/lsh.py, and movesdatasketch/experimental/aio/storage.pytodatasketch/aio/storage.py.AsyncMinHashLSH,AsyncMinHashLSHInsertionSession, andAsyncMinHashLSHDeleteSessionfromdatasketch.aio.AsyncMinHashLSHfrom the top-leveldatasketchpackage sofrom datasketch import AsyncMinHashLSHworks. The import itself is always safe; instantiation still requiresmotor/redis.asyncioto be installed.Deprecation of
datasketch.experimentaldatasketch.experimental,datasketch.experimental.aio, anddatasketch.experimental.aio.lshnow re-export fromdatasketch.aioand emit aDeprecationWarningon attribute access.__getattr__and cached back intoglobals()so the warning fires exactly once per process and intermediate package imports don't trigger redundant warnings.TYPE_CHECKINGre-exports keep static analyzers happy without importing the real symbols eagerly.Packaging
aiooptional-dependency group inpyproject.toml(aiounittest,motor>3.6.0).experimental_aioas a duplicated alias for backwards compatibility (PEP 621 does not support referencing another optional-dependency group).*/experimental/*from the coverageomitlist now that the module is no longer experimental-only.Docs
docs/documentation.rstanddocs/lsh.rstupdated to point atdatasketch.aio.AsyncMinHashLSH.CI
test-cassandra.yml,test-mongo.yml, andtest-redis.ymlnow trigger onpush/pull_requestagainstmasterdirectly instead of cascading off a completedTestworkflow run, so integration tests actually gate PRs.Tests
test/aio/test_lsh.pyupdated to import from the new location and gains a test for theget_subset_countsasync method.Backwards compatibility
Old imports still work:
New preferred imports:
Checklist
DeprecationWarning.https://claude.ai/code