-
Notifications
You must be signed in to change notification settings - Fork 284
HNSW index update with CDC #21917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
cpegeric
wants to merge
241
commits into
matrixorigin:main
Choose a base branch
from
cpegeric:cdc_sqlexecutor_cleanup
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
HNSW index update with CDC #21917
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
daviszhen
approved these changes
Jul 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
User description
What type of PR is this?
Which issue(s) this PR fixes:
issue #21835
What this PR does / why we need it:
To update the HNSW index via CDC changes.
The design doc:
https://github.com/cpegeric/mo-docs/blob/hnsw_cdc/design/mo/sql/20250501-cpegeric-hnswsync.md
PR Type
Enhancement, Tests
Description
• Implement comprehensive HNSW index CDC (Change Data Capture) synchronization functionality
• Add new
HnswSync
struct andhnswCdcUpdate
SQL function for processing CDC updates via multi-threaded operations• Introduce
hnswSyncSinker
for updating HNSW indexes with CDC changes, supporting both float32 and float64 vector types• Refactor HNSW architecture by replacing
HnswBuildIndex
andHnswSearchIndex
with unifiedHnswModel
structure• Add CDC data structures (
VectorIndexCdc
,VectorIndexCdcEntry
) and operations for insert, update, delete operations• Implement transaction-aware SQL execution with
RunTxn
function and enhanced error handling• Add comprehensive test suites covering CDC sinker functionality, synchronization operations, and model operations
• Integrate CDC task creation into HNSW index creation workflow with automatic cleanup placeholders
• Enhance array casting with dimension validation and standardize error message formats across vector operations
• Add distributed test cases for HNSW CDC synchronization scenarios including bulk loads and incremental updates
Changes walkthrough 📝
12 files
hnsw_sinker_test.go
Add comprehensive test suite for HNSW CDC sinker
pkg/cdc/hnsw_sinker_test.go
• Comprehensive test suite for HNSW CDC sinker functionality with 692
lines of test code
• Mock implementations for SQL executors and error
handling scenarios
• Test cases covering sinker creation, execution,
error handling, and data processing
• Tests for snapshot and atomic
batch processing with vector data
sync_test.go
Add test suite for HNSW CDC synchronization
pkg/vectorindex/hnsw/sync_test.go
• Test suite for HNSW CDC synchronization with various operation
scenarios
• Tests covering upsert, delete, insert operations with
single and multiple models
• Mock implementations for SQL execution
and streaming operations
• Shuffle testing for concurrent operation
handling
search_test.go
Enhance HNSW search tests with multi-file support
pkg/vectorindex/hnsw/search_test.go
• Added mock functions for catalog SQL operations and multi-file
scenarios
• New test helper functions for creating metadata and index
batches
• Enhanced test coverage for search operations with multiple
index files
model_test.go
Add comprehensive test suite for HnswModel functionality
pkg/vectorindex/hnsw/model_test.go
• Added comprehensive test suite for
HnswModel
functionality• Tests
cover search operations, loading/unloading, add/remove operations, and
SQL generation
• Includes edge case testing for nil model scenarios
•
Uses mock SQL functions for testing database interactions
func_hnsw_test.go
Add test cases for HNSW CDC update function
pkg/sql/plan/function/func_hnsw_test.go
• Added test cases for
hnswCdcUpdate
function• Tests various error
conditions including null arguments and invalid JSON
• Validates
function parameter validation and error handling
build_test.go
Update HNSW build tests to use new HnswModel structure
pkg/vectorindex/hnsw/build_test.go
• Updated test code to use
HnswModel
instead ofHnswSearchIndex
•
Changed function call from
NewHnswBuildIndex
toNewHnswModelForBuild
•
Updated struct initialization to use new model type
types_test.go
Add test cases for vector index CDC functionality
pkg/vectorindex/types_test.go
• Added test cases for CDC functionality
• Tests Insert, Delete,
Upsert operations and JSON serialization
• Validates CDC data
structure behavior and state management
sinker_test.go
Update sinker tests for new function signature
pkg/cdc/sinker_test.go
• Updated test calls to
NewSinker
to include the newcnUUID
parameter• Maintains test compatibility with updated function signature
cdc_test.go
Update CDC test mocks for new sinker signature
pkg/frontend/cdc_test.go
• Updated mock
NewSinker
stub to includecnUUID
parameter• Maintains
test compatibility with updated function signature
function_id_test.go
Update function ID tests for HNSW CDC function
pkg/sql/plan/function/function_id_test.go
• Updated predefined function IDs to include
HNSW_CDC_UPDATE
•
Incremented
FUNCTION_END_NUMBER
to maintain test consistencyvector_hnsw_sync.result
Add test results for HNSW CDC synchronization functionality
test/distributed/cases/vector/vector_hnsw_sync.result
• Added test results for HNSW CDC synchronization scenarios
• Covers
empty data, bulk load, and incremental update test cases
• Validates
vector search functionality after CDC operations
vector_hnsw_sync.sql
Add comprehensive HNSW CDC synchronization test cases
test/distributed/cases/vector/vector_hnsw_sync.sql
• Added comprehensive test cases for HNSW CDC synchronization
• Tests
PITR and CDC task creation, data operations, and vector searches
•
Includes scenarios for empty tables, bulk loads, and incremental
updates
6 files
sync.go
Implement HNSW index CDC synchronization functionality
pkg/vectorindex/hnsw/sync.go
• New CDC synchronization functionality for HNSW index updates via SQL
function
hnsw_cdc_update()
•
HnswSync
struct for managing CDCoperations with insert, update, delete operations
• Multi-threaded
processing support with concurrent model loading and vector operations
• SQL generation for metadata and storage table updates
hnsw_sinker.go
Add HNSW CDC sinker for vector index updates
pkg/cdc/hnsw_sinker.go
• New
hnswSyncSinker
implementation for updating HNSW indexes via CDCchanges
• Support for both float32 and float64 vector types with
JSON-based CDC updates
• Transaction-based SQL execution with error
handling and rollback support
• Processing of snapshot and tail data
with atomic batch operations
func_hnsw.go
Implement HNSW CDC update function for vector index synchronization
pkg/sql/plan/function/func_hnsw.go
• Implemented
hnswCdcUpdate
function for processing CDC updates•
Validates input parameters (database name, table name, dimension, CDC
JSON)
• Calls
hnsw.CdcSync
to perform the actual synchronization•
Includes comprehensive error handling and logging
util.go
Add CDC task generation for HNSW index synchronization
pkg/sql/compile/util.go
• Added
genCdcHnswIndex
function to generate CDC task creation SQL•
Creates PITR and CDC task SQL statements for HNSW index
synchronization
• Includes placeholder logic for future CDC task
registration
list_builtIn.go
Register HNSW CDC update function in built-in functions
pkg/sql/plan/function/list_builtIn.go
• Added
HNSW_CDC_UPDATE
function definition to built-in functions list• Configured function signature with varchar and int32 parameters
returning uint64
ddl_index_algo.go
Integrate CDC task creation into HNSW index creation
pkg/sql/compile/ddl_index_algo.go
• Added call to
genCdcHnswIndex
in vector HNSW index handling•
Executes generated CDC SQL statements during index creation
9 files
model.go
Refactor HNSW model with CDC support and enhanced operations
pkg/vectorindex/hnsw/model.go
• New
HnswModel
struct replacingHnswBuildIndex
with enhanced CDCsupport
• Added dirty tracking, atomic length counters, and view mode
support
• Enhanced file operations with checksum validation and
streaming SQL loading
• Methods for concurrent vector operations and
model lifecycle management
types.go
Add CDC data structures and operations for vector index
pkg/vectorindex/types.go
• Added CDC-related constants (
CDC_INSERT
,CDC_UPSERT
,CDC_DELETE
)•
Introduced
VectorIndexCdc
andVectorIndexCdcEntry
structs for CDCoperations
• Added
HnswCdcParam
struct for CDC parameters•
Implemented CDC data manipulation methods (Insert, Upsert, Delete,
ToJson)
sinker.go
Add HNSW sync sinker support and improve error handling
pkg/cdc/sinker.go
• Added support for
CDCSinkType_HnswSync
sink type• Updated
NewSinker
function signature to include
cnUUID
parameter• Fixed potential nil
pointer dereference in error handling
ddl.go
Add placeholder logic for CDC task cleanup in DDL operations
pkg/sql/compile/ddl.go
• Added TODO comments for CDC task cleanup in
DropIndex
andDropTable
methods
• Placeholder logic for cleaning up CDC tasks when dropping
vector/fulltext indexes
sqlexec.go
Add transaction-aware SQL execution function
pkg/vectorindex/sqlexec/sqlexec.go
• Added
RunTxn
function for executing SQL operations withintransactions
• Provides transaction-aware SQL execution with proper
context and options setup
func_cast.go
Enhance array casting with dimension validation
pkg/sql/plan/function/func_cast.go
• Enhanced array casting with dimension validation
• Added bypass for
max dimension check when width equals
MaxArrayDimension
• Improved
error handling for dimension mismatches
hnsw.go
Relax table scan validation in HNSW query building
pkg/sql/plan/hnsw.go
• Commented out table scan validation in
buildHnswCreate
• Relaxed
constraints on child node type checking
cdc_options.go
Add HNSW sync sink type support in CDC options
pkg/frontend/cdc_options.go
• Added support for
CDCSinkType_HnswSync
in CDC options validation•
Extended sink type validation to include HNSW sync type
cdc_exector.go
Update CDC executor to pass CN UUID to sinker
pkg/frontend/cdc_exector.go
• Updated
NewSinker
call to includecnUUID
parameter• Passes
executor's CN UUID to sinker creation
2 files
build.go
Refactor HNSW build to use unified model structure
pkg/vectorindex/hnsw/build.go
• Refactored to use
HnswModel
instead ofHnswBuildIndex
forconsistency
• Removed duplicate
HnswBuildIndex
struct and relatedmethods
• Updated build operations to work with the new unified model
structure
search.go
Refactor HNSW search to use HnswModel instead of HnswSearchIndex
pkg/vectorindex/hnsw/search.go
• Removed
HnswSearchIndex
struct and related methods (loadChunk
,LoadIndex
,Search
)• Replaced
HnswSearchIndex
withHnswModel
in theHnswSearch
struct• Refactored
LoadMetadata
to be a standalonefunction and updated field mappings
• Updated
LoadIndex
method to usenew
HnswModel.LoadIndex
signature2 files
function_id.go
Add function ID for HNSW CDC update function
pkg/sql/plan/function/function_id.go
• Added
HNSW_CDC_UPDATE
function ID constant• Updated
FUNCTION_END_NUMBER
and function registry mappingtypes.go
Add HNSW sync sink type constant
pkg/cdc/types.go
• Added
CDCSinkType_HnswSync
constant for HNSW synchronization sinktype
3 files
vector_hnsw.result
Update vector dimension error message format
test/distributed/cases/vector/vector_hnsw.result
• Updated error message format for dimension mismatch from "vector ops
between different dimensions" to "expected vector dimension X !=
actual dimension Y"
vector_index.result
Update vector index error message format
test/distributed/cases/vector/vector_index.result
• Updated error message format for dimension mismatch to use new
standardized format
array.result
Update array dimension error message format
test/distributed/cases/array/array.result
• Updated error messages for array dimension mismatches to use new
standardized format