-
Notifications
You must be signed in to change notification settings - Fork 285
fulltext/ivfflat/hnsw index update with ISCP #22414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
cpegeric
wants to merge
292
commits into
matrixorigin:main
Choose a base branch
from
cpegeric:cdc_fulltext_merge
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ne into cdc_sqlexecutor_cleanup
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/feature
Possible security concern
Review effort 5/5
size/XXL
Denotes a PR that changes 2000+ lines
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
User description
What type of PR is this?
Which issue(s) this PR fixes:
issue #21835
What this PR does / why we need it:
cdc index update feature added.
PR Type
Enhancement, Tests
Description
• Implements CDC (Change Data Capture) integration for fulltext, IVFFLAT, and HNSW indexes with asynchronous update capabilities
• Adds comprehensive SQL writers for different index types with CDC operations (Insert, Upsert, Delete)
• Introduces
IndexConsumer
for processing ISCP data and managing index synchronization operations• Implements HNSW model unification with CDC synchronization functionality and parallel processing support
• Adds async index support in DDL operations with automatic CDC task lifecycle management
• Enhances data type support in ISCP utilities (JSON, arrays, date/time, UUID, etc.)
• Provides extensive test coverage for all new CDC and async index functionality
• Integrates CDC task management into table operations (CREATE, DROP, ALTER, TRUNCATE)
• Adds new
HNSW_CDC_UPDATE
built-in function for processing CDC operations• Improves error handling and transaction support across index operations
Diagram Walkthrough
File Walkthrough
15 files
index_sqlwriter.go
Add index SQL writers for CDC operations
pkg/iscp/index_sqlwriter.go
• Implements SQL writers for different index types (Fulltext, IVFFLAT,
HNSW) with CDC operations
• Provides
IndexSqlWriter
interface withmethods for Insert, Upsert, Delete, and SQL generation
• Includes base
implementation
BaseIndexSqlWriter
and specialized writers for eachindex algorithm
• Handles row serialization and SQL generation for
index update operations
sync.go
Add HNSW CDC synchronization implementation
pkg/vectorindex/hnsw/sync.go
• Implements HNSW index CDC synchronization functionality
• Provides
CdcSync
function to update HNSW indexes via CDC data• Includes
HnswSync
struct for managing index updates with parallel processing•
Handles model loading, updating, and SQL generation for index
persistence
model.go
Add HNSW model implementation for index management
pkg/vectorindex/hnsw/model.go
• Implements
HnswModel
struct for HNSW index management• Provides
methods for index operations (Add, Remove, Contains, Search)
•
Includes file I/O operations for saving/loading index data
• Handles
SQL generation for database persistence and cleanup operations
build_dml_util.go
Add async index support in DML operations
pkg/sql/plan/build_dml_util.go
• Adds async index support by checking
IndexAlgoParams
for async flag• Skips synchronous index operations when async mode is enabled
•
Updates
MultiTableIndex
struct to includeIndexAlgoParams
field•
Applies async checks to fulltext and IVF index operations
create.go
Add async option to index creation syntax
pkg/sql/parsers/tree/create.go
• Adds
Async
boolean field toIndexOption
struct• Updates
Format
method to include async option in SQL output
• Enables parsing and
formatting of async index creation syntax
index_consumer.go
Add IndexConsumer for CDC index synchronization
pkg/iscp/index_consumer.go
• Implements a new
IndexConsumer
struct that processes ISCP data forindex updates
• Handles both snapshot and tail data types with
different processing strategies
• Manages SQL generation and execution
for index synchronization operations
• Provides methods for insert,
delete, and upsert operations on index data
ddl.go
Integrate CDC tasks into DDL operations
pkg/sql/compile/ddl.go
• Integrates CDC task management into DDL operations (CREATE, DROP,
ALTER, TRUNCATE)
• Adds calls to create and drop CDC tasks for vector
and fulltext indexes
• Implements automatic CDC task lifecycle
management during table operations
cdc_util.go
Add CDC task management utilities
pkg/sql/compile/cdc_util.go
• Implements utility functions for CDC task management (create,
delete, register)
• Provides PITR (Point-in-Time Recovery) creation
and management for indexes
• Handles validation and lifecycle
management of index CDC tasks
secondary_index_utils.go
Add async parameter support for indexes
pkg/catalog/secondary_index_utils.go
• Adds support for
async
parameter in index configurations•
Implements
IsIndexAsync
function to check if an index is asynchronous• Updates parameter parsing to handle async flag for different index
types
types.go
Add CDC data structures and operations
pkg/vectorindex/types.go
• Defines CDC-related data structures and constants
• Implements
VectorIndexCdc
for managing CDC operations (insert, delete, upsert)•
Adds JSON serialization support for CDC data structures
func_hnsw.go
Add HNSW CDC update function implementation
pkg/sql/plan/function/func_hnsw.go
• Implements
hnswCdcUpdate
function for processing HNSW CDC operations• Handles JSON deserialization of CDC data and calls synchronization
logic
• Provides parameter validation and error handling for CDC
updates
ddl_index_algo.go
Integrate CDC tasks into index algorithm handling
pkg/sql/compile/ddl_index_algo.go
• Integrates CDC task creation into fulltext and IVF-flat index
handling
• Adds async parameter checking and CDC task registration
•
Updates index creation workflow to support asynchronous updates
sqlexec.go
Add transaction-based SQL execution support
pkg/vectorindex/sqlexec/sqlexec.go
• Adds
RunTxn
function for executing SQL operations withintransactions
• Provides transaction-based SQL execution with proper
context and options
list_builtIn.go
Register HNSW CDC update function
pkg/sql/plan/function/list_builtIn.go
• Registers the new
HNSW_CDC_UPDATE
function in the built-in functionlist
• Defines function signature and parameter types for CDC update
operations
function_id.go
Add HNSW CDC update function ID
pkg/sql/plan/function/function_id.go
• Adds
HNSW_CDC_UPDATE
function ID and registers it in the functionregistry
• Updates function end number to accommodate new function
3 files
util.go
Enable additional data type support in ISCP utilities
pkg/iscp/util.go
• Uncomments and enables support for additional data types in row
extraction and SQL conversion
• Adds support for JSON, bit, array
types, date/time types, decimal types, UUID, and other specialized
types
• Includes
appendHex
function for binary data formatting•
Enhances NULL value handling with proper type casting
fulltext.go
Enhance fulltext index tokenization support
pkg/sql/plan/fulltext.go
• Enhances fulltext index tokenization to support both table scan and
values scan
• Adds support for composite primary keys in fulltext
operations
• Improves parameter handling and type validation for
fulltext functions
func_cast.go
Enhance array dimension validation in casting
pkg/sql/plan/function/func_cast.go
• Improves array dimension validation in string-to-array casting
•
Adds proper dimension checking and error reporting for array types
•
Handles maximum dimension bypass for flexible array operations
15 files
index_consumer_test.go
Add test suite for index consumer functionality
pkg/iscp/index_consumer_test.go
• Adds comprehensive test suite for index consumer functionality
•
Includes mock implementations for retriever, SQL executor, and
transaction executor
• Tests HNSW snapshot and tail operations with
various data scenarios
• Validates SQL generation and execution for
index updates
sync_test.go
Add comprehensive tests for HNSW sync operations
pkg/vectorindex/hnsw/sync_test.go
• Provides extensive test coverage for HNSW synchronization operations
• Tests various CDC operations including upsert, delete, insert
scenarios
• Includes tests for multi-file operations and shuffled data
handling
• Validates sync behavior with empty datasets and large data
volumes
index_sqlwriter_test.go
Add comprehensive tests for index SQL writers
pkg/iscp/index_sqlwriter_test.go
• Adds comprehensive test cases for index SQL writers (fulltext, HNSW,
IVF-flat)
• Tests SQL generation for different index types and primary
key configurations
• Validates handling of composite primary keys and
multi-part indexes
search_test.go
Update HNSW search tests for new model
pkg/vectorindex/hnsw/search_test.go
• Updates test cases to work with new model structure
• Adds mock
functions for testing multi-file scenarios
• Enhances test coverage
for metadata and catalog operations
model_test.go
Add comprehensive HnswModel tests
pkg/vectorindex/hnsw/model_test.go
• Adds comprehensive tests for the new
HnswModel
functionality• Tests
model operations like load, unload, add, remove, and search
•
Validates SQL generation and file handling capabilities
func_hnsw_test.go
Add tests for HNSW CDC update function
pkg/sql/plan/function/func_hnsw_test.go
• Adds test cases for the new
hnswCdcUpdate
function• Tests various
error conditions and parameter validation scenarios
• Validates
function behavior with null and invalid inputs
mysql_sql_test.go
Update parser tests for async index support
pkg/sql/parsers/dialect/mysql/mysql_sql_test.go
• Updates test cases to include
async
keyword in index creationstatements
• Validates parsing of async parameter for different index
types (HNSW, IVF-flat, fulltext)
build_test.go
Update HNSW build tests for new model
pkg/vectorindex/hnsw/build_test.go
• Updates test cases to use
HnswModel
instead ofHnswSearchIndex
•
Adjusts function calls and type references for the new model structure
types_test.go
Add tests for CDC data structures
pkg/vectorindex/types_test.go
• Adds tests for CDC data structures and operations
• Validates JSON
serialization and CDC operation methods
• Tests insert, delete, upsert
operations and state management
vector_ivf_async.result
IVF vector index async functionality test results
test/distributed/cases/vector/vector_ivf_async.result
• Added comprehensive test results for IVF vector index with ASYNC
functionality
• Tests include creating tables with vector columns,
inserting vector data, and creating async IVF indexes
• Validates
vector similarity search using
L2_DISTANCE
function with various queryvectors
• Tests both small datasets and large datasets (10k-20k
records) with bulk data loading
vector_ivf_async.sql
IVF vector index async functionality test cases
test/distributed/cases/vector/vector_ivf_async.sql
• Added test cases for IVF vector index with ASYNC support
• Tests
table creation, index creation with
ASYNC
keyword, and vectorsimilarity queries
• Includes tests for both small manual inserts and
large bulk data loads
• Validates that async index building works
correctly with concurrent data operations
vector_hnsw_async.result
HNSW vector index async functionality test results
test/distributed/cases/vector/vector_hnsw_async.result
• Added test results for HNSW vector index with ASYNC functionality
•
Tests include CRUD operations (insert, update, delete) with async
index updates
• Validates vector similarity search performance with
large datasets
• Tests concurrent data loading while async index
building is in progress
vector_hnsw_async.sql
HNSW vector index async functionality test cases
test/distributed/cases/vector/vector_hnsw_async.sql
• Added comprehensive test cases for HNSW vector index with ASYNC
support
• Tests CRUD operations with async index updates and vector
similarity queries
• Includes scenarios with concurrent data loading
and index building
• Validates proper handling of insert, update, and
delete operations with async indexes
fulltext_async.sql
Fulltext index async functionality test cases
test/distributed/cases/fulltext/fulltext_async.sql
• Added test cases for fulltext index with ASYNC functionality
• Tests
fulltext search with
MATCH...AGAINST
queries on async indexes•
Includes multilingual content (English and Chinese) for comprehensive
testing
• Tests handling of NULL values in fulltext indexed columns
fulltext_async.result
Fulltext index async functionality test results
test/distributed/cases/fulltext/fulltext_async.result
• Added expected test results for fulltext index with ASYNC
functionality
• Validates fulltext search results using TF-IDF
relevancy algorithm
• Tests search functionality across multiple
columns with async index building
• Confirms proper handling of
multilingual content and NULL values
2 files
build.go
Refactor HNSW build to use unified model
pkg/vectorindex/hnsw/build.go
• Refactors
HnswBuildIndex
to useHnswModel
instead of the originalstruct
• Removes duplicate code by consolidating index functionality
into shared model
• Updates function signatures and method calls to
use the new model structure
search.go
Refactor HNSW search to use unified model
pkg/vectorindex/hnsw/search.go
• Refactors search functionality to use
HnswModel
instead ofHnswSearchIndex
• Moves metadata loading logic to shared functions
•
Simplifies search implementation by leveraging unified model structure
1 files
util.go
Improve error handling in fulltext SQL generation
pkg/sql/compile/util.go
• Updates
genInsertIndexTableSqlForFullTextIndex
to return erroralongside SQL
• Improves error handling in fulltext index SQL
generation
2 files
watermark_updater.go
Add safety check for empty table ID list
pkg/iscp/watermark_updater.go
• Adds safety check to prevent SQL execution with empty table ID list
• Improves error handling in database cleanup operations
iteration.go
Improve error handling and context in iteration
pkg/iscp/iteration.go
• Adds proper error handling for
CollectChanges
function• Sets system
account context for consumer operations
17 files