-
-
Notifications
You must be signed in to change notification settings - Fork 145
fix: add support for list type data in field stats #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
current: stats for list type fields show UNSUPPORTED change: add support for the list type fields specially useful in otel metrics where `data_point_bucket_counts` and `data_point_explicit_bounds` are left as arrays and not considered for flattening
WalkthroughThe field statistics calculation logic has been refactored from Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ObjectStorage
participant FieldStats
participant DataFusion
participant Arrow
Client->>ObjectStorage: Request field stats calculation
ObjectStorage->>FieldStats: call calculate_field_stats(...)
FieldStats->>DataFusion: Register Parquet as table
FieldStats->>FieldStats: collect_all_field_stats (parallel per field)
loop For each field
FieldStats->>DataFusion: SQL query for field stats
DataFusion->>Arrow: Stream batches
FieldStats->>FieldStats: Process Arrow batches, aggregate stats
end
FieldStats->>ObjectStorage: Push stats to internal stream
ObjectStorage->>Client: Return result
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (2)📓 Common learnings
src/storage/field_stats.rs (12)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
🔇 Additional comments (7)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
src/storage/field_stats.rs (1)
386-914
: Add tests for list type field statisticsThe test suite is comprehensive but doesn't include tests for the newly added list type support. Consider adding test cases to verify that list fields are properly handled in field statistics.
Would you like me to generate test cases for list type fields to ensure the new functionality is properly tested?
🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)
375-382
: Consider handling additional list types for comprehensive supportThe current implementation handles
List
type, but Arrow also hasLargeList
andFixedSizeList
types that might appear in OpenTelemetry or other data sources. Consider adding support for these types to ensure comprehensive list handling.Add support for additional list types after the current
List
case:DataType::Null => "NULL".to_string(), +DataType::LargeList(_field) => { + try_downcast!(arrow_array::LargeListArray, array, |list_array: &arrow_array::LargeListArray| { + let child_array = list_array.values(); + let offsets = list_array.value_offsets(); + let start = offsets[idx] as usize; + let end = offsets[idx + 1] as usize; + + let formatted_values: Vec<String> = (start..end) + .map(|i| format_arrow_value(child_array.as_ref(), i)) + .collect(); + + format!("[{}]", formatted_values.join(", ")) + }) +} +DataType::FixedSizeList(_field, size) => { + try_downcast!(arrow_array::FixedSizeListArray, array, |list_array: &arrow_array::FixedSizeListArray| { + let child_array = list_array.values(); + let start = idx * (*size as usize); + let end = start + (*size as usize); + + let formatted_values: Vec<String> = (start..end) + .map(|i| format_arrow_value(child_array.as_ref(), i)) + .collect(); + + format!("[{}]", formatted_values.join(", ")) + }) +} _ => {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/storage/field_stats.rs
(1 hunks)src/storage/mod.rs
(1 hunks)src/storage/object_storage.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
src/storage/object_storage.rs (5)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
src/storage/field_stats.rs (4)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (2)
src/storage/mod.rs (1)
39-39
: LGTM!The public module declaration correctly exposes the new
field_stats
module as part of the storage crate's public API.src/storage/object_storage.rs (1)
62-62
: Clean refactoring!The field statistics logic has been successfully extracted to a dedicated module while maintaining the same interface. The synchronous calculation during upload aligns with the established pattern to prevent race conditions.
Also applies to: 885-897
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/storage/field_stats.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
src/storage/field_stats.rs (6)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1288
File: src/handlers/http/modal/mod.rs:279-301
Timestamp: 2025-04-07T13:23:10.092Z
Learning: For critical operations like writing metadata to disk in NodeMetadata::put_on_disk(), it's preferred to let exceptions propagate (using expect/unwrap) rather than trying to recover with fallback mechanisms, as the failure indicates a fundamental system issue that needs immediate attention.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
🔇 Additional comments (10)
src/storage/field_stats.rs (10)
1-53
: LGTM! Well-organized imports and reasonable concurrency limit.The imports are comprehensive and the MAX_CONCURRENT_FIELD_STATS constant of 10 provides a sensible balance between performance and resource usage.
54-73
: Well-designed data structures for field statistics.The hierarchical design from DistinctStat → FieldStat → DatasetStats is clear and properly captures the statistics data model.
126-153
: Excellent concurrency handling for field statistics calculation.The use of
buffer_unordered
with a reasonable concurrency limit and the approach of collecting field names upfront to avoid lifetime issues demonstrates good async Rust practices.
155-220
: Solid implementation with appropriate error handling.The streaming approach for handling large result sets is memory-efficient, and the error handling appropriately logs warnings while gracefully degrading. The direct downcasts for expected SQL result columns are correct for this context.
222-254
: Well-structured SQL generation with proper security measures.The SQL query uses CTEs for clarity and properly escapes field/stream names to prevent injection attacks. The window functions efficiently calculate the required statistics.
256-270
: Excellent error handling macro for safe downcasting.The
try_downcast!
macro provides consistent error handling across the codebase by logging warnings and gracefully degrading to "UNSUPPORTED" when downcasts fail.
358-373
: List type support successfully addresses the PR objective.The implementation properly handles list arrays by extracting child values and formatting them recursively. This directly addresses the PR's goal of supporting list type fields that were previously marked as UNSUPPORTED. The use of
try_downcast!
macro ensures robust error handling.
374-383
: Comprehensive data type support with proper fallback handling.The function covers a wide range of Arrow data types and appropriately logs unsupported types while providing a clear fallback value.
957-1071
: Excellent test coverage for list type support validates PR objective.The comprehensive tests for both
int_list
andfloat_list
fields confirm that list type data is now properly supported with correct formatting and statistics calculation. This directly validates that the PR objective has been achieved.
385-956
: Comprehensive test suite covers diverse scenarios and edge cases.The tests thoroughly validate functionality across multiple data types, special characters, empty tables, streaming behavior, and large datasets, ensuring robust operation in various conditions.
current: stats for list type fields show UNSUPPORTED
change: add support for the list type fields
specially useful in otel metrics where
data_point_bucket_counts
anddata_point_explicit_bounds
are left as arraysand not considered for flattening
Summary by CodeRabbit
New Features
Refactor