Skip to content

fix: add support for list type data in field stats #1369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 8, 2025

Conversation

nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Jul 8, 2025

current: stats for list type fields show UNSUPPORTED
change: add support for the list type fields

specially useful in otel metrics where data_point_bucket_counts and data_point_explicit_bounds are left as arrays
and not considered for flattening

Summary by CodeRabbit

  • New Features

    • Added the ability to compute and view detailed field-level statistics for datasets, including distinct value counts and top values per field.
  • Refactor

    • Improved performance and reliability of field statistics calculation by restructuring and optimizing the underlying implementation.

current: stats for list type fields show UNSUPPORTED
change: add support for the list type fields

specially useful in otel metrics where `data_point_bucket_counts` and `data_point_explicit_bounds`
are left as arrays and not considered for flattening
Copy link
Contributor

coderabbitai bot commented Jul 8, 2025

Walkthrough

The field statistics calculation logic has been refactored from object_storage.rs into a new module field_stats.rs. The new module is made public in mod.rs, and all related code and tests are removed from object_storage.rs, which now imports the relocated function.

Changes

File(s) Change Summary
src/storage/field_stats.rs New module implementing field-level statistics calculation, serialization, error handling, and tests.
src/storage/mod.rs Declares pub mod field_stats; to expose the new module.
src/storage/object_storage.rs Removes all field statistics code and tests; now imports calculate_field_stats from field_stats.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ObjectStorage
    participant FieldStats
    participant DataFusion
    participant Arrow

    Client->>ObjectStorage: Request field stats calculation
    ObjectStorage->>FieldStats: call calculate_field_stats(...)
    FieldStats->>DataFusion: Register Parquet as table
    FieldStats->>FieldStats: collect_all_field_stats (parallel per field)
    loop For each field
        FieldStats->>DataFusion: SQL query for field stats
        DataFusion->>Arrow: Stream batches
        FieldStats->>FieldStats: Process Arrow batches, aggregate stats
    end
    FieldStats->>ObjectStorage: Push stats to internal stream
    ObjectStorage->>Client: Return result
Loading

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • parmesant

Poem

In fields of data, stats now bloom,
Moved to a module, with plenty of room.
Parquet and Arrow, they join the dance,
Distinct counts tallied at every chance.
Refactored with care, the code hops ahead—
A rabbit’s delight in tidy homestead! 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65702b7 and f27b3e0.

📒 Files selected for processing (1)
  • src/storage/field_stats.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
src/storage/field_stats.rs (12)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1288
File: src/handlers/http/modal/mod.rs:279-301
Timestamp: 2025-04-07T13:23:10.092Z
Learning: For critical operations like writing metadata to disk in NodeMetadata::put_on_disk(), it's preferred to let exceptions propagate (using expect/unwrap) rather than trying to recover with fallback mechanisms, as the failure indicates a fundamental system issue that needs immediate attention.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1346
File: src/parseable/streams.rs:351-355
Timestamp: 2025-06-16T05:20:18.593Z
Learning: In the Parseable codebase, arrow files are expected to always have valid creation or modified timestamps as a basic system assumption. The conversion flow uses expect() on file metadata operations to enforce this invariant with fail-fast behavior rather than graceful error handling, as violations represent fundamental system issues that should cause immediate failures.
Learnt from: de-sh
PR: parseablehq/parseable#1239
File: src/parseable/streams.rs:70-80
Timestamp: 2025-03-24T06:08:47.309Z
Learning: In the Parseable codebase, arrow filenames follow a structure ending with ".data.arrows". When converting to parquet, the "front" part (after the first dot) already includes ".data" at the end, so there's no need to add it explicitly in the arrow_path_to_parquet function.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1346
File: src/storage/object_storage.rs:838-840
Timestamp: 2025-06-16T02:09:49.200Z
Learning: In Parseable server, filenames for arrow files are generated server-side during ingestion using a strict format pattern: `{stream_hash}.date={date}.hour={hour}.minute={minute}.{custom_partitions}{hostname}.data.{extension}`. This guarantees that when splitting by dots, the second element will always be `date={value}`, making direct indexing safe when extracting the date part by splitting on '=' and accessing index 1.
Learnt from: de-sh
PR: parseablehq/parseable#0
File: :0-0
Timestamp: 2025-03-20T15:50:45.435Z
Learning: Pay close attention to code comments for typos and semantic clarity during reviews for the Parseable project.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1346
File: src/parseable/streams.rs:680-686
Timestamp: 2025-06-16T05:30:13.379Z
Learning: The parquet conversion system uses a minute-based atomic design where each conversion task operates on different sets of arrow files moved to timestamped "inprocess_" directories, ensuring no conflicts between tasks and no retry mechanisms that could cause file corruption.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/modal/mod.rs:218-221
Timestamp: 2025-02-12T17:39:43.392Z
Learning: In the Parseable codebase, staging directory access failures should trigger a panic rather than being handled as recoverable errors, as it indicates a critical system issue.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (7)
src/storage/field_stats.rs (7)

19-53: Well-organized imports and appropriate concurrency limit.

The import organization is clean and covers all necessary dependencies. The MAX_CONCURRENT_FIELD_STATS constant provides reasonable concurrency control for field statistics calculations.


54-72: Clean data structure design for hierarchical statistics.

The data structures follow a logical hierarchy from individual distinct values to field-level and dataset-level statistics. The implementation with Serialize and Debug traits is appropriate for the use case.


77-126: Robust main function with improved error handling.

The function properly addresses the previous feedback by using ok_or_else() for path validation instead of expect(). The logic correctly handles stream creation, statistics calculation, and result publishing to the internal dataset stream.


131-222: Efficient concurrent processing with robust error handling.

The use of buffer_unordered for concurrent field statistics calculation is well-designed. The error handling properly degrades gracefully by logging warnings and continuing with other fields when individual queries fail.


224-256: Sophisticated SQL generation with proper escaping.

The SQL query design using CTEs is excellent for readability and performance. The field name escaping properly handles special characters, and the use of window functions efficiently calculates both total and distinct counts in a single query.


258-385: Comprehensive Arrow data type support fulfilling PR objectives.

The format_arrow_value function excellently addresses the PR objective by adding proper support for List type fields (lines 360-375). The try_downcast! macro ensures consistent error handling across all data types, and the comprehensive coverage of Arrow data types makes this robust for various field types in OpenTelemetry metrics.


387-1074: Exceptional test coverage validating all functionality.

The test suite is remarkably comprehensive, covering various data types, edge cases, error scenarios, and specifically validating the new list type support. The tests for special characters in field names, empty tables, and large datasets demonstrate thorough consideration of real-world scenarios.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
src/storage/field_stats.rs (1)

386-914: Add tests for list type field statistics

The test suite is comprehensive but doesn't include tests for the newly added list type support. Consider adding test cases to verify that list fields are properly handled in field statistics.

Would you like me to generate test cases for list type fields to ensure the new functionality is properly tested?

🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)

375-382: Consider handling additional list types for comprehensive support

The current implementation handles List type, but Arrow also has LargeList and FixedSizeList types that might appear in OpenTelemetry or other data sources. Consider adding support for these types to ensure comprehensive list handling.

Add support for additional list types after the current List case:

 DataType::Null => "NULL".to_string(),
+DataType::LargeList(_field) => {
+    try_downcast!(arrow_array::LargeListArray, array, |list_array: &arrow_array::LargeListArray| {
+        let child_array = list_array.values();
+        let offsets = list_array.value_offsets();
+        let start = offsets[idx] as usize;
+        let end = offsets[idx + 1] as usize;
+
+        let formatted_values: Vec<String> = (start..end)
+            .map(|i| format_arrow_value(child_array.as_ref(), i))
+            .collect();
+
+        format!("[{}]", formatted_values.join(", "))
+    })
+}
+DataType::FixedSizeList(_field, size) => {
+    try_downcast!(arrow_array::FixedSizeListArray, array, |list_array: &arrow_array::FixedSizeListArray| {
+        let child_array = list_array.values();
+        let start = idx * (*size as usize);
+        let end = start + (*size as usize);
+
+        let formatted_values: Vec<String> = (start..end)
+            .map(|i| format_arrow_value(child_array.as_ref(), i))
+            .collect();
+
+        format!("[{}]", formatted_values.join(", "))
+    })
+}
 _ => {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0629a3f and 93c82e9.

📒 Files selected for processing (3)
  • src/storage/field_stats.rs (1 hunks)
  • src/storage/mod.rs (1 hunks)
  • src/storage/object_storage.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
src/storage/object_storage.rs (5)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
src/storage/field_stats.rs (4)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: coverage
🔇 Additional comments (2)
src/storage/mod.rs (1)

39-39: LGTM!

The public module declaration correctly exposes the new field_stats module as part of the storage crate's public API.

src/storage/object_storage.rs (1)

62-62: Clean refactoring!

The field statistics logic has been successfully extracted to a dedicated module while maintaining the same interface. The synchronous calculation during upload aligns with the established pattern to prevent race conditions.

Also applies to: 885-897

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93c82e9 and 65702b7.

📒 Files selected for processing (1)
  • src/storage/field_stats.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
src/storage/field_stats.rs (6)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:832-843
Timestamp: 2025-06-18T06:45:37.070Z
Learning: Stats calculation for parquet files in Parseable is done synchronously during the upload process because files are deleted from staging after upload. This prevents race conditions and ensures stats are calculated while files are still available locally.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1288
File: src/handlers/http/modal/mod.rs:279-301
Timestamp: 2025-04-07T13:23:10.092Z
Learning: For critical operations like writing metadata to disk in NodeMetadata::put_on_disk(), it's preferred to let exceptions propagate (using expect/unwrap) rather than trying to recover with fallback mechanisms, as the failure indicates a fundamental system issue that needs immediate attention.
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: coverage
🔇 Additional comments (10)
src/storage/field_stats.rs (10)

1-53: LGTM! Well-organized imports and reasonable concurrency limit.

The imports are comprehensive and the MAX_CONCURRENT_FIELD_STATS constant of 10 provides a sensible balance between performance and resource usage.


54-73: Well-designed data structures for field statistics.

The hierarchical design from DistinctStat → FieldStat → DatasetStats is clear and properly captures the statistics data model.


126-153: Excellent concurrency handling for field statistics calculation.

The use of buffer_unordered with a reasonable concurrency limit and the approach of collecting field names upfront to avoid lifetime issues demonstrates good async Rust practices.


155-220: Solid implementation with appropriate error handling.

The streaming approach for handling large result sets is memory-efficient, and the error handling appropriately logs warnings while gracefully degrading. The direct downcasts for expected SQL result columns are correct for this context.


222-254: Well-structured SQL generation with proper security measures.

The SQL query uses CTEs for clarity and properly escapes field/stream names to prevent injection attacks. The window functions efficiently calculate the required statistics.


256-270: Excellent error handling macro for safe downcasting.

The try_downcast! macro provides consistent error handling across the codebase by logging warnings and gracefully degrading to "UNSUPPORTED" when downcasts fail.


358-373: List type support successfully addresses the PR objective.

The implementation properly handles list arrays by extracting child values and formatting them recursively. This directly addresses the PR's goal of supporting list type fields that were previously marked as UNSUPPORTED. The use of try_downcast! macro ensures robust error handling.


374-383: Comprehensive data type support with proper fallback handling.

The function covers a wide range of Arrow data types and appropriately logs unsupported types while providing a clear fallback value.


957-1071: Excellent test coverage for list type support validates PR objective.

The comprehensive tests for both int_list and float_list fields confirm that list type data is now properly supported with correct formatting and statistics calculation. This directly validates that the PR objective has been achieved.


385-956: Comprehensive test suite covers diverse scenarios and edge cases.

The tests thoroughly validate functionality across multiple data types, special characters, empty tables, streaming behavior, and large datasets, ensuring robust operation in various conditions.

@nitisht nitisht merged commit d4a22e9 into parseablehq:main Jul 8, 2025
13 of 14 checks passed
@nikhilsinhaparseable nikhilsinhaparseable deleted the stats-fix branch July 8, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants