Skip to content

feat: Support for fault-tolerant execution #779

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dervoeti
Copy link
Member

@dervoeti dervoeti commented Aug 4, 2025

Description

Implementation for #576

When providing credentials for S3, Azure or GCS we currently need a SecretClass. It would be easier if the user could just provide a Secret as an alternative, we had that discussion recently. To be consistent with the current state, I used SecretClass exclusively for now.

Things I did not do in this PR to ease the review process (we could tackle those as a separate PR):

  • Combine redundant logic from FTE and Catalog modules
  • Use config-utils instead of doing the load_env_from_files stunt

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

@dervoeti dervoeti force-pushed the feat/fault-tolerant-execution branch from 59f062b to 4cc640f Compare August 5, 2025 18:37
@dervoeti dervoeti force-pushed the feat/fault-tolerant-execution branch from 4cc640f to 3d113df Compare August 5, 2025 18:44
@dervoeti dervoeti self-assigned this Aug 5, 2025
@dervoeti dervoeti moved this to Development: In Review in Stackable Engineering Aug 5, 2025
@dervoeti dervoeti force-pushed the feat/fault-tolerant-execution branch from fc684e3 to 85a09cc Compare August 5, 2025 19:40
Copy link
Member

@sbernauer sbernauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped some comments on the CRD code and YAML.
As an possible alternative I can propose a complex enum style, where we can together think about it, just pasting it here for reference for now

spec:
  clusterConfig:
    faultTolerantExecution:
      query:
        retryAttempts: 4 # maps to query-retry-attempts
        retryInitialDelay: 10s
        retryMaxDelay: 60s
        retryDelayScaleFactor: 2.0 # f32
        exchangeManager: # Optional
          deduplicationBufferSize: 64Mi # Quantity
          encryptionEnabled: true
          sinkBufferPoolMinSize: 20
          sinkBuffersPerPartition: 4
          sinkMaxFileSize: 2Gi # Quantity
          sourceConcurrentReaders: 8
          s3:
            baseDirectories:
              - s3://trino-exchange-bucket/spooling
            connection: # Mandatory
              reference: minio-connection
            maxErrorRetries: 10
            uploadPartSize: 10Mi # Quantity
      # OR
      task:
        retryAttemptsPerTask: 4 # maps to task-retry-attempts-per-task
        retryInitialDelay: 10s
        retryMaxDelay: 60s
        retryDelayScaleFactor: 2.0 # f32
        exchangeManager: # Mandatory
          # ... same struct as above


/// Data size of the coordinator's in-memory buffer used to store output of query stages.
#[serde(skip_serializing_if = "Option::is_none")]
pub exchange_deduplication_buffer_size: Option<String>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, ideally we have same sort of validating struct for byte sizes, such as Duration for times.
That being said, WDYT of using Option<Quantity> here? This way we are using k8s quantities and are consistent with e.g. the memory limit config.
Same for all other byte values, such as uploadPartSize or sinkMaxFileSize

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using MemoryQuantity, we'd need to derive JsonSchema on it in op-rs.
Also: If we do this, we need to change it in other places of at least trino-operator as well (e.g. query_max_memory). This would be a breaking change.


#[derive(Clone, Debug, Deserialize, Eq, JsonSchema, PartialEq, Serialize)]
#[serde(rename_all = "camelCase")]
pub struct AzureExchangeConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I'm not sure how to feel about the Azure and Google integration. On the one hand it's nice, on the other hand IIRC we don't support it in any other CRD yet.
I feel like we should carefully think about adding it. Do we want a AzureConnection similar to S3Connection? Can you specify a flavor on the S3Connection? etc...
Maybe keep the struct in code (clippy allow unused) and leave this for a future issue?

@dervoeti dervoeti force-pushed the feat/fault-tolerant-execution branch from 36baa33 to 9565064 Compare August 6, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Development: In Review
Development

Successfully merging this pull request may close these issues.

2 participants