Skip to content

doc/design: Configurable cluster replica sizes #33120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SangJunBak
Copy link
Contributor

@SangJunBak SangJunBak commented Jul 23, 2025

Rendered: https://github.com/MaterializeInc/materialize/blob/ade1eb3a1501b4b74335e7774c1dc0f6d9450972/doc/developer/20250723_configurable_replica_sizes.md

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@SangJunBak SangJunBak marked this pull request as ready for review July 23, 2025 15:54
@SangJunBak SangJunBak force-pushed the jun/config-cluster-replica-sizes branch from 9c1b057 to d5a4478 Compare July 23, 2025 15:55
@SangJunBak SangJunBak force-pushed the jun/config-cluster-replica-sizes branch from d5a4478 to ade1eb3 Compare July 23, 2025 15:58
@SangJunBak
Copy link
Contributor Author

SangJunBak commented Jul 24, 2025

Talked more about this in cloud hangout. Main takeaways:

  • Cluster replica size CRD would be nice for validation of the cluster replica size schema (rn we'd need to just look for logs in envd)
  • Generations / hashes of cluster replica sizes would be nice and important if we ever wanted to implement something like when we want to rollout a modification of of a cluster replica size for all clusters. This can be achieved manually by the user via a mechanism like blue/green in self managed, but this'd be useful if we ever wanted to do this in Cloud.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, just some thoughts.

It feels to me that we're struggling with how to represent data in Materialize that is somewhere between configuration and system data. Configuration data in my eyes is somewhat static and doesn't change once set. System data is something that can evolve over time, especially without restarting envd. This design moves the cluster replica size information closer to system data away from configuration data.

I feel there might be a design that makes this clearer, but I'm not sure if it's the right time to implement it, but hear me out. I think we could have the cluster replica sizes in a table, and now the question of applying configurations becomes a question of who gets to write what to the table. The current way would be that the cluster replica size configuration is blindly applied to the table. A CRD would also be appended to the table. We could imagine SQL commands to write entries to the table. RBAC would allow us to limit who can write to the table.

One wrinkle is that cluster replicas name the replica size, but we don't "depend" on it, as in that the cluster replica size is not a nameable object. We assume it exists, and crash if it doesn't. When we introduce a table with cluster replica sizes, we need to make sure that deleting an entry does not get us into a state where envd couldn't start anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file should be in the design doc location, one level down.


1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.

We could potentially get the cluster replica size definition from load_remote_system_parameters (code pointer). However, it runs into the potential case where:

  1. We have a default param set for this new dyncfg which includes a size used to build a builtin cluster like mz_catalog_server
  2. User decides to get rid of it in their configmap, causing envd not to bootup

The main approach I thought of was to completely separate the set of replica sizes used to create builtin cluster replicas and keep them as CLI args / config variables, where users aren't expected to edit this set. Then user-defined custom ones can live in the dyncfg and merge into the pre-existing set in the catalog. Another thing we could is keep them all unified but just document / log errors that the sizes they specify to bootstrap builtin cluster replicas need to exist in the configmap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But assuming we keep all sizes unified in the dyncfg, as Justin said we'd either:

  • No longer watch for statefulset changes for cluster replica sizes and force users to manually update already-created clusters
  • Sync this dyncfg to make changes to the statefulset via the cloud resource controller.

```

1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we store a copy of the cluster replica size configuration in the catalog so we can restart envd even if the configmap contains errors or is absent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't store a copy of it in the catalog, but my thinking was if we were to have it stored as a dyncfg, it'd be saved in the catalog as a system variable. Then if the configmap contains error, we'd use the last-synced values or the default values, similar to what happens when LD fails to sync


## Out of Scope

- Allow configurable sizes in Cloud
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for not wanting to consider cloud? For the rollout of swap, being able to specify the replica sizes in dyncfgs would be a huge boon. We could set up LD to perform a slow rollout of swap, based on the risk segment and the Mz version.

Copy link
Contributor Author

@SangJunBak SangJunBak Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mainly because:

  • Only a single self managed potential customer has asked for it
  • We don't want to support user-configurable cluster sizes for Cloud and the complexity it comes wit

But I didn't even consider the rollout of swap! I think it'd be nice to find a way to slowrollout sizes in this design doc, but one pretty big piece of complexity is how we'd rollout modifications to existing customer clusters with the sizes we want to modify. One simple idea is have it configurable through dyncfg but only roll it out on a new release using 0dt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a single self managed potential customer has asked for it

I'd view asks from the team help our internal management of MZ cloud at about level as "customer ask" in this case. If the ways customers and MZ would use these are very different then we may need to separate, but LD vs file dyncfg are close enough. If customers would really want create replica size that maybe takes on a life of its own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest thinking is that we might introduce new replica sizes, instead of swapping out the spilling mechanism for cc sizes, in which case this might not be relevant for the swap rollout after all. It still seems useful to be able to configure replica sizes through LD, to make one-off adjustments without having to manually edit statefulsets and hope the changes stick around for long enough.


## The Problem

We want users to be able to safely configure cluster replica sizes on runtime for self managed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could scope this more broadly to encompass everyone managing materialize both MZ and self-managed customers and community users. If those don't overlap perhaps we just revisit MZ cloud use-case later.


1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.

Comment on lines +52 to +53
- Allow orchestratord to create a configmap in a volume in environmentd. Then glue the path to the dyncfg synchronization.
- Create a custom cluster size CRD. This will allow orchestratord to handle statefulset creation in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for dyncfg file stuff for self-managed I was thinking that we'd need to add a sessionVarConfigMap or something like that so that we can create the configmap ahead of time and allow it to be configured per MZ (or shared between MZs).

Comment on lines +102 to +103
- What should the interface be in the Materialize CR? We have the following options:
- A boolean that signals if we want to create the configmap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have orchestratord create the configmap if one is not passed in, or we could require one to be passed in via the MZ CR.

Allowing it to be passed in is definitely easier to manage from a IAC perspective, and allows us to share the configmap across MZs if that's ever a thing someone wants to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in seems most reasonable!

```

1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes`
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be quite easy for users to miss if something won't parse... but could be fine initially as long as we log it pretty loudly.. If we wanted to make the process easier to debug we could potentially pack dynfg parsing issues it into an internal table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a neat idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants