-
Notifications
You must be signed in to change notification settings - Fork 473
doc/design: Configurable cluster replica sizes #33120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9c1b057
to
d5a4478
Compare
d5a4478
to
ade1eb3
Compare
Talked more about this in cloud hangout. Main takeaways:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, just some thoughts.
It feels to me that we're struggling with how to represent data in Materialize that is somewhere between configuration and system data. Configuration data in my eyes is somewhat static and doesn't change once set. System data is something that can evolve over time, especially without restarting envd. This design moves the cluster replica size information closer to system data away from configuration data.
I feel there might be a design that makes this clearer, but I'm not sure if it's the right time to implement it, but hear me out. I think we could have the cluster replica sizes in a table, and now the question of applying configurations becomes a question of who gets to write what to the table. The current way would be that the cluster replica size configuration is blindly applied to the table. A CRD would also be appended to the table. We could imagine SQL commands to write entries to the table. RBAC would allow us to limit who can write to the table.
One wrinkle is that cluster replicas name the replica size, but we don't "depend" on it, as in that the cluster replica size is not a nameable object. We assume it exists, and crash if it doesn't. When we introduce a table with cluster replica sizes, we need to make sure that deleting an entry does not get us into a state where envd couldn't start anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file should be in the design doc location, one level down.
|
||
1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes` | ||
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue. | ||
- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if envd restarts? At the moment, we'd use the new cluster replica size definition, but it's not clear to me what should happen in your design.
We could potentially get the cluster replica size definition from load_remote_system_parameters
(code pointer). However, it runs into the potential case where:
- We have a default param set for this new dyncfg which includes a size used to build a builtin cluster like
mz_catalog_server
- User decides to get rid of it in their configmap, causing envd not to bootup
The main approach I thought of was to completely separate the set of replica sizes used to create builtin cluster replicas and keep them as CLI args / config variables, where users aren't expected to edit this set. Then user-defined custom ones can live in the dyncfg and merge into the pre-existing set in the catalog. Another thing we could is keep them all unified but just document / log errors that the sizes they specify to bootstrap builtin cluster replicas need to exist in the configmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But assuming we keep all sizes unified in the dyncfg, as Justin said we'd either:
- No longer watch for statefulset changes for cluster replica sizes and force users to manually update already-created clusters
- Sync this dyncfg to make changes to the statefulset via the cloud resource controller.
``` | ||
|
||
1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes` | ||
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we store a copy of the cluster replica size configuration in the catalog so we can restart envd even if the configmap contains errors or is absent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't store a copy of it in the catalog, but my thinking was if we were to have it stored as a dyncfg, it'd be saved in the catalog as a system variable. Then if the configmap contains error, we'd use the last-synced values or the default values, similar to what happens when LD fails to sync
|
||
## Out of Scope | ||
|
||
- Allow configurable sizes in Cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for not wanting to consider cloud? For the rollout of swap, being able to specify the replica sizes in dyncfgs would be a huge boon. We could set up LD to perform a slow rollout of swap, based on the risk segment and the Mz version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mainly because:
- Only a single self managed potential customer has asked for it
- We don't want to support user-configurable cluster sizes for Cloud and the complexity it comes wit
But I didn't even consider the rollout of swap! I think it'd be nice to find a way to slowrollout sizes in this design doc, but one pretty big piece of complexity is how we'd rollout modifications to existing customer clusters with the sizes we want to modify. One simple idea is have it configurable through dyncfg but only roll it out on a new release using 0dt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a single self managed potential customer has asked for it
I'd view asks from the team help our internal management of MZ cloud at about level as "customer ask" in this case. If the ways customers and MZ would use these are very different then we may need to separate, but LD vs file dyncfg are close enough. If customers would really want create replica size
that maybe takes on a life of its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest thinking is that we might introduce new replica sizes, instead of swapping out the spilling mechanism for cc
sizes, in which case this might not be relevant for the swap rollout after all. It still seems useful to be able to configure replica sizes through LD, to make one-off adjustments without having to manually edit statefulsets and hope the changes stick around for long enough.
|
||
## The Problem | ||
|
||
We want users to be able to safely configure cluster replica sizes on runtime for self managed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could scope this more broadly to encompass everyone managing materialize both MZ and self-managed customers and community users. If those don't overlap perhaps we just revisit MZ cloud use-case later.
|
||
1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes` | ||
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue. | ||
- If a cluster size is modified, any existing clusters with that size shouldn’t be affected. Only newly created cluster replicas with the modified cluster size will. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine a the statefulset needs to be redeployed for any changes to take affect, if envd diffs the sizes and redeploys that could be a problem, otherwise it'd still be a manual process to redeploy replica's post size change.
- Allow orchestratord to create a configmap in a volume in environmentd. Then glue the path to the dyncfg synchronization. | ||
- Create a custom cluster size CRD. This will allow orchestratord to handle statefulset creation in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for dyncfg file stuff for self-managed I was thinking that we'd need to add a sessionVarConfigMap or something like that so that we can create the configmap ahead of time and allow it to be configured per MZ (or shared between MZs).
- What should the interface be in the Materialize CR? We have the following options: | ||
- A boolean that signals if we want to create the configmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have orchestratord create the configmap if one is not passed in, or we could require one to be passed in via the MZ CR.
Allowing it to be passed in is definitely easier to manage from a IAC perspective, and allows us to share the configmap across MZs if that's ever a thing someone wants to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing in seems most reasonable!
``` | ||
|
||
1. To verify the cluster replica sizes in the database itself, one can run `SHOW custom_user_cluster_replica_sizes` | ||
1. If the configmap fails to sync, we’ll print out a warning in the logs of the environmentd pod on which field is causing the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be quite easy for users to miss if something won't parse... but could be fine initially as long as we log it pretty loudly.. If we wanted to make the process easier to debug we could potentially pack dynfg parsing issues it into an internal table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a neat idea!
Rendered: https://github.com/MaterializeInc/materialize/blob/ade1eb3a1501b4b74335e7774c1dc0f6d9450972/doc/developer/20250723_configurable_replica_sizes.md
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.