-
Notifications
You must be signed in to change notification settings - Fork 6
SYSTEM PRESHUTDOWN command for graceful shutdown swarm node #852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.3
Are you sure you want to change the base?
Conversation
{ | ||
if (getContext()->isPreShutdownCalled()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that operations prohibited in PreShutdown phase (e.g getting new tasks), are allowed in Shutdown phase?
If yes, is it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand ClickHouse does not have specific shutdown phase. On SYSTEM SHUTDOWN
just calls kill(0, SIGTERM)
. Without PRESHUTDOW
this caused error on initiator as well as already taken but unfinished tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then what is the purpose of shutdown_called flag?
From the first glance I would expect that all checks that are true for preshutdown_called should be true for shutdown_called as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, understood, flag set in destructor.
Yes, make sense to set preshutdown there too.
a842ce6
to
9642c42
Compare
As I understand, regular queries would still be processed. Literally the only thing that's stopped is the "swarm node" work. Wouldn't it make more sense to rename the command to something more meaningful? (e.g UNREGISTER FROM SWARM) This is just a question, not a change request |
It's a topic to discussion. |
Need to make something "system stop swarm/system start swarm" |
SYSTEM STOP SWARM MODE |
Co-authored-by: Davit Mnatobishvili <[email protected]>
Co-authored-by: Davit Mnatobishvili <[email protected]>
SYSTEM STOP SWARM MODE |
void Context::shutdown() TSA_NO_THREAD_SAFETY_ANALYSIS | ||
{ | ||
shared->shutdown(); | ||
} | ||
|
||
void Context::stopSwarmMode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The individual operations of this function are atomic, but the function itself is not. Can't this cause problems?
For instance, consider two STOP/START queries running in different threads:
thread1 -> stop swarm mode
metrics updated, it is now false
context switch
thread2 -> start swarm mode
metrics updated, it is now true
swarm mode enabled
thread 1 -> resume
swarm mode disabled
The resulting state is: swarm mode boolean = true, metrics = false.
@@ -693,6 +693,20 @@ BlockIO InterpreterSystemQuery::execute() | |||
case Type::START_MOVES: | |||
startStopAction(ActionLocks::PartsMove, true); | |||
break; | |||
case Type::STOP_SWARM_MODE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should also be a single atomic operation?
read_context.packet.type = read_context.executor.getConnections().receivePacketTypeUnlocked(async_callback); | ||
read_context.has_read_packet_part = PacketPart::Type; | ||
suspend_callback(); | ||
if (e.code() == ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a comment explaining what ErrorCodes::ATTEMPT_TO_READ_AFTER_EOF
usually means and why you are catching it?
The same for the below if statement
@@ -4500,6 +4502,21 @@ std::shared_ptr<Cluster> Context::tryGetCluster(const std::string & cluster_name | |||
return res; | |||
} | |||
|
|||
void Context::unregisterInDynamicClusters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance.
What does "dynamic" cluster actually mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure looks ok, we just need to sort out the atomic thing and I need to understand the dynamic cluster thing
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
SYSTEM PRESHUTDOWN command for graceful shutdown swarm node
Documentation entry for user-facing changes
Solved #759
New command
SYSTEM PRESHUTDOWN
.Scenario:
We want to scale down swarm cluster. On node which we want to shutdown we call
SYSTEM PRESHUTDOWN
, after that node stops to accept new distributed commands. It can still processed objects which is started processed beforeSYSTEM PRESHUTDOWN
. When all that objects successfully processed, we can kill that node without any errors or lost data in responses on initiator.After
SYSTEM PRESHUTDOWN
on swarm node:On initiator node:
skip_unavailable_shards=true
unexpected closing of socket is legal if no data packets were accepted before. This allow to shutdown non-autodiscovery node too.Exclude tests: