Skip to content

Log warning if initiateChannel fails #131520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DaveCTurner
Copy link
Contributor

Also account for the fact that channelFuture.cause() might be null.

Also account for the fact that `channelFuture.cause()` might be `null`.
@DaveCTurner DaveCTurner requested a review from mhl-b July 18, 2025 12:00
@DaveCTurner DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations auto-backport Automatically create backport pull requests when merged v9.2.0 v9.1.1 v8.19.1 v9.0.5 v8.18.5 labels Jul 18, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DaveCTurner
Copy link
Contributor Author

Note that mostly we do already log a failure here much further up the stack, for instance:

logger.warn(
() -> format(
"""
Successfully discovered master-eligible node [%s] at address [%s] but could not connect to it at its \
publish address of [%s]. Each node in a cluster must be accessible at its publish address by all other \
nodes in the cluster. See %s for more information.""",
remoteNode.descriptionWithoutAttributes(),
transportAddress,
remoteNode.getAddress(),
ReferenceDocs.NETWORK_BINDING_AND_PUBLISHING
),
e
);

logger.warn(
() -> format(
"received join request from [%s] but could not connect back to the joining node",
joinRequest.getSourceNode()
),
e
);

// Only warn every 6th failure. We work around this log while stopping integ test clusters in InternalTestCluster#close
// by temporarily raising the log level to ERROR. If the nature of this log changes in the future, that workaround might
// need to be adjusted.
final Level level = currentFailureCount % 6 == 1 ? Level.WARN : Level.DEBUG;
logger.log(level, () -> format("failed to connect to %s (tried [%s] times)", discoveryNode, currentFailureCount), e);

However it does seem unusual enough to deserve its own log message every time too.

Comment on lines 271 to 277
Channel channel = connectFuture.channel();
if (channel == null) {
ExceptionsHelper.maybeDieOnAnotherThread(connectFuture.cause());
throw new IOException(connectFuture.cause());
final var cause = connectFuture.cause();
logger.warn(Strings.format("failed to initiate channel to [%s]", node), cause);
ExceptionsHelper.maybeDieOnAnotherThread(cause);
throw new IOException(cause);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you need a callback. Checking null on channel might be not enough.
connectFuture.addListener(f -> if (f.isSuccess() == false) { log.error...})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add that listener here:

I don't think we want to log every such failure, and definitely not at error. The logging we have today is enough for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait sorry I get what you mean: if channel == null we should still wait for connectFuture to complete before logging the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. addListener(connectFuture, connectContext); should be enough. if (channel == null) block is confusing, channel is undefined until future is resolved. We should pass only connectFuture, not channel, to the Netty4TcpChannel, once connectFuture is resolved Netty4TcpChannel should update it's own channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channel is undefined

I meant there are few steps that can go wrong - channel initialization and registration and failures are dispatched at different thread either event-loop or global-executor. In all cases channel would be closed forcibly. But using channel that failed to initialize means we don't have our handlers attached.

@DaveCTurner DaveCTurner requested a review from mhl-b July 18, 2025 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Distributed Coordination/Network Http and internode communication implementations >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v8.18.5 v8.19.1 v9.0.5 v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants