Log warning if `initiateChannel` fails #131520

DaveCTurner · 2025-07-18T12:00:45Z

Also account for the fact that channelFuture.cause() might be null.

Also account for the fact that `channelFuture.cause()` might be `null`.

elasticsearchmachine · 2025-07-18T12:01:10Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2025-07-18T12:03:39Z

Note that mostly we do already log a failure here much further up the stack, for instance:

elasticsearch/server/src/main/java/org/elasticsearch/discovery/HandshakingTransportAddressConnector.java

Lines 187 to 199 in f3a1664

    
                               logger.warn( 
        
                                   () -> format( 
        
                                       """ 
        
                                           Successfully discovered master-eligible node [%s] at address [%s] but could not connect to it at its \ 
        
                                           publish address of [%s]. Each node in a cluster must be accessible at its publish address by all other \ 
        
                                           nodes in the cluster. See %s for more information.""", 
        
                                       remoteNode.descriptionWithoutAttributes(), 
        
                                       transportAddress, 
        
                                       remoteNode.getAddress(), 
        
                                       ReferenceDocs.NETWORK_BINDING_AND_PUBLISHING 
        
                                   ), 
        
                                   e 
        
                               );

elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

Lines 675 to 681 in cedcb5c

    
           logger.warn( 
        
               () -> format( 
        
                   "received join request from [%s] but could not connect back to the joining node", 
        
                   joinRequest.getSourceNode() 
        
               ), 
        
               e 
        
           );

elasticsearch/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

Lines 340 to 344 in f1e3058

    
           // Only warn every 6th failure. We work around this log while stopping integ test clusters in InternalTestCluster#close 
        
           // by temporarily raising the log level to ERROR. If the nature of this log changes in the future, that workaround might 
        
           // need to be adjusted. 
        
           final Level level = currentFailureCount % 6 == 1 ? Level.WARN : Level.DEBUG; 
        
           logger.log(level, () -> format("failed to connect to %s (tried [%s] times)", discoveryNode, currentFailureCount), e);

However it does seem unusual enough to deserve its own log message every time too.

mhl-b · 2025-07-18T15:39:48Z

modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4Transport.java

        Channel channel = connectFuture.channel();
        if (channel == null) {
-            ExceptionsHelper.maybeDieOnAnotherThread(connectFuture.cause());
-            throw new IOException(connectFuture.cause());
+            final var cause = connectFuture.cause();
+            logger.warn(Strings.format("failed to initiate channel to [%s]", node), cause);
+            ExceptionsHelper.maybeDieOnAnotherThread(cause);
+            throw new IOException(cause);
        }


I believe you need a callback. Checking null on channel might be not enough.
connectFuture.addListener(f -> if (f.isSuccess() == false) { log.error...})

We add that listener here:

elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/transport/netty4/Netty4TcpChannel.java

Line 47 in 3504c27

addListener(connectFuture, connectContext);

I don't think we want to log every such failure, and definitely not at error. The logging we have today is enough for that.

Wait sorry I get what you mean: if channel == null we should still wait for connectFuture to complete before logging the error.

Ah, I see. addListener(connectFuture, connectContext); should be enough. if (channel == null) block is confusing, channel is undefined until future is resolved. We should pass only connectFuture, not channel, to the Netty4TcpChannel, once connectFuture is resolved Netty4TcpChannel should update it's own channel.

channel is undefined

I meant there are few steps that can go wrong - channel initialization and registration and failures are dispatched at different thread either event-loop or global-executor. In all cases channel would be closed forcibly. But using channel that failed to initialize means we don't have our handlers attached.

Log warning if initiateChannel fails

6be2ec5

Also account for the fact that `channelFuture.cause()` might be `null`.

DaveCTurner requested a review from mhl-b July 18, 2025 12:00

DaveCTurner added >non-issue :Distributed Coordination/Network Http and internode communication implementations auto-backport Automatically create backport pull requests when merged v9.2.0 v9.1.1 v8.19.1 v9.0.5 v8.18.5 labels Jul 18, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 18, 2025

DaveCTurner mentioned this pull request Jul 18, 2025

Log failure in internalSend #131418

Open

mhl-b reviewed Jul 18, 2025

View reviewed changes

DaveCTurner added 2 commits July 18, 2025 17:00

Merge branch 'main' into 2025/07/18/log-initiateChannel-failure

6c9a1d3

Handle delayed completion

03cec95

DaveCTurner requested a review from mhl-b July 18, 2025 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log warning if `initiateChannel` fails #131520

Log warning if `initiateChannel` fails #131520

DaveCTurner commented Jul 18, 2025

Uh oh!

elasticsearchmachine commented Jul 18, 2025

Uh oh!

DaveCTurner commented Jul 18, 2025

Uh oh!

mhl-b Jul 18, 2025

Uh oh!

DaveCTurner Jul 18, 2025

Uh oh!

DaveCTurner Jul 18, 2025

Uh oh!

mhl-b Jul 18, 2025

Uh oh!

mhl-b Jul 18, 2025

Uh oh!

Uh oh!

Log warning if initiateChannel fails #131520

Are you sure you want to change the base?

Log warning if initiateChannel fails #131520

Conversation

DaveCTurner commented Jul 18, 2025

Uh oh!

elasticsearchmachine commented Jul 18, 2025

Uh oh!

DaveCTurner commented Jul 18, 2025

Uh oh!

mhl-b Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

mhl-b Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Log warning if `initiateChannel` fails #131520

Log warning if `initiateChannel` fails #131520