Skip to content

Commit bafb0ed

Browse files
committed
dns-resolve: Do not treat never accessed responses as expired.
If a daemon has multiple remotes to connect to and it already reached pretty high backoff interval on those connections, it is possible that it will never be able to connect, if the DNS TTL value is too low. For example, if ovn-controller has 3 remotes in the ovn-remote configuration, where each is presented as a host name, the following is happening when it reaches default max backoff of 8 seconds: 1. Tries to connect to the first remote - sends async DNS request. 2. Since jsonrpc/reconnect modules are not really aware of this, they just treat connection as temporarily failed - 8 second backoff. 3. After the backoff - switching to a new remote - sending DNS request. 4. Temporarily failing - 8 second backoff. 5. Switching to the third remote - sending DNS request. 6. Temporarily failing - 8 second backoff. 7. Finally trying the first remote again - checking DNS. 8. There is a processed response, but it is already 24 seconds old. 9. If DNS TTL is lower than 24 seconds - consider expired - send a new DNS request. 10. Go to step 2. With that, if DNS TTL is lower than 3x of the backoff interval, the process will never be able to connect without some external help to break the loop. A proper solution for this should include: 1. Making jsonrpc and reconnect and all the users of these modules aware of the DNS request being made. This means introduction of a new RECONNECT state for DNS request and not switching to a new target while we're still in this state. 2. Making the poll loop state machine properly react to DNS responses by waiting on the file descriptor provided by the unbound library. However, such solution will be very invasive to the code structure and all the involved libraries, so it may not be something that we would want to backport as a bug fix to stable branches. Instead, making a much simpler change to allow use of never previously accessed DNS replies for a short period of time, so the loop can be broken. It's not caching if we just requested the value, but didn't use it yet, it's a "transaction in progress" situation in which we can use the response even if TTL is zero. Without a proper solution though we can't be sure that the process will ever look at the result of asynchronous request, so we need to have an upper limit for such "transactions in progress". Limiting them to a fairly arbitrary, but big enough, value of 5 minutes. In the worst case where the address actually goes stale in between our request and the first access, we'll try to use the stale value once and then re-request right away on failure to connect. This solution seems reasonable and simple enough to backport to stable branches while working on the proper solution on main. Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2025-June/053738.html Acked-by: Eelco Chaudron <[email protected]> Acked-by: Mike Pattrick <[email protected]> Signed-off-by: Ilya Maximets <[email protected]>
1 parent cd77890 commit bafb0ed

File tree

1 file changed

+23
-3
lines changed

1 file changed

+23
-3
lines changed

lib/dns-resolve.c

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
#include "openvswitch/hmap.h"
2929
#include "openvswitch/vlog.h"
3030
#include "timeval.h"
31+
#include "util.h"
3132

3233
VLOG_DEFINE_THIS_MODULE(dns_resolve);
3334

@@ -43,6 +44,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
4344
enum resolve_state {
4445
RESOLVE_INVALID,
4546
RESOLVE_PENDING,
47+
RESOLVE_GOOD_UNUSED,
4648
RESOLVE_GOOD,
4749
RESOLVE_ERROR
4850
};
@@ -220,19 +222,37 @@ resolve_find_or_new__(const char *name)
220222
return req;
221223
}
222224

225+
#define RESOLVE_UNUSED_TIMEOUT 300
226+
223227
static bool
224228
resolve_check_expire__(struct resolve_request *req)
225229
OVS_REQUIRES(dns_mutex__)
226230
{
227-
return time_now() > req->time + req->ub_result->ttl;
231+
int ttl = req->ub_result->ttl;
232+
233+
/* When we just sent a request, but didn't look at the response yet, it's
234+
* not caching, but a "transaction in progress" situation, so we can use
235+
* the response even with TTL of 0 and more than 1 second passed.
236+
* Allow such values to be accessed for at least RESOLVE_UNUSED_TIMEOUT
237+
* seconds without considering them stale. This is necessary in case of
238+
* large backoff intervals on connections or if the process is doing some
239+
* other work not looking at the response for longer than TTL. */
240+
if (req->state == RESOLVE_GOOD_UNUSED) {
241+
ttl = MAX(ttl, RESOLVE_UNUSED_TIMEOUT);
242+
/* Not a "transaction in progress" anymore, normal caching rules should
243+
* apply from this point forward. */
244+
req->state = RESOLVE_GOOD;
245+
}
246+
247+
return time_now() > req->time + ttl;
228248
}
229249

230250
static bool
231251
resolve_check_valid__(struct resolve_request *req)
232252
OVS_REQUIRES(dns_mutex__)
233253
{
234254
return (req != NULL
235-
&& req->state == RESOLVE_GOOD
255+
&& (req->state == RESOLVE_GOOD || req->state == RESOLVE_GOOD_UNUSED)
236256
&& !resolve_check_expire__(req));
237257
}
238258

@@ -289,7 +309,7 @@ resolve_callback__(void *req_, int err, struct ub_result *result)
289309

290310
req->ub_result = result;
291311
req->addr = addr;
292-
req->state = RESOLVE_GOOD;
312+
req->state = RESOLVE_GOOD_UNUSED;
293313
req->time = time_now();
294314
}
295315

0 commit comments

Comments
 (0)