Skip to content

Conversation

matheus23
Copy link
Member

@matheus23 matheus23 commented Jul 22, 2025

Description

This is an alternative to #3396 but also refactors the code a bit more.

Instead of trying to switch to a different address when the best address becomes outdated soon, we instead track the path validity for all paths we have the same way.

This means the BestAddr struct and module completely disappears. Instead we replace it with a PathValidity struct and module and track it inside PathState.

Then we implement the "stickyness" that the best address had (i.e. not recalculating the best address all the time and sticking to a path as long as it's valid) by tracking a best: UdpSendAddr and a best_ipv4: UdpSendAddr in NodeUdpPaths.

This also tries to generally update the NodeUdpPaths when we receive data instead of when we try to send.
This means that NodeUdpPaths::get_send_addr might be slightly outdated, but it'll snap back to the "correct" thing once we receive any pings or data or call me maybes etc.

I think this is a much cleaner approach, albeit with a bigger diff.

Breaking Changes

There shouldn't be any breaking changes.

Notes & open questions

There's some stuff like clearing the best address when NodeMap::reset is called that I didn't implement. I'm not sure if there's any point to that code path, it seems to be only called on Endpoint::stop(), so I don't know why we'd want to clear the state there.

Maybe more? I'm curious about what you think.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.
    • List all breaking changes in the above "Breaking Changes" section.
    • Open an issue or PR on any number0 repos that are affected by this breaking change. Give guidance on how the updates should be handled or do the actual updates themselves. The major ones are:

@matheus23 matheus23 self-assigned this Jul 22, 2025
@matheus23 matheus23 changed the base branch from main to maint-0.35 July 22, 2025 10:13
@n0bot n0bot bot added this to iroh Jul 22, 2025
@github-project-automation github-project-automation bot moved this to 🏗 In progress in iroh Jul 22, 2025
@matheus23 matheus23 marked this pull request as ready for review July 22, 2025 14:36
Copy link

github-actions bot commented Jul 22, 2025

Netsim report & logs for this PR have been generated and is available at: LOGS
This report will remain available for 3 days.

Last updated for commit: 8b8d146

Copy link

matheus23/refactor-path-validity.7b03147
Perf report:

test case throughput_gbps throughput_transfer
iroh_latency_20ms 1_to_1 1.27 2.28
iroh_latency_20ms 1_to_3 3.95 7.33
iroh_latency_20ms 1_to_5 6.36 11.49
iroh_latency_20ms 1_to_10 10.08 15.59
iroh_latency_20ms 2_to_2 2.43 4.22
iroh_latency_20ms 2_to_4 4.92 8.66
iroh_latency_20ms 2_to_6 7.45 13.21
iroh_latency_20ms 2_to_10 12.10 21.05
iroh_relay_only 1_to_1 2.57 2.59
iroh_relay_only 1_to_3 4.08 4.09
iroh_relay_only 1_to_5 4.39 4.40
iroh_relay_only 1_to_10 4.45 4.45
iroh_relay_only 2_to_2 3.35 3.36
iroh_relay_only 2_to_4 5.80 5.81
iroh_relay_only 2_to_6 6.00 6.02
iroh_relay_only 2_to_10 7.42 7.44
iroh 1_to_1 1.20 2.06
iroh 1_to_3 3.83 6.92
iroh 1_to_5 6.12 10.73
iroh 1_to_10 9.98 15.36
iroh 2_to_2 2.37 4.05
iroh 2_to_4 4.92 8.65
iroh 2_to_6 7.40 13.03
iroh 2_to_10 12.03 20.83
iroh_latency_200ms 1_to_1 1.21 2.10
iroh_latency_200ms 1_to_3 3.80 6.83
iroh_latency_200ms 1_to_5 6.11 10.69
iroh_latency_200ms 1_to_10 9.41 15.70
iroh_latency_200ms 2_to_2 2.38 4.09
iroh_latency_200ms 2_to_4 4.98 8.85
iroh_latency_200ms 2_to_6 7.40 13.03
iroh_latency_200ms 2_to_10 11.89 20.43
iroh_cust_10gb 1_to_1 2.57 2.82
iroh_cust_10gb 1_to_3 6.72 7.29
iroh_cust_10gb 1_to_5 11.56 12.59
iroh_cust_10gb 1_to_10 15.20 16.06
iroh_cust_10gb 2_to_2 4.80 5.24
iroh_cust_10gb 2_to_4 8.10 8.72
iroh_cust_10gb 2_to_6 10.84 11.58
iroh_cust_10gb 2_to_10 16.90 17.96

match self {
Source::ReceivedPong => TRUST_UDP_ADDR_DURATION,
// // TODO: Fix time
// Source::BestCandidate => Duration::from_secs(60 * 60),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the BestCandidate situation, how is that handled now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BestCandidate used to be set in NodeUdpPaths::assign_best_addr_from_candidates_if_empty, which in my experience was only called on the receive side the very first time you'd send data from the magic socket.

For some reason it doesn't have a best address set at that point, so it instead picks one from all the PathStates it has and assigns the best it can find with Source::BestCandidate.

I... am really unsure why in that special case it should trust this candidate for one hour. This seems ridiculously high and totally out of proportion with TRUST_UDP_ADDR_DURATION otherwise.

In practice I don't think this is all to bad (even if the best addr never expires for an hour). In theory that allows an attacker to waste lots of resources on our end if they manage to spoof an address from the other side.
Or it can mean we'd try sending on an address that isn't actually reachable anymore (doesn't pong) for quite a while before we fall back to the relay, but usually these cases would also coincide with the addrs in the call me maybe changing, thus that'd clear the best addr as well.

Idk. All in all, very weird code. But writing this comment makes me want to jump back into iroh 0.35 code and see why this NodeUdpPaths::assign_best_addr_from_candidates_if_empty function was needed at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason it doesn't have a best address set at that point, so it instead picks one from all the PathStates it has and assigns the best it can find with Source::BestCandidate.

I think the idea was to avoid flip flopping, and keeping this stable, unless new information comes in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't flip flop though, at least not more often than every TRUST_UDP_ADDR_DURATION, since once you validate a path for that duration, you don't pick another until it expires :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's incorrect (and we've observed more frequent flip-flopping), because we'll usually have multiple paths that hole-punch successfully, and they'll all turn Valid at around the same time (and we'll choose the lowest latency ones).

Then they all expire at roughly the same time, at which point we flip between them back and forth a bit.

It's... really hard to fix this flip-flopping. (1) because both ends of the pipe will usually choose to switch the best address at around the same time (after TRUST_UDP_ADDR_DURATION) and (2) it's very hard to agree on "the same" path on both ends of the connection (which would be the most stable configuration) and (3) the latency of a path goes up the more it's used.

Ideally all of this becomes irrelevant once we integrate path validation with our QUIC stack. At that point there's no reason for both ends to "agree" on the path to send and receive on and it's should be rare to "accidentally" lose path validation packets.

Copy link

github-actions bot commented Jul 22, 2025

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3398/docs/iroh/

Last updated: 2025-08-01T07:39:37Z

Copy link
Contributor

@flub flub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Contains some changes I've long wanted 😄
The path validity seems to be a much better abstraction than best-addr was. If this has seen enough testing I don't think there's anything blocking?

Comment on lines +26 to +27
confirmed_at: Instant,
trust_until: Instant,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems these always have the same relation. Why store both?

This is mostly an observation, it doesn't bother met that much if you want to keep it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree.

There's some more "duplicated state" stored here with some fields in PongReply.

I think refactoring this further, I'd clean up the PathValidity struct a little more. At this point it's literally a combination of the fields in BestAddr with another PongReply.

Self {
paths,
best_addr,
chosen_candidate: None,
best_ipv4: best, // we only use ipv4 addrs in tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth a debug_assert! perhaps? I wouldn't be surprised if I ended up writing tests with IPv6 at some point and it would catch me out. That probably works entirely fine - Quinn's test suite is entirely IPv6 and doesn't seem to cause issues anywhere.

dsocolobsky pushed a commit to PsycheFoundation/psyche that referenced this pull request Jul 24, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jul 28, 2025
…ith `PathValidity` (#3400)

## Description

#3398 but now rebased on `main`. See its description for more
information.

There's only one small change: We don't need to think about
`NodeMap::reset`, as that's removed in `main`.

## Breaking Changes

None.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist
<!-- Remove any that are not relevant. -->
- [x] Self-review.
@matheus23 matheus23 force-pushed the matheus23/refactor-path-validity branch from c6b3906 to 8e4d739 Compare July 30, 2025 09:13
@matheus23 matheus23 changed the title fix(iroh): Also keep track of non-best path's validity fix(iroh): Also keep track of non-best path's validity (v0.35) Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

3 participants