Skip to content

Federation with matrix.org fails: stale SRV-based cache entry overrides .well-known delegation (port 8443 instead of 443) #385

@oly-nittka

Description

@oly-nittka

Summary

Federation from a Tuwunel homeserver to matrix.org fails silently with make_join returning [404] <non-json bytes>. This causes accepting DM invitations (and likely other joins to rooms hosted on matrix.org) to fail with "the room no longer exists or the invitation is no longer valid".

Tuwunel version: v1.5.1

Please note: I'm a new user to Matrix/Tuwunel. Federation didn't work, so I had Claude Code investigate the issue and it came up with the below fix. I hardly have any experience with Rust, either, but I thought I'd share what Claude found out, maybe it helps someone. In case it's bullshit, just delete the issue, please.


Root Cause Analysis

The Immediate Cause

The federation resolver cache (stored persistently in RocksDB) contained a stale entry for matrix.org originating from a previous SRV-based resolution:

matrix.org → FedDest::Named("matrix.org", ":8443")

This cached destination was set at a time when matrix.org/.well-known/matrix/server was either not yet consulted or not yet set up, causing the code to fall through to the SRV path (actual_dest_4). The SRV record _matrix._tcp.matrix.org currently points to:

10 5 8443 matrix-federation.matrix.org.cdn.cloudflare.net.

Since the destination cache has a TTL of 18–36 hours and is stored in RocksDB (persistent across restarts), every request to matrix.org used the stale entry and connected to https://matrix.org:8443 instead of https://matrix-federation.matrix.org:443. Port 8443 on Cloudflare for this host returns HTTP 521 ("Web Server Is Down") with a plain-text body, which ruma reports as [521] <non-json bytes> (or [404] after internal mapping), causing make_join to fail.

Confirmed by: Clearing the destination cache entry for matrix.org via the admin room (!admin server clear-caches) immediately resolved the issue — make_join succeeded and federation with matrix.org worked correctly.

The Underlying Code Bug

Even after the cache is cleared, the resolver will re-populate it correctly only until the next cache expiry, because there is a secondary bug that causes the IP override cache to store the wrong port.

In both actual_dest_2 and actual_dest_3_2 in src/service/resolver/actual.rs, split_at(pos) is used to separate the hostname and port from a host:port string, where pos is the index of the : character. This means port includes the leading colon (e.g. ":443"). The subsequent port.parse::<u16>() call then always fails because ":443" is not a valid u16 string, falling back to unwrap_or(8448):

// actual_dest_2, line 126
let (host, port) = dest.as_str().split_at(pos);
self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache).await?;
//                                      ^^^^^^^^^^^^^^^^^^^ always fails, port = ":443"

// actual_dest_3_2, line 169  
let (host, port) = delegated.split_at(pos);
self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache).await?;
//                                      ^^^^^^^^^^^^^^^^^^^ always fails, port = ":443"

Note: The FedDest returned by both functions correctly stores ":443" as a PortString (the try_into() conversion works fine for the string type), so the URL used for the request is correct (https://matrix-federation.matrix.org:443). The bug only affects the port stored in the CachedOverride, which maps the hostname to resolved IPs plus a port number. Since the reqwest HTTP client uses the URL's port for the actual TCP connection (not the port from the resolved SocketAddr), this secondary bug does not directly cause connection failures in the current code — but it is incorrect and may cause subtle issues if connection behavior changes.


Proposed Fix

--- a/src/service/resolver/actual.rs
+++ b/src/service/resolver/actual.rs
@@ -123,7 +123,8 @@ impl super::Service {
 	async fn actual_dest_2(&self, dest: &ServerName, cache: bool, pos: usize) -> Result<FedDest> {
 		debug!("2: Hostname with included port");
 		let (host, port) = dest.as_str().split_at(pos);
-		self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache)
+		let port_num = port.trim_start_matches(':').parse::<u16>().unwrap_or(8448);
+		self.conditional_query_and_cache(host, port_num, cache)
 			.await?;
 
 		Ok(FedDest::Named(
@@ -166,7 +167,8 @@ impl super::Service {
 	async fn actual_dest_3_2(&self, cache: bool, delegated: &str, pos: usize) -> Result<FedDest> {
 		debug!("3.2: Hostname with port in .well-known file");
 		let (host, port) = delegated.split_at(pos);
-		self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache)
+		let port_num = port.trim_start_matches(':').parse::<u16>().unwrap_or(8448);
+		self.conditional_query_and_cache(host, port_num, cache)
 			.await?;
 
 		Ok(FedDest::Named(

Workaround

Until a fix is released, clearing the resolver cache via the admin room resolves the issue immediately:

!admin server clear-caches

Or query/inspect the specific entry first:

!admin query resolver destinations-cache matrix.org

Additional Notes

  • The Matrix spec (Section 3.1) states that when .well-known returns a delegated hostname with an explicit port, SRV lookups must be skipped. The actual_dest_3 logic correctly implements this (taking the if let Some(pos) = delegated.find(':') branch). The stale cache issue is therefore not a logic error in the current resolution code, but a persistence problem: a cache entry from a previous code path (pre-.well-known, or during a transient .well-known fetch failure) survives indefinitely until expiry or manual clearing.

  • The issue is particularly hard to diagnose because Tuwunel's release binary has max_level_info compiled in, making TUWUNEL_LOG=debug ineffective. The diagnosis required network-level tracing (ss -tn inside the container network namespace) to observe the actual destination port.

Metadata

Metadata

Assignees

No one assigned

    Labels

    protocol issueThe actual problem is with the protocol. The defect may or may not be mitigated by the server.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions