Summary
Federation from a Tuwunel homeserver to matrix.org fails silently with make_join returning [404] <non-json bytes>. This causes accepting DM invitations (and likely other joins to rooms hosted on matrix.org) to fail with "the room no longer exists or the invitation is no longer valid".
Tuwunel version: v1.5.1
Please note: I'm a new user to Matrix/Tuwunel. Federation didn't work, so I had Claude Code investigate the issue and it came up with the below fix. I hardly have any experience with Rust, either, but I thought I'd share what Claude found out, maybe it helps someone. In case it's bullshit, just delete the issue, please.
Root Cause Analysis
The Immediate Cause
The federation resolver cache (stored persistently in RocksDB) contained a stale entry for matrix.org originating from a previous SRV-based resolution:
matrix.org → FedDest::Named("matrix.org", ":8443")
This cached destination was set at a time when matrix.org/.well-known/matrix/server was either not yet consulted or not yet set up, causing the code to fall through to the SRV path (actual_dest_4). The SRV record _matrix._tcp.matrix.org currently points to:
10 5 8443 matrix-federation.matrix.org.cdn.cloudflare.net.
Since the destination cache has a TTL of 18–36 hours and is stored in RocksDB (persistent across restarts), every request to matrix.org used the stale entry and connected to https://matrix.org:8443 instead of https://matrix-federation.matrix.org:443. Port 8443 on Cloudflare for this host returns HTTP 521 ("Web Server Is Down") with a plain-text body, which ruma reports as [521] <non-json bytes> (or [404] after internal mapping), causing make_join to fail.
Confirmed by: Clearing the destination cache entry for matrix.org via the admin room (!admin server clear-caches) immediately resolved the issue — make_join succeeded and federation with matrix.org worked correctly.
The Underlying Code Bug
Even after the cache is cleared, the resolver will re-populate it correctly only until the next cache expiry, because there is a secondary bug that causes the IP override cache to store the wrong port.
In both actual_dest_2 and actual_dest_3_2 in src/service/resolver/actual.rs, split_at(pos) is used to separate the hostname and port from a host:port string, where pos is the index of the : character. This means port includes the leading colon (e.g. ":443"). The subsequent port.parse::<u16>() call then always fails because ":443" is not a valid u16 string, falling back to unwrap_or(8448):
// actual_dest_2, line 126
let (host, port) = dest.as_str().split_at(pos);
self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache).await?;
// ^^^^^^^^^^^^^^^^^^^ always fails, port = ":443"
// actual_dest_3_2, line 169
let (host, port) = delegated.split_at(pos);
self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache).await?;
// ^^^^^^^^^^^^^^^^^^^ always fails, port = ":443"
Note: The FedDest returned by both functions correctly stores ":443" as a PortString (the try_into() conversion works fine for the string type), so the URL used for the request is correct (https://matrix-federation.matrix.org:443). The bug only affects the port stored in the CachedOverride, which maps the hostname to resolved IPs plus a port number. Since the reqwest HTTP client uses the URL's port for the actual TCP connection (not the port from the resolved SocketAddr), this secondary bug does not directly cause connection failures in the current code — but it is incorrect and may cause subtle issues if connection behavior changes.
Proposed Fix
--- a/src/service/resolver/actual.rs
+++ b/src/service/resolver/actual.rs
@@ -123,7 +123,8 @@ impl super::Service {
async fn actual_dest_2(&self, dest: &ServerName, cache: bool, pos: usize) -> Result<FedDest> {
debug!("2: Hostname with included port");
let (host, port) = dest.as_str().split_at(pos);
- self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache)
+ let port_num = port.trim_start_matches(':').parse::<u16>().unwrap_or(8448);
+ self.conditional_query_and_cache(host, port_num, cache)
.await?;
Ok(FedDest::Named(
@@ -166,7 +167,8 @@ impl super::Service {
async fn actual_dest_3_2(&self, cache: bool, delegated: &str, pos: usize) -> Result<FedDest> {
debug!("3.2: Hostname with port in .well-known file");
let (host, port) = delegated.split_at(pos);
- self.conditional_query_and_cache(host, port.parse::<u16>().unwrap_or(8448), cache)
+ let port_num = port.trim_start_matches(':').parse::<u16>().unwrap_or(8448);
+ self.conditional_query_and_cache(host, port_num, cache)
.await?;
Ok(FedDest::Named(
Workaround
Until a fix is released, clearing the resolver cache via the admin room resolves the issue immediately:
!admin server clear-caches
Or query/inspect the specific entry first:
!admin query resolver destinations-cache matrix.org
Additional Notes
-
The Matrix spec (Section 3.1) states that when .well-known returns a delegated hostname with an explicit port, SRV lookups must be skipped. The actual_dest_3 logic correctly implements this (taking the if let Some(pos) = delegated.find(':') branch). The stale cache issue is therefore not a logic error in the current resolution code, but a persistence problem: a cache entry from a previous code path (pre-.well-known, or during a transient .well-known fetch failure) survives indefinitely until expiry or manual clearing.
-
The issue is particularly hard to diagnose because Tuwunel's release binary has max_level_info compiled in, making TUWUNEL_LOG=debug ineffective. The diagnosis required network-level tracing (ss -tn inside the container network namespace) to observe the actual destination port.
Summary
Federation from a Tuwunel homeserver to
matrix.orgfails silently withmake_joinreturning[404] <non-json bytes>. This causes accepting DM invitations (and likely other joins to rooms hosted on matrix.org) to fail with "the room no longer exists or the invitation is no longer valid".Tuwunel version: v1.5.1
Please note: I'm a new user to Matrix/Tuwunel. Federation didn't work, so I had Claude Code investigate the issue and it came up with the below fix. I hardly have any experience with Rust, either, but I thought I'd share what Claude found out, maybe it helps someone. In case it's bullshit, just delete the issue, please.
Root Cause Analysis
The Immediate Cause
The federation resolver cache (stored persistently in RocksDB) contained a stale entry for
matrix.orgoriginating from a previous SRV-based resolution:This cached destination was set at a time when
matrix.org/.well-known/matrix/serverwas either not yet consulted or not yet set up, causing the code to fall through to the SRV path (actual_dest_4). The SRV record_matrix._tcp.matrix.orgcurrently points to:Since the destination cache has a TTL of 18–36 hours and is stored in RocksDB (persistent across restarts), every request to
matrix.orgused the stale entry and connected tohttps://matrix.org:8443instead ofhttps://matrix-federation.matrix.org:443. Port 8443 on Cloudflare for this host returns HTTP 521 ("Web Server Is Down") with a plain-text body, which ruma reports as[521] <non-json bytes>(or[404]after internal mapping), causingmake_jointo fail.Confirmed by: Clearing the destination cache entry for
matrix.orgvia the admin room (!admin server clear-caches) immediately resolved the issue —make_joinsucceeded and federation withmatrix.orgworked correctly.The Underlying Code Bug
Even after the cache is cleared, the resolver will re-populate it correctly only until the next cache expiry, because there is a secondary bug that causes the IP override cache to store the wrong port.
In both
actual_dest_2andactual_dest_3_2insrc/service/resolver/actual.rs,split_at(pos)is used to separate the hostname and port from ahost:portstring, whereposis the index of the:character. This meansportincludes the leading colon (e.g.":443"). The subsequentport.parse::<u16>()call then always fails because":443"is not a validu16string, falling back tounwrap_or(8448):Note: The
FedDestreturned by both functions correctly stores":443"as aPortString(thetry_into()conversion works fine for the string type), so the URL used for the request is correct (https://matrix-federation.matrix.org:443). The bug only affects the port stored in theCachedOverride, which maps the hostname to resolved IPs plus a port number. Since the reqwest HTTP client uses the URL's port for the actual TCP connection (not the port from the resolvedSocketAddr), this secondary bug does not directly cause connection failures in the current code — but it is incorrect and may cause subtle issues if connection behavior changes.Proposed Fix
Workaround
Until a fix is released, clearing the resolver cache via the admin room resolves the issue immediately:
Or query/inspect the specific entry first:
Additional Notes
The Matrix spec (Section 3.1) states that when
.well-knownreturns a delegated hostname with an explicit port, SRV lookups must be skipped. Theactual_dest_3logic correctly implements this (taking theif let Some(pos) = delegated.find(':')branch). The stale cache issue is therefore not a logic error in the current resolution code, but a persistence problem: a cache entry from a previous code path (pre-.well-known, or during a transient.well-knownfetch failure) survives indefinitely until expiry or manual clearing.The issue is particularly hard to diagnose because Tuwunel's release binary has
max_level_infocompiled in, makingTUWUNEL_LOG=debugineffective. The diagnosis required network-level tracing (ss -tninside the container network namespace) to observe the actual destination port.