Skip to content

test suite use of dpd, etc. doesn't work #8999

@davepacheco

Description

@davepacheco

(this is all based on a meeting this morning with @jgallagher @internet-diglett and I -- my apologies for any errors here)

Background

  • In real systems, Dendrite configuration is switch-specific. Switches may have different uplinks available, different routing rules, even different external connectivity (in a future multi-rack work). That means Nexus always needs to know which one it's talking to.
  • (I believe) Dendrite does not currently know which switch it is (and doesn't need to).
  • In a real system, MGS (Management Gateway) is the source of truth for which switch the "switch zone" is attached to.

In order for Nexus to do anything with switches, it needs to be able to find the Dendrite instances for each switch. There are two parts to this problem:

  • knowing the (IP address, TCP ports) where Dendrite instances are running
  • knowing which switch each Dendrite instance is attached to

How this works today

Finding services

To find the switch zone services:

So how does DNS get filled in with these addresses and ports? In general, internal DNS contents gets computed in two different places: (1) during rack setup, RSS computes it; and (2) after that, Nexus computes it from the current target blueprint during blueprint execution. In real systems, when RSS does it, the IPs are those of the switch zones that it knows about and the TCP ports are the hardcoded ports for these services. There's not really another approach here because unlike control plane zones, the switch zone is started before RSS is even running so there's no way for RSS to configure what TCP ports it's using (as it does for control plane zones). In real systems, when blueprint execution computes internal DNS, it also uses their hardcoded ports. (The linked code does mention overrides for testing, but those are only used for unit tests. They're not present in the Nexus that gets spun up by the test suite and I think that's for the best.)

Which switch does each service correspond with?

As mentioned above, Nexus asks MGS which switch it's attached to, and then it assumes that all the other services at the same IP are talking to the same switch.

The problem

None of this works in the test suite.

Problem 1: Finding the IP/port addresses of the networking services

The test harness sets up a virtual control plane that includes MGS and Dendrite, and those wind up running on arbitrary ports. Like everything else in the test suite, we do this in order to support concurrent execution of multiple tests.

The test harness plays the role of RSS in this context. It appears to configure internal DNS with the actual TCP ports for the MGS and Dendrite that it started:

self.dendrite.get(&switch_location).unwrap().port,
self.gateway.get(&switch_location).unwrap().port,
self.mgd.get(&switch_location).unwrap().port,

So initially, the TCP ports will be correct. However, as soon as the test system executes a blueprint, those TCP ports will wind up changed to the wrong values (the hardcoded ones for production systems). I verified this in cargo xtask omicron-dev run-all (which runs basically the same environment as each test suite test):

$ cargo xtask omicron-dev run-all
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.66s
     Running `target/debug/xtask omicron-dev run-all`
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.17s
     Running `target/debug/omicron-dev run-all`
omicron-dev: setting up all services ... 
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.22889.0.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.22889.0.log"
DB URL: postgresql://root@[::1]:48588/omicron?sslmode=disable
DB address: [::1]:48588
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.22889.2.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.22889.2.log"
omicron-dev: Adding disks to first sled agent
omicron-dev: services are running.
omicron-dev: nexus external API:    127.0.0.1:12220
omicron-dev: nexus internal API:    [::1]:12221
omicron-dev: cockroachdb pid:       22893
omicron-dev: cockroachdb URL:       postgresql://root@[::1]:48588/omicron?sslmode=disable
omicron-dev: cockroachdb directory: /dangerzone/omicron_tmp/.tmpdq7MIy
omicron-dev: internal DNS HTTP:     http://[::1]:53334
omicron-dev: internal DNS:          [::1]:53652
omicron-dev: external DNS name:     oxide-dev.test
omicron-dev: external DNS HTTP:     http://[::1]:41750
omicron-dev: external DNS:          [::1]:55935
omicron-dev:   e.g. `dig @::1 -p 55935 test-suite-silo.sys.oxide-dev.test`
omicron-dev: management gateway:    http://[::1]:47805 (switch0)
omicron-dev: silo name:             test-suite-silo
omicron-dev: privileged user name:  test-privileged
...

Enable the initial target blueprint:

$ ./target/debug/omdb --dns-server [::1]:53652 nexus blueprints list
note: Nexus URL not specified.  Will pick one from DNS.
note: using Nexus URL http://[::1]:12221
T ENA ID                                   PARENT TIME_CREATED             
* no  bfd9cbc5-8f90-4839-9808-38db6dbbbf12 <none> 2025-09-04T21:26:33.764Z 

$ ./target/debug/omdb --dns-server [::1]:53652 nexus blueprints target enable current -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using Nexus URL http://[::1]:12221
set target blueprint bfd9cbc5-8f90-4839-9808-38db6dbbbf12 to enabled

Wait a second, then see that the internal DNS version has been bumped:

$ ./target/debug/omdb --dns-server [::1]:53652 db dns show
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:48588/omicron?sslmode=disable
note: database schema version matches expected (186.0.0)
GROUP    ZONE                         ver UPDATED              REASON                                                                  
internal control-plane.oxide.internal 2   2025-09-04T21:28:25Z blueprint bfd9cbc5-8f90-4839-9808-38db6dbbbf12 (initial test blueprint) 
external oxide-dev.test               2   2025-09-04T21:26:34Z create silo: "test-suite-silo"                           

$ ./target/debug/omdb --dns-server [::1]:53652 db dns diff internal 2
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using database URL postgresql://root@[::1]:48588/omicron?sslmode=disable
note: database schema version matches expected (186.0.0)
DNS zone:                   control-plane.oxide.internal (Internal)
requested version:          2 (created at 2025-09-04T21:28:25Z)
version created by Nexus:   e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
version created because:    blueprint bfd9cbc5-8f90-4839-9808-38db6dbbbf12 (initial test blueprint)
changes:                    names added: 6, names removed: 5

+  _dendrite._tcp                                     (records: 1)
+      SRV  port 12224 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
+  _mgd._tcp                                          (records: 1)
+      SRV  port  4676 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
+  _mgs._tcp                                          (records: 1)
+      SRV  port 12225 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
+  _nexus._tcp                                        (records: 2)
+      SRV  port 12223 a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f.host.control-plane.oxide.internal
+      SRV  port 12221 e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c.host.control-plane.oxide.internal
+  a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f.host          AAAA ::1
+  dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host AAAA ::2
-  _dendrite._tcp                                     (records: 1)
-      SRV  port 52689 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
-  _mgd._tcp                                          (records: 1)
-      SRV  port 45341 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
-  _mgs._tcp                                          (records: 1)
-      SRV  port 47805 dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host.control-plane.oxide.internal
-  _nexus._tcp                                        (records: 1)
-      SRV  port 12221 e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c.host.control-plane.oxide.internal
-  dendrite-b6d65341-167c-41df-9b5c-41cded99c229.host AAAA ::1

The correct ports for the services started by the test suite have been replaced by the stock hardcoded ones.

This isn't a huge deal because we don't enable blueprint execution by default in tests, but I think it's fair to say this is at least partly working by accident.

Problem 2: which switch goes with each service?

In the test suite, all of these are running on localhost, which breaks the assumption Nexus makes that it can determine which switch, say, Dendrite manages based on which MGS instance it shares an IP address with.

Proposed solution

For the problem of discovering switch zone services more reliably, we suggested:

  • add a new field to the blueprint that specifies the IP address and TCP port for each switch service
    • RSS can fill this in properly for the first blueprint because it knows it
    • the test runner can also fill this in properly for the first blueprint because it knows it
  • we need to keep this up to date if scrimlets get moved around -- I can't remember from the call if we determined how to do this. (On real systems, the planner has the information to do this. In the test suite, though, I think that would cause it to do the wrong thing.)
  • during blueprint execution, in computing DNS, Nexus would use these addresses and ports rather than inferring the switch zone addresses and hardcoding TCP ports

Recall that this problem is less urgent than the next one because currently the initial DNS contents in the test suite are fine and only blueprint execution (which is currently disabled out of the box) breaks them.

For the problem of figuring out which switch each service goes with:

  • Dendrite should be able to tell clients directly which switch it's managing
    • Dendrite should ask MGS for this information. (It already talks to its local MGS for other reasons.)
    • Dendrite should support a start-time config option for the address/port of the MGS to use so that the test suite can point it at the MGS that it started, rather than assuming it should talk to the one on localhost and the hardcoded port.
  • Rather than assuming the switch is determined by the IP address, Nexus should ask Dendrite which switch it's managing and use that

Another option related to all this is:

  • have the inventory collector ask each MGS which switch it's attached to and record that
  • maybe do the same with Dendrite and the other switch zone services (after having located them using the generic _dendrite._tcp records)?
  • have blueprint planner latch this state
  • have blueprint execution create DNS names specific to each rack, switch, and service, like $rack_id.switch_0._dendrite._tcp (separate from the existing _dendrite._tcp records).

This way, Nexus would look up ServiceName::Dendrite(RackId, SwitchLocation) and get back exactly one thing which is the right Dendrite for that switch in that rack. This would be more aligned with using DNS the way we intended (which is: all SRV records under a given name should be fungible, which isn't currently the case for _dendrite._tcp).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions