diff --git a/doc/gnpsi/gnpsi_hld.md b/doc/gnpsi/gnpsi_hld.md new file mode 100644 index 00000000000..e59fb2d7ff4 --- /dev/null +++ b/doc/gnpsi/gnpsi_hld.md @@ -0,0 +1,465 @@ +# SONIC GNPSI HLD + + + +## [Table of Content](#table-of-content) + * [1. Revision](#1-revision) + * [2. Scope](#2-scope) + * [3. Definitions/Abbreviations](#3-definitions-abbreviations) + * [4. Overview](#4-overview) + * [5. Requirements](#5-requirements) + * [6. High-Level Design](#6-high-level-design) + + [GNPSI application](#gnpsi-application) + + [SONiC gNPSI implementation](#sonic-gnpsi-implementation) + + [Repository and Module](#repository-and-module) + + [Files](#files) + + [Alternative](#alternative) + * [7. Configuration and management](#7-configuration-and-management) + + [7.1 Config DB Enhancements](#71-config-db-enhancements) + * [8. Warmboot and Fastboot Design Impact](#8-warmboot-and-fastboot-design-impact) + * [9. Memory Consumption](#9-memory-consumption) + * [10. Testing Design](#10-testing-design) + + [10.1. Unit Test cases](#101-unit-test-cases) + - [Config DB monitor](#config-db-monitor) + - [Stats](#stats) + + [10.2. System Test cases](#102-system-test-cases) + - [Ondatra](#ondatra) + + + +### 1. Revision + +| Revision | Author | Updated date| +| -------- | ------ | ----------- | +| v0.1 | Google | Jun 16, 2025| + +### 2. Scope + +Add the service for supporting gNPSI in SONiC sFlow. + + +### 3. Definitions/Abbreviations + +[gNPSI](https://github.com/openconfig/gnpsi) - gRPC Network Packet Sampling Interface + + +### 4. Overview + +sFlow uses the Hsflowd agent within a docker container to send sample datagrams via UDP, whereas GNPSI is a new gRPC service streaming packet samples to the telemetry infrastructure. + +It aims to address following issues of sFlow: + + + +* Lack of encryption and authentication (vulnerable to man-in-the-middle attacks) +* Challenges with UDP-based transport (packet loss) +* Relies on VIP collector discovery + +gNPSI solves these problems: + + + +* gRPC provides security & authentication +* gRPC on TCP provides sequencing & re-transmission +* Dial-in solution avoids discovery + + +### 5. Requirements + +Integrate gNPSI into current sampling service, i.e. sFlow. gNPSI is an optional feature that can be enabled by config and would not break any existing sFlow support. + + +### 6. High-Level Design + + +#### GNPSI application + +Sonic runs sflow in a docker container by running an agent `hsflowd`. When sampled packets comes in, hsflowd would accumulate the packets into UDP datagrams and send them to the collector. + +![Current sflow stack](images/image1.png "Current sflow stack") + + +After the introduction of gNPSI, a new gNPSI server process would be brought up within the sFlow container. `hsflowd` would configure an internal local loopback collector just like a sFlow collector. + +**Advantage**: This local loopback approach ensures a streamlined architecture and simplifies the maintenance of the upstream code. Refer to the [Alternative](?tab=t.0#heading=h.5bgbtjgtpqgx) section on the rationale behind choosing this design. + +When sampled packets come in, hsflowd would send it to the local loopback collector. Then gNPSI process would read from local loopback and relay samples to the subscribed clients. + + +![sFlow switch stack with gNPSI](images/image2.png "new switch stack") + + +![sFlow sample path with gNPSI](images/image3.png "sample path") + + +#### SONiC gNPSI implementation + +The gNPSI relay server has been implemented within the [open-source gNPSI repository](https://github.com/openconfig/gnpsi) as a library. The SONiC gNPSI server integrates this relay server implementation, along with the addition of configuration, telemetry, and security functionalities. + +Upon initialization, the gNPSI process will retrieve path configuration from the CONFIG_DB and activate or deactivate the relay server accordingly. Furthermore, it will collect statistics from the relay server and update the COUNTERS_DB every 30 seconds. + +The gNPSI process has a scalability limitation of 3 clients. This modification is platform-independent. + +![gNPSI process sequence diagram](images/image4.png "gNPSI process sequence diagram") + + + +#### Repository and Module + +This design introduces a SONiC application extension. The primary modification involves adding a new `sonic-gnpsi` repository. The `sonic-gnpsi` would utilize an [open source gNPSI repo on github](https://github.com/openconfig/gnpsi) to build the server code. The `sonic-gnpsi` repo would be added as a submodule into the `sonic-buildimage` + +By this way we ensure modularity and maintainability, making it easier to update and manage dependencies. + +Build dependencies include the new `sonic-gnpsi` repository, the github `openconfig/gnpsi` repository, and `bazel`. + +The dependency of repository is shown below + +![Image showing how gNPSI is integrated into SONiC repo](images/image5.png "gnpsi repos") + + + +#### Files + + + +* `gnpsi.cc` : main file +* `/db_monitor`: to monitor config db gnpsi-related events +* `/server`: helper to start and stop relay server and the stats thread, also write to appl_state_db +* `/utils`: stats util, credential util and authz logger util. + + +#### Alternative + +![alternative design without local loopback](images/image6.png "alternative") + + +**Advantages:** + + + +* Avoids the overhead of writing and reading from the local loopback IP. + +**Disadvantages:** + + + +* Introduces architectural complexities. +* Difficult to upstream and maintain the sFlow open-source C code. +* Cannot share a common gNPSI relay server implmentation for other protocols(NetFlow/Ipfix). + +Given that the overhead of using the local loopback collector is negligible, we opted for local loopback collector solution instead of integrated processes. + + +### 7. Configuration and management + + +gNMI path + +These paths controls the enablement of the server and gRPC port for gNPSI relay server + + + +* `/system/grpc-servers/grpc-server[name=gnpsi]/config/enable` +* `/system/grpc-servers/grpc-server[name=gnpsi]/config/port` + +These paths cover the stats collection for gNPSI + + + +* `/system/grpc-servers/grpc-server[name=gnpsi]/clients/client[address=][port=]/state/bytes-sent` +* `/system/grpc-servers/grpc-server[name=gnpsi]/clients/client[address=][port=]/state/packets-sent` +* `/system/grpc-servers/grpc-server[name=gnpsi]/clients/client[address=][port=]/state/sample-send-error` + + +Server flag + + +* `gnpsi_grpc_port`: gRPC port for the gNPSI Server +* `gnpsi_max_clients`: the max number of clients that can connect to gNPSI server at same time +* `udp_port` UDP port to read sFlow samples. Switch would configure localhost and this port as sFlow collector. + + +#### 7.1 Config DB Enhancements + +gNPSI uses `CONFIG_DB`, `APPL_STATE_DB` and `COUNTERS_DB` + +DB schema is shown as below: + + +``` +CONFIG_DB: + "GNPSI|global": { + "admin_state": "ENABLE"|"DISABLE", + "port": "" +} +APPL_STATE_DB: + "GNPSI:global": { + "admin_state": "ENABLE"|"DISABLE", + "port": "" +} +COUNTERS_DB: + "COUNTERS:GNPSI:/": { + "bytes_sent": "0", + "packets_sent": "0", + "packets_error": "0" +} +``` + + + + +* For CONFIG_DB/APP_STATE_DB, `admin_state` and `port` fields are added for configuration purposes. +* For COUNTERS_DB, gNPSI added stats fields `bytes_sent`, `packets_sent` and `packetserror` for each collector. + + +### 8. Warmboot and Fastboot Design Impact + +Since gNPSI does not interact with Switch hardware, it has no impact with respect to warm reboot and fast reboot. + + +### 9. Memory Consumption + +Memory consumption is not significant based on collected stats. + +![gNPSI memory usage](images/image7.png "memory usage") + + +(All data is collected on Google-internal device environment) + +There is a ~10MB(5%) memory increase after integration of feature (even when configuration is disabled) + + +### 10. Testing Design + + +#### 10.1. Unit Test cases + + +##### Config DB monitor + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Test cases + expected behavior +
TestConfigDbAdminStateEventSuccess + Expect db_monitor can receive admin_state event successfully +
TestConfigDbPortEventSuccess + Expect db_monitor can receive port event successfully +
TestConfigDbAllEventSuccess + Expect db_monitor can receive admin_state and port event successfully +
TestConfigDbInvalidEventFailure + Expect db_monitor can parse invalid event as error +
TestConfigDbIgnoreUninterestedEventSuccess + Expect db_monitor would not return uninterested event +
TestConfigDbConsecutiveEventsSuccess + Expect db_monitor can receive several events in correct order +
+ + + + +* + + +##### Stats + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Test cases + expected behavior +
ReadCounterDbStatsSucceeds + Can read from counters DB correctly +
UpdateCounterDbStatsSucceeds + Can update counters DB correctly +
MergeEmptyExistingStatsNoopSucceeds + Merge empty stats would not have effect +
MergeSameIpStatsDoubleCountSucceeds + Merge same IP stats would double the stats count +
MergeDiffIpStatsNoopSucceeds + Merge different IP stats would not change existing stats +
StatsThreadStartSucceeds + Stats thread can start and update the stats correctly +
StatsThreadStopStartSucceeds + Stats thread can stop and restart +
+ + + +#### 10.2. System Test cases + + +##### Ondatra + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Test cases + expected behavior +
TestGNPSISubscribe +

+This test verifies that a client can successfully establish a basic subscription to the gNPSI (gRPC Network Path and Service Interface) service on a device.

    + +
  1. Configure DUT: It enables the gNPSI service on the Device Under Test (DUT). +
  2. Dial gNPSI: It creates a gNPSI client and establishes a connection to the DUT. +
  3. Subscribe: The client sends a Subscribe request to the DUT. +
  4. Wait for Response: It waits for an initial response or error from the subscription. The test doesn't expect a data sample, just a successful connection.
+ +
    + +
  1. The client should successfully create a Subscribe RPC stream with the gNPSI service on the DUT without any errors. The test passes if the connection is established correctly.
+ +
TestClientReceiveSampleSucceed +

+ This test ensures that a gNPSI client can receive a sampled packet after traffic is sent through the device.

    + +
  1. Configure DUT: It enables both sFlow (for packet sampling) and gNPSI on the DUT. +
  2. Set Sampling Rate: It configures a high sampling rate on a random interface to ensure packets are captured easily. +
  3. Create Client: It starts a gNPSI client and subscribes to the DUT. +
  4. Send Traffic: It uses a separate control device to send packets to the monitored interface on the DUT. +
  5. Wait for Sample: The test waits for the gNPSI client to receive a sample packet from the DUT.
+ +
    + +
  1. After traffic is sent, the gNPSI client must receive at least one sFlow sample packet from the DUT within the timeout.
+ +
TestGNPSISubscribeMultipleClients +

+This test verifies that the DUT correctly enforces the maximum number of allowed gNPSI clients. +

    + +
  1. Configure DUT: It enables the gNPSI service on the DUT. +
  2. Connect Max Clients: It enters a loop and successfully connects the maximum number of allowed clients (maxClients). Each client establishes and maintains its subscription. +
  3. Attempt Extra Connection: After all allowed slots are filled, it attempts to connect one more "extra" client.
+ +
    + +
  1. The first maxClients should connect without issues. The connection attempt from the extra client must fail. +
  2. The error returned should specifically indicate that the client limit has been reached.
+ +
TestClientReconnectAfterServiceRestart +

+ This test checks if a gNPSI client can gracefully handle a service restart on the DUT and successfully reconnect. +

    + +
  1. Initial Connection: It connects a client and verifies it receives a packet sample. +
  2. Start Traffic: It starts sending continuous traffic in the background to ensure samples are always being generated. +
  3. Stop Service: It remotely disables the gNPSI service on the DUT. +
  4. Check for Error: It verifies that the client receives an EOF (End-Of-File) error, indicating the server closed the connection. +
  5. Restart Service: It re-enables the gNPSI service on the DUT. +
  6. Reconnect: A new client is created, and it attempts to subscribe and receive a new sample.
+ +
    + +
  1. The original client should detect the service shutdown via an error. +
  2. After the service is restarted, a new client should be able to connect and start receiving packet samples again.
+ +
TestClientReconnectAfterSwitchReboot +

+ This test validates the robustness of the gNPSI service across different types of system reboots. +

    + +
  1. Initial State: It connects a client and confirms it receives a sample packet to ensure the setup is working. +
  2. Perform Reboots: The test runs through two scenarios:
      + +
    • NSF Reboot: A "graceful" Non-Stop Forwarding reboot. +
    • Cold Reboot: A complete power-cycle-style reboot. +
    • Post-Reboot Verification: For each reboot type, the test:
        + +
      • Waits for the switch to come back online and stabilize. +
      • Confirms that the sFlow and gNPSI configurations have been correctly reapplied. +
      • Creates a new gNPSI client. +
      • Sends new traffic to the DUT. +
      • Check for Sample: It checks if the new client successfully receives a sample packet.
      +
    +
+ +
    + +
  1. The gNPSI service and its configuration must persist across both NSF and cold reboots. +
  2. After the DUT is back online, a new client must be able to connect and receive packet samples successfully.
+ +
+ diff --git a/doc/gnpsi/images/image1.png b/doc/gnpsi/images/image1.png new file mode 100644 index 00000000000..dd9df3796f6 Binary files /dev/null and b/doc/gnpsi/images/image1.png differ diff --git a/doc/gnpsi/images/image2.png b/doc/gnpsi/images/image2.png new file mode 100644 index 00000000000..978683dc215 Binary files /dev/null and b/doc/gnpsi/images/image2.png differ diff --git a/doc/gnpsi/images/image3.png b/doc/gnpsi/images/image3.png new file mode 100644 index 00000000000..9929ebcda3e Binary files /dev/null and b/doc/gnpsi/images/image3.png differ diff --git a/doc/gnpsi/images/image4.png b/doc/gnpsi/images/image4.png new file mode 100644 index 00000000000..fce6273f960 Binary files /dev/null and b/doc/gnpsi/images/image4.png differ diff --git a/doc/gnpsi/images/image5.png b/doc/gnpsi/images/image5.png new file mode 100644 index 00000000000..9704b1f67b6 Binary files /dev/null and b/doc/gnpsi/images/image5.png differ diff --git a/doc/gnpsi/images/image6.png b/doc/gnpsi/images/image6.png new file mode 100644 index 00000000000..3be2b830efd Binary files /dev/null and b/doc/gnpsi/images/image6.png differ diff --git a/doc/gnpsi/images/image7.png b/doc/gnpsi/images/image7.png new file mode 100644 index 00000000000..0c463d0f5be Binary files /dev/null and b/doc/gnpsi/images/image7.png differ