feature/gpu-hardware-validation #159

mattdf · 2025-03-21T13:08:33Z

closes #102

…etting Improvement/shm size setting

…etting

* feature: ability to disable node ejection

* improve the resiliency of validator

* add improvement to metrics reporting

…enge very basic validator functionality

…t if not needed

* basic interconnect test * add issue tracker for software & hardware issues * introduce a new flag to ignore errors

…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed

* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping

Copilot

Pull Request Overview

This PR adds GPU hardware validation functionality by introducing new GPU challenge endpoints and extending task configuration to include port mappings for Docker containers. Key changes include:

Addition of the "ports" field in task models and test payloads.
Creation of a new module for GPU challenge messages in the shared models.
Updates in the Docker service and manager to support port-bindings when launching containers.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
orchestrator/src/api/routes/task.rs	Updated test payloads for task creation to include ports.
shared/src/models/gpu_challenge.rs	Added new GPU challenge request/response definitions.
worker/src/docker/service.rs	Integrated port binding support when starting containers.
worker/Cargo.toml	Updated reqwest features.
shared/src/models/task.rs	Extended task models with a ports field.
orchestrator/src/api/routes/heartbeat.rs	Updated heartbeat test payload to include ports.
worker/src/api/routes/mod.rs	Added GPU challenge routes module.
shared/src/models/mod.rs	Registered GPU challenge module.
worker/src/api/server.rs	Registered new GPU challenge endpoints.
worker/src/docker/docker_manager.rs	Updated Docker manager to pass port binding configuration.

Comments suppressed due to low confidence (3)

worker/src/docker/service.rs:162

[nitpick] Consider adding an inline comment explaining how the constant BINDABLE_PORTS_START and the 'next_bound_port' variable are used to assign host ports for container bindings.

let mut port_bindings = ::std::collections::HashMap::new();

shared/src/models/task.rs:56

[nitpick] Consider adding a documentation comment to clarify the purpose and expected format of the 'ports' field in the task model.

pub ports: Option<Vec<String>>,

worker/src/api/server.rs:58

[nitpick] Ensure that the newly added GPU challenge routes are thoroughly covered by integration tests.

.service(gpu_challenge_routes())

JannikSt

great job - can't wait to try it out once I have my storage issues resolved.

As discussed today in person regarding dangling threads: I see multiple tokio::spawn instances that do not provide an explicit way to cancel them.
I suggest using the cancellation token that we should have in every package with the tokio::select statement to ensure threads are properly terminated:

 tokio::spawn(async move {
                let mut retries = 0;
                loop {
                    tokio::select! {
                        _ = cancel_token_clone.cancelled() => {

worker/src/api/routes/gpu_challenge.rs

manveerxyz

we should probably update the installation script in /worker/scripts/install.sh to incl. the setup of docker and nvidia-ctk right?

alternatively i can note these as prerequisites in the documentation

JannikSt · 2025-03-31T03:26:48Z

we should probably update the installation script in /worker/scripts/install.sh to incl. the setup of docker and nvidia-ctk right?

alternatively i can note these as prerequisites in the documentation

I would not update the installation script - this should purely install the worker. The software check detects when this is missing and very specific to your setup.

JannikSt · 2025-03-31T03:28:52Z

validator/src/validators/hardware.rs

 }

 impl<'a> HardwareValidator<'a> {
    pub fn new(wallet: &'a Wallet, contracts: Arc<Contracts>) -> Self {
-        Self { wallet, contracts }
+        let verifier_url = env::var("GPU_VERIFIER_SERVICE_URL")


we usually use parameters for everything rightnow with start cmd - we only use env for one keys. IMO we should stick to one approach

JannikSt · 2025-03-31T03:30:37Z

validator/src/validators/hardware.rs

+                continue;
+            } else {
+                let mut sessions = self.node_sessions.lock().await;
+                let session = sessions.get_mut(&node.id).unwrap();


rather use ? instead of unwrap to prevent panicks:

JannikSt · 2025-03-31T03:32:28Z

worker/src/api/server.rs

@@ -54,6 +56,7 @@ pub async fn start_server(
            .service(invite_routes())
            .service(task_routes())
            .service(challenge_routes())
+            .service(gpu_challenge_routes(cancellation_token.clone()))


I would rather put the cancellation token into the app state

JannikSt · 2025-03-31T03:32:59Z

worker/src/docker/service.rs

    cancellation_token: CancellationToken,
    pub state: Arc<DockerState>,
-    has_gpu: bool,
+    pub has_gpu: bool,


why is this public? consider rechecking the setup in command and just sharing the full node config

JannikSt · 2025-03-31T03:34:56Z

validator/src/validators/hardware.rs

+                count,
+                benchmark,
+            },
+        }
    }

    pub async fn validate_nodes(&self, nodes: Vec<DiscoveryNode>) -> Result<()> {


Can we break up the function here? its very long and would be easier to understand when split into multiple functions

JannikSt · 2025-03-31T03:36:23Z

validator/src/validators/hardware.rs

+
+                    match session.status {
+                        NodeChallengeStatus::Init | NodeChallengeStatus::Running => {
+                            info!("Node {} challenge is still pending", node.id);


duplicte log? see further down

JannikSt · 2025-03-31T03:37:03Z

validator/src/validators/hardware.rs

+                                info!("Node {} challenge is still pending", node.id);
+                                continue;
+                            } else {
+                                session.attempts += 1;


Duplicate code to the handling in failed

JannikSt · 2025-04-21T17:45:58Z

As discussed in meeting today, ideally we create a new PR from this one @mattdf

JannikSt and others added 30 commits February 6, 2025 01:51

ability to automatically set shm size based on sys memory

c42793f

clippy

f1f7538

Merge pull request #107 from PrimeIntellect-ai/improvement/shm-size-s…

6d4aa01

…etting Improvement/shm size setting

bump version

ca57e74

Merge pull request #108 from PrimeIntellect-ai/improvement/shm-size-s…

faf4375

…etting

Feature/disable ejection (#111)

ad90427

* feature: ability to disable node ejection

support ability to restore metrics via orchestrator (#112)

bf52611

bump version

8e8808a

merge

535e0c3

improve resiliency of validator (#114)

b698e32

* improve the resiliency of validator

bump version

7ddf448

resolve conflicts

bbabc26

add improvement to metrics reporting (#119)

b7adeeb

* add improvement to metrics reporting

resolve conflicts

519307a

fmt

0d3d5cd

very basic validator functionality

8cb71b4

fmt

bf88c80

add signature ...

c58574b

use sign_request

92629cd

use custom serializer for consistency

75e055d

add validator arg to miner, implement rounding robust partial eq

61a482b

fmt, clippy fix

36349d6

fix remote makefile entry

7a47331

misc makefile fixes

9a85c57

fix clippy

58f97ba

fix fmt...

9e801d3

make app_state unused

7255a74

Merge pull request #96 from PrimeIntellect-ai/feature/validator-chall…

e172fbd

…enge very basic validator functionality

minor readme adjustment

7350607

remove redundant files

c3480c9

JannikSt and others added 9 commits March 24, 2025 16:19

improve README & general setup info (#165)

7100b5c

Create generate-node-wallet command to skip generating provider walle…

706c247

…t if not needed

improve stake approval ux by showing response immediatly (#167)

bf1a951

Feature: interconnect check w. issue tracker (#166)

c664091

* basic interconnect test * add issue tracker for software & hardware issues * introduce a new flag to ignore errors

Run cargo fmt

4312d39

Merge pull request #169 from PrimeIntellect-ai/improvement/generate-n…

5a61192

…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed

Preserve folder structure for toploc-validator (#168)

0ec478a

* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping

make sure port bindings are passed even to non-gpu dockers

b1042e8

remove misc race conditions, add session id for row proofs request

d84fcd8

mattdf marked this pull request as ready for review March 26, 2025 16:18

mattdf added 3 commits March 26, 2025 17:21

Merge branch 'develop' into feature/gpu-hardware-validator

6a383bd

fmt

fc9e327

stop container once challenge has completed

1f210eb

JannikSt requested a review from Copilot March 27, 2025 05:28

Copilot AI reviewed Mar 27, 2025

View reviewed changes

JannikSt requested changes Mar 27, 2025

View reviewed changes

worker/src/api/routes/gpu_challenge.rs Outdated Show resolved Hide resolved

worker/src/api/routes/gpu_challenge.rs Outdated Show resolved Hide resolved

mattdf added 6 commits March 28, 2025 11:52

address review comments

a908dbb

fix panic

dc9d46b

make requirements env parametrizable

cc60c5f

make challenge autoscale to memory requirement

c7f5743

add support for verifier /clear endpoint

03f161a

fix overflow in ram calculation

4e4e9ff

manveerxyz reviewed Mar 28, 2025

View reviewed changes

JannikSt requested changes Mar 31, 2025

View reviewed changes

JannikSt force-pushed the develop branch from 758d58e to bcf50a4 Compare April 12, 2025 16:50

JannikSt force-pushed the develop branch from f588cac to b89c1ba Compare April 21, 2025 14:48

JannikSt marked this pull request as draft April 21, 2025 17:45

JannikSt closed this Jul 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature/gpu-hardware-validation #159

feature/gpu-hardware-validation #159

Uh oh!

mattdf commented Mar 21, 2025 •

edited by JannikSt

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

JannikSt left a comment

Uh oh!

Uh oh!

Uh oh!

manveerxyz left a comment

Uh oh!

JannikSt commented Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Uh oh!

JannikSt commented Apr 21, 2025

Uh oh!

Uh oh!

feature/gpu-hardware-validation #159

feature/gpu-hardware-validation #159

Uh oh!

Conversation

mattdf commented Mar 21, 2025 • edited by JannikSt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

JannikSt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

manveerxyz left a comment

Choose a reason for hiding this comment

Uh oh!

JannikSt commented Mar 31, 2025

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

JannikSt commented Apr 21, 2025

Uh oh!

Uh oh!

mattdf commented Mar 21, 2025 •

edited by JannikSt

Loading