Skip to content

feature/gpu-hardware-validation #159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 94 commits into from
Closed

Conversation

mattdf
Copy link
Member

@mattdf mattdf commented Mar 21, 2025

closes #102

JannikSt and others added 9 commits March 24, 2025 16:19
* basic interconnect test

* add issue tracker for software & hardware issues 

* introduce a new flag to ignore errors
…ode-wallet

PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed
* Improve the synthetic data validation code to ensure we preserve file structure 

* added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping
@mattdf mattdf marked this pull request as ready for review March 26, 2025 16:18
@JannikSt JannikSt requested a review from Copilot March 27, 2025 05:28
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds GPU hardware validation functionality by introducing new GPU challenge endpoints and extending task configuration to include port mappings for Docker containers. Key changes include:

  • Addition of the "ports" field in task models and test payloads.
  • Creation of a new module for GPU challenge messages in the shared models.
  • Updates in the Docker service and manager to support port-bindings when launching containers.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
orchestrator/src/api/routes/task.rs Updated test payloads for task creation to include ports.
shared/src/models/gpu_challenge.rs Added new GPU challenge request/response definitions.
worker/src/docker/service.rs Integrated port binding support when starting containers.
worker/Cargo.toml Updated reqwest features.
shared/src/models/task.rs Extended task models with a ports field.
orchestrator/src/api/routes/heartbeat.rs Updated heartbeat test payload to include ports.
worker/src/api/routes/mod.rs Added GPU challenge routes module.
shared/src/models/mod.rs Registered GPU challenge module.
worker/src/api/server.rs Registered new GPU challenge endpoints.
worker/src/docker/docker_manager.rs Updated Docker manager to pass port binding configuration.
Comments suppressed due to low confidence (3)

worker/src/docker/service.rs:162

  • [nitpick] Consider adding an inline comment explaining how the constant BINDABLE_PORTS_START and the 'next_bound_port' variable are used to assign host ports for container bindings.
let mut port_bindings = ::std::collections::HashMap::new();

shared/src/models/task.rs:56

  • [nitpick] Consider adding a documentation comment to clarify the purpose and expected format of the 'ports' field in the task model.
pub ports: Option<Vec<String>>,

worker/src/api/server.rs:58

  • [nitpick] Ensure that the newly added GPU challenge routes are thoroughly covered by integration tests.
.service(gpu_challenge_routes())

Copy link
Member

@JannikSt JannikSt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job - can't wait to try it out once I have my storage issues resolved.

As discussed today in person regarding dangling threads: I see multiple tokio::spawn instances that do not provide an explicit way to cancel them.
I suggest using the cancellation token that we should have in every package with the tokio::select statement to ensure threads are properly terminated:

 tokio::spawn(async move {
                let mut retries = 0;
                loop {
                    tokio::select! {
                        _ = cancel_token_clone.cancelled() => {

Copy link
Member

@manveerxyz manveerxyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably update the installation script in /worker/scripts/install.sh to incl. the setup of docker and nvidia-ctk right?

alternatively i can note these as prerequisites in the documentation

@JannikSt
Copy link
Member

we should probably update the installation script in /worker/scripts/install.sh to incl. the setup of docker and nvidia-ctk right?

alternatively i can note these as prerequisites in the documentation

I would not update the installation script - this should purely install the worker. The software check detects when this is missing and very specific to your setup.

}

impl<'a> HardwareValidator<'a> {
pub fn new(wallet: &'a Wallet, contracts: Arc<Contracts>) -> Self {
Self { wallet, contracts }
let verifier_url = env::var("GPU_VERIFIER_SERVICE_URL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use parameters for everything rightnow with start cmd - we only use env for one keys. IMO we should stick to one approach

continue;
} else {
let mut sessions = self.node_sessions.lock().await;
let session = sessions.get_mut(&node.id).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather use ? instead of unwrap to prevent panicks:

@@ -54,6 +56,7 @@ pub async fn start_server(
.service(invite_routes())
.service(task_routes())
.service(challenge_routes())
.service(gpu_challenge_routes(cancellation_token.clone()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather put the cancellation token into the app state

cancellation_token: CancellationToken,
pub state: Arc<DockerState>,
has_gpu: bool,
pub has_gpu: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this public? consider rechecking the setup in command and just sharing the full node config

count,
benchmark,
},
}
}

pub async fn validate_nodes(&self, nodes: Vec<DiscoveryNode>) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break up the function here? its very long and would be easier to understand when split into multiple functions


match session.status {
NodeChallengeStatus::Init | NodeChallengeStatus::Running => {
info!("Node {} challenge is still pending", node.id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicte log? see further down

info!("Node {} challenge is still pending", node.id);
continue;
} else {
session.attempts += 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate code to the handling in failed

@JannikSt
Copy link
Member

As discussed in meeting today, ideally we create a new PR from this one @mattdf

@JannikSt JannikSt closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create docker container for running validator challenges
6 participants