-
Notifications
You must be signed in to change notification settings - Fork 23
feature/gpu-hardware-validation #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…etting Improvement/shm size setting
* feature: ability to disable node ejection
* improve the resiliency of validator
* add improvement to metrics reporting
…enge very basic validator functionality
* basic interconnect test * add issue tracker for software & hardware issues * introduce a new flag to ignore errors
…ode-wallet PRI-1097: Create generate-node-wallet command to skip generating provider wallet if not needed
* Improve the synthetic data validation code to ensure we preserve file structure * added a temporary s3 access until we're fully decentralized as the validator needs access to the file sha mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds GPU hardware validation functionality by introducing new GPU challenge endpoints and extending task configuration to include port mappings for Docker containers. Key changes include:
- Addition of the "ports" field in task models and test payloads.
- Creation of a new module for GPU challenge messages in the shared models.
- Updates in the Docker service and manager to support port-bindings when launching containers.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
orchestrator/src/api/routes/task.rs | Updated test payloads for task creation to include ports. |
shared/src/models/gpu_challenge.rs | Added new GPU challenge request/response definitions. |
worker/src/docker/service.rs | Integrated port binding support when starting containers. |
worker/Cargo.toml | Updated reqwest features. |
shared/src/models/task.rs | Extended task models with a ports field. |
orchestrator/src/api/routes/heartbeat.rs | Updated heartbeat test payload to include ports. |
worker/src/api/routes/mod.rs | Added GPU challenge routes module. |
shared/src/models/mod.rs | Registered GPU challenge module. |
worker/src/api/server.rs | Registered new GPU challenge endpoints. |
worker/src/docker/docker_manager.rs | Updated Docker manager to pass port binding configuration. |
Comments suppressed due to low confidence (3)
worker/src/docker/service.rs:162
- [nitpick] Consider adding an inline comment explaining how the constant BINDABLE_PORTS_START and the 'next_bound_port' variable are used to assign host ports for container bindings.
let mut port_bindings = ::std::collections::HashMap::new();
shared/src/models/task.rs:56
- [nitpick] Consider adding a documentation comment to clarify the purpose and expected format of the 'ports' field in the task model.
pub ports: Option<Vec<String>>,
worker/src/api/server.rs:58
- [nitpick] Ensure that the newly added GPU challenge routes are thoroughly covered by integration tests.
.service(gpu_challenge_routes())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job - can't wait to try it out once I have my storage issues resolved.
As discussed today in person regarding dangling threads: I see multiple tokio::spawn
instances that do not provide an explicit way to cancel them.
I suggest using the cancellation token that we should have in every package with the tokio::select statement to ensure threads are properly terminated:
tokio::spawn(async move {
let mut retries = 0;
loop {
tokio::select! {
_ = cancel_token_clone.cancelled() => {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably update the installation script in /worker/scripts/install.sh
to incl. the setup of docker
and nvidia-ctk
right?
alternatively i can note these as prerequisites in the documentation
I would not update the installation script - this should purely install the worker. The software check detects when this is missing and very specific to your setup. |
} | ||
|
||
impl<'a> HardwareValidator<'a> { | ||
pub fn new(wallet: &'a Wallet, contracts: Arc<Contracts>) -> Self { | ||
Self { wallet, contracts } | ||
let verifier_url = env::var("GPU_VERIFIER_SERVICE_URL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use parameters for everything rightnow with start cmd - we only use env for one keys. IMO we should stick to one approach
continue; | ||
} else { | ||
let mut sessions = self.node_sessions.lock().await; | ||
let session = sessions.get_mut(&node.id).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather use ?
instead of unwrap
to prevent panicks:
@@ -54,6 +56,7 @@ pub async fn start_server( | |||
.service(invite_routes()) | |||
.service(task_routes()) | |||
.service(challenge_routes()) | |||
.service(gpu_challenge_routes(cancellation_token.clone())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather put the cancellation token into the app state
cancellation_token: CancellationToken, | ||
pub state: Arc<DockerState>, | ||
has_gpu: bool, | ||
pub has_gpu: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this public? consider rechecking the setup in command and just sharing the full node config
count, | ||
benchmark, | ||
}, | ||
} | ||
} | ||
|
||
pub async fn validate_nodes(&self, nodes: Vec<DiscoveryNode>) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we break up the function here? its very long and would be easier to understand when split into multiple functions
|
||
match session.status { | ||
NodeChallengeStatus::Init | NodeChallengeStatus::Running => { | ||
info!("Node {} challenge is still pending", node.id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicte log? see further down
info!("Node {} challenge is still pending", node.id); | ||
continue; | ||
} else { | ||
session.attempts += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate code to the handling in failed
As discussed in meeting today, ideally we create a new PR from this one @mattdf |
closes #102