-
Notifications
You must be signed in to change notification settings - Fork 97
Cloudwatch Agent + DCGM Integration for nvidia-inference test #665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cloudwatch Agent + DCGM Integration for nvidia-inference test #665
Conversation
| BertInferenceImage string `flag:"bertInferenceImage" desc:"BERT inference container image"` | ||
| InferenceMode string `flag:"inferenceMode" desc:"Inference mode for BERT (throughput or latency)"` | ||
| GpuRequested int `flag:"gpuRequested" desc:"Number of GPUs required for inference"` | ||
| NodeType string `flag:"nodeType" desc:"Instance type for cluster nodes"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nodeType required in inference test? It is not used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Will remove it
| testConfig = TestConfig{ | ||
| InferenceMode: "throughput", | ||
| GpuRequested: 1, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could you comment above about this is config default value?
- We should maintain all config default value here including
InferenceModeto keep the consistency
| // Render CloudWatch Agent manifest with dynamic dimensions | ||
| renderedCloudWatchAgentManifest, err := manifests.RenderCloudWatchAgentManifest(testConfig.MetricDimensions) | ||
| if err != nil { | ||
| log.Printf("Warning: failed to render CloudWatch Agent manifest: %v", err) | ||
| } | ||
|
|
||
| manifestsList := [][]byte{ | ||
| manifests.NvidiaDevicePluginManifest, | ||
| } | ||
|
|
||
| if len(testConfig.MetricDimensions) > 0 { | ||
| manifestsList = append(manifestsList, manifests.DCGMExporterManifest, renderedCloudWatchAgentManifest) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- manifest render should happened inside the
if len(testConfig.MetricDimensions) > 0 {}after the manifestsList. if the metric dimension not exist, there is no need to do manifest render right? - Can you also revised this in training here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made PR: #667 to revise in nvidia-training.
|
The build failed. you might need to sync the branch to pick up the latest merged changes |
6ac5638 to
e167589
Compare
5ae0cf0 to
1ae622e
Compare
Issue #, if available:
Description of changes:
[DO NOT MERGE until nvidia-training PR is merged] The tests will only pass after the original changes are applied.
This PR contains changes to the
nvidia-inferencetest to deploy dcgm and cloudwatch manifests when enabled by--metricDimensionsflag.It also has standardization for flag formatting and common functions for daemonset deployment similar to
nvidia-trainingBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.