Skip to content

Conversation

@annapoornanarayan
Copy link
Contributor

Issue #, if available:

Description of changes:
[DO NOT MERGE until nvidia-training PR is merged] The tests will only pass after the original changes are applied.
This PR contains changes to the nvidia-inference test to deploy dcgm and cloudwatch manifests when enabled by --metricDimensions flag.
It also has standardization for flag formatting and common functions for daemonset deployment similar to nvidia-training

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

BertInferenceImage string `flag:"bertInferenceImage" desc:"BERT inference container image"`
InferenceMode string `flag:"inferenceMode" desc:"Inference mode for BERT (throughput or latency)"`
GpuRequested int `flag:"gpuRequested" desc:"Number of GPUs required for inference"`
NodeType string `flag:"nodeType" desc:"Instance type for cluster nodes"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nodeType required in inference test? It is not used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Will remove it

Comment on lines +40 to +43
testConfig = TestConfig{
InferenceMode: "throughput",
GpuRequested: 1,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Could you comment above about this is config default value?
  • We should maintain all config default value here including InferenceMode to keep the consistency

Comment on lines 58 to 69
// Render CloudWatch Agent manifest with dynamic dimensions
renderedCloudWatchAgentManifest, err := manifests.RenderCloudWatchAgentManifest(testConfig.MetricDimensions)
if err != nil {
log.Printf("Warning: failed to render CloudWatch Agent manifest: %v", err)
}

manifestsList := [][]byte{
manifests.NvidiaDevicePluginManifest,
}

if len(testConfig.MetricDimensions) > 0 {
manifestsList = append(manifestsList, manifests.DCGMExporterManifest, renderedCloudWatchAgentManifest)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • manifest render should happened inside the if len(testConfig.MetricDimensions) > 0 {} after the manifestsList. if the metric dimension not exist, there is no need to do manifest render right?
  • Can you also revised this in training here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made PR: #667 to revise in nvidia-training.

@wwvela
Copy link
Contributor

wwvela commented Aug 6, 2025

The build failed. you might need to sync the branch to pick up the latest merged changes

@annapoornanarayan annapoornanarayan force-pushed the nvidia-inference-cw-agent branch from 6ac5638 to e167589 Compare August 11, 2025 18:32
@annapoornanarayan annapoornanarayan force-pushed the nvidia-inference-cw-agent branch from 5ae0cf0 to 1ae622e Compare August 11, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants