-
Notifications
You must be signed in to change notification settings - Fork 133
Add ROCm libraries benchmark tests #2261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
9ed8ca3
2697c01
1798f3d
013f14d
52d7171
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # Test Framework | ||
|
|
||
| Automated benchmark testing framework for ROCm libraries with system detection, results collection, and performance tracking. | ||
|
|
||
| ## Features | ||
|
|
||
| - Automated benchmark execution (ROCfft, ROCrand, ROCsolver, hipBLASLt) | ||
| - Hardware, OS, GPU, and ROCm auto-detection | ||
| - Local storage (JSON) and API upload | ||
| - LKG (Last Known Good) comparison | ||
| - File rotation and configurable logging | ||
| - Modular, extensible architecture | ||
|
|
||
|
|
||
| ## Configuration | ||
|
|
||
| ```bash | ||
| # Required environment variables | ||
| export THEROCK_BIN_DIR=/path/to/rocm/bin | ||
| export ARTIFACT_RUN_ID=$WORKFLOW_RUN_ID | ||
| export AMDGPU_FAMILIES=gfx950-dcgpu | ||
| ``` | ||
|
|
||
| ### Run Benchmarks | ||
|
|
||
| ```bash | ||
| python3 build_tools/github_actions/test_executable_scripts/test_rocfft_benchmark.py | ||
| ``` | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ``` | ||
| build_tools/github_actions/ | ||
| ├── configs/ | ||
| │ ├── config.yml # Main configuration | ||
| │ └── benchmarks/ # Benchmark configs | ||
| │ | ||
| ├── test_executable_scripts/ # Benchmark scripts | ||
| │ ├── test_rocfft_benchmark.py | ||
| │ ├── test_rocrand_benchmark.py | ||
| │ ├── test_rocsolver_benchmark.py | ||
| │ └── test_hipblaslt_benchmark.py | ||
| │ | ||
| └── utils/ # Framework utilities | ||
| ├── test_client.py # Main client API | ||
| ├── config/ # Configuration management | ||
| ├── system/ # System detection | ||
| └── results/ # Results handling | ||
| ``` | ||
|
|
||
| ## Config setup | ||
|
|
||
| ### Main Config: `configs/config.yml` | ||
|
|
||
| ```yaml | ||
| Config: | ||
| Core: | ||
| LogLevel: INFO # DEBUG, INFO, WARNING, ERROR | ||
| LogToFile: true | ||
| LogDirectory: "./logs" | ||
| UploadTestResultsToAPI: true | ||
|
|
||
| Results: | ||
| OutputDirectory: "./results" | ||
| SaveJSON: true | ||
| ``` | ||
| ## Architecture | ||
| ### Core Components | ||
| - **TestClient** - Main API for test execution and results management | ||
| - **System Detection** - Hardware, OS, and ROCm detection | ||
| - **Results Handling** - Local storage and API submission with retry | ||
| - **Configuration** - YAML-based config with environment variable expansion | ||
| ### Test Flow | ||
| 1. Initialize TestClient → Detect system and load config | ||
| 2. Run Benchmarks → Execute binary and capture output | ||
| 3. Parse Results → Extract metrics from log file | ||
| 4. Upload Results → Submit to API and save locally | ||
| 5. Compare with LKG → Fetch and compare scores | ||
| 6. Report Results → Display table and return status | ||
| ## Adding New Benchmarks | ||
| ### 1. Create Script | ||
| ```python | ||
| from utils import TestClient | ||
| from utils.logger import log | ||
|
|
||
| def run_benchmarks(): | ||
| """Run benchmarks and save output to log file.""" | ||
| pass | ||
|
|
||
| def parse_results(): | ||
| """Parse benchmark results from log file.""" | ||
| pass | ||
|
|
||
| def main(): | ||
| client = TestClient(auto_detect=True) | ||
| client.print_system_summary() | ||
| run_benchmarks() | ||
| test_results, table = parse_results() | ||
| client.upload_results(...) | ||
|
|
||
| if __name__ == '__main__': | ||
| main() | ||
| ``` | ||
| ### 2. Add Config (Optional) | ||
| Create `configs/benchmarks/your_benchmark.json`: | ||
|
|
||
| ```json | ||
| { | ||
| "test_cases": ["case1", "case2"] | ||
| } | ||
| ``` | ||
|
|
||
| ## Documentation | ||
|
|
||
| - [Main framework documentation](README.md) - Main framework documentation | ||
| - [Utils Module](../utils/README.md) - Framework utilities | ||
| - [Configuration Guide](config.yml) - Configuration options | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| { | ||
| "input_shapes": [ | ||
| "8192 320 320 1", | ||
| "2048 640 640 1", | ||
| "512 1280 1280 1", | ||
| "8192 320 1280 1", | ||
| "512 10240 1280 1", | ||
| "2048 5120 640 1", | ||
| "8192 2560 320 1", | ||
| "512 1280 5120 1", | ||
| "2048 640 2560 1", | ||
| "154 320 768 1", | ||
| "154 1280 768 1", | ||
| "4096 40 4096 16", | ||
| "1024 80 1024 16", | ||
| "1024 80 77 16" | ||
| ], | ||
| "ntinput_shapes": [ | ||
| "4096 4096 40 16", | ||
| "1024 1024 80 16", | ||
| "4096 77 40 16", | ||
| "256 77 160 16", | ||
| "1024 77 80 16" | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| { | ||
| "generic": [ | ||
| "16777216", | ||
| "14348907", | ||
| "9765625", | ||
| "4096 4096", | ||
| "6561 6561", | ||
| "3125 3125", | ||
| "256 256 256", | ||
| "243 243 243", | ||
| "125 125 125", | ||
| "100 100 100 -t 2 -o", | ||
| "100 100 100 -t 3 -o", | ||
| "200 200 200 -t 2 -o", | ||
| "200 200 200 -t 3 -o", | ||
| "192 192 192 -t 2 -o", | ||
| "192 192 192 -t 3 -o", | ||
| "64 64 64 -t 2 -o", | ||
| "60 -b 1024" | ||
| ], | ||
| "gfx94X-dcgpu": [ | ||
| "336 336 56 --double -o" | ||
| ], | ||
| "gfx950-dcgpu": [ | ||
| "336 336 56 --double -o" | ||
| ], | ||
| "gfx1151": [] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # Test Framework Configuration | ||
| # Configuration for logging, API integration, and test execution | ||
|
|
||
| Config: | ||
| Core: | ||
| # Logging Configuration | ||
| LogLevel: INFO # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL | ||
| LogToFile: true # Enable file logging with rotation | ||
| LogDirectory: "./logs" # Directory for log files | ||
| LogMaxSizeMB: 10 # Maximum log file size before rotation | ||
| LogBackupCount: 5 # Number of backup log files to keep | ||
|
|
||
| # Execution Metadata | ||
| # Environment variables with defaults (format: ${VAR:-default}) | ||
| DeployedUser: "${DEPLOYED_USER:-therockbot}" | ||
| ExecutionLabel: "${EXECUTION_LABEL:-therock_pr}" | ||
| CIGroup: "${CI_GROUP:-therock_pr}" | ||
|
|
||
| # Results API Configuration | ||
| UploadTestResultsToAPI: true | ||
| ResultsAPI: | ||
| URL: "${API_URL}" # API URL from GitHub secrets | ||
| FallbackURL: "${API_FALLBACK_URL}" # Fallback API URL from GitHub secrets | ||
| APIKey: "${API_KEY:-}" # API key from environment (secure) | ||
| Timeout: 30 # Request timeout in seconds | ||
| MaxRetries: 3 # Maximum retry attempts | ||
| RetryDelay: 5 # Delay between retries in seconds | ||
|
|
||
| Results: | ||
| # Local Results Output | ||
| OutputDirectory: "./results" | ||
| SaveJSON: true | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -62,6 +62,15 @@ def _get_script_path(script_name: str) -> str: | |
| "platform": ["linux", "windows"], | ||
| "total_shards": 6, | ||
| }, | ||
| "hipblaslt_bench": { | ||
| "job_name": "hipblaslt_bench", | ||
| "fetch_artifact_args": "--blas --tests", | ||
| "timeout_minutes": 60, | ||
| "test_script": f"python {_get_script_path('test_hipblaslt_benchmark.py')}", | ||
| # TODO(lajagapp): Add windows test | ||
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
|
Comment on lines
+65
to
+73
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this may be an issue as this will get run during each PR, push to main and scheduled run, probably resulting in long queue times and machine shortages I would imagine we want to add benchmark tests as a separate (perhaps nightly) run on separate machines? is this the case? how frequent do we want to run these and how long do these take? |
||
| # SOLVER tests | ||
| "hipsolver": { | ||
| "job_name": "hipsolver", | ||
|
|
@@ -80,6 +89,15 @@ def _get_script_path(script_name: str) -> str: | |
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
| "rocsolver_bench": { | ||
| "job_name": "rocsolver_bench", | ||
| "fetch_artifact_args": "--blas --tests", | ||
| "timeout_minutes": 60, | ||
| "test_script": f"python {_get_script_path('test_rocsolver_benchmark.py')}", | ||
| # TODO(lajagapp): Add windows test | ||
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
| # PRIM tests | ||
| "rocprim": { | ||
| "job_name": "rocprim", | ||
|
|
@@ -142,6 +160,15 @@ def _get_script_path(script_name: str) -> str: | |
| "platform": ["linux", "windows"], | ||
| "total_shards": 1, | ||
| }, | ||
| "rocrand_bench": { | ||
| "job_name": "rocrand_bench", | ||
| "fetch_artifact_args": "--rand --tests", | ||
| "timeout_minutes": 60, | ||
| "test_script": f"python {_get_script_path('test_rocrand_benchmark.py')}", | ||
| # TODO(lajagapp): Add windows test | ||
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
| "hiprand": { | ||
| "job_name": "hiprand", | ||
| "fetch_artifact_args": "--rand --tests", | ||
|
|
@@ -160,6 +187,15 @@ def _get_script_path(script_name: str) -> str: | |
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
| "rocfft_bench": { | ||
| "job_name": "rocfft_bench", | ||
| "fetch_artifact_args": "--fft --rand --tests", | ||
| "timeout_minutes": 60, | ||
| "test_script": f"python {_get_script_path('test_rocfft_benchmark.py')}", | ||
| # TODO(lajagapp): Add windows test | ||
| "platform": ["linux"], | ||
| "total_shards": 1, | ||
| }, | ||
| "hipfft": { | ||
| "job_name": "hipfft", | ||
| "fetch_artifact_args": "--fft --rand --tests", | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for add windows tests, can we open a Github issue and link it here? so we can keep track?