Skip to content

Commit a710941

Browse files
committed
hw-management: thermal: Add retry for fro ASIC temperature reading
Add retry for fro ASIC temperature reading in sync script. It will increase robustness in case of single error ASIC temperature/input file reading. Unittests added: unittest/hw_mgmgt_sync/asic_populate_temperature/test_asic_temp_populate.py Bug: 4280981 4625429 Signed-off-by: Oleksandr Shamray <[email protected]>
1 parent 3e208b7 commit a710941

File tree

5 files changed

+2333
-13
lines changed

5 files changed

+2333
-13
lines changed
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# ASIC Temperature Populate Test Suite
2+
3+
This directory contains comprehensive unit tests for the `asic_temp_populate` function from `hw_management_sync.py`, covering all major functionality, error conditions, and edge cases.
4+
5+
## Features
6+
7+
- **Beautiful colored output** with ASCII icons for terminal compatibility
8+
- **Configurable test iterations** - ALL tests repeat N iterations with random parameter generation
9+
- **Detailed comprehensive reporting** enabled by default
10+
- **🧠 Enhanced intelligent error reporting** with smart analysis, severity classification, and actionable recommendations
11+
- **Sensor read error cleanup** before each test iteration
12+
- **Comprehensive test coverage** of all major scenarios (17+ test scenarios)
13+
- **Hardware-aware testing** with actual ASIC constants (retry counts, temperature limits)
14+
- **Standalone executable** test file
15+
- **Advanced performance metrics and analysis**
16+
17+
## Reporting Features
18+
19+
### Detailed Comprehensive Reporting (Default)
20+
- **Execution Statistics**: Complete breakdown of test results
21+
- **Performance Metrics**: Timing analysis including average, slowest, and fastest tests
22+
- **Test Categories**: Results grouped by test type (normal_operation, error_handling, etc.)
23+
- **Test Coverage**: Statistics on ASIC configurations, temperature ranges, and error conditions tested
24+
- **Input Parameter Analysis**: Detailed analysis of test parameters and their success rates
25+
- **Failure Analysis**: Categorized error patterns and detailed failure information
26+
- **Recommendations**: Intelligent suggestions for test improvements
27+
28+
### Basic Reporting (`--simple` flag)
29+
- Test pass/fail counts
30+
- Overall success rate
31+
- Failed test details with input parameters
32+
- Basic execution time
33+
34+
### Sample Detailed Report Output (Default)
35+
```
36+
================================================================================
37+
[GEAR] COMPREHENSIVE TEST RESULTS REPORT [GEAR]
38+
================================================================================
39+
40+
[STATS] EXECUTION STATISTICS:
41+
Total Tests Run: 33
42+
[+] Passed: 33
43+
[-] Failed: 0
44+
Success Rate: 100.0%
45+
46+
[PERF] PERFORMANCE METRICS:
47+
Average Test Time: 0.021s
48+
Slowest Test: 0.032s
49+
Fastest Test: 0.009s
50+
51+
[COV] TEST COVERAGE:
52+
ASIC Configurations: 1
53+
Temperature Ranges: 2
54+
Error Conditions: 1
55+
File Operations: 2
56+
57+
[REC] RECOMMENDATIONS:
58+
[+] All tests passed! Great job!
59+
60+
================================================================================
61+
```
62+
63+
## Usage
64+
65+
### Basic Execution (Detailed Reporting - Default)
66+
```bash
67+
python3 test_asic_temp_populate.py
68+
```
69+
70+
### With Custom Iterations
71+
```bash
72+
python3 test_asic_temp_populate.py -i 10 # Run 10 iterations per test
73+
```
74+
75+
### With Verbose Output
76+
```bash
77+
python3 test_asic_temp_populate.py -v
78+
```
79+
80+
### With Simple Basic Reporting
81+
```bash
82+
python3 test_asic_temp_populate.py --simple
83+
```
84+
85+
### Combined Options
86+
```bash
87+
python3 test_asic_temp_populate.py -i 10 -v # 10 iterations, verbose, detailed
88+
python3 test_asic_temp_populate.py -i 5 --simple # 5 iterations, simple reporting
89+
```
90+
91+
### Help
92+
```bash
93+
python3 test_asic_temp_populate.py --help
94+
```
95+
96+
## Test Scenarios
97+
98+
### Core Functionality Tests
99+
1. **Normal Condition Testing** - Tests normal operation when all temperature attribute files are present and readable
100+
2. **Input Read Error Default Values** - Tests behavior when the main temperature input file cannot be read
101+
3. **Input Read Error Retry Logic** - Tests the 3-retry error handling mechanism
102+
4. **Other Attributes Read Error** - Tests behavior when threshold or cooling level files cannot be read
103+
5. **Random ASIC Configuration** - Tests all ASICs with randomized configurations (temperature range 0-800)
104+
6. **SDK Temperature Conversion** - Tests the `sdk_temp2degree()` function
105+
7. **Argument Validation** - Tests that function arguments are properly validated
106+
107+
### Advanced Error Handling Tests
108+
8. **Error Handling No Crash** - Tests that the function doesn't crash under various error conditions
109+
9. **ASIC Not Ready Conditions** - Tests behavior when ASIC is not ready (SDK not started)
110+
10. **Invalid Temperature Values** - Tests handling of invalid, non-numeric, or extreme temperature values
111+
11. **Temperature File Write Errors** - Tests behavior when writing output files fails (permissions, disk full, etc.)
112+
113+
### System Integration Tests
114+
12. **Symbolic Link Existing Files** - Tests behavior when thermal output files already exist as symbolic links
115+
13. **ASIC Chipup Completion Logic** - Tests chipup completion counting and asics_init_done logic
116+
14. **ASIC Temperature Reset Functionality** - Tests the asic_temp_reset function behavior
117+
15. **Counter and Logging Mechanisms** - Tests counter increments and logging ID mechanisms
118+
16. **File System Permission Scenarios** - Tests various file system permission and access scenarios
119+
- **Note**: `/var/run/hw-management/` directory always maintains r/w access (production requirement)
120+
- Mixed permission errors only affect source files, ready files, and config files
121+
122+
## Test Configuration
123+
124+
- **ASIC Count**: 2 ASICs (asic and asic1 are the same asic === asic1)
125+
- **Input Path Template**: `/sys/module/sx_core/asic0/`
126+
- **Output Path**: `/var/run/hw-management/thermal/`
127+
- **Temperature Range**: 0-800 for random testing
128+
129+
## Output Files Generated
130+
131+
Each ASIC generates the following output files:
132+
- `asic{N}` - Processed temperature value
133+
- `asic{N}_temp_norm` - Constant value
134+
- `asic{N}_temp_crit` - Constant value
135+
- `asic{N}_temp_emergency` - Constant value
136+
- `asic{N}_temp_trip_crit` - Constant value
137+
138+
## Error Handling
139+
140+
The test suite includes **enhanced intelligent error reporting** with:
141+
142+
### 🧠 **Smart Error Analysis**
143+
- **Error Classification**: Automatic categorization (Temperature Processing, File System, ASIC Readiness, etc.)
144+
- **Severity Assessment**: CRITICAL, HIGH, MEDIUM severity levels with priority recommendations
145+
- **Root Cause Analysis**: Intelligent identification of potential causes based on error patterns
146+
147+
### 🔧 **Actionable Solutions**
148+
- **Fix Recommendations**: Specific, actionable suggestions based on error type and context
149+
- **Hardware Constants Context**: Relevant ASIC constants (75000mC temp limits, 3-retry counts, etc.)
150+
- **Environmental Context**: Temperature ranges, ASIC configs, and test scenarios when errors occur
151+
152+
### 📊 **Comprehensive Details**
153+
- **Critical Stack Traces**: Highlights the most relevant error lines from full stack traces
154+
- **Input Parameters**: Complete context of parameters that caused the error
155+
- **Performance Impact**: Execution time analysis for failed operations
156+
- **Smart Recommendations**: Pattern-based suggestions for preventing similar errors
157+
- **Crash Recovery**: Automatic continuation after failures with detailed logging
158+
- **Success Rate Calculation**: Statistical analysis of test reliability
159+
160+
## Notes
161+
162+
- **Comprehensive Coverage**: Tests all major code paths, error conditions, and edge cases in `asic_temp_populate`
163+
- **Non-Destructive Testing**: All tests use extensive mocking and don't affect the actual file system
164+
- **Random Parameter Generation**: Each iteration uses different random parameters for thorough testing
165+
- **Error Condition Testing**: Covers all error scenarios including file permissions, invalid data, and system failures
166+
- **Reset Functionality**: Tests ASIC temperature reset logic and counter mechanisms
167+
- **File System Integration**: Tests symbolic links, file permissions, and directory access scenarios
168+
- **Logging and Counters**: Validates logging mechanisms and error counter logic
169+
- **Sensor Cleanup**: Cleans sensor_read_error before each test iteration
170+
- **All tests repeat N iterations**: Each test scenario runs multiple times with different parameters
171+
- **Enterprise Grade**: Comprehensive error reporting and detailed analysis suitable for production use
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
#!/bin/bash
2+
# ASIC Temperature Populate Test Runner
3+
# Simple wrapper script for easy test execution
4+
########################################################################
5+
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
6+
# Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
7+
#
8+
# Redistribution and use in source and binary forms, with or without
9+
# modification, are permitted provided that the following conditions are met:
10+
#
11+
# 1. Redistributions of source code must retain the above copyright
12+
# notice, this list of conditions and the following disclaimer.
13+
# 2. Redistributions in binary form must reproduce the above copyright
14+
# notice, this list of conditions and the following disclaimer in the
15+
# documentation and/or other materials provided with the distribution.
16+
# 3. Neither the names of the copyright holders nor the names of its
17+
# contributors may be used to endorse or promote products derived from
18+
# this software without specific prior written permission.
19+
#
20+
# Alternatively, this software may be distributed under the terms of the
21+
# GNU General Public License ("GPL") version 2 as published by the Free
22+
# Software Foundation.
23+
#
24+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
25+
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27+
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
28+
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
29+
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
30+
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
31+
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
32+
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
33+
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
34+
# POSSIBILITY OF SUCH DAMAGE.
35+
#
36+
37+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
38+
TEST_SCRIPT="$SCRIPT_DIR/test_asic_temp_populate.py"
39+
40+
echo "[GEAR] ASIC Temperature Populate Test Runner [GEAR]"
41+
echo "======================================================="
42+
43+
if [[ "$1" == "--help" || "$1" == "-h" ]]; then
44+
echo "Usage: $0 [OPTIONS]"
45+
echo ""
46+
echo "Options:"
47+
echo " -i NUM Number of iterations for ALL tests (default: 5)"
48+
echo " -v Verbose output"
49+
echo " -s Simple basic reporting (detailed is default)"
50+
echo " --help Show this help message"
51+
echo ""
52+
echo "Examples:"
53+
echo " $0 # Run with default 5 iterations (detailed reporting)"
54+
echo " $0 -i 10 # Run with 10 iterations (detailed reporting)"
55+
echo " $0 -i 3 -v # Run with 3 iterations and verbose output"
56+
echo " $0 -i 5 -s # Run with 5 iterations and simple reporting"
57+
echo " $0 -i 2 -v -s # Run with 2 iterations, verbose, and simple reporting"
58+
exit 0
59+
fi
60+
61+
# Check if Python 3 is available
62+
if ! command -v python3 &> /dev/null; then
63+
echo "[FAIL] Python 3 is not installed or not in PATH"
64+
exit 1
65+
fi
66+
67+
# Check if test script exists
68+
if [[ ! -f "$TEST_SCRIPT" ]]; then
69+
echo "[FAIL] Test script not found: $TEST_SCRIPT"
70+
exit 1
71+
fi
72+
73+
# Make sure test script is executable
74+
chmod +x "$TEST_SCRIPT"
75+
76+
# Run the test with all provided arguments
77+
echo "[INFO] Running ASIC Temperature Populate tests..."
78+
echo "-------------------------------------------------------"
79+
80+
python3 "$TEST_SCRIPT" "$@"
81+
exit_code=$?
82+
83+
echo ""
84+
echo "======================================================="
85+
if [[ $exit_code -eq 0 ]]; then
86+
echo "[PASS] All tests completed successfully!"
87+
else
88+
echo "[FAIL] Some tests failed. Check output above for details."
89+
fi
90+
91+
exit $exit_code

0 commit comments

Comments
 (0)