Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 41 additions & 2 deletions .github/workflows/generate_matrix_page.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,12 @@ jobs:
steps:
- name: Set dynamic env vars
run: |
# GPU Operator dashboard paths
echo "DASHBOARD_DATA_FILEPATH=${DASHBOARD_OUTPUT_DIR}/gpu_operator_matrix.json" >> "$GITHUB_ENV"
echo "DASHBOARD_HTML_FILEPATH=${DASHBOARD_OUTPUT_DIR}/gpu_operator_matrix.html" >> "$GITHUB_ENV"
# Network Operator dashboard paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to have the NNO dashboard updated in the same workflow as the GPU operator? But even then, I think it would be better to separate the NNO steps from the GPU ones. E.g. "Set GPU operator env vars", "Set NNO env vars", "Fetch GPU operator CI results", "Fetch NNO CI resutls", etc.

echo "NNO_DASHBOARD_DATA_FILEPATH=${DASHBOARD_OUTPUT_DIR}/network_operator_matrix.json" >> "$GITHUB_ENV"
echo "NNO_DASHBOARD_HTML_FILEPATH=${DASHBOARD_OUTPUT_DIR}/network_operator_matrix.html" >> "$GITHUB_ENV"
echo "GH_PAGES_BRANCH=${{ github.event.inputs.gh_pages_branch || 'gh-pages' }}" >> "$GITHUB_ENV"
env:
DASHBOARD_OUTPUT_DIR: ${{ env.DASHBOARD_OUTPUT_DIR }}
Expand Down Expand Up @@ -67,27 +71,62 @@ jobs:
- name: Install Dependencies
run: |
pip install -r workflows/gpu_operator_dashboard/requirements.txt
pip install -r workflows/nno_dashboard/requirements.txt

- name: Fetch CI Data
run: |
echo "Processing PR: ${{ steps.determine_pr.outputs.PR_NUMBER }}"
# GPU Operator
python -m workflows.gpu_operator_dashboard.fetch_ci_data \
--pr_number "${{ steps.determine_pr.outputs.PR_NUMBER }}" \
--baseline_data_filepath "${{ env.DASHBOARD_DATA_FILEPATH }}" \
--merged_data_filepath "${{ env.DASHBOARD_DATA_FILEPATH }}"
# Network Operator
python -m workflows.nno_dashboard.fetch_ci_data \
--pr_number "${{ steps.determine_pr.outputs.PR_NUMBER }}" \
--baseline_data_filepath "${{ env.NNO_DASHBOARD_DATA_FILEPATH }}" \
--merged_data_filepath "${{ env.NNO_DASHBOARD_DATA_FILEPATH }}"


- name: Generate HTML Dashboard (only if JSON changed)
run: |
cd "${{ env.DASHBOARD_OUTPUT_DIR }}"

# Check if GPU Operator JSON changed
GPU_CHANGED=false
if [[ ${{ github.event_name }} == "pull_request_target" ]] && git diff --exit-code gpu_operator_matrix.json; then
echo "no changes"
echo "GPU Operator: no changes"
else
echo "GPU Operator: changes detected"
GPU_CHANGED=true
fi

# Check if Network Operator JSON changed
NNO_CHANGED=false
if [[ ${{ github.event_name }} == "pull_request_target" ]] && git diff --exit-code network_operator_matrix.json; then
echo "Network Operator: no changes"
else
cd "${{ github.workspace }}"
echo "Network Operator: changes detected"
NNO_CHANGED=true
fi

cd "${{ github.workspace }}"

# Generate GPU Operator dashboard if changed
if [ "$GPU_CHANGED" = true ]; then
echo "Generating GPU Operator dashboard..."
python -m workflows.gpu_operator_dashboard.generate_ci_dashboard \
--dashboard_data_filepath "${{ env.DASHBOARD_DATA_FILEPATH }}" \
--dashboard_html_filepath "${{ env.DASHBOARD_HTML_FILEPATH }}"
fi

# Generate Network Operator dashboard if changed
if [ "$NNO_CHANGED" = true ]; then
echo "Generating Network Operator dashboard..."
python -m workflows.nno_dashboard.generate_ci_dashboard \
--dashboard_data_filepath "${{ env.NNO_DASHBOARD_DATA_FILEPATH }}" \
--dashboard_html_filepath "${{ env.NNO_DASHBOARD_HTML_FILEPATH }}"
fi

- name: Deploy HTML to GitHub Pages
uses: JamesIves/github-pages-deploy-action@v4
Expand Down
43 changes: 40 additions & 3 deletions workflows/gpu_operator_dashboard/fetch_ci_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,15 +92,20 @@ class TestResult:
test_status: str
prow_job_url: str
job_timestamp: str
test_flavor: Optional[str] = None # NNO-specific: test configuration flavor

def to_dict(self) -> Dict[str, Any]:
return {
result = {
OCP_FULL_VERSION: self.ocp_full_version,
GPU_OPERATOR_VERSION: self.gpu_operator_version,
"test_status": self.test_status,
"prow_job_url": self.prow_job_url,
"job_timestamp": self.job_timestamp,
}
# Include test_flavor only if it's set (NNO-specific)
if self.test_flavor is not None:
result["test_flavor"] = self.test_flavor
return result

def composite_key(self) -> TestResultKey:
repo, pr_number, job_name, build_id = extract_build_components(self.prow_job_url)
Expand Down Expand Up @@ -571,8 +576,15 @@ def merge_ocp_version_results(
bundle_result_limit: Optional[int] = None
) -> Dict[str, Any]:
"""Merge results for a single OCP version."""
# Initialize the structure
merged_version_data = {"notes": [], "bundle_tests": [], "release_tests": [], "job_history_links": []}
# Initialize the structure with all possible fields
merged_version_data = {
"notes": [],
"bundle_tests": [],
"release_tests": [],
"job_history_links": [],
"test_flavors": {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of test flavors? It looks like you want to incorporate them into "bundle_tests"/"release_tests" instead of having a separate section. E.g.

OpenShift GPU Operator Network Operator
4.19.17 25.10.1 25.3 (eth), 25.3 (infiniband)

Let's think what would be the best way to organize the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this doesn't seem to belong in "gpu_operator_dashboard". We'll need to change the directory structure. Maybe keep the shared code separately, operator-specific code in the respective directories.

}
# Update with existing data (preserves any additional fields)
merged_version_data.update(existing_version_data)

# Merge bundle tests with limit
Expand All @@ -599,6 +611,31 @@ def merge_ocp_version_results(
# Convert back to sorted list for JSON serialization
merged_version_data["job_history_links"] = sorted(list(all_job_history_links))

# Merge test_flavors (NNO-specific) if present
new_test_flavors = new_version_data.get("test_flavors", {})
existing_test_flavors = merged_version_data.get("test_flavors", {})

# Merge test flavors by combining results for each flavor
for flavor_name, flavor_data in new_test_flavors.items():
if flavor_name not in existing_test_flavors:
existing_test_flavors[flavor_name] = {"results": [], "job_history_links": set()}

# Merge results for this flavor (using same logic as release_tests)
new_flavor_results = flavor_data.get("results", [])
existing_flavor_results = existing_test_flavors[flavor_name].get("results", [])
existing_test_flavors[flavor_name]["results"] = merge_release_tests(
new_flavor_results, existing_flavor_results
)

# Merge job history links for this flavor
new_flavor_links = flavor_data.get("job_history_links", set())
existing_flavor_links = existing_test_flavors[flavor_name].get("job_history_links", set())
all_flavor_links = set(existing_flavor_links if isinstance(existing_flavor_links, (set, list)) else [])
all_flavor_links.update(new_flavor_links)
existing_test_flavors[flavor_name]["job_history_links"] = sorted(list(all_flavor_links))

merged_version_data["test_flavors"] = existing_test_flavors

return merged_version_data


Expand Down
136 changes: 136 additions & 0 deletions workflows/nno_dashboard/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# NVIDIA Network Operator Dashboard Workflow

This workflow generates an HTML dashboard showing NVIDIA Network Operator test results across different operator versions and OpenShift versions. It fetches test data from CI systems and creates visual reports for tracking test status over time.

## Overview

The dashboard workflow:
- Fetches test results from Google Cloud Storage based on pull request data
- Supports various network operator test patterns including:
- `nvidia-network-operator-legacy-sriov-rdma`
- `nvidia-network-operator-e2e`
- DOCA-based tests (e.g., `doca4-nvidia-network-operator-*`)
- Merges new results with existing baseline data
- Generates HTML dashboard reports
- Automatically deploys updates to GitHub Pages

## Architecture

This dashboard **reuses** the GPU Operator Dashboard code and only overrides the operator-specific parts:
- ✅ Imports all core logic from `workflows.gpu_operator_dashboard.fetch_ci_data`
- ✅ Overrides only Network Operator specific:
- Regex patterns to match network operator job names
- Artifact paths (`network-operator-e2e/artifacts/`)
- Version field names (`network_operator_version` vs `gpu_operator_version`)
- ✅ Maintains a clean, DRY codebase with minimal duplication

This design makes maintenance easier - bug fixes in the core logic automatically benefit both dashboards.

## Supported Test Patterns

The dashboard recognizes the following test job patterns:
- `pull-ci-rh-ecosystem-edge-nvidia-ci-main-{version}-nvidia-network-operator-legacy-sriov-rdma`
- `pull-ci-rh-ecosystem-edge-nvidia-ci-main-{version}-nvidia-network-operator-e2e`
- `rehearse-{id}-pull-ci-rh-ecosystem-edge-nvidia-ci-main-doca4-nvidia-network-operator-*`

Example URL that will be processed:
```
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/67673/rehearse-67673-pull-ci-rh-ecosystem-edge-nvidia-ci-main-doca4-nvidia-network-operator-legacy-sriov-rdma/1961127149603655680/
```

## Usage

### Prerequisites

```console
pip install -r workflows/nno_dashboard/requirements.txt
```

**Important:** Before running fetch_ci_data.py, create the baseline data file and initialize it with an empty JSON object if it doesn't exist:

```console
echo '{}' > nno_data.json
```

### Fetch CI Data

```console
# Process a specific PR
python -m workflows.nno_dashboard.fetch_ci_data --pr_number "123" --baseline_data_filepath nno_data.json --merged_data_filepath nno_data.json

# Process all merged PRs - limited to 100 most recent (default)
python -m workflows.nno_dashboard.fetch_ci_data --pr_number "all" --baseline_data_filepath nno_data.json --merged_data_filepath nno_data.json

# Process with bundle result limit (keep only last 50 bundle tests per version)
python -m workflows.nno_dashboard.fetch_ci_data --pr_number "all" --baseline_data_filepath nno_data.json --merged_data_filepath nno_data.json --bundle_result_limit 50
```

### Generate Dashboard

```console
python -m workflows.nno_dashboard.generate_ci_dashboard --dashboard_data_filepath nno_data.json --dashboard_html_filepath nno_dashboard.html
```

The dashboard generator also **reuses** the GPU Operator dashboard code:
- Imports all HTML generation logic from `workflows.gpu_operator_dashboard.generate_ci_dashboard`
- Uses Network Operator specific templates (in `templates/` directory)
- Only aliases `NETWORK_OPERATOR_VERSION` as `GPU_OPERATOR_VERSION` for compatibility

### Running Tests

First, make sure `pytest` is installed. Then, run:

```console
python -m pytest workflows/nno_dashboard/tests/ -v
```

## GitHub Actions Integration

- **Automatic**: Processes merged pull requests to update the dashboard with new test results and deploys to GitHub Pages
- **Manual**: Can be triggered manually via GitHub Actions workflow dispatch

## Data Structure

The fetched data follows this structure:

```json
{
"doca4": {
"notes": [],
"bundle_tests": [
{
"ocp_full_version": "4.16.0",
"network_operator_version": "24.10.0",
"test_status": "SUCCESS",
"prow_job_url": "https://...",
"job_timestamp": "1234567890"
}
],
"release_tests": [...],
"job_history_links": [
"https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/..."
]
}
}
```

## Troubleshooting

### No data being fetched

1. Verify the PR number exists and has network operator test runs
2. Check that the job names match the expected patterns (see regex in fetch_ci_data.py line 36-40)
3. Ensure the test artifacts contain the required files:
- `finished.json`
- `network-operator-e2e/artifacts/ocp.version`
- `network-operator-e2e/artifacts/operator.version`

### Regex pattern not matching

The regex pattern is designed to match:
- Repository: `rh-ecosystem-edge_nvidia-ci` or `openshift_release` (for rehearse jobs)
- OCP version prefix: Can be `doca4`, `nno1`, or other custom prefixes
- Job suffix: Must contain `nvidia-network-operator` followed by test type

If your job names don't match, you may need to adjust the `TEST_RESULT_PATH_REGEX` pattern in `fetch_ci_data.py`.

Empty file.
Loading