Skip to content

Conversation

@NirWolfer
Copy link
Contributor

Description

Today we have a few CI pipelines that still use nodeLabel: 'master' in their matrix file, causing some steps to run on the master node.

This is a security risk and we should run only on containers, the master node will be taken down in the future not allowing this type of run.

What

Replace master agent definition with docker definition

Why ?

HPCINFRA-3388

How ?

It is optional but for complex PRs please provide information about the design,
architecture, approach, etc.

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

@NirWolfer NirWolfer marked this pull request as ready for review October 22, 2025 14:04
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates the weekly disaster-recovery CI pipeline from executing on the Jenkins master node to running inside Kubernetes-managed containers. The change removes runs_on_agents: master and replaces it with a kubernetes configuration block (cloud, namespace, resource limits/requests, architecture-specific node selectors) plus a runs_on_dockers entry pointing to quay.io/podman/stable:v5.0.2. The agentSelector: master directives are removed from both the "Determine release_tag" and "pipeline_stop" steps. This aligns with the existing CI framework documented in .ci/README.md, which describes a matrix-driven approach where all steps run in Docker containers defined in the runs_on_dockers catalog.

Potential Issues

  1. Git command execution in container: Line 43 runs git describe --tags --abbrev=0 to determine the release tag. The quay.io/podman/stable:v5.0.2 image may not have git installed, which would cause the step to fail silently or error out. You should verify that the container includes git or switch to an image that does (similar to how other pipelines use purpose-built CI images from harbor.mellanox.com).

  2. Jenkins API access: Line53 uses Jenkins.instance.getItemByFullName('libxlio/LibXLIO-release') to retrieve job metadata. This requires the container to have access to Jenkins internals, which may not be possible from a standard Podman container. Typically, Jenkins API calls require the script to run with elevated privileges or on the Jenkins master. This could break the "pipeline_stop" step.

  3. Resource limits may be too restrictive: The limits are set to 256Mi memory and 100m CPU. If the Groovy script performs any non-trivial operations (parsing Jenkins job metadata, running git commands, etc.), these limits may be insufficient and cause OOM kills or throttling. Consider profiling the actual resource usage or aligning with other pipelines that use more generous limits.

  4. Inconsistency with other CI files: Other matrix files in .ci/ (e.g., matrix_job.yaml, release_matrix_job.yaml) use purpose-built Docker images from harbor.mellanox.com with pre-installed toolchains and the swx-jenkins user. Using a generic Podman image here diverges from that pattern and may introduce environment inconsistencies or credential issues.

  5. No jnlpImage specified for the Docker step: The kubernetes section defines jnlpImage: 'harbor.mellanox.com/toolbox/c3po-jnlp:latest' for the agent, but the runs_on_dockers entry does not specify a separate JNLP sidecar. If the pipeline framework expects each Docker step to have its own JNLP container, this may cause connectivity issues.

Confidence Score: 2/5

The change correctly addresses the security requirement to move off the master node, but there are multiple execution risks (git availability, Jenkins API access, resource limits) that could cause the pipeline to fail. The use of a generic Podman image instead of a purpose-built CI image is inconsistent with the rest of the .ci/ directory and suggests the change may not have been fully tested in the target environment.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@NirWolfer
Copy link
Contributor Author

NirWolfer commented Oct 22, 2025

Swapped image to harbor.mellanox.com/toolbox/c3po-jnlp:latest to have git
Increased resources for the container running this pipeline

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates the disaster-recovery (DR) weekly CI pipeline from executing on the Jenkins master node to running in Kubernetes containers. The runs_on_agents section with nodeLabel: 'master' has been replaced with a runs_on_dockers configuration that deploys a c3po-jnlp Jenkins agent image in a Kubernetes pod (namespace runai-hpcx-nightly-prod-sw, cloud kubernetes). Additionally, all agentSelector references pointing to the "base" category have been removed from pipeline steps, since the unified Docker-based approach no longer requires explicit agent selection. This change eliminates the security risk of running CI steps directly on the master node, which is scheduled for decommissioning per HPCINFRA-3388.

Important Files Changed

Filename Score Overview
.ci/pipeline/dr_weekly_matrix.yaml 4/5 Replaced master node agent with Kubernetes pod running c3po-jnlp container and removed agentSelector references from pipeline steps

Confidence score: 4/5

  • This PR is safe to merge with careful attention to the container image URL
  • Score reflects a straightforward configuration migration with clear intent, but relies on an external dependency (the c3po-jnlp image) and a previous review noted the Harbor URL may not work during disaster-recovery scenarios requiring geoDNS URLs
  • Pay close attention to the Docker image reference at line 17 to ensure it resolves correctly in both normal operations and disaster-recovery situations

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant Kubernetes
    participant GroovyContainer
    participant ReleaseJob as LibXLIO-release Job
    participant Email as Email System

    User->>Jenkins: Trigger LIBXLIO-dr-weekly-launcher
    Jenkins->>Kubernetes: Request pod with groovy container
    Kubernetes->>GroovyContainer: Deploy nbu-harbor.gtm.nvidia.com/toolbox/c3po-jnlp:latest
    
    alt MAIL_TO parameter provided
        GroovyContainer->>Email: Send "weekly DR build started" email
    end
    
    GroovyContainer->>GroovyContainer: Determine release_tag from params or git
    GroovyContainer->>Jenkins: Check if LibXLIO-release job is enabled
    
    alt Job is disabled
        GroovyContainer->>ReleaseJob: Enable job
    end
    
    GroovyContainer->>ReleaseJob: Trigger build with parameters (release_tag, release_folder, notification_email, sha1)
    ReleaseJob-->>GroovyContainer: Return build result
    
    alt Build failed
        GroovyContainer->>Jenkins: Set currentBuild.result = 'FAILURE'
    end
    
    alt MAIL_TO parameter provided
        GroovyContainer->>Email: Send "weekly DR build ended" email with result
    end
    
    Jenkins-->>User: Build complete
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates the weekly disaster-recovery (DR) CI pipeline away from the Jenkins master node to container-based execution for security compliance (HPCINFRA-3388). The changes delete obsolete JJB and matrix files that were tied to the master node, add a new do_dr_build parameter to the release job to optionally trigger DR builds from the containerized pipeline, update the release matrix to auto-determine release tags for DR runs, and ensure the release script checks out tags explicitly when running in ephemeral containers. The net effect is that all DR and release workflows now execute exclusively in Docker containers instead of the persistent master environment, which is scheduled for decommission.

Important Files Changed

Filename Score Overview
.ci/pipeline/dr_weekly_jjb.yaml 5/5 Deleted entire JJB template that ran daily DR launcher on master node
.ci/pipeline/dr_weekly_matrix.yaml 4/5 Removed weekly DR matrix file with master nodeLabel and job orchestration logic
.ci/pipeline/release_jjb.yaml 5/5 Added new boolean parameter do_dr_build to release job template
.ci/pipeline/release_matrix_job.yaml 3/5 Updated jnlpImage registry and added conditional release_tag determination for DR builds
.ci/do_release.sh 4/5 Added explicit git checkout of release tag before version extraction

Confidence score: 3/5

  • This PR requires careful review as critical DR automation logic is being restructured with potential gaps in coverage.
  • Score reflects the deletion of substantial orchestration and notification logic in dr_weekly_matrix.yaml without clear evidence it was recreated elsewhere, the use of a harbor registry URL that a previous reviewer flagged as incompatible with DR failover scenarios (needs geoDNS URL), and the missing error handling in do_release.sh for the new git checkout command. The auto-tag determination logic in release_matrix_job.yaml also risks picking the wrong tag if the repository state is unexpected.
  • Pay close attention to .ci/pipeline/release_matrix_job.yaml (registry URL must support geoDNS for DR), dr_weekly_matrix.yaml (verify email notifications and job orchestration are preserved elsewhere), and .ci/do_release.sh (add error handling for tag checkout).

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant Git
    participant Docker
    participant NFS as "NFS Release Storage"
    
    User->>Jenkins: "Trigger release job with release_tag, revision, release_folder"
    Jenkins->>Jenkins: "Execute pipeline_start (set displayName, check do_dr_build)"
    Jenkins->>Docker: "Start privileged container (rhel8.6)"
    Docker->>Git: "git clone libdpcp repository"
    Git-->>Docker: "libdpcp source code"
    Docker->>Docker: "Build libdpcp (autogen, configure, make install)"
    Docker->>Git: "git checkout tags/${release_tag}"
    Git-->>Docker: "Repository at release_tag"
    Docker->>Docker: "Parse version from configure.ac"
    Docker->>Docker: "Validate release_tag matches configure.ac version"
    Docker->>Docker: "Build source RPM and tarball (build_pkg.sh -s -t)"
    Docker->>Docker: "Check if do_release=true"
    alt do_release is true
        Docker->>NFS: "Create destination directory ${release_folder}/${release_tag}"
        Docker->>NFS: "Copy ${pkg_name} to ${DST_DIR}"
        Docker->>NFS: "Copy ${tarball_name} to ${DST_DIR}"
        Docker->>NFS: "Create symlink to ${pkg_name}"
        NFS-->>Docker: "Release artifacts stored"
    else do_release is false
        Docker->>Docker: "Skip package release"
    end
    Docker-->>Jenkins: "Build artifacts (RPM, logs)"
    Jenkins->>Jenkins: "Archive artifacts (build_pkg.log, packages/*.rpm)"
    Jenkins->>Jenkins: "Execute pipeline_stop"
    Jenkins->>User: "Send email notification with build status"
Loading

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates the libxlio CI pipelines from running on the Jenkins master node to running exclusively in Docker containers for security compliance. The key changes consolidate the disaster-recovery (DR) weekly pipeline into the main release pipeline by deleting the standalone dr_weekly_jjb.yaml and dr_weekly_matrix.yaml files and adding a do_dr_build boolean parameter to release_jjb.yaml. When do_dr_build=true, the release pipeline now auto-detects the Git tag via git describe in release_matrix_job.yaml and adjusts email subjects accordingly. The do_release.sh script was updated to explicitly checkout the release tag before version extraction, ensuring the packaged code matches the tagged release. The JNLP agent image reference was updated to NVIDIA's internal registry (nbu-harbor.gtm.nvidia.com).

Important Files Changed

Filename Score Overview
.ci/pipeline/dr_weekly_jjb.yaml 1/5 Completely deleted the standalone disaster-recovery weekly pipeline job template that ran on master node
.ci/pipeline/dr_weekly_matrix.yaml 1/5 Completely removed the weekly DR pipeline orchestration matrix with no replacement in this PR
.ci/pipeline/release_jjb.yaml 4/5 Added do_dr_build boolean parameter to enable dual-purpose release/DR pipeline behavior
.ci/pipeline/release_matrix_job.yaml 2/5 Migrated to new container registry, added conditional tag detection for DR builds, and customized email subjects
.ci/do_release.sh 2/5 Added explicit git tag checkout before version extraction to ensure correct code is packaged

Confidence score: 2/5

  • This PR has significant issues that could break CI functionality and requires careful review before merging.
  • Score reflects: (1) complete deletion of DR weekly pipeline files without visible replacement, (2) unguarded parameter references in Groovy that will throw exceptions if do_dr_build is undefined, (3) unvalidated git checkout operations that can fail silently, (4) inconsistent handling of the new container registry URL (prior review noted geoDNS issues for DR scenarios), and (5) missing error handling for the git describe command that could set empty release tags.
  • Pay close attention to release_matrix_job.yaml (lines 58-60, 72) and do_release.sh (line47) - these contain unguarded operations that will fail at runtime. Verify that the DR weekly functionality has truly been moved elsewhere or document the intentional removal.

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant Git
    participant Docker
    participant Kubernetes
    participant BuildSystem
    participant ReleaseFolder
    participant Email

    User->>Jenkins: Trigger release job (with release_tag, revision, do_release params)
    Jenkins->>Jenkins: Execute pipeline_start (set displayName, get tag if DR build)
    Jenkins->>Git: Clone repository at specified tag/commit
    Jenkins->>Kubernetes: Request privileged pod (8Gi memory, 7 CPU)
    Kubernetes->>Docker: Spin up RHEL 8.6 container
    
    Docker->>Git: Clone libdpcp repository
    Docker->>BuildSystem: Run autogen.sh, configure, make install for libdpcp
    
    Docker->>Git: Checkout release tag
    Docker->>BuildSystem: Parse version from configure.ac
    Docker->>BuildSystem: Execute build_pkg.sh to create src.rpm and tarball
    Docker->>Docker: Validate release_tag matches configure.ac version
    
    alt do_release is true
        Docker->>ReleaseFolder: Create destination directory as swx-jenkins user
        Docker->>ReleaseFolder: Copy src.rpm to release folder
        Docker->>ReleaseFolder: Copy tarball to release folder
        Docker->>ReleaseFolder: Create symbolic link to package
    else do_release is false
        Docker->>Docker: Skip package release
    end
    
    Docker->>Jenkins: Archive build artifacts (build_pkg.log, *.rpm)
    Jenkins->>Jenkins: Execute pipeline_stop
    
    alt notification_email is set
        Jenkins->>Email: Send build completion notification with status
    end
    
    Jenkins->>User: Report build result
Loading

Additional Comments (1)

  1. .ci/pipeline/dr_weekly_matrix.yaml

    logic: the entire file was deleted - this completely breaks the weekly DR (disaster recovery) pipeline. The referenced Jira ticket HPCINFRA-3388 only mentions migrating from master nodes to containers, not removing the pipeline entirely. Unless this functionality was moved to a different file, the weekly DR builds will no longer run. Was this functionality moved to another file (e.g., merged into another matrix file), or is the weekly DR pipeline intentionally being removed?

5 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates libxlio's disaster-recovery (DR) weekly and release pipelines from master-node execution to container-based execution, addressing security requirements in HPCINFRA-3388. The changes delete two legacy DR pipeline files (dr_weekly_jjb.yaml, dr_weekly_matrix.yaml), consolidate DR and release workflows by adding a do_dr_build parameter, and modify the release matrix and JJB configurations to run exclusively in Docker containers. The do_release.sh script now explicitly checks out the release tag before version validation, making it self-contained for ephemeral container environments. These changes align with the broader .ci/ infrastructure documented in the repository, which uses Jenkins Job Builder (JJB) templates, matrix YAML files, and Docker images to define reproducible CI/CD pipelines.

Important Files Changed

Filename Score Overview
.ci/pipeline/dr_weekly_jjb.yaml 4/5 Entire JJB template for weekly DR launcher deleted; functionality must be reimplemented elsewhere
.ci/do_release.sh 3/5 Adds explicit git checkout of release tag before version extraction, but lacks error handling for invalid tags
.ci/pipeline/dr_weekly_matrix.yaml 4/5 Complete removal of weekly DR matrix definition that ran on master node
.ci/pipeline/release_matrix_job.yaml 3/5 Updates harbor registry URL and adds conditional git-describe logic for DR builds, but missing error handling
.ci/pipeline/release_jjb.yaml 2/5 Changes default sha1 to 'vNext' and adds do_dr_build parameter, breaking manual release workflow expectations

Confidence score: 2/5

  • This PR introduces breaking changes to the manual release workflow and has multiple unhandled edge cases that could cause silent failures in production pipelines
  • Score reduced because: (1) the sha1 default change in release_jjb.yaml diverges from documented behavior and will cause manual releases to check out the wrong commit unless users override parameters; (2) no error handling for git operations (git checkout, git describe) that can fail silently; (3) harbor registry URL may not work in disaster-recovery scenarios (per previous review feedback); (4) missing validation for do_dr_build parameter existence
  • Pay close attention to .ci/pipeline/release_jjb.yaml (breaking default change), .ci/do_release.sh (needs tag validation), and .ci/pipeline/release_matrix_job.yaml (needs error handling for git describe failure)

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant Git
    participant Docker
    participant Build System
    participant NFS Release Folder
    participant Email Service

    User->>Jenkins: Trigger release job with parameters (release_tag, revision, do_release)
    Jenkins->>Jenkins: Execute pipeline_start (update display name)
    
    alt do_dr_build is true
        Jenkins->>Git: Get latest tag (git describe)
        Git-->>Jenkins: Return tag
    end
    
    Jenkins->>Docker: Start privileged Kubernetes pod (rhel8.6 container)
    Docker-->>Jenkins: Container ready
    
    Jenkins->>Build System: Execute Build-dpcp step
    Build System->>Git: Clone libdpcp repository
    Git-->>Build System: Repository cloned
    Build System->>Build System: ./autogen.sh && ./configure --prefix=/usr
    Build System->>Build System: sudo make install
    
    Jenkins->>Build System: Execute Release step (.ci/do_release.sh)
    Build System->>Git: Checkout release tag
    Build System->>Build System: Parse configure.ac for version
    Build System->>Build System: Validate tag matches version
    Build System->>Build System: Build packages (contrib/build_pkg.sh -s -t)
    
    alt do_release is true
        Build System->>NFS Release Folder: Create destination directory
        Build System->>NFS Release Folder: Copy src.rpm package
        Build System->>NFS Release Folder: Copy tarball
        Build System->>NFS Release Folder: Create symbolic link
        NFS Release Folder-->>Build System: Packages released
    end
    
    Build System-->>Jenkins: Packages built and released
    Jenkins->>Jenkins: Archive artifacts (build_pkg.log, *.rpm)
    
    Jenkins->>Jenkins: Execute pipeline_stop
    
    alt notification_email is not empty
        Jenkins->>Email Service: Send notification email with build status
        Email Service-->>User: Email delivered
    end
    
    Jenkins-->>User: Release job completed
Loading

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR migrates the XLIO CI pipelines to eliminate execution on the Jenkins master node, addressing security requirements per HPCINFRA-3388. The changes remove two legacy weekly disaster-recovery (DR) job definitions (dr_weekly_jjb.yaml, dr_weekly_matrix.yaml) that relied on nodeLabel: master, consolidate DR functionality into the existing release pipeline by introducing a do_dr_build parameter, migrate the JNLP agent image to the new nbu-harbor.gtm.nvidia.com registry, and add an explicit git checkout tags/${release_tag} in do_release.sh to ensure version extraction occurs from the correct commit. The DR and release workflows are now unified in release_matrix_job.yaml and release_jjb.yaml, with conditional logic to distinguish between DR builds (which auto-detect the latest tag via git describe) and manual release builds (which use the release_tag parameter).

Important Files Changed

Filename Score Overview
.ci/pipeline/dr_weekly_jjb.yaml 5/5 Deleted the JJB template for the weekly DR job; functionality consolidated elsewhere
.ci/pipeline/dr_weekly_matrix.yaml 4/5 Deleted the matrix definition for weekly DR launcher; orchestration logic must now exist in other files
.ci/pipeline/release_matrix_job.yaml 3/5 Updated JNLP registry, added conditional logic for DR builds, customized email subjects
.ci/pipeline/release_jjb.yaml 2/5 Added do_dr_build parameter but changed default branch to 'vNext', breaking release workflow expectations
.ci/do_release.sh 4/5 Added explicit checkout of release tag to ensure version validation reads from correct commit

Confidence score: 2/5

  • This PR introduces significant risk due to the release_jjb.yaml changes that alter the default checkout behavior for release builds, potentially breaking manual release workflows if not carefully coordinated.
  • Score reflects three critical unresolved issues: (1) the default sha1 change to 'vNext' in release_jjb.yaml will cause release builds to checkout the wrong branch by default, contradicting the parameter description; (2) the geoDNS harbor URL concern raised in previous review remains unaddressed, which will break DR builds in PDC; (3) the complete removal of dr_weekly_matrix.yaml without visible replacement logic leaves uncertainty about whether email notifications, job orchestration, and release-tag determination have been properly migrated to the consolidated pipeline.
  • Pay close attention to .ci/pipeline/release_jjb.yaml (verify whether 'vNext' default is intentional and whether conditional logic is missing) and .ci/pipeline/release_matrix_job.yaml (confirm the harbor URL works in all environments and that the git describe error handling is sufficient for empty repositories or missing tags).

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant KubernetesPod as "Kubernetes Pod<br/>(il-ipp-blossom-prod)"
    participant DockerContainer as "Docker Container<br/>(rhel8.6)"
    participant GitHub
    participant ReleaseFolder as "Release Folder<br/>(/auto/sw/release)"
    
    User->>Jenkins: "Trigger release job<br/>(release_tag, revision)"
    Jenkins->>GitHub: "Checkout code from tag"
    GitHub-->>Jenkins: "Repository content"
    Jenkins->>KubernetesPod: "Create privileged pod<br/>(swx-media namespace)"
    KubernetesPod->>DockerContainer: "Launch container<br/>(nbu-harbor registry)"
    DockerContainer->>DockerContainer: "Clone libdpcp from GitHub"
    DockerContainer->>DockerContainer: "Build and install libdpcp"
    DockerContainer->>DockerContainer: "Execute .ci/do_release.sh"
    DockerContainer->>DockerContainer: "Validate version matches tag"
    DockerContainer->>DockerContainer: "Build source RPM and tarball"
    alt do_release == true
        DockerContainer->>ReleaseFolder: "Create release directory"
        DockerContainer->>ReleaseFolder: "Copy .src.rpm and .tar.gz"
        DockerContainer->>ReleaseFolder: "Create symbolic link"
    end
    DockerContainer-->>Jenkins: "Archive artifacts<br/>(build_pkg.log, *.rpm)"
    Jenkins->>User: "Send email notification<br/>(success/failure)"
Loading

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@NirWolfer
Copy link
Contributor Author

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR consolidates CI infrastructure to eliminate the security risk of running jobs on the master Jenkins node by deleting the DR weekly standalone pipeline and folding its functionality into the existing release pipeline via a new do_dr_build parameter.

Key changes:

  • Deleted dr_weekly_jjb.yaml and dr_weekly_matrix.yaml that used nodeLabel: 'master'
  • Added do_dr_build parameter to release pipeline to support both regular releases and DR weekly builds
  • Changed default sha1 from 'tags/$release_tag' to 'vNext' in release_jjb.yaml:24
  • Updated jnlpImage registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com in release_matrix_job.yaml:18
  • Added logic to dynamically determine release_tag from git describe when do_dr_build=true
  • Disabled shallow clone and set depth=2 to support tag fetching

Critical issue:

  • The added git checkout tags/${release_tag} at release_matrix_job.yaml:51 runs unconditionally for all builds, but will fail for regular release builds when the default branch is 'vNext' and the tag hasn't been merged there yet

Confidence Score: 2/5

  • This PR has a critical logic error that will break regular release builds
  • The unconditional git checkout tags/${release_tag} will fail for normal release builds when the default branch is 'vNext' and the tag exists only on master or another branch. The change from 'tags/$release_tag' to 'vNext' as the default sha1 creates a mismatch where the pipeline checks out vNext but then tries to checkout a tag that may not exist in that branch's history
  • Pay close attention to .ci/pipeline/release_matrix_job.yaml which has the logic error that needs fixing

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 3/5 Added do_dr_build parameter, changed default sha1 from 'tags/$release_tag' to 'vNext', disabled shallow clone
.ci/pipeline/release_matrix_job.yaml 2/5 Added git checkout command, do_dr_build logic, and changed jnlpImage registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com

Sequence Diagram

sequenceDiagram
    participant User
    participant JJB as Jenkins Job Builder
    participant Pipeline as Release Pipeline
    participant Container as Docker Container
    participant Git as Git Repository
    participant Release as Release Folder

    alt DR Weekly Build
        User->>JJB: Trigger with do_dr_build=true
        JJB->>Pipeline: Start on vNext branch
        Pipeline->>Git: Clone vNext (depth=2, no shallow)
        Pipeline->>Container: Start in nbu-harbor.gtm.nvidia.com/toolbox/c3po-jnlp
        Container->>Git: git describe --tags --abbrev=0
        Git-->>Container: Return latest tag (e.g. 3.50.3)
        Container->>Container: Set release_tag env var
        Container->>Git: git checkout tags/${release_tag}
        Container->>Container: Run do_release.sh
        Container->>Release: Copy packages to /auto/sw/release/
        Pipeline->>User: Email "Weekly DR build ended"
    else Regular Release Build
        User->>JJB: Trigger with release_tag parameter
        JJB->>Pipeline: Start on vNext branch (default)
        Pipeline->>Git: Clone vNext (depth=2, no shallow)
        Pipeline->>Container: Start in nbu-harbor.gtm.nvidia.com/toolbox/c3po-jnlp
        Container->>Git: git checkout tags/${release_tag}
        Note over Container,Git: This will fail if tag not in vNext!
        Container->>Container: Run do_release.sh
        Container->>Release: Copy packages to /auto/sw/release/
        Pipeline->>User: Email "Release build ended"
    end
Loading

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@NirWolfer NirWolfer requested a review from dpressle November 4, 2025 09:53
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR consolidates the DR weekly build pipeline into the existing release pipeline by deleting the standalone DR weekly files and adding a do_dr_build boolean parameter to control behavior.

Key Changes:

  • Removed dedicated dr_weekly_jjb.yaml and dr_weekly_matrix.yaml files
  • Added do_dr_build parameter to release pipeline to distinguish DR builds from regular releases
  • Updated container image registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com
  • Changed default sha1 checkout from tags/$release_tag to vNext
  • Added explicit git checkout tags/${release_tag} step in release script
  • Changed notification email subject to differentiate DR vs Release builds

Critical Issues Found:

  • The unconditional git checkout tags/${release_tag} on line 51 of release_matrix_job.yaml will fail for regular release builds because the JJB now checks out vNext by default (where the tag doesn't exist yet), breaking the normal release workflow
  • The change in default sha1 from tags/$release_tag to vNext in release_jjb.yaml creates a breaking change for manual release builds where users must now explicitly set both parameters

Confidence Score: 1/5

  • This PR will break regular release builds due to incorrect git checkout logic
  • Score reflects critical logic error where unconditional git checkout of tag will fail for non-DR builds. The default sha1 change from tags/$release_tag to vNext combined with unconditional checkout means regular releases will attempt to checkout a tag that doesn't exist on the vNext branch, causing immediate failure
  • release_matrix_job.yaml and release_jjb.yaml require immediate fixes before merge

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 2/5 Changed default sha1 from 'tags/$release_tag' to 'vNext', added do_dr_build param, disabled shallow clone - introduces breaking change for manual release builds
.ci/pipeline/release_matrix_job.yaml 1/5 Added git checkout and conditional release_tag resolution, changed image registry - has critical bug where checkout will fail for non-DR builds on vNext branch

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant ReleaseJob as LibXLIO-release
    participant Container as Kubernetes Container
    participant Git
    
    alt DR Weekly Build (do_dr_build=true)
        User->>Jenkins: Trigger with do_dr_build=true
        Jenkins->>ReleaseJob: Checkout vNext (sha1 default)
        ReleaseJob->>Git: git describe --tags --abbrev=0
        Git-->>ReleaseJob: returns latest tag
        ReleaseJob->>Container: Start build with release_tag
        Container->>Git: git checkout tags/${release_tag}
        Container->>Container: Build dpcp dependency
        Container->>Container: Run do_release.sh
        Container-->>ReleaseJob: Build artifacts
        ReleaseJob->>User: Email "Weekly DR build ended"
    else Regular Release (do_dr_build=false)
        User->>Jenkins: Trigger with release_tag param
        Jenkins->>ReleaseJob: Checkout vNext (sha1 default)
        ReleaseJob->>Container: Start build with release_tag
        Container->>Git: git checkout tags/${release_tag}
        Note over Container,Git: ERROR: Tag doesn't exist on vNext branch!
        Container->>Container: Build dpcp dependency
        Container->>Container: Run do_release.sh
        Container-->>ReleaseJob: Build artifacts
        ReleaseJob->>User: Email "Release build ended"
    end
Loading

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@NirWolfer
Copy link
Contributor Author

@NirWolfer
Copy link
Contributor Author

bot:retest

3 similar comments
@NirWolfer
Copy link
Contributor Author

bot:retest

@NirWolfer
Copy link
Contributor Author

bot:retest

@NirWolfer
Copy link
Contributor Author

bot:retest

dpressle
dpressle previously approved these changes Nov 5, 2025
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR consolidates DR weekly and release builds into a single pipeline that runs on Kubernetes containers instead of the master node, improving security posture.

Key changes:

  • Deleted separate DR weekly pipeline files (dr_weekly_jjb.yaml, dr_weekly_matrix.yaml) that used nodeLabel: 'master'
  • Added do_dr_build boolean parameter to existing release pipeline to handle both release and DR builds
  • Updated container image registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com
  • Changed git clone from shallow (shallow-clone: true) to non-shallow (shallow-clone: false, depth: 2)

Issues found:

  • Critical: refspec in release_jjb.yaml missing tags reference - DR builds will fail when attempting to checkout tags/${release_tag}

Confidence Score: 3/5

  • This PR addresses the security concern but has a critical bug that will break DR builds
  • The refactoring successfully eliminates the master node security risk, but the missing tags refspec in release_jjb.yaml:58 will cause DR builds to fail when attempting to checkout tags. Regular release builds should continue working.
  • Pay close attention to .ci/pipeline/release_jjb.yaml - the refspec must include tags for DR builds to function

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 3/5 Added do_dr_build parameter and changed shallow-clone to false; refspec missing tags which breaks DR builds
.ci/pipeline/release_matrix_job.yaml 3/5 Added DR build logic with conditional tag checkout and dynamic release_tag; runs on Kubernetes containers instead of master node

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant ReleasePipeline as Release Pipeline (Container)
    participant Git
    
    alt Regular Release Build (do_dr_build=false)
        User->>Jenkins: Trigger with release_tag param
        Jenkins->>Git: Checkout $sha1 (defaults to tags/$release_tag)
        Jenkins->>ReleasePipeline: Run in Kubernetes container
        ReleasePipeline->>ReleasePipeline: Skip tag checkout (do_dr_build=false)
        ReleasePipeline->>ReleasePipeline: Run .ci/do_release.sh
        ReleasePipeline->>User: Send release notification email
    end
    
    alt DR Weekly Build (do_dr_build=true)
        User->>Jenkins: Trigger with do_dr_build=true
        Jenkins->>Git: Checkout $sha1 (uses vNext branch)
        Jenkins->>ReleasePipeline: Run in Kubernetes container
        ReleasePipeline->>Git: git describe --tags to get release_tag
        ReleasePipeline->>Git: git checkout tags/${release_tag}
        ReleasePipeline->>ReleasePipeline: Run .ci/do_release.sh
        ReleasePipeline->>User: Send Weekly DR notification email
    end
    
    Note over Jenkins,ReleasePipeline: Previously: DR builds ran on master node<br/>Now: All builds run in containers
Loading

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Consolidates DR weekly builds into the release pipeline by removing standalone DR job files and adding a do_dr_build flag to toggle between release and DR weekly builds.

Major changes:

  • Deleted dr_weekly_jjb.yaml and dr_weekly_matrix.yaml that ran on master node (nodeLabel: 'master')
  • Added do_dr_build boolean parameter to release_jjb.yaml for DR build mode
  • Updated registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com
  • Modified release pipeline to conditionally checkout tags and determine release_tag for DR builds

Critical issues found:

  • Shell variable syntax error on line 51 of release_matrix_job.yaml - ${do_dr_build} won't work in bash context
  • Git refspec in release_jjb.yaml missing tags/branches, breaking DR builds that need to checkout tags
  • git describe will fail due to missing tags from incomplete refspec

Confidence Score: 1/5

  • This PR will break DR weekly builds due to syntax errors and missing git refspec configuration
  • Score reflects critical blocking issues: shell variable syntax incompatibility will cause immediate script failure, and incomplete git refspec prevents tag checkout operations required for DR builds. While the security goal of removing master node usage is correct, the implementation has fundamental errors that will prevent the DR build feature from functioning
  • Pay close attention to release_jjb.yaml (git refspec) and release_matrix_job.yaml (shell syntax)

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 1/5 Added do_dr_build flag, disabled shallow-clone, but refspec missing tags causing DR builds to fail
.ci/pipeline/release_matrix_job.yaml 1/5 Added DR build logic and updated registry, but shell variable syntax error will cause failures

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant ReleaseJJB as release_jjb.yaml
    participant ReleaseMatrix as release_matrix_job.yaml
    participant Git
    participant Container as Docker Container

    User->>Jenkins: Trigger release pipeline (do_dr_build=true/false)
    Jenkins->>ReleaseJJB: Load job config
    ReleaseJJB->>Git: Fetch (refspec: PRs only)
    Note over ReleaseJJB,Git: Missing tags & branches!
    ReleaseJJB->>Container: Spawn container (nbu-harbor registry)
    Container->>ReleaseMatrix: Execute pipeline_start
    alt do_dr_build == true
        ReleaseMatrix->>Git: git describe --tags --abbrev=0
        Note over ReleaseMatrix,Git: Fails - no tags fetched!
        Git-->>ReleaseMatrix: release_tag
    end
    ReleaseMatrix->>ReleaseMatrix: Build-dpcp step
    ReleaseMatrix->>ReleaseMatrix: Release step
    alt do_dr_build == true
        ReleaseMatrix->>Git: git checkout tags/${release_tag}
        Note over ReleaseMatrix,Git: Syntax error: ${do_dr_build}
    end
    ReleaseMatrix->>ReleaseMatrix: Execute do_release.sh
    ReleaseMatrix->>User: Send email (Weekly DR / Release)
Loading

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

@NirWolfer
Copy link
Contributor Author

bot:retest

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Consolidates DR weekly builds into the release pipeline by removing separate DR pipeline files and adding a do_dr_build boolean parameter. Also updates container registry from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com.

Major changes:

  • Deleted dr_weekly_jjb.yaml and dr_weekly_matrix.yaml (previously ran on master node)
  • Added do_dr_build parameter to release pipeline to support both release and DR builds
  • Updated jnlp container image registry path
  • Modified pipeline to conditionally checkout tags and set release_tag based on build type
  • Updated email subject to differentiate between "Weekly DR" and "Release" builds

Critical issues found:

  • Git refspec in release_jjb.yaml:58 only fetches PRs, missing tags and branches needed for DR builds
  • Shell conditional syntax error in release_matrix_job.yaml:51 - uses bash syntax ${do_dr_build} for Jenkins parameter
  • git describe --tags will fail because tags aren't fetched by the refspec

Confidence Score: 1/5

  • This PR has critical logic errors that will break DR weekly builds in production
  • The DR build workflow will fail immediately due to missing git refs (tags not fetched) and incorrect shell conditional syntax. The refspec only fetches PRs but DR builds require tags for both checkout and git describe operations. These are blocking issues that must be fixed before merge.
  • .ci/pipeline/release_jjb.yaml and .ci/pipeline/release_matrix_job.yaml both require fixes before DR builds will work

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 1/5 Added do_dr_build parameter but refspec incomplete - missing tags/branches needed for DR builds to work
.ci/pipeline/release_matrix_job.yaml 1/5 Added DR build logic and updated container registry, but shell conditional syntax incorrect and depends on missing git refs

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant GitRepo as Git Repository
    participant Container as K8s Container
    participant ReleaseFolder as Release Folder

    alt DR Weekly Build (do_dr_build=true)
        User->>Jenkins: Trigger with do_dr_build=true
        Jenkins->>GitRepo: Fetch refs (PRs only - missing tags)
        Note over Jenkins,GitRepo: ⚠️ Refspec incomplete
        Jenkins->>Jenkins: Run pipeline_start
        Jenkins->>Jenkins: git describe --tags --abbrev=0
        Note over Jenkins: ⚠️ Fails - no tags fetched
        Jenkins->>Container: Spawn build container
        Container->>Container: Build libdpcp
        Container->>GitRepo: git checkout tags/${release_tag}
        Note over Container: ⚠️ Syntax error in conditional
        Container->>Container: Run .ci/do_release.sh
        Container->>ReleaseFolder: Copy packages (if do_release=true)
        Container->>Jenkins: Report results
        Jenkins->>User: Send "Weekly DR build ended" email
    else Regular Release (do_dr_build=false)
        User->>Jenkins: Trigger with release_tag
        Jenkins->>GitRepo: Checkout tags/$release_tag
        Jenkins->>Container: Spawn build container
        Container->>Container: Build libdpcp
        Container->>Container: Run .ci/do_release.sh (no git checkout)
        Container->>ReleaseFolder: Copy packages
        Container->>Jenkins: Report results
        Jenkins->>User: Send "Release build ended" email
    end
Loading

4 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@dpressle dpressle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your commit message is not good i want to see something about DR weekly release test job not running on master

@NirWolfer
Copy link
Contributor Author

Added mention that the DR weekly now runs on the release pipeline and without running on master:
Weekly DR test job now runs using the release pipeline, which doesnt use the master node (runs on containers only)

@NirWolfer NirWolfer requested a review from dpressle November 10, 2025 08:51
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR consolidates the DR weekly build pipeline into the release pipeline by removing separate DR pipeline files that ran on the master node, and adding a do_dr_build boolean parameter to the release pipeline to support both workflows securely within containers.

Key changes:

  • Removed dr_weekly_jjb.yaml and dr_weekly_matrix.yaml files that used nodeLabel: 'master' (security risk)
  • Added do_dr_build parameter to release_jjb.yaml to enable DR build mode
  • Updated release_matrix_job.yaml to conditionally checkout tags and determine release tag for DR builds
  • Changed jnlpImage from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com

Critical issue found:
The refspec in release_jjb.yaml:58 only fetches PRs (+refs/pull/*:refs/remotes/origin/pr/*), but DR builds require tags for:

  1. git checkout tags/${release_tag} on line 52 of release_matrix_job.yaml
  2. git describe --tags --abbrev=0 on line 62 of release_matrix_job.yaml

Without tags in the refspec, DR builds (do_dr_build=true) will fail with "pathspec 'tags/x.x.x' did not match any file(s)" errors.

Confidence Score: 0/5

  • This PR cannot be merged - DR builds will fail immediately due to missing tags
  • The refspec configuration is missing tag fetching, which breaks the core DR build functionality added by this PR. The git checkout tags/${release_tag} and git describe --tags commands will fail because no tags are available in the repository after checkout.
  • .ci/pipeline/release_jjb.yaml requires immediate attention - the refspec must include tags for DR builds to function

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 0/5 Added do_dr_build parameter but refspec only fetches PRs - DR builds will fail when trying to checkout/describe tags
.ci/pipeline/release_matrix_job.yaml 2/5 Added DR build logic and updated jnlpImage, but git operations depend on tags that won't be fetched by JJB refspec

Sequence Diagram

sequenceDiagram
    participant User
    participant JJB as Jenkins Job Builder
    participant SCM as Git SCM
    participant Pipeline as release_matrix_job.yaml
    participant Script as do_release.sh
    
    User->>JJB: Trigger with do_dr_build=true
    JJB->>SCM: Fetch with refspec (PRs only)
    Note over SCM: ❌ Tags not fetched!
    SCM-->>Pipeline: Checkout $sha1
    
    alt do_dr_build == true
        Pipeline->>SCM: git checkout tags/${release_tag}
        Note over SCM: ❌ Fails - tag not fetched
        Pipeline->>SCM: git describe --tags --abbrev=0
        Note over SCM: ❌ Fails - no tags available
    end
    
    Pipeline->>Script: Execute do_release.sh
    Script->>Script: Build packages
Loading

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Today we have a few CI pipelines that still use nodeLabel: 'master' in
their matrix file, causing some steps to run on the master node.

This is a security risk and we should run only on containers, the master
node will be taken down in the future not allowing this type of run.

Removed weekly dr job that uses nodelabel definition, move this ability
to the release pipeline
Weekly DR test job now runs using the release pipeline, which doesnt
use the master node (runs on containers only)

Signed-off-by: NirWolfer <[email protected]>
@greptile-apps
Copy link

greptile-apps bot commented Nov 11, 2025

Greptile Overview

Greptile Summary

This PR successfully addresses the security requirement by removing nodeLabel: 'master' and consolidating DR weekly builds into the containerized release pipeline. The approach eliminates two separate pipeline files and adds a do_dr_build parameter to control DR vs release behavior.

Key changes:

  • Deleted dr_weekly_jjb.yaml and dr_weekly_matrix.yaml (which used insecure master node)
  • Added do_dr_build boolean parameter to release_jjb.yaml to distinguish DR builds
  • Added conditional logic in release_matrix_job.yaml to checkout tags and determine version for DR builds
  • Updated harbor registry URL from harbor.mellanox.com to nbu-harbor.gtm.nvidia.com

Critical issues found:

  • Git refspec in release_jjb.yaml:58 only fetches PRs, missing tags/branches needed for DR builds to execute git checkout tags/${release_tag} and git describe --tags
  • Shell variable syntax ${do_dr_build} in release_matrix_job.yaml:51 is incorrect - Jenkins boolean parameters need explicit string comparison

Confidence Score: 2/5

  • This PR will break DR builds due to missing git refspec configuration and incorrect shell syntax
  • The security goal (removing master node usage) is achieved, but two critical bugs will cause DR builds to fail: (1) refspec doesn't include tags needed for checkout/describe operations, (2) shell variable syntax won't properly evaluate Jenkins boolean parameter. Regular release builds should work unchanged.
  • Both release_jjb.yaml and release_matrix_job.yaml require fixes before DR builds can function

Important Files Changed

File Analysis

Filename Score Overview
.ci/pipeline/release_jjb.yaml 2/5 Added do_dr_build parameter but refspec missing tags/branches needed for DR builds to checkout and run git describe
.ci/pipeline/release_matrix_job.yaml 3/5 Added DR build logic with conditional checkout and git describe, but shell variable syntax may fail and missing tag fetch will break it

Sequence Diagram

sequenceDiagram
    participant User
    participant Jenkins
    participant JJB as release_jjb.yaml
    participant Matrix as release_matrix_job.yaml
    participant Git
    participant Container as Kubernetes Pod

    User->>Jenkins: Trigger release build
    Note over User,Jenkins: Parameters: do_dr_build, release_tag, sha1
    
    Jenkins->>JJB: Load job configuration
    JJB->>Git: Fetch code (refspec: +refs/pull/*)
    Note over JJB,Git: ⚠️ Only fetches PRs, missing tags!
    
    Jenkins->>Matrix: Execute pipeline_start
    
    alt do_dr_build == true
        Matrix->>Git: git describe --tags --abbrev=0
        Note over Matrix,Git: ⚠️ Will fail - no tags fetched!
        Git-->>Matrix: Returns latest tag
        Matrix->>Matrix: Set env.release_tag
    end
    
    Matrix->>Matrix: Update build display name
    
    Jenkins->>Container: Provision Kubernetes pod
    Note over Container: nodeSelector: kubernetes.io/arch=amd64<br/>Image: nbu-harbor.gtm.nvidia.com/toolbox/c3po-jnlp
    
    Container->>Container: Build-dpcp step
    Container->>Git: Clone libdpcp
    Container->>Container: Build and install libdpcp
    
    Container->>Container: Release step
    
    alt do_dr_build == true
        Container->>Git: git checkout tags/${release_tag}
        Note over Container,Git: ⚠️ Will fail - tags not in refspec!
    end
    
    Container->>Container: Execute do_release.sh
    Container->>Container: Archive artifacts
    
    Jenkins->>Matrix: Execute pipeline_stop
    Matrix->>User: Send notification email
    Note over Matrix,User: Subject includes "Weekly DR" or "Release"
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@NirWolfer
Copy link
Contributor Author

bot:retest

@galnoam galnoam merged commit e0d4799 into Mellanox:vNext Nov 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants