Skip to content

TauferLab/pegasus_flux_user_deploy

Repository files navigation

User Deployment of Pegasus on Flux Systems

License

This repo contains a collection of utility scripts for installing and deploying HTCondor and Pegasus on systems running LLNL's Flux resource manager.

Warning

This repo is NOT meant to be used for system-wide or multi-user deployments of HTCondor or Pegasus. This repo is only meant for single-user deployments in which all HTCondor daemons are launched on and managed from a single node. If you are trying to deploy HTCondor and Pegasus across an entire Flux-scheduled system, refer to the official HTCondor and Pegasus documentation. However, you can still refer to the section on Running Pegasus Workflows Under Flux for information about how to configure Pegasus workflows to use Flux for job scheduling and management.

Installing HTCondor and Pegasus

To install HTCondor and Pegasus with Flux support, users can run the install.sh script. This script will download HTCondor and Pegasus and install both of them in the same directory. The table below summarizes all the options that can be provided to install.sh:

Flag Requires Value? Flag Required? Default Value Description
-a Yes No x86_64 Sets the architecture for which to download Pegasus
-o Yes No rhel Sets the OS for which to download Pegasus
-v Yes No 8 Sets the OS version for which to download Pegasus
-p Yes No $PWD/pegasus_install Sets the installation prefix for HTCondor
and Pegasus
-j Yes No 1 Sets the number of parallel builders to pass
to make -j
-w Yes No None Sets extra flags to pass to wget for getting tarballs
-c Yes No None Sets extra flags to pass to cmake when building
HTCondor from source
-d No No N/A If provided, allow install.sh to delete existing directories
-s No No N/A If provided, clone Git repos with SSH instead of HTTPS

After running install.sh, a new script called pegasus_prefix.sh will be created in the same directory as install.sh. This script is used by the other scripts in this repo to perform other operations.

Note

Valid values for the -a, -o, and -v options can be found by looking at the tarballs at https://download.pegasus.isi.edu/pegasus/5.1.2.dev.0.

Important

Flux support is still in-progress for HTCondor under this PR, and Flux support in Pegasus has been added as of the 5.1.2 release. As a result, install.sh currently builds HTCondor from source using this branch, and it installs Pegasus using pre-release tarballs found here. If you'd rather build a different version of HTCondor or Pegasus, you can edit the variables at the top of install.sh, but keep in mind that doing so may break Flux support.

Starting and Stoping HTCondor and Pegasus

To simplify the process of starting and stopping HTCondor and Pegasus for a single user, this repo provides the start_pegasus.sh and stop_pegasus.sh scripts. Before running either script, users should first source the setup_pegasus_env.sh script. This script sets several environment variables necessary for the other scripts and HTCondor to work.

After sourcing setup_pegasus_env.sh, users can start HTCondor and configure it for Pegasus by running start_pegasus.sh. This script starts all of HTCondor's daemons on the current node by running condor_master, and it configures HTCondor's GLite/BLAHP component for Pegasus using the pegasus-configure-glite command.

After starting HTCondor, users can check the status of the deployment by running check_pegasus.sh. This script will print out the following information which can be used to verify that HTCondor is running correctly:

  • The running HTCondor daemons
  • The version of Pegasus
  • The status of HTCondor
  • The status of any running HTCondor jobs

Finally, users can shutdown HTCondor by running stop_pegasus.sh. This script simply identifies the PID of condor_master and kills that process.

Note

All of these scripts depend on the pegasus_prefix.sh script created by install.sh. Users should not delete this script unless they are uninstalling HTCondor and Pegasus, which can be done with the uninstall.sh script.

Running Pegasus Workflows Under Flux

Running Pegasus workflows under any batch scheduler (e.g., Slurm, Flux, LSF) requires configuring Pegasus to use HTCondor's BLAHP (formerly glite) component. BLAHP converts HTCondor jobs into batch scripts for a specified batch scheduler, and it provides an interface for other scheduling-related tasks (e.g., checking job status).

To configure Pegasus to use BLAHP's Flux support, users need to add two profiles to their site definitions. First, the pegasus.style profile should be set to glite. This tells Pegasus to use BLAHP. Second, the condor.grid_resource profile should be set to batch flux. This tells BLAHP to generate batch scripts and commands for Flux. Below are some examples of how to specify this.

YAML Config
sites:
- name: local-flux
  directories:
  # The following is a shared directory amongst all the nodes in the cluster
  - type: sharedScratch
      path: /lfs/local-flux/glite-sharedfs-example/shared-scratch
      fileServers:
      - url: file:///lfs/local-flux/glite-sharedfs-example/shared-scratch
        operation: all
  profiles:
    pegasus:
      style: glite
    condor:
      grid_resource: batch flux 
    # This last part isn't necessary for Flux support, but it's a good
    # idea to include it
    env:
      PEGASUS_HOME: /lfs/software/pegasus
Python API
from Pegasus.api import Directory, FileServer, Operation, Site, SiteCatalog, Workflow

wflow = Workflow("example_workflow")
sites = SiteCatalog()

shared_scratch_dir = "/lfs/local-flux/glite-sharedfs-example/shared-scratch"
flux_site = Site("local-flux").add_directories(
    Directory(Directory.SHARED_SCRATCH, shared_scratch_dir).add_file_server(
        FileServer("file://" + shared_scratch_dir, Operation.ALL)
    )
)

flux_site.add_pegasus_profile(style="glite")
flux_site.add_condor_profile(grid_resource="batch flux")

sites.add_sites(flux_site)

wflow.add_site_catalog(sites)

After telling Pegasus to use Flux for a site, users can set additional Pegasus profiles to define resource and other job requirements. The Pegasus documentation has a table that explains each profile here.

Finally, users need to tell Pegasus to use the site configured for Flux. This can be done in one of two ways. If you want your entire workflow to run on the Flux site, simply add -s <site_name> to your pegasus-plan invocation. If you only want specific jobs to run on the Flux site, you can set the selector.execution_site profile for those jobs to the name of the Flux site.

Hierarchical Scheduling in Pegasus with Flux

Pegasus and HTCondor are not designed to support Flux's hierarchical scheduling capabilities, so neither tool will try to perform any hierarchical scheduling on their own. However, there are two ways in which users can take advantange of hierarchical scheduling.

First, users can utilize Flux's hierarchical scheduling capabilities to isolate their Pegasus workflows from other users. This allows workflows to run through Flux without incurring overheads from having to contend with other users' jobs in the system-wide scheduler. In the past, the only way to avoid this overhead was to use the pegasus-mpi-cluster (PMC), which orchestrates workflow DAGs through a single MPI program. PMC avoids the overheads of system-wide scheduling, but it also eliminates all overheads from batch scheduling logic, which can produce unrealistically fast workflow makespan.

To isolate your Pegasus workflow from other users, simply do the following:

  1. Get a Flux allocation with flux alloc or flux batch
  2. Within the allocation, start HTCondor and Pegasus using the instructions above
  3. Within the allocation, plan and run your workflow with Pegasus

The way to take advantage of hierarchical scheduling is to simply use a script that invokes Flux as the transformation for your Pegasus job(s). By putting your job's logic in a shell script, you can use Flux commands like flux batch and flux alloc within that script to perform hierarchical scheduling within a single job.

Testing with Docker

To allow users to play around with Pegasus on a Flux system, we provide a Dockerfile that will provide a Flux instance, an install of Pegasus and HTCondor, and the setup_pegasus_env.sh, start_pegasus.sh, stop_pegasus.sh, and check_pegasus.sh scripts.

Caution

This Dockerfile is still under development. There is no guarantee that it will work yet. This notice will be removed once the Dockerfile is complete and ready for use.

Copyright and License

Copyright 2026 Global Computing Lab.

The code in this repository is distributed under the terms of the Apache License, Version 2.0 with LLVM Exceptions.

See LICENSE and COPYRIGHT for more details.

SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

Acknowledgements

This material is based upon work supported by the US National Science Foundation under Grant No. 2530461, 2513101, 2331152, 2223704, 2138811, and 2103845.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by the LLNL-LDRD Program under Project No. 24-SI-005.

About

A repo providing scripts to simplify the process of users deploying Pegasus on a Flux system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors