Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
ab02a2d
biglib directory is added to the sandbox
rgoldouz Sep 26, 2022
5d6ac11
Merge pull request #647 from rgoldouz/master
klannon Sep 26, 2022
ca98d08
adapting to lobster-python3 to run3 CMSSWs
anpicci Jan 28, 2025
daf9696
latest updates
anpicci Feb 5, 2025
e645fe4
working version
anpicci Feb 7, 2025
6254ae5
working version2
anpicci Feb 7, 2025
ea3c9cf
working version 3
anpicci Feb 7, 2025
20d2e40
Merge branch 'lobster-python3-run3' into lobster-python3-run3-v2
anpicci Oct 3, 2025
1583538
Merge pull request #1 from anpicci/lobster-python3-run3-v2
anpicci Oct 3, 2025
0f607c8
aligning to Reza's PR
anpicci Oct 3, 2025
52a8cb9
removing unneeded commented lines
anpicci Oct 3, 2025
eb2129b
Addressing Andrew's comments
anpicci Oct 6, 2025
f7f585b
Merge branch 'lobster-python3' into lobster-python3-run3
anpicci Oct 6, 2025
bf3b4c0
Update hackathon_instructions.md
anpicci Oct 9, 2025
41d9fd5
Update hackathon_instructions.md
anpicci Oct 9, 2025
720a2e1
Fixes to pinned packages
Andrew42 Oct 17, 2025
c23022e
fixing error in task.py
anpicci Oct 25, 2025
b10aa39
add wq print statement
ywan2 Oct 27, 2025
39d57a2
Merge branch 'lobster-python3-run3' of https://github.com/anpicci/lob…
ywan2 Oct 27, 2025
42e44a0
Merge remote-tracking branch 'origin' into lobster-python3-run3
anpicci Dec 8, 2025
223973b
do not consider vscode open ssh sessions as a danger; manipulation of…
anpicci Dec 10, 2025
610658a
modifications to make lobster working both with py2- and py3-compatib…
anpicci Dec 23, 2025
d678e6d
Fix worker/runtime robustness: CMSSW sandbox + siteconf/XRootD discov…
anpicci Dec 24, 2025
51e5021
Fix xrootd streaming PFN construction for absolute LFNs (#26)
anpicci Feb 18, 2026
1387069
fix: retry config hot-reload import from base directory (#29)
anpicci Feb 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions docs/FailureThresholdRecovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Recovering after `threshold_for_failure` / `threshold_for_skipping` is reached

When Lobster stops retrying work because units/files hit configured retry limits,
you can recover in-place from the same working directory after fixing the root
cause.

## Why retries stopped

- `threshold_for_failure` controls per-unit retries. In task creation, units with
`failed > threshold_for_failure` are skipped; units with
`failed == threshold_for_failure` are run as isolation tasks.
- `threshold_for_skipping` controls per-file access retries. Files are considered
for new tasks only while `skipped < threshold_for_skipping`.

## Recovery workflow

1. **Fix the origin of failures**
- Example: bad input endpoint, missing credentials, broken executable,
site/storage issue.

2. **Stop the running Lobster process (if still running)**
- Use `lobster terminate <workdir>` for graceful shutdown.

3. **Raise thresholds in the running workdir config**
- Run `lobster configure <workdir>` and increase:
- `advanced.threshold_for_failure`
- `advanced.threshold_for_skipping`
- Set them high enough for the current counters:
- For failed units, retries resume when the threshold is at least the
current `failed` counter (units at equality are retried in isolation).
- For skipped files, threshold must be strictly greater than the current
`skipped` counter.

4. **Resume processing from the same workdir**
- Start again with `lobster process <workdir>` (or `--foreground` while
debugging).

5. **Validate progress**
- Use `lobster status <workdir>` to monitor failed/skipped summaries.

## Which config should you edit?

Edit the **workdir copy** (`<workdir>/config.py`), not the original config file
you used for the first `lobster process` launch.

- `lobster configure <workdir>` opens exactly `<workdir>/config.py`.
- When resuming from an existing run, Lobster loads state from the workdir
(`config.pkl` / checkpointed config) rather than re-applying your original
startup config path.

## Notes

- `threshold_for_failure` and `threshold_for_skipping` are runtime-mutable
options; changes are wired to `source.update_stuck` and can be applied to an
existing run.
- If you only need a clean wrap-up/merge after processing, `lobster process
--finalize <workdir>` forces both thresholds to `0` (no new retries).


## Quick answer to the common sequence

Your sequence is close, with one tweak:

1. Edit `<workdir>/config.py` (this is the right file).
2. `lobster configure <workdir>` is just a convenience command to open that
same file in `$EDITOR`.
- So do **either** manual edit **or** `lobster configure`, not both.
3. Don’t just hope: verify in logs/status.
- If `lobster process` is running, Lobster watches `<workdir>/config.py`
and applies updates after mtime changes.
- Then check `configure.log` / `process.log` and run
`lobster status <workdir>`.


## When is `config.pkl` rewritten?

- **Not immediately on file edit.** Editing `<workdir>/config.py` alone does not
rewrite `config.pkl`.
- `config.pkl` is rewritten when Lobster processes configuration updates in the
running control loop (`Actions.update_configuration -> config.save()`).

Practical behavior:

- If `lobster process <workdir>` is already running, it will detect the modified
`config.py` (mtime change), apply the update, and then rewrite `config.pkl`.
- If Lobster is not running, start/restart with `lobster process <workdir>` so
the update loop runs and persists the new config.

So yes, you need a running `lobster process` instance for the change to be
applied and saved to `config.pkl`.
38 changes: 38 additions & 0 deletions docs/StorageFileURLs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# `file://` output handling

When an output destination in `StorageConfiguration(output=[...])` starts with
`file://`, Lobster treats it as a **local filesystem stage-out target**.

## Behavior

- Lobster strips the `file://` prefix and uses the remaining string as a local
directory prefix. For each workflow output file, it appends the per-task
remote filename and copies data with `shutil.copy2(...)`.
- A `file://` stage-out target is only attempted if the destination parent
directory already exists on the worker (`os.path.isdir(os.path.dirname(...))`).
It does not auto-create missing output directories during task stage-out.
- After copying, Lobster verifies transfer correctness via local `stat` size
comparison. If size checks fail, that stage-out method is considered failed.
- If no configured output method succeeds for a produced file, the task raises a
stage-out error.

## Important limitation

Yes: with `file://`, Lobster can only stage out to storage that is directly
reachable from the worker as a local filesystem path.

If workers cannot access that path (or parent directories are missing), the
`file://` method fails and Lobster must succeed via another configured output
method (for example `root://`) or the task fails stage-out.

## Practical implication

For values like:

- `file:///project01/ndcms//store/user/$USER/...`

Lobster will stage out by copying to local paths under:

- `/project01/ndcms//store/user/$USER/.../<remotename>`

(Extra slashes are tolerated by normal POSIX path handling.)
33 changes: 16 additions & 17 deletions hackathon_instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
First login to glados, and then run the following commands.
The general directory structure is:
```
lobster-python3
lobster-python3-run3
lobster
```
If you'd like to use a different setup, just adjust the paths accordingly.
Expand All @@ -14,42 +14,41 @@ unset PYTHONPATH
mkdir lobster-python3
cd lobster-python3

git clone https://github.com/NDCMS/lobster.git
git clone https://github.com/anpicci/lobster.git
cd lobster
git checkout lobster-python3
git checkout lobster-python3-run3

conda env create -f lobster_env.yaml -n lobster
conda activate lobster

cd .. # back to lobster-python3/
cd .. # back to lobster-python3-run3/

git clone git@github.com:dmwm/WMCore.git --branch 2.3.5
cd WMCore
sed -i -E '/^(gfal2|htcondor|kkmysqlclient|rucio-clients|Sphinx|coverage|memory-profiler|mox3|nose|nose2|pycodestyle|pylint|pymongo)/d' requirements.txt
pip install -r requirements.txt .

cd ../lobster # to lobster-python3/lobster
cd ../lobster # to lobster-python3-run3/lobster
```

**Note:** The yaml file and the above command creates an environment named `lobster`,
if you'd like to change the name of the environment, change the value after the `-n` flag
e.g. `conda env create -f lobster_env.yaml -n lobster-python3`
**Note:** The yaml file and the above command create an environment named `lobster`. If you'd like to change the name of the environment, change the value after the `-n` flag, e.g., `conda env create -f lobster_env.yaml -n lobster-python3-run3`


Then, still in the cloned lobster directory, run the following command to install lobster as an editable package:
```
pip install -e .
```

Now that the `lobster` env is setup, in the future all you need to do is run the following:
Now that the `lobster` env is set up, in the future all you need to do is run the following:
```
unset PYTHONPATH
unset PERL5LIB
conda activate lobster
export PATH=/afs/crc.nd.edu/group/ccl/software/x86_64/RedHat9/cctools/7.11.1/bin:$PATH ##needed to use parrot_run when sandboxing CMSSW
```

# Running a Simple Config
In the lobster repository, there is a python script called "simple.py". This has been updated to work with `lobster-python3` and can be run in the following way:
In the lobster repository, there is a Python script called "simple.py". This has been updated to work with `lobster-python3` and can be run in the following way:

1. Set up the necessary CMSSW release in the same directory as where you're running the config file (see directions below).
2. unset the pythonpath and start the `lobster` environment
Expand All @@ -68,20 +67,20 @@ After the jobs are completed, check the output. In general, lobster output is st

# Setting up a CMSSW environment for the simple example
For the simple.py script, we're using CMSSW_10_6_26. There are two options:
1. install CMSSW_10_6_26 inside the same directory where simple.py is located
1. Install CMSSW_10_6_26 inside the same directory where simple.py is located
- `lobster/examples/`
- inside the examples directory run `unset PERL5LIB` and `cmsrel CMSSW_10_6_26` (NOTE: this has to be done outside of lobster conda environment)
- reminder: DO NO do cmsenv
- inside the examples directory, run `unset PERL5LIB` and `cmsrel CMSSW_10_6_26` (NOTE: this has to be done outside of lobster conda environment)
- reminder: DO NOT do cmsenv
2. install CMSSW_10_6_26 somewhere else, and edit the path in simple.py on line 45: `release='<your-path-to-CMSSW_10_6_26>'`

# Possible Errors
After submitting a lobster process and starting a work_queue_factory, if there are no errors in `process.err` or `process_debug.log` but workers are never assigned to the job, try the follwing:
After submitting a lobster process and starting a work_queue_factory, if there are no errors in `process.err` or `process_debug.log` but workers are never assigned to the job, try the following:
- Kill the current work_queue_factory.
- Try running a worker directly with the following command:
- `apptainer exec --bind /cvmfs:/cvmfs --bind $CONDA_PREFIX:/conda_env /afs/crc.nd.edu/group/ccl/software/runos/images/cc7-wq-7.11.1.img /conda_env/bin/work_queue_worker -M "lobster_${USER}.*" -dall --cores 1 --disk 10000 -t 150`
- If the manual worker succeeds and not the work_queue_factory then it may indicate a bug in lobster, or your environment. Save your logs and contact the development team for help.
- If the manual worker succeeds and not the work_queue_factory, then it may indicate a bug in lobster or your environment. Save your logs and contact the development team for help.

# Other Troubleshooting
- Did you remember to re-new your proxy?
- Did you remember to renew your proxy?
- Unset PYTHONPATH and PERL5LIB?
- In the right conda env? Run `work_queue_factory --version` and check the output for 7.11.1.
- In the right conda env? Run `work_queue_factory --version` and check the output for 7.11.1.
38 changes: 34 additions & 4 deletions lobster/actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,23 @@
import os
import time
import traceback

from contextlib import contextmanager
from lobster.commands.plot import Plotter
from lobster import util

logger = logging.getLogger('lobster.actions')

@contextmanager
def _temporary_cwd(path):
old = os.getcwd()
changed = bool(path and os.path.isdir(path) and os.path.abspath(path) != os.path.abspath(old))
try:
if changed:
os.chdir(path)
yield
finally:
if changed:
os.chdir(old)

def runplots(plotter, foremen):
try:
Expand Down Expand Up @@ -44,12 +55,31 @@ def update_configuration(self):
logger.info('updating configuration')
self.__last_config_update = time.time()
new_config = imp.load_source('userconfig', configfile).config
self.config.update(new_config)
self.config.save()
util.register_checkpoint(self.config.workdir, 'configuration_check', self.__last_config_update)
except ModuleNotFoundError as e:
base_dir = getattr(self.config, 'base_directory', None)
if base_dir and os.path.isdir(base_dir):
logger.warning(
"reload import failed for module '{}' from workdir context; retrying from base directory {}".format(
e.name, base_dir))
with _temporary_cwd(base_dir):
new_config = imp.load_source('userconfig', configfile).config
else:
logger.exception('failed to update configuration:')
logger.error(
"missing module '{}' while importing workdir config; base_directory is unavailable, make imports resilient in config.py".format(
e.name))
util.PartiallyMutable.purge()
new_config = None
except Exception:
logger.exception('failed to update configuration:')
logger.error('configuration reload imports <workdir>/config.py as-is; guard top-level commands (e.g. git) in your config with fallback logic')
util.PartiallyMutable.purge()
new_config = None

if new_config is not None:
self.config.update(new_config)
self.config.save()
util.register_checkpoint(self.config.workdir, 'configuration_check', self.__last_config_update)

for method, args in util.PartiallyMutable.changes():
if method is None:
Expand Down
11 changes: 8 additions & 3 deletions lobster/cmssw/sandbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def __dontpack(self, fn):
return False

def _recycle(self, outdir):
release_and_arch = re.compile(r'sandbox-(.*)-(slc.*)-[A-Fa-f0-9]*.tar.bz2$')
release_and_arch = re.compile(r'sandbox-(.*?)-(slc\d+_\S+|el\d+_\S+)-[A-Fa-f0-9]+\.tar\.bz2$')
shutil.copy2(self.recycle, outdir)
m = release_and_arch.search(self.recycle)
if not m:
Expand All @@ -65,9 +65,14 @@ def _recycle(self, outdir):
return rtname, rtarch, os.path.join(outdir, os.path.split(self.recycle)[-1])

def _get_cmssw_arch(self, dirname):
# First, search for 'slc*'
candidates = glob.glob('{}/.SCRAM/slc*'.format(dirname))
# If no 'slc*' is found, fallback to searching for 'el*'
if not candidates:
candidates = glob.glob('{}/.SCRAM/el*'.format(dirname))

if len(candidates) != 1:
raise AttributeError("Can't determine SCRAM arch!")
raise AttributeError("Can't determine SCRAM arch! in {0}".format(dirname))
return os.path.basename(candidates[0])

def _get_cmssw_version(self, dirname):
Expand Down Expand Up @@ -102,7 +107,7 @@ def ignore_file(fn):
tarball = tarfile.open(outfile, "w|bz2")

# package bin, etc
subdirs = ['bin', 'cfipython', 'external', 'lib', 'python']
subdirs = ['bin', 'cfipython', 'external', 'lib', 'python', 'biglib']
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is biglib used for? We did our full Run 3 production without it.

subdirs += [os.path.join('src', incl) for incl in self.include]

for (path, dirs, files) in os.walk(os.path.join(indir, 'src')):
Expand Down
34 changes: 28 additions & 6 deletions lobster/commands/plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,19 +139,26 @@ def mp_pie(vals, labels, name, plotdir=None, **kwargs):
ax.set_prop_cycle(monochrome)

newlabels = []
total = sum(vals) # total == 0 causes division error
if total != 0:
total = sum(vals)
if total > 0:
for label, val in zip(labels, vals):
if float(val) / total < .025:
newlabels.append('')
else:
newlabels.append(label)
else:
newlabels = [''] * len(labels)

with open(os.path.join(plotdir, name + '.dat'), 'w') as f:
for l, v in zip(labels, vals):
f.write('{0}\t{1}\n'.format(l, v))

patches, texts = ax.pie([max(0, val) for val in vals], labels=newlabels, **kwargs)
if total > 0:
patches, texts = ax.pie([max(0, val) for val in vals], labels=newlabels, **kwargs)
else:
patches, texts = [], []
ax.text(0.5, 0.5, 'No data', ha='center', va='center')
ax.set_axis_off()

if paper:
for p, h in zip(patches, itertools.cycle(hatching)):
Expand Down Expand Up @@ -287,9 +294,24 @@ def mp_plot(a, xlabel, stub=None, ylabel='tasks', bins=50, modes=None, ymax=None
if '/' not in ylabel:
ax.set_ylabel('{} / {:.0f} min'.format(ylabel,
(bins[1] - bins[0]) * 24 * 60.))
elif float('inf') not in a: # error passing inf to hist
ax.hist([y for (x, y) in a], bins=bins,
histtype='stepfilled', stacked=True, **kwargs)
else:
cleaned = []
has_data = False
for _, y in a:
arr = np.asarray(y)
finite = arr[np.isfinite(arr)]
if len(finite) > 0:
has_data = True
cleaned.append(finite)
else:
cleaned.append(np.array([]))

if has_data:
ax.hist(cleaned, bins=bins,
histtype='stepfilled', stacked=True, **kwargs)
else:
ax.text(0.5, 0.5, 'No finite data', ha='center', va='center')
ax.set_axis_off()
elif mode & Plotter.PROF:
filename += '-prof'
data['data'] = []
Expand Down
5 changes: 3 additions & 2 deletions lobster/commands/process.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from lobster.core.source import TaskProvider

import work_queue as wq
print('\n\n\nUsing WQ version:', wq.__version__, '\n\n\n')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywan2 this was helpful for debugging, but I don't know if we need it in the official release. I guess it doesn't hurt. What do you think @Andrew42 @anpicci?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should keep any "mandatory" print statements to a minimum. I believe this would only ever get printed once when the user does a lobster process path/to/lobster_config.py, so probably not very disruptive, but maybe there should be a better way to log/convey this information to the user.


logger = logging.getLogger('lobster.core')

Expand Down Expand Up @@ -136,7 +137,7 @@ def localkill(num, frame):
process = psutil.Process()
preserved = [f.name for f in args.preserve]
preserved += [os.path.realpath(os.path.abspath(f)) for f in preserved]
openfiles = [f for f in process.open_files() if f.path not in preserved]
openfiles = [f for f in process.open_files() if f.path not in preserved and "vscode" not in f.path]
openconns = process.connections()

for c in openconns:
Expand All @@ -147,7 +148,7 @@ def localkill(num, frame):
logger.error("cannot daemonize due to open files")
for f in openfiles:
logger.error("open file: {}".format(f.path))
raise RuntimeError("open files or connections")
raise RuntimeError(f"open files or connections {f.path}")

with daemon.DaemonContext(
detach_process=not args.foreground,
Expand Down
Loading