-
Notifications
You must be signed in to change notification settings - Fork 16
Changes for el9-based cmssw versions #693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: lobster-python3
Are you sure you want to change the base?
Changes from all commits
ab02a2d
5d6ac11
ca98d08
daf9696
e645fe4
6254ae5
ea3c9cf
20d2e40
1583538
0f607c8
52a8cb9
eb2129b
f7f585b
bf3b4c0
41d9fd5
720a2e1
c23022e
b10aa39
39d57a2
42e44a0
223973b
610658a
d678e6d
51e5021
1387069
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # Recovering after `threshold_for_failure` / `threshold_for_skipping` is reached | ||
|
|
||
| When Lobster stops retrying work because units/files hit configured retry limits, | ||
| you can recover in-place from the same working directory after fixing the root | ||
| cause. | ||
|
|
||
| ## Why retries stopped | ||
|
|
||
| - `threshold_for_failure` controls per-unit retries. In task creation, units with | ||
| `failed > threshold_for_failure` are skipped; units with | ||
| `failed == threshold_for_failure` are run as isolation tasks. | ||
| - `threshold_for_skipping` controls per-file access retries. Files are considered | ||
| for new tasks only while `skipped < threshold_for_skipping`. | ||
|
|
||
| ## Recovery workflow | ||
|
|
||
| 1. **Fix the origin of failures** | ||
| - Example: bad input endpoint, missing credentials, broken executable, | ||
| site/storage issue. | ||
|
|
||
| 2. **Stop the running Lobster process (if still running)** | ||
| - Use `lobster terminate <workdir>` for graceful shutdown. | ||
|
|
||
| 3. **Raise thresholds in the running workdir config** | ||
| - Run `lobster configure <workdir>` and increase: | ||
| - `advanced.threshold_for_failure` | ||
| - `advanced.threshold_for_skipping` | ||
| - Set them high enough for the current counters: | ||
| - For failed units, retries resume when the threshold is at least the | ||
| current `failed` counter (units at equality are retried in isolation). | ||
| - For skipped files, threshold must be strictly greater than the current | ||
| `skipped` counter. | ||
|
|
||
| 4. **Resume processing from the same workdir** | ||
| - Start again with `lobster process <workdir>` (or `--foreground` while | ||
| debugging). | ||
|
|
||
| 5. **Validate progress** | ||
| - Use `lobster status <workdir>` to monitor failed/skipped summaries. | ||
|
|
||
| ## Which config should you edit? | ||
|
|
||
| Edit the **workdir copy** (`<workdir>/config.py`), not the original config file | ||
| you used for the first `lobster process` launch. | ||
|
|
||
| - `lobster configure <workdir>` opens exactly `<workdir>/config.py`. | ||
| - When resuming from an existing run, Lobster loads state from the workdir | ||
| (`config.pkl` / checkpointed config) rather than re-applying your original | ||
| startup config path. | ||
|
|
||
| ## Notes | ||
|
|
||
| - `threshold_for_failure` and `threshold_for_skipping` are runtime-mutable | ||
| options; changes are wired to `source.update_stuck` and can be applied to an | ||
| existing run. | ||
| - If you only need a clean wrap-up/merge after processing, `lobster process | ||
| --finalize <workdir>` forces both thresholds to `0` (no new retries). | ||
|
|
||
|
|
||
| ## Quick answer to the common sequence | ||
|
|
||
| Your sequence is close, with one tweak: | ||
|
|
||
| 1. Edit `<workdir>/config.py` (this is the right file). | ||
| 2. `lobster configure <workdir>` is just a convenience command to open that | ||
| same file in `$EDITOR`. | ||
| - So do **either** manual edit **or** `lobster configure`, not both. | ||
| 3. Don’t just hope: verify in logs/status. | ||
| - If `lobster process` is running, Lobster watches `<workdir>/config.py` | ||
| and applies updates after mtime changes. | ||
| - Then check `configure.log` / `process.log` and run | ||
| `lobster status <workdir>`. | ||
|
|
||
|
|
||
| ## When is `config.pkl` rewritten? | ||
|
|
||
| - **Not immediately on file edit.** Editing `<workdir>/config.py` alone does not | ||
| rewrite `config.pkl`. | ||
| - `config.pkl` is rewritten when Lobster processes configuration updates in the | ||
| running control loop (`Actions.update_configuration -> config.save()`). | ||
|
|
||
| Practical behavior: | ||
|
|
||
| - If `lobster process <workdir>` is already running, it will detect the modified | ||
| `config.py` (mtime change), apply the update, and then rewrite `config.pkl`. | ||
| - If Lobster is not running, start/restart with `lobster process <workdir>` so | ||
| the update loop runs and persists the new config. | ||
|
|
||
| So yes, you need a running `lobster process` instance for the change to be | ||
| applied and saved to `config.pkl`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| # `file://` output handling | ||
|
|
||
| When an output destination in `StorageConfiguration(output=[...])` starts with | ||
| `file://`, Lobster treats it as a **local filesystem stage-out target**. | ||
|
|
||
| ## Behavior | ||
|
|
||
| - Lobster strips the `file://` prefix and uses the remaining string as a local | ||
| directory prefix. For each workflow output file, it appends the per-task | ||
| remote filename and copies data with `shutil.copy2(...)`. | ||
| - A `file://` stage-out target is only attempted if the destination parent | ||
| directory already exists on the worker (`os.path.isdir(os.path.dirname(...))`). | ||
| It does not auto-create missing output directories during task stage-out. | ||
| - After copying, Lobster verifies transfer correctness via local `stat` size | ||
| comparison. If size checks fail, that stage-out method is considered failed. | ||
| - If no configured output method succeeds for a produced file, the task raises a | ||
| stage-out error. | ||
|
|
||
| ## Important limitation | ||
|
|
||
| Yes: with `file://`, Lobster can only stage out to storage that is directly | ||
| reachable from the worker as a local filesystem path. | ||
|
|
||
| If workers cannot access that path (or parent directories are missing), the | ||
| `file://` method fails and Lobster must succeed via another configured output | ||
| method (for example `root://`) or the task fails stage-out. | ||
|
|
||
| ## Practical implication | ||
|
|
||
| For values like: | ||
|
|
||
| - `file:///project01/ndcms//store/user/$USER/...` | ||
|
|
||
| Lobster will stage out by copying to local paths under: | ||
|
|
||
| - `/project01/ndcms//store/user/$USER/.../<remotename>` | ||
|
|
||
| (Extra slashes are tolerated by normal POSIX path handling.) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,6 +17,7 @@ | |
| from lobster.core.source import TaskProvider | ||
|
|
||
| import work_queue as wq | ||
| print('\n\n\nUsing WQ version:', wq.__version__, '\n\n\n') | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that we should keep any "mandatory" print statements to a minimum. I believe this would only ever get printed once when the user does a |
||
|
|
||
| logger = logging.getLogger('lobster.core') | ||
|
|
||
|
|
@@ -136,7 +137,7 @@ def localkill(num, frame): | |
| process = psutil.Process() | ||
| preserved = [f.name for f in args.preserve] | ||
| preserved += [os.path.realpath(os.path.abspath(f)) for f in preserved] | ||
| openfiles = [f for f in process.open_files() if f.path not in preserved] | ||
| openfiles = [f for f in process.open_files() if f.path not in preserved and "vscode" not in f.path] | ||
| openconns = process.connections() | ||
|
|
||
| for c in openconns: | ||
|
|
@@ -147,7 +148,7 @@ def localkill(num, frame): | |
| logger.error("cannot daemonize due to open files") | ||
| for f in openfiles: | ||
| logger.error("open file: {}".format(f.path)) | ||
| raise RuntimeError("open files or connections") | ||
| raise RuntimeError(f"open files or connections {f.path}") | ||
|
|
||
| with daemon.DaemonContext( | ||
| detach_process=not args.foreground, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is
biglibused for? We did our full Run 3 production without it.