Skip to content

Conversation

@matthewfeickert
Copy link
Collaborator

@matthewfeickert matthewfeickert commented Jun 7, 2025

This PR adds the myst.md paper source for the SciPy 2025 proceedings for the tutorial "Reproducible Machine Learning Workflows for Scientists with Pixi".

This is currently in draft mode as this is being opened in advance of the

- Jun 13: Deadline to submit first draft by authors, as GitHub pull request

deadline to ensure that a draft gets in. This contribution will remain silent for most of the time leading up to the deadline as it will be developed in a personal repository where all authors can more easily work on it together.

@matthewfeickert matthewfeickert added paper This indicates that the PR in question is a paper draft This triggers Curvenote Preview actions labels Jun 7, 2025
@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from 4862252 to da6d6ca Compare June 7, 2025 07:16
@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from da6d6ca to 13f7fd1 Compare June 7, 2025 22:14
@github-actions
Copy link

github-actions bot commented Jun 7, 2025

Curvenote Preview

Directory Preview Checks Updated (UTC)
papers/matthew_feickert 🔍 Inspect 53 checks passed (13 optional) Oct 14, 2025, 6:41 PM

@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from c3915d5 to 1eb0cdd Compare June 8, 2025 19:44
@matthewfeickert matthewfeickert changed the title Paper: Reproducible Machine Learning Workflows for Scientists with Pixi Paper: Reproducible Machine Learning Workflows for Scientists Jun 13, 2025
@ameyxd
Copy link
Contributor

ameyxd commented Jun 23, 2025

Inviting reviewers: @Gift-Ojeabulu and @[email protected]

@matthewfeickert
Copy link
Collaborator Author

Comment to reviewers: I'll move the edits I've been preparing elsewhere here tonight, so don't be alarmed that there's nothing to review right at this moment!

@ameyxd It seems that the second reviewer handle was their email, not their GitHub user name.

@Gift-Ojeabulu
Copy link

Gift-Ojeabulu commented Jun 24, 2025 via email

@sanjaybk7
Copy link

Commenting to confirm. Will review this week and provide feedback at the earliest. Thanks.

@ameyxd
Copy link
Contributor

ameyxd commented Jul 2, 2025

Hi folk, I will serve as the editor for this paper when reviews are complete.

Editor: @ameyxd

@ameyxd ameyxd self-assigned this Jul 2, 2025
@Gift-Ojeabulu
Copy link

I am currently reviewing this and will give detailed feedback tomorrow

@matthewfeickert
Copy link
Collaborator Author

@Gift-Ojeabulu Sorry, I hadn't added the revised paper. I'll do that tonight, but I unfortunately am fully engaged for the full day.

@Gift-Ojeabulu
Copy link

Gift-Ojeabulu commented Jul 9, 2025 via email

@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from 1eb0cdd to 53154ae Compare July 14, 2025 07:59
@matthewfeickert matthewfeickert marked this pull request as ready for review August 2, 2025 01:56
@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from 2dd14ca to c004b47 Compare August 2, 2025 02:03
@matthewfeickert
Copy link
Collaborator Author

@Gift-Ojeabulu @sanjaybk7 I've added some revisions and rebased to get any changes from the upstream 2025 branch. I realized that I left this in "draft" mode for the last weeks, so you might have not yet looked at it. If you have any questions please let me know.

@ameyxd
Copy link
Contributor

ameyxd commented Aug 6, 2025

@sanjaybk7 - can you confirm your review here ASAP?

@ameyxd
Copy link
Contributor

ameyxd commented Aug 20, 2025

@Gift-Ojeabulu - can you confirm your review here, or we would need to reassign reviewers.

Copy link
Contributor

@ameyxd ameyxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to chime in as a reviewer/editor. All-in-all, good write-up, I recommend a few minor edits. Great job.


Software the involves hardware acceleration on computing resources like GPUs requires additional information to be provided for full computational reproducibility.
In addition to the computer platform, information about the hardware acceleration device, its supported drivers, and compatible hardware accelerated versions of the software in the environment (GPU enabled builds) are required.
Traditionally this has been very difficult to do, but multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention why this has been difficult, concisely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now

While this information is straightforward to collect, traditionally this has been difficult to make use of in practice given software access restrictions and the lack of declarative human interfaces for defining relationships between system-level drivers and user software.
Multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.

1. **Package management**: Pixi enables the user to install, update, and remove packages from these environments through the `pixi` command line.
1. **Task management**: Pixi has a task runner system built-in, which allows for tasks with custom logic and dependencies on other tasks to be created.

combined with robust behaviors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a stray phrase?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was supposed to be a continuation of the text directly above it (with the enumerated points being treated as punctuated ideas) but I've revised this to now be

These features become powerful when combined with robust behaviors

1. **Pairity of conda and Python packages**: Pixi allows for conda packages and Python packages to be used together seamlessly, and is unique in its ability to handle overlap in dependencies between them.
Pixi will first solve all conda package requirements for the target environment, lock the environment, and then solve all the dependencies of the Python packages for the environment, determine if there are any overlaps with the existing conda environment, and the only install the missing Python dependencies.
This ensures allows for fully reproducible solves and for the two package ecosystems to compliment each other rather than potentially cause conflicts.
1. **Efficient caching**: Pixi uses an extremely efficient global caching scheme.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention why/how it the caching system is efficient here, if possible, or exclude the adverb 'extremely'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ruben has added context on what exactly is happening now.

@matthewfeickert
Copy link
Collaborator Author

Thanks @ameyxd! I am in transit currently but I will aim to revise things based off your review by early next week.

@ameyxd
Copy link
Contributor

ameyxd commented Aug 27, 2025

@matthewfeickert Hi, just a reminder that the Final Author Revision Deadline is 9/4/2025. This means you shouldn't be making any changes after this date. If you want to make changes, please do so before this date!

Copy link
Collaborator Author

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional text beyond typo fixes and revisions asker for by reviewers were added to provide more conceptual clarity. These changes are outlined at a high level in these comments.

Comment on lines +4 to +5
For systems with shared filesystems (e.g. SLURM) it is possible to use Pixi workspaces in workflows in a similar manner to local machine (e.g. laptop or workstation).
Other systems (e.g. HTCondor) do not have a shared filesystem (e.g. HTCondor), requiring that each worker node receive its own copy of the software environment.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made more explicit to give more context on why containerization would be needed.

Comment on lines 156 to 180
The same applies for the `lab` feature and environment, which additionally provides JupyterLab for interactive programming with notebooks.

```{code} toml
:filename: pixi.toml

...

[feature.inference.dependencies]
matplotlib = ">=3.10.3,<4"

[feature.lab.dependencies]
notebook = ">=7.4.5,<8"
jupyterlab = ">=4.4.7,<5"

[feature.lab.tasks.start]
description = "Launch JupyterLab"
cmd = "jupyter lab"

[environments]
...
gpu = ["gpu"]
inference = ["gpu", "inference"]
lab = ["gpu", "inference", "lab"]
```

Composing multiple environments from Pixi features allows for separating conceptual steps of scientific analysis into bespoke software environments that contain only the necessary dependencies.
This allows for each step's environment to be better defined, potentially with radically different or conflicting dependencies from other steps, and for clean separation between interactive and non-interactive ("batch") computing models.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lab feature and environment were added to show another conceptual example, and the additional text was added to make the use motivations more clear.

Copy link
Collaborator Author

@matthewfeickert matthewfeickert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scipy-conference/2025-proceedings 👋 I have some formatting questions (probably mostly for the Curvenote team members). I don't think any of them are critical, but for visual quality of the proceedings these would be interesting to know if they can be addressed. (Apologies for not asking these questions 2 months ago. 😬)

Comment on lines 36 to 40
```{literalinclude} code/ml-example/pixi.toml
:linenos:
:label: pixi-ml-example-workspace
:caption: Example of a multi-platform and multi-environment Pixi manifest with all required information and constraints to resolve and install CUDA accelerated conda packages.
```
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an extra line that is included in the render

Image

though the included file only has 50 lines

$ wc -l papers/matthew_feickert/code/ml-example/pixi.toml 
50 papers/matthew_feickert/code/ml-example/pixi.toml

This doesn't happen with the Dockerfile later though, and I'm not sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am at a loss. Listed this here #1082, we will fix before publishing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into this a little bit: there is a trailing new line in pixi.toml (and Dockerfile). wc -l does not count this as an extra line (I also get 50):
image
(In github you don't see the ⊖ No newline symbol.)

Using :linenos: in literalinclude, the trailing newline is picked up; without :linenos: (as is the case with Dockerfile) it's ignored. All this said, I think the behavior in myst is correct... We could trim the included content... but I'd worry that could have unintended consequences, maybe...?

We can definitely fix this with :end-line: - in the case of pixi.toml, both :end-line: -2 and :end-line: 49 work (ugh, here the zero- vs one-indexing is dicey - both of these make 50 the actual end-line).

Anyway! I'll make this change for you now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into this a little bit: there is a trailing new line in pixi.toml (and Dockerfile).
...
All this said, I think the behavior in myst is correct...

@fwkoch The point here is that only for the pixi.toml include was this not working. There are no issues with including papers/matthew_feickert/code/ml-example/Dockerfile or papers/matthew_feickert/example.lock, which have the same endline. So why is there different behavior between the the pixi.toml and everything else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @matthewfeickert - in case you are still following along. This is not related to the file, it's whether or not you use the :linenos: option. If you use this option, the trailing newline will show up numbered; if you do not, it will be trimmed.

This comes all the way down to the frontend package MyST uses for rendering code - there is an open issue for this exact thing here: react-syntax-highlighter/react-syntax-highlighter#443

Comment on lines 53 to 70
```{code} toml
:filename: pixi.toml

...

[feature.cpu.dependencies]
pytorch-cpu = ">=2.7.1,<3"
torchvision = ">=0.22.0,<0.23"

[feature.cpu.tasks.train-cpu]
description = "Train a PyTorch CNN on MNIST on CPU"
cmd = "python src/torch_MNIST.py --epochs 2 --save-model --data-dir data"

...

[environments]
cpu = ["cpu"]
```
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I want to show components of a file that I would normally include, but want to show only segments of it (like here where I separate segments with ...) is it possible to do this with MyST?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible using the lines option of literalinclude here:
https://mystmd.org/guide/directives#directive-include

However, there isn't the ... option for sure (I am also not quite sure how the line numbers show up, likely continuous, which might not be what you want.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to be bad and make one change after the "pens down" deadline (#1085 (comment)) as I noticed a path inconsistency between the actual TOML file and the copy pasted examples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible using the lines option of literalinclude here: https://mystmd.org/guide/directives#directive-include

... (I am also not quite sure how the line numbers show up, likely continuous, which might not be what you want.)

@rowanc1 Yeah, exactly. This is what I tried at first, but then indeed the line numbers are sequential, which for comparison would be more confusing then having no line numbers.

e.g.

```{literalinclude} code/ml-example/pixi.toml
:linenos:
:start-line: 34
:end-line: 40
:emphasize-lines: 4-6
```

gives

image

@matthewfeickert
Copy link
Collaborator Author

As @ameyxd had moved this to "Editor signed-off" (thanks!) I'm going to squash all commits and rebase this off of the current HEAD of scipy-conference:2025 to make this easy and a clean commit for the future.

@matthewfeickert matthewfeickert force-pushed the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch from 98b00c3 to 0651b00 Compare September 10, 2025 21:07
Copy link
Contributor

@ameyxd ameyxd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, ready to merge. Approved.

@matthewfeickert
Copy link
Collaborator Author

Thanks you for fixing things up @fwkoch (and Curvenote team)!

@fwkoch fwkoch added approved This triggers Curvenote Submission action and removed draft This triggers Curvenote Preview actions labels Oct 14, 2025
@fwkoch fwkoch merged commit b47e8db into scipy-conference:2025 Oct 14, 2025
16 checks passed
@matthewfeickert matthewfeickert deleted the feat/add-reproducible-ml-workflows-for-scientists-proceedings branch October 14, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This triggers Curvenote Submission action paper This indicates that the PR in question is a paper

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants