-
Notifications
You must be signed in to change notification settings - Fork 572
Paper: Reproducible Machine Learning Workflows for Scientists #1085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper: Reproducible Machine Learning Workflows for Scientists #1085
Conversation
4862252 to
da6d6ca
Compare
da6d6ca to
13f7fd1
Compare
|
Curvenote Preview
|
c3915d5 to
1eb0cdd
Compare
|
Inviting reviewers: @Gift-Ojeabulu and @[email protected] |
|
Comment to reviewers: I'll move the edits I've been preparing elsewhere here tonight, so don't be alarmed that there's nothing to review right at this moment! @ameyxd It seems that the second reviewer handle was their email, not their GitHub user name. |
|
Seen I will get on that soon.
sorry for the delay,
…On Mon, Jun 23, 2025 at 8:12 PM Matthew Feickert ***@***.***> wrote:
*matthewfeickert* left a comment (scipy-conference/scipy_proceedings#1085)
<#1085 (comment)>
Comment to reviewers: I'll move the edits I've been preparing elsewhere
here tonight, so don't be alarmed that there's nothing to review right at
this moment!
@ameyxd <https://github.com/ameyxd> It seems that the second reviewer
handle was their email, not their GitHub user name.
—
Reply to this email directly, view it on GitHub
<#1085 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK5ZAYF5WGF4YWM3JGODJQT3FBGTBAVCNFSM6AAAAAB6ZL2TWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSOJXGY3DIOJVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Commenting to confirm. Will review this week and provide feedback at the earliest. Thanks. |
|
Hi folk, I will serve as the editor for this paper when reviews are complete. Editor: @ameyxd |
|
I am currently reviewing this and will give detailed feedback tomorrow |
|
@Gift-Ojeabulu Sorry, I hadn't added the revised paper. I'll do that tonight, but I unfortunately am fully engaged for the full day. |
|
I understand that the conference is currently going on, take your time .
I will keep checking.
I hope to get it soon.
…On Tue, 8 Jul 2025, 23:20 Matthew Feickert, ***@***.***> wrote:
*matthewfeickert* left a comment (scipy-conference/scipy_proceedings#1085)
<#1085 (comment)>
@Gift-Ojeabulu <https://github.com/Gift-Ojeabulu> Sorry, I hadn't added
the revised paper. I'll do that tonight, but I unfortunately am fully
engaged for the full day.
—
Reply to this email directly, view it on GitHub
<#1085 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK5ZAYFT4AZYHIYLNQSDGYD3HQ73LAVCNFSM6AAAAAB6ZL2TWSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJQGQ2DGNJXGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
1eb0cdd to
53154ae
Compare
2dd14ca to
c004b47
Compare
|
@Gift-Ojeabulu @sanjaybk7 I've added some revisions and rebased to get any changes from the upstream |
|
@sanjaybk7 - can you confirm your review here ASAP? |
|
@Gift-Ojeabulu - can you confirm your review here, or we would need to reassign reviewers. |
ameyxd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to chime in as a reviewer/editor. All-in-all, good write-up, I recommend a few minor edits. Great job.
|
|
||
| Software the involves hardware acceleration on computing resources like GPUs requires additional information to be provided for full computational reproducibility. | ||
| In addition to the computer platform, information about the hardware acceleration device, its supported drivers, and compatible hardware accelerated versions of the software in the environment (GPU enabled builds) are required. | ||
| Traditionally this has been very difficult to do, but multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention why this has been difficult, concisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now
While this information is straightforward to collect, traditionally this has been difficult to make use of in practice given software access restrictions and the lack of declarative human interfaces for defining relationships between system-level drivers and user software.
Multiple recent technological advancements (made possible by social agreements and collaborations) in the scientific open source world now provide solutions to these problems.
papers/matthew_feickert/pixi.md
Outdated
| 1. **Package management**: Pixi enables the user to install, update, and remove packages from these environments through the `pixi` command line. | ||
| 1. **Task management**: Pixi has a task runner system built-in, which allows for tasks with custom logic and dependencies on other tasks to be created. | ||
|
|
||
| combined with robust behaviors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a stray phrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was supposed to be a continuation of the text directly above it (with the enumerated points being treated as punctuated ideas) but I've revised this to now be
These features become powerful when combined with robust behaviors
papers/matthew_feickert/pixi.md
Outdated
| 1. **Pairity of conda and Python packages**: Pixi allows for conda packages and Python packages to be used together seamlessly, and is unique in its ability to handle overlap in dependencies between them. | ||
| Pixi will first solve all conda package requirements for the target environment, lock the environment, and then solve all the dependencies of the Python packages for the environment, determine if there are any overlaps with the existing conda environment, and the only install the missing Python dependencies. | ||
| This ensures allows for fully reproducible solves and for the two package ecosystems to compliment each other rather than potentially cause conflicts. | ||
| 1. **Efficient caching**: Pixi uses an extremely efficient global caching scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention why/how it the caching system is efficient here, if possible, or exclude the adverb 'extremely'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ruben has added context on what exactly is happening now.
|
Thanks @ameyxd! I am in transit currently but I will aim to revise things based off your review by early next week. |
|
@matthewfeickert Hi, just a reminder that the Final Author Revision Deadline is 9/4/2025. This means you shouldn't be making any changes after this date. If you want to make changes, please do so before this date! |
matthewfeickert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional text beyond typo fixes and revisions asker for by reviewers were added to provide more conceptual clarity. These changes are outlined at a high level in these comments.
| For systems with shared filesystems (e.g. SLURM) it is possible to use Pixi workspaces in workflows in a similar manner to local machine (e.g. laptop or workstation). | ||
| Other systems (e.g. HTCondor) do not have a shared filesystem (e.g. HTCondor), requiring that each worker node receive its own copy of the software environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was made more explicit to give more context on why containerization would be needed.
| The same applies for the `lab` feature and environment, which additionally provides JupyterLab for interactive programming with notebooks. | ||
|
|
||
| ```{code} toml | ||
| :filename: pixi.toml | ||
|
|
||
| ... | ||
|
|
||
| [feature.inference.dependencies] | ||
| matplotlib = ">=3.10.3,<4" | ||
|
|
||
| [feature.lab.dependencies] | ||
| notebook = ">=7.4.5,<8" | ||
| jupyterlab = ">=4.4.7,<5" | ||
|
|
||
| [feature.lab.tasks.start] | ||
| description = "Launch JupyterLab" | ||
| cmd = "jupyter lab" | ||
|
|
||
| [environments] | ||
| ... | ||
| gpu = ["gpu"] | ||
| inference = ["gpu", "inference"] | ||
| lab = ["gpu", "inference", "lab"] | ||
| ``` | ||
|
|
||
| Composing multiple environments from Pixi features allows for separating conceptual steps of scientific analysis into bespoke software environments that contain only the necessary dependencies. | ||
| This allows for each step's environment to be better defined, potentially with radically different or conflicting dependencies from other steps, and for clean separation between interactive and non-interactive ("batch") computing models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lab feature and environment were added to show another conceptual example, and the additional text was added to make the use motivations more clear.
matthewfeickert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scipy-conference/2025-proceedings 👋 I have some formatting questions (probably mostly for the Curvenote team members). I don't think any of them are critical, but for visual quality of the proceedings these would be interesting to know if they can be addressed. (Apologies for not asking these questions 2 months ago. 😬)
| ```{literalinclude} code/ml-example/pixi.toml | ||
| :linenos: | ||
| :label: pixi-ml-example-workspace | ||
| :caption: Example of a multi-platform and multi-environment Pixi manifest with all required information and constraints to resolve and install CUDA accelerated conda packages. | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am at a loss. Listed this here #1082, we will fix before publishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into this a little bit: there is a trailing new line in pixi.toml (and Dockerfile). wc -l does not count this as an extra line (I also get 50):

(In github you don't see the ⊖ No newline symbol.)
Using :linenos: in literalinclude, the trailing newline is picked up; without :linenos: (as is the case with Dockerfile) it's ignored. All this said, I think the behavior in myst is correct... We could trim the included content... but I'd worry that could have unintended consequences, maybe...?
We can definitely fix this with :end-line: - in the case of pixi.toml, both :end-line: -2 and :end-line: 49 work (ugh, here the zero- vs one-indexing is dicey - both of these make 50 the actual end-line).
Anyway! I'll make this change for you now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into this a little bit: there is a trailing new line in
pixi.toml(andDockerfile).
...
All this said, I think the behavior in myst is correct...
@fwkoch The point here is that only for the pixi.toml include was this not working. There are no issues with including papers/matthew_feickert/code/ml-example/Dockerfile or papers/matthew_feickert/example.lock, which have the same endline. So why is there different behavior between the the pixi.toml and everything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @matthewfeickert - in case you are still following along. This is not related to the file, it's whether or not you use the :linenos: option. If you use this option, the trailing newline will show up numbered; if you do not, it will be trimmed.
This comes all the way down to the frontend package MyST uses for rendering code - there is an open issue for this exact thing here: react-syntax-highlighter/react-syntax-highlighter#443
| ```{code} toml | ||
| :filename: pixi.toml | ||
|
|
||
| ... | ||
|
|
||
| [feature.cpu.dependencies] | ||
| pytorch-cpu = ">=2.7.1,<3" | ||
| torchvision = ">=0.22.0,<0.23" | ||
|
|
||
| [feature.cpu.tasks.train-cpu] | ||
| description = "Train a PyTorch CNN on MNIST on CPU" | ||
| cmd = "python src/torch_MNIST.py --epochs 2 --save-model --data-dir data" | ||
|
|
||
| ... | ||
|
|
||
| [environments] | ||
| cpu = ["cpu"] | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I want to show components of a file that I would normally include, but want to show only segments of it (like here where I separate segments with ...) is it possible to do this with MyST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be possible using the lines option of literalinclude here:
https://mystmd.org/guide/directives#directive-include
However, there isn't the ... option for sure (I am also not quite sure how the line numbers show up, likely continuous, which might not be what you want.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to be bad and make one change after the "pens down" deadline (#1085 (comment)) as I noticed a path inconsistency between the actual TOML file and the copy pasted examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be possible using the
linesoption ofliteralincludehere: https://mystmd.org/guide/directives#directive-include... (I am also not quite sure how the line numbers show up, likely continuous, which might not be what you want.)
@rowanc1 Yeah, exactly. This is what I tried at first, but then indeed the line numbers are sequential, which for comparison would be more confusing then having no line numbers.
e.g.
```{literalinclude} code/ml-example/pixi.toml
:linenos:
:start-line: 34
:end-line: 40
:emphasize-lines: 4-6
```gives
|
As @ameyxd had moved this to "Editor signed-off" (thanks!) I'm going to squash all commits and rebase this off of the current |
* SciPy 2025 proceedings for the tutorial given on 2025-07-07. - c.f. https://github.com/matthewfeickert-talks/reproducible-ml-for-scientists-with-pixi-scipy-2025 Co-authored-by: Ruben Arts <[email protected]>
98b00c3 to
0651b00
Compare
ameyxd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, ready to merge. Approved.
|
Thanks you for fixing things up @fwkoch (and Curvenote team)! |

This PR adds the
myst.mdpaper source for the SciPy 2025 proceedings for the tutorial "Reproducible Machine Learning Workflows for Scientists with Pixi".This is currently in draft mode as this is being opened in advance of the
scipy_proceedings/README.md
Line 108 in e53c3df
deadline to ensure that a draft gets in. This contribution will remain silent for most of the time leading up to the deadline as it will be developed in a personal repository where all authors can more easily work on it together.