Skip to content

Add notebook image CI build (mdx2:1.0.2-1) and remove broken docker job#1

Merged
Abdelsalam-Abbas merged 8 commits into
mainfrom
notebook-image-cicd
May 4, 2026
Merged

Add notebook image CI build (mdx2:1.0.2-1) and remove broken docker job#1
Abdelsalam-Abbas merged 8 commits into
mainfrom
notebook-image-cicd

Conversation

@Abdelsalam-Abbas
Copy link
Copy Markdown
Collaborator

@Abdelsalam-Abbas Abdelsalam-Abbas commented May 2, 2026

Summary

Adds a CI build for diffuseproject/mdx2-jhub:1.0.2-1, the JupyterHub singleuser image used by the Diffuse SSH gateway. Reproduces the existing :test image's scientific stack via a captured conda lockfile, pins the mdx2 source to the same commit :test was built from, and adds openssh-client, openssh-sftp-server, and rsync so scp, sftp, and rsync work through the gateway.

Also removes the pre-existing broken docker: job from the workflow, strips dead weight from the new Dockerfile, and adds workflow safety guards to prevent silent tag overwrites.

Why

ssh into a JupyterHub notebook pod via the Diffuse SSH gateway works (shipped via webapp PRs #287–#294). But scp, rsync, and sftp fail because they exec the corresponding binaries inside the pod, and the running :test image has none of them:

$ kubectl -n jupyterhub exec jupyter-<user> -- sh -c "which scp; which rsync; ls /usr/lib/openssh/sftp-server"
which scp:           (empty)
ls /usr/lib/openssh/sftp-server: No such file or directory
type scp:            scp: not found  (exit 127)

Modern OpenSSH (≥9.0) defaults scp to the SFTP subsystem, which is why even basic scp started failing.

Approach

The previous :test image (built 2026-02-26) was produced from a now-deleted feature branch (feat/jupyterhub-singleuser, PR #56) by docker build on someone's laptop. Its source Dockerfile was never committed anywhere. To rebuild faithfully:

  1. Conda env via explicit lockfile (jhub-env.lock, 385 packages from conda-forge). Captured with micromamba env export --explicit -n mdx2-dev from the live :test pod, so the new image gets byte-equivalent DIALS 3.24.1, dxtbx 3.24.1, hdf5 1.14.3, Python 3.10.19, and 381 other packages — same build hashes, same versions.
  2. mdx2 source pinned to 327bf6e1541e3e0b63a22c8aff100b92c4aa6e39 — PR #56's head commit. mdx2/VERSION = 1.0.2 and the package directory matches the live pod exactly (including the absence of io.py, which only exists on main). Reachable via refs/pull/56/head only since the branch was deleted; the Dockerfile fetches that ref explicitly.
  3. Pip versions pinned to match the live pod: jupyterhub==5.4.3 and jupyter-vscode-proxy==0.7.
  4. Apt additions: openssh-client, openssh-sftp-server, rsync.

The tag 1.0.2-1 follows Debian-revision style: upstream 1.0.2 (mdx2 version) + -1 (our first build with this upstream).

Future upgrade story

The lockfile decouples mdx2 source upgrades from scientific stack upgrades:

Shape What changes Tag bump Trigger
Source-only MDX2_COMMIT env var in workflow 1.0.3-1, 1.0.4-1, ... mdx2 ships a new release
Stack-only Regenerate jhub-env.lock 1.0.2-2, 1.0.2-3, ... DIALS bump, security patches, quarterly refresh
Both Both files <new-mdx2>-1 New mdx2 needs new deps

Lockfile regeneration: spin up a temp container, micromamba env export --explicit -n mdx2-dev > jhub-env.lock, commit. Bump IMAGE_TAG env var in the workflow at the same time — the immutability check will refuse to overwrite the existing tag if you forget.

Step-by-step recipes for each scenario (with concrete commands, smoke tests, rollback procedure, and common pitfalls): see docs/upgrading-the-jhub-image.md.

Workflow safety

The workflow has explicit guards against accidental tag overwrites and unnecessary builds:

  • Workflow-level env: IMAGE_NAME, IMAGE_TAG, MDX2_COMMIT are hoisted to a single review surface at the top of the workflow. Bumping a version is a single diff line.
  • Pre-push immutability check: before pushing on main, the workflow runs docker manifest inspect against Dockerhub. If ${IMAGE_NAME}:${IMAGE_TAG} already exists, the workflow fails with an explicit error message telling the maintainer to bump IMAGE_TAG. Prevents silent overwrites of published-and-deployed tags.
  • Path filter on push trigger: push: to main only fires when Dockerfile.jhub, jhub-env.lock, or the workflow itself changes. Docs-only PRs that get merged don't trigger an unnecessary rebuild. PR-CI stays unfiltered for predictable required-status-check behavior.
  • Least-privilege permissions: workflow declares contents: read, actions: write (only the latter for type=gha cache writes). No packages: write since we don't push to ghcr.

Files changed

  • New: Dockerfile.jhub (71 lines). 3-stage build: mambaorg/micromamba:1.5.5 (micromamba binary) → debian:stable-slim (mdx2 git clone, isolated stage so git stays out of final image) → debian:stable-slim (final, with conda env from lockfile + mdx2 editable + pip pins + ssh tooling).
  • New: jhub-env.lock (389 lines). Conda explicit lockfile; do not edit by hand.
  • New: docs/upgrading-the-jhub-image.md (304 lines). Operational runbook for the three upgrade scenarios (mdx2 source bump, conda env refresh, OS apt additions) with concrete commands, rollback procedure, and common pitfalls.
  • Modified: .github/workflows/docker.yml. The pre-existing docker: job is removed. The new notebook: job replaces it with: env-var-driven version, paths-filtered push trigger, conditional login, immutability check, and least-privilege permissions.

Why remove the existing docker: job

It was added 2026-03-25 with two pre-existing bugs that prevented it from ever working: file: dockerfile (lowercase, fatal on Linux runners) and unconditional login (failed every PR before Dockerhub creds existed). It has never successfully pushed an image. The image it would build (.github/Dockerfile, the :latest-flavored standalone Jupyter Lab launcher with CMD jupyter lab) is not used in the Diffuse deployment chain — only mdx2-workflows/Dockerfile itself references :1.0.0 as a base image, and that base already exists on Dockerhub from a manual push by jlee in March 2026.

.github/Dockerfile itself is kept around as a reference if anyone wants to revive :1.0.0 automation in a focused follow-up PR.

Pre-merge checklist

  • Maintainer adds repo-level secrets:
    • vars.DOCKERHUB_USERNAME (verified working — login step succeeds in CI)
    • secrets.DOCKERHUB_TOKEN (same)
  • PR's CI build of Dockerfile.jhub passes (no push, just verifies the build succeeds in a clean ubuntu-latest runner)
  • On merge, main's CI run pushes diffuseproject/mdx2-jhub:1.0.2-1 to Dockerhub

Verification done locally

Check Result
docker buildx build succeeded; image at sha256:4616b2f03c42…, 4.61 GB uncompressed (~50 MB smaller than :test after stripping unused python_stage and /opt/conda)
which scp; which rsync; ls /usr/lib/openssh/sftp-server all present
jupyterhub-singleuser --version 5.4.3
jupyterhub-singleuser boot with realistic env extensions load (jupyterhub, jupyterlab, jupyterlab_h5web, notebook_shim, jupyter_lsp, jupyter_server_terminals)
import mdx2; mdx2.__version__ 1.0.2 (matches live pod)
mdx2.io attribute absent (matches live pod — io.py only exists on main)
mdx2 source dir contents matches live pod listing (VERSION, command_line, data.py, dxtbx_machinery.py, geometry.py, scaling.py, utils.py); .git removed
dxtbx.__version__ 3.24.1 (matches live)
pip-pinned versions (jupyterhub, jupyter-vscode-proxy, mdx2) match live
Lockfile parity (385 packages, name+version+build normalized) md5 d5efd15da8bcbddebfa20b20ef2f2ff6 matches lockfile exactly
mdx2.import_data --help functional, prints usage
sftp-server -h, scp (no args), rsync --version all run
actionlint on workflow passes (no warnings)

Cutover plan (after merge)

  1. CI on main pushes diffuseproject/mdx2-jhub:1.0.2-1 to Dockerhub. (One-time: the new Dockerhub repo diffuseproject/mdx2-jhub needs to exist as public and the vars.DOCKERHUB_USERNAME account needs push access. Lazy creation on first push usually works for orgs with auto-create-on-push enabled; eager creation in the Dockerhub UI is the boring/reliable path.)
  2. Webapp PR (separate, against diff-use/webapp) updates the JupyterHub singleuser image config from diffuseproject/mdx2:test to diffuseproject/mdx2-jhub:1.0.2-1. See app/config.py:217 plus the repo-name constant if the webapp keeps repo and tag separate.
  3. After webapp merge + deploy, restart the JupyterHub user notebook from the Hub Control Panel. KubeSpawner pulls fresh because imagePullPolicy: Always.
  4. Verify from a laptop:
    echo hello > /tmp/scp-smoke
    scp -P 30022 /tmp/scp-smoke <user>@jupyter-<cluster>.hub.diffuse.science:/home/jovyan/
    ssh -p 30022 <user>@jupyter-<cluster>.hub.diffuse.science 'cat /home/jovyan/scp-smoke'   # → "hello"
    

Rollback: revert the webapp PR (single-line). Old :test stays parked on Dockerhub.

Notes for future maintenance

  • 327bf6e lives only at refs/pull/56/head. If the PR is ever purged the commit becomes unreachable. Suggested follow-up: push that SHA as a durable git tag in diff-use/mdx2 (e.g. singleuser-source-pin) so future builds can pin to a tag instead of a PR ref.
  • .github/Dockerfile is kept in the repo despite no longer being referenced by any workflow. It's the (verified) source for what produced :1.0.0 and :latest on Dockerhub. Useful as a reference if the standalone Lab image needs reviving later.
  • The mdx2 v1.0.4 upgrade is not bundled here. Separate, deliberate PR once 1.0.2-1 is verified in prod, with its own rollback boundary.

Commit history

Commit Purpose
1db4c76 Add notebook image CI build (mdx2:1.0.2-1) — the main change
cdf2221 Remove broken docker: job from workflow
14ec2e8 Strip dead weight from notebook image (unused python_stage, /opt/conda, build-context .git) — saves ~50 MB
3249501 Add workflow guards: env vars, immutability check, path filter, permissions
fc51467 Document MDX2_COMMIT: what the SHA is and when to bump it
daaa8c1 Add upgrade guide at docs/upgrading-the-jhub-image.md
7443d20 Rename image from diffuseproject/mdx2 to diffuseproject/mdx2-notebook to disambiguate from the standalone Lab image at :1.0.0/:latest
d002c88 Rename image and supporting files from notebook to jhub (image, Dockerfile, lockfile, doc, workflow refs) — operational-role naming matches the team's verbal shorthand

Reproduces diffuseproject/mdx2:test using a captured conda lockfile +
mdx2 pinned to 327bf6e (PR #56 head, the source :test was built from),
then adds openssh-client, openssh-sftp-server, and rsync so scp, sftp,
and rsync work through the SSH gateway.

The existing :1.0.0/:latest build job is left untouched.
The pre-existing docker: job was added 2026-03-25 (PR #57's revert era) and
has never successfully pushed an image to Dockerhub. Two pre-existing bugs:
file: dockerfile (lowercase, doesn't exist on Linux runners), and the login
step ran unconditionally with creds that weren't set until 2026-05-02.

The image it would build (.github/Dockerfile, the standalone Jupyter Lab
launcher tagged :latest / :1.0.0) is not used anywhere in the Diffuse
deployment chain — it's only referenced by mdx2-workflows/Dockerfile as a
base for local Prefect dev, and that base image already exists on Dockerhub
from a manual push in March 2026.

Removing the dead job leaves the workflow with one focused, working job
and clears the perpetual red X on every PR. The .github/Dockerfile is
kept around as a reference if anyone wants to revive :1.0.0 builds later
in a focused PR.
@Abdelsalam-Abbas Abdelsalam-Abbas changed the title Add notebook image CI build (mdx2:1.0.2-1) with scp/rsync/sftp Add notebook image CI build (mdx2:1.0.2-1) and remove broken docker job May 2, 2026
- Drop python_stage: nothing in the image invokes /usr/local/bin/python3.10;
  the conda env at /root/micromamba/envs/mdx2-dev/bin supplies the canonical
  Python (3.10.19, matching dxtbx 3.24.1)
- Drop /opt/conda mkdir and PATH entry: the directory was never populated
- Remove .git from cloned source so `git status` inside the running pod
  doesn't surface stale build-context state

Image shrinks ~50 MB (4.66 GB to 4.61 GB uncompressed). Lockfile parity
preserved: normalized md5 of conda list unchanged at
d5efd15da8bcbddebfa20b20ef2f2ff6.
Hoist IMAGE_NAME, IMAGE_TAG, and MDX2_COMMIT to workflow-level env so
they're a single review surface for "what version are we publishing."
Add a pre-push step that fails if the target tag already exists on
Dockerhub, preventing silent overwrites of published immutable tags.

Add a paths filter to the push trigger so only image-relevant changes
trigger a build-and-push on main; pull_request stays unfiltered for
predictable required-status-check behavior.

Set permissions to least-privilege (contents:read for checkout,
actions:write for type=gha cache).
@Abdelsalam-Abbas Abdelsalam-Abbas requested review from jlee733 and saada May 3, 2026 06:02
Recipe-style operational doc under docs/ covering the three upgrade
scenarios (mdx2 source bump, conda env refresh, OS apt addition) with
concrete commands, rollback procedure, and common pitfalls.

The 1.0.2 -> 1.0.4 upgrade walkthrough doubles as the cleanup procedure
for the PR-ref scaffolding (MDX2_PR_REF + git fetch origin pull/56/head),
which becomes deletable as soon as we move off the 327bf6e pin.

Cluster SSH endpoint is referred to as <sampleworks-host>; actual value
lives in the diff-use/infra Pulumi config (private repo).
The shared diffuseproject/mdx2 Dockerhub repo currently holds two
unrelated images: :1.0.0 and :latest are the standalone Jupyter Lab
launcher (saada's image, base for mdx2-workflows/Dockerfile), while
:test (and now :1.0.2-1) are the JupyterHub singleuser image. Sharing
a repo across two semantically different images, distinguished only by
tag, is the kind of latent collision that bites during incidents.

Move the new image to its own repo (diffuseproject/mdx2-notebook). The
legacy :1.0.0/:latest tags stay where they are; this repo's Dockerfile
and docker-compose.yml continue to reference them unchanged.

Provenance comments referring to the legacy diffuseproject/mdx2:test
tag are kept verbatim, since they document historical fact about a
real Dockerhub image that still exists at the old name.
Roll the previous notebook->jhub naming through every dependent surface in
one atomic commit:

- diffuseproject/mdx2-notebook  -> diffuseproject/mdx2-jhub  (workflow IMAGE_NAME)
- Dockerfile.notebook           -> Dockerfile.jhub           (git mv)
- notebook-env.lock             -> jhub-env.lock             (git mv)
- docs/upgrading-the-notebook-image.md -> docs/upgrading-the-jhub-image.md (git mv)
- workflow paths filter, file: arg, job key, display name, cache scope
  all updated to match.

Reasoning: image names that describe the operational role (jhub) are more
specific than artifact-form names (notebook), and the team's verbal
shorthand for this image is 'the jhub image'. Aligning the artifact name
with how it is referred to in conversation removes one translation step
between Slack/standup and the codebase.

Preserved verbatim: prose references to 'a notebook' (user's Jupyter
session, not the image), the kubernetes container name '-c notebook' (set
in the JupyterHub deployment config and not under this repo's control),
and 'jupyterhub_notebook_image_tag' (a field name in the diff-use/webapp
config; renaming that is a separate webapp-side decision).
Copy link
Copy Markdown
Collaborator

@jlee733 jlee733 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Comment thread .github/workflows/docker.yml Outdated
@Abdelsalam-Abbas Abdelsalam-Abbas merged commit 75215c8 into main May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants