Skip to content

fetchFromGitHub: add archiveReproducibilityWorkaround option#428377

Draft
YorikSar wants to merge 2 commits intoNixOS:masterfrom
tweag:fetchfromgithub-archive-reproducibility
Draft

fetchFromGitHub: add archiveReproducibilityWorkaround option#428377
YorikSar wants to merge 2 commits intoNixOS:masterfrom
tweag:fetchfromgithub-archive-reproducibility

Conversation

@YorikSar
Copy link
Contributor

@YorikSar YorikSar commented Jul 25, 2025

This option fetches tarball from GitHub based on tree hash instead of the tag to get unprocessed data from the repo, and then applies a partial implementation of export-subst to replace certain format strings, but unlike Git uses reproducible values for them.
This includes an example of using this in a package affected by this non-reproducibility.

Related PR: #417859
Related issue: #84312
Inspired by discussion: https://discourse.nixos.org/t/fetchfromgithub-and-the-versioneer-fixing-source-reproducibility/66539

TODO

  • move script to a separate package
  • add docs
  • add release notes entry

Things done

  • Built on platform:
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • Tested, as applicable:
  • Ran nixpkgs-review on this PR. See nixpkgs-review usage.
  • Tested basic functionality of all binary files, usually in ./result/bin/.
  • Nixpkgs Release Notes
    • Package update: when the change is major or breaking.
  • NixOS Release Notes
    • Module addition: when adding a new NixOS module.
    • Module update: when the change is significant.
  • Fits CONTRIBUTING.md, pkgs/README.md, maintainers/README.md and other READMEs.

Add a 👍 reaction to pull requests you find important.

@YorikSar YorikSar requested a review from roberth July 25, 2025 15:44
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/fetchfromgithub-and-the-versioneer-fixing-source-reproducibility/66539/5

@nixpkgs-ci nixpkgs-ci bot added 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: fetch Fetchers (e.g. fetchgit, fetchsvn, ...) labels Jul 25, 2025
@emilazy
Copy link
Member

emilazy commented Jul 25, 2025

Can we get away with not doing the replacements at all for simplicity? A git clone won’t apply them, right? jq generating a sed script scares me somewhat.

Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good strategy to solve the export-subst problem.

Could you:

  • factor out the processing logic into a separate script package with a test, and/or
  • add a fetcher test like we have for fetchgit

fetchgit = recurseIntoAttrs (callPackages ../build-support/fetchgit/tests.nix { });

Those use https://nixos.org/manual/nixpkgs/unstable/#tester-invalidateFetcherByDrvHash to generate unique FODs per fetcher implementation "version". Otherwise you end up testing nothing at all after running it once.

Unfortunately, we don't have any for fetchFromGitHub yet. Yikes.

@roberth
Copy link
Member

roberth commented Jul 25, 2025

jq generating a sed script scares me somewhat.

We won't be able to change its behavior easily when this is in use, so there's some reality to that fear. I think it can be done, if we have tests. Future iterations of it may need to be versioned, because otherwise we'd break existing call sites' FODs.

@emilazy
Copy link
Member

emilazy commented Jul 25, 2025

Do we know for sure that we need the processing logic? Most things don’t use this and will build fine from a Git clone, which is what fetching the tree with no post‐processing would give us. We could have a separate hook, that runs outside the FOD (and is therefore not such a compatibility risk), but reuses information from fetchFromGitHub, if we find some things really need the substitutions. Just having the logic to fetch the tree seems like a better approach for now to me.

@YorikSar YorikSar force-pushed the fetchfromgithub-archive-reproducibility branch from 0a4d73c to c928acd Compare July 31, 2025 15:47
Copy link
Contributor Author

@YorikSar YorikSar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roberth, I've addressed all your comments, except I didn't move the script to a separate package yet, will do this later. I'll add a TODO list to the top post.

@emilazy While some packages don't really use this, packages with user interface (CLI, WebUI, etc) might display version string, and it will show the template instead of the actual version. While we could fix this package by package as we currently do, I agree with @roberth's comment that this warrants a more generic fix.

@emilazy
Copy link
Member

emilazy commented Aug 1, 2025

@emilazy While some packages don't really use this, packages with user interface (CLI, WebUI, etc) might display version string, and it will show the template instead of the actual version. While we could fix this package by package as we currently do, I agree with @roberth's comment that this warrants a more generic fix.

But there’s nothing stopping us only implementing the raw fetch inside fetchFromGitHub and doing the export-subst stuff as a hook outside of the FOD, to avoid having to keep byte‐stability for all eternity, right?

(Even then I’d prefer to wait to see if anything warrants it – we already do a lot of patching of packages for bad version checks and it’s not clear to me this would be a meaningful additional burden – but it wouldn’t have the serious issues of putting more logic in FODs, something we’re trying to get away from as it keeps causing us issues.)

@roberth
Copy link
Member

roberth commented Aug 1, 2025

@emilazy A hook would need the tag name to be passed as an environment variable. It would be nicer to have a function that produces a finished "git archive"-like store path. You're absolutely right to criticize FOD complexity and I probably shouldn't have made an exception for this.
So how about we split these changes into two derivations:

  • A non-FOD derivation that performs the replacements and calls fetchFromGitHub (or perhaps fetchgit)
  • A flag for pure fetching on fetchFromGitHub

These could then be created by a single function, to give it a nice interface, e.g. fetchArchiveFromGitHub

@YorikSar YorikSar force-pushed the fetchfromgithub-archive-reproducibility branch 2 times, most recently from 911a04a to c14210c Compare August 8, 2025 15:45
@nixpkgs-ci nixpkgs-ci bot added 2.status: merge conflict This PR has merge conflicts with the target branch and removed 2.status: merge conflict This PR has merge conflicts with the target branch labels Aug 8, 2025
This option fetches tarball from GitHub based on tree hash instead of
the tag to get unprocessed data from the repo, and then applies a
partial implementation of `export-subst` to replace certain format
strings, but unlike Git uses reproducible values for them.

This includes an example of using this in a package affected by this
non-reproducibility.

Related PR: NixOS#417859
Related issue: NixOS#84312
Inspired by discussion: https://discourse.nixos.org/t/fetchfromgithub-and-the-versioneer-fixing-source-reproducibility/66539
@YorikSar YorikSar force-pushed the fetchfromgithub-archive-reproducibility branch from c14210c to 72299b3 Compare August 8, 2025 16:02
@YorikSar
Copy link
Contributor Author

YorikSar commented Aug 8, 2025

I've extracted the substitution logic into a separate package and rebased the branch on top of master to resolve a conflict.

Could you please point me to the argument against adding logic to FOD? We seem to run rather complex computations in them anyway (e.g. any git clone), so what's one more sed going to do?

I'm not against moving this outside of FODs, but there would be a problem of getting data for export-subst. In the current state, we're fetching it from the GitHub API, but outside of FOD, we'd have to make users provide all this data somehow, which would force package maintainers to keep this set of parameters up to date somehow.
As for API, I'd prefer to keep it in fetchFromGitHub and add these layers of derivations inside of it. I think it would keep API for package maintainers simple.

@nixpkgs-ci nixpkgs-ci bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Aug 26, 2025
@philiptaron
Copy link
Contributor

I'm just scared of this PR in general. export-subst is pretty cursed. Archives which depend on it are cursed.

I would prefer (in general) that these sorts of features are turned off at some high level, and then when that choice to opt out becomes a problem, we make a specific fetcher for the specific problem at hand. That causes me much less fear.

@nixpkgs-ci nixpkgs-ci bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: fetch Fetchers (e.g. fetchgit, fetchsvn, ...) 6.topic: python Python is a high-level, general-purpose programming language. 10.rebuild-darwin: 1-10 This PR causes between 1 and 10 packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants