-
-
Notifications
You must be signed in to change notification settings - Fork 160
[RFC 0017] Intensional Store #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 4 commits
ae6d7e3
a7b3772
520e3e2
53baec8
ee3ed3a
3fda171
1c7f749
e9c3340
d4ad873
b0362a0
b0b655e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
--- | ||
feature: intensional_store | ||
start-date: 2017-08-11 | ||
author: [email protected] | ||
co-authors: (find a buddy later to help our with the RFC) | ||
related-issues: (will contain links to implementation PRs) | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--- | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
One paragraph explanation of the feature. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Motivation | ||
[motivation]: #motivation | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Better re-use of inputs between compiles | ||
* Faster updates via nix-channel | ||
* Less compiling | ||
* More benefit from reproducible compiles, so more reason to work on that | ||
|
||
# Detailed design | ||
[design]: #detailed-design | ||
|
||
Terms used: | ||
* derivation: a `nix-build` output product depending on some inputs and resulting in a file or directory under `/nix/store` | ||
* dependent derivation: a derivation built using the currently considered derivation | ||
* `$out`: name of the location where a derivation is installed first, e.g., `zyb6qaasr5yhh2r4484x01cy87xzddn7-unit-script-1.12` | ||
* calculated based on the hashes of all the inputs, including build tools | ||
* `$cas`: output hash, the total hash of all the files under $out, with the derivation name appended, e.g., `qqzyb6bsr5yhh2r5624x01cy87xzn7aa-unit-script-1.12` | ||
|
||
## Concept | ||
|
||
The basic concept is aliasing equivalent input derivations in such a way that dependent derivations won't need to change if only `$out` changes but not the input derivation contents. | ||
|
||
After building a derivation, `$cas` is calculated, and `$out` is renamed to `$cas`. Then, if another build requires the input `$out`, it gets `$cas` instead, and all references to that build input will be `$cas` instead of `$out`. That dependent derivation will also have its input hash calculated with the `$cas` instead of the `$out`. | ||
|
||
This means that if 2 different derivations of the same input have a different `$out` but the same `$cas`, any dependent builds will not need to rebuild due to the inputs being different. For example, the 12MB input `poppler-data` is often the same across multiple different input derivations, so many `$out`s for `poppler-data` all result in the same `$cas`. Similarly, a compiler flag change might leave most derivations unchanged. | ||
|
||
In order to know which `$out`s refer to a particular `$cas`, symlinks can be used (`$out` pointing to `$cas`), or that data can be stored in the store database. The database can help with doing reverse lookups from `$cas` to all the `$out`s. Using symlinks will have a benefit of handling the case when self-references that are not discoverable via grep (e.g. filtered by ```xxd(1)```). | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Calculating `$cas` | ||
|
||
There is one important corner case that needs special handling: if a derivation refers to itself, it will be referring to `$out`, because `$cas` is not known at the time of the build. This means that each `$out` of a furthermore equivalent build would have a different hash, due to the different `$out`s. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To fix this, the `$cas` calculation has to replace all occurrences of `$out` with an equal-length string of (for example) NULL bytes. After that, `$out` is renamed to `$cas` and all occurrences of `$out` are replaced with `$cas`. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
This also means that `$out` and `$cas` should have the same length. The easiest way to achieve that is to use the same hash function for the output hash as used for the input hashes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not work if the references are are hidden for example an executable compressed with PEX or a java jar file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For those edge cases symlinks would be nice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I'd prefer that things fail so we can find the hidden reference and find a permanent fix for that particular problem. As for IPFS: ah yes, looks like there would have to be a second translation from $cas to $ipfs, but that would be trustless because you can check the downloaded $cas. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So basically I just assume that for any build, we'll find a way to make it build in a different location than the final install location. |
||
|
||
To calculate `$cas` we need to include all the data that uniquely defines a derivation: the file contents, case-sensitive names, and the permission bits, traversed in a fixed order, no matter what the filesystem or platform. Not to be included are the owning `uid` of the store and timestamps. | ||
|
||
## Distributing derivations | ||
|
||
Since `$cas` is only known when `$out` is built, binary caches would need to retain that information. When you look up `$out` to see if it was built already, the response should be _"Yes, this is available as `$cas`"_. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Maintaining the Nix store | ||
|
||
When garbage collecting, the Nix store should also remove `$out` references (be they symlinks or db entries) when removing a `$cas`. | ||
|
||
## Micro-optimizations not worth considering | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* By stripping the version from `$cas`, it could be the same for multiple versions of the same derivation. | ||
* However, increased version numbers mean the derivation actually changed, so there is no point in doing that. | ||
* By stripping the name and the version from `$cas`, it could be the same for multiple different derivations. | ||
* However, this makes it hard to find out what derivation a certain `$cas` is | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Furthermore, different inputs with the same contents are very unlikely, and there is no reduction in builds that need to be done. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Finally, `nix-store` supports hardlinking duplicate files, so the above optimizations are useless. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
* Extra code to maintain | ||
* Slightly more processing after a build | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Alternatives | ||
[alternatives]: #alternatives | ||
|
||
* No change: This is only an optimization, it won't change the fundamental working of Nix in any way | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
* Whether to store mappings as symlinks or db entries | ||
* Exactly how the Hydra protocol needs to be changed | ||
|
||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Future work | ||
[future]: #future-work | ||
|
||
## Input-agnostic derivations | ||
|
||
If a derivation with a new input is the same except that it has a changed reference to that input (e.g., a script referring to its interpreter, or a binary using a new library version), we call this an input-agnostic derivation for those two input versions (old and new input). | ||
|
||
* To detect this, calculate the hash over the derivation, replacing *all* input references with NULL bytes. If that resulting hash is the same as a previous derivation, it is input-agnostic for those versions. | ||
* This means that instead of downloading for installing it, it could be patched together from the previous version, by patching the old input `$cas`s with the new `$cas`s. | ||
* This could keep storage and network traffic for Hydra down, by storing the previous `$cas` and the strings that need to be patched. | ||
|
||
### …and beyond: | ||
|
||
Knowing this also could enable a building shortcut: If a dependent derivation needs rebuilding, and a previous version is available depending on an input-agnostic derivation, it could be generated by patching in the new `$cas`. | ||
|
||
This will not always work, i.e., when the input-agnostic derivation is used to copy data from the input it is agnostic over, it results in a change besides the input reference. | ||
|
||
Therefore, this optimization should be optional, defaulting to off. | ||
|
||
## Reproducible builds | ||
|
||
If two derivations are the same except for some irrelevant build-environment changes, they won't get the same `$cas`. Since this impacts rebuilds, there is more incentive to have fully reproducible builds. | ||
|
||
Hopefully this means we'll have it at some point, so we can crowd-source `$out` to `$cas` mappings by trusting many systems that get the same result. | ||
wmertens marked this conversation as resolved.
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.