Skip to content

[RFC 0152] local-overlay store #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
408 changes: 408 additions & 0 deletions rfcs/0000-local-overlay-store.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,408 @@
---
feature: local-overlay-store
start-date: 2023-06-14
author: John Ericson (@Ericson2314)
co-authors: Ben Radford (@benradf)
shepherd-team: @RaitoBezarius @tomberek @arianvp
shepherd-leader: @tomberek
related-issues: https://github.com/NixOS/nix/pull/8397
---

# Summary
[summary]: #summary

Add a new `local-overlay` store implementation to Nix.
This will be a local store that is layered upon another local filesystem store (local store or daemon).
This allows locally extending a shared store that is periodically updated with additional store objects.

# Motivation
[motivation]: #motivation

## Technical motivation

Many organizational users of Nix have a large collection of Nix store objects they wish to share with a number of consumers, be they human users, build farm workers etc.
The existing ways of doing this are:

- Share nothing: Copy store objects to each consumer

- Share everything: Use single mutable NFS store, carefully synchronizing updates.

- Overlay everything: Mount all of `/nix` (store and DB) with OverlayFS.

Each has serious drawbacks:

- "Share nothing" wastes tons of space as many duplicate store objects are stored separately.

- "Share everything" incurs major overhead from synchronization, even if consumers are making store objects they don't intend any other consumer to use.
It also poses an inflexible security model where the actions of one consumer effect all of them.

- Overlay everything cannot take advantage of new store objects added to the lower store, because its "fork" of the DB covers up the lower store's.
(Furthermore, separate files from the DB proper like an out of date SQLite Write-Ahead-Logging (WAL) file *could* leak through, causing chaos.)

The new `local-overlay` store also uses OverlayFS, but just for the store directory.
The database is still a regular fresh empty one to start, and instead Nix explicitly knows about the lower store, so it can get any information for the DB it needs from it manually.
This avoids all 3 downsides:

- Store objects are never duplicated by the overlay store.
OverlayFS ensures that one copy in either layer is enough, and we are careful never to wastefully include a store object in the upper layer if it is already in the lower layer.

- No excess synchronization.
Local changes are just that: local, not shared with any other consumer.
The lower store is never written to (no modifications or even filesystem locks) so we should not encounter any slow write and sync paths in filesystem implementations like NFS.

- No rigidity of the lower store.
Since Nix sees both the `overlay-store`'s DB and the lower store, it is possible for it to be aware of and utilize objects that were added to the lower store after the upper store was created.

This gives us a "best of all three worlds" solution.

## Marketing motivation

It is quite common for organizations using Nix to first adopt it behind the scenes.
That is to say, Nix is used to prepare some artifacts which are then presented to a consumer that need not be aware they were made with Nix.
Later though, because of Nix's gaining popularity, there may be a desire to reveal its usage so consumers can use Nix themselves.
Rather than Nix being a controversial tool worth hiding, it can be a popular tool worth exposing.
Nix-unaware usage can still work, but Nix-aware usage can do additional things.

The `local-overlay` store can serve as a crucial tool to bridge these two modes of using Nix.
The lower store can be served as before
--- however the artifacts were disseminated in the "hidden Nix" first phase of adoption
--- perhaps with only a small tweak to expose the DB / daemon socket if it wasn't before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is exposing the socket used anywhere in the proposal, or is it just mentioned as a separate possibility (with relevant metadata sharing done via reading the underlying SQLite DB)?

Copy link
Member Author

@Ericson2314 Ericson2314 Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local overlay store is potentially a consumer of the socket provided by another Nix daemon. A Nix daemon can also be spun up using the local overlay store instead of the local store.

Basically, no new socket code is needed for this. As far as I can tell, everything one would want with sockets already works without limitation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A consumer — doing what with the socket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing regular client, like anything else using the daemon. It will in fact only use it to read metadata; the lower store can be read-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reread the paragraph, and even with your explanations I am not sure what process the paragraph as written describes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Server Nix store dir, clients don't use Nix but do use Nix-built things
  2. Serve either socket or SQLite database, clients can use Nix with that store but with restrictions (e.g. perhaps only read-only)
  3. Use local overlay store (the writable upper layer makes the read-only lower layer less of an issue)

Doe that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the step 1 each snapshot gets its own store path, and the rest of the store is visible? (I guess one could also share a fixed-path directory with links to all the currently-relevant snapshots?)

The `local-overlay` store is new, but purely local, separate for each user that wants to use Nix, and completely not impacting any user that doesn't.

By providing the `local-overlay` store, we are essentially completing a reusable step-by-step guide for Nix users to "Nixify their workplace" in a very conscientious and non-disruptive manner.

## Motivation in action

See [Replit's own announcement](https://blog.replit.com/super-colliding-nix-stores) of this feature (in its current non-upstreamed) form aimed at its users.
This covers many of the same points above, but for the perspective of Replit users that would like to use Nix rather than the Nix community.

# Detailed design
[design]: #detailed-design

## Basic principle

`local-overlay` is a store representing the extension of a lower store with a collection of additional store objects.
(We don't refer to the upper layer as an "upper store" because it is not self-contained
--- it doesn't abide by the closure property because objects in the upper layer can refer to objects that are only in the lower layer.)

## Class hierarchy, configuration settings, and initialization

```mermaid
flowchart TD
subgraph LocalFSStore-instance
LS
LM
end
LS(store directory)
LM(Abstract metadata source)
subgraph LocalOverlayStore-instance
US
UM
end
US(store directory)
UM(SQLite DB)
UD(Directory for additional store objects)
LocalOverlayStore-instance -->|`lower-store` configuation option| LocalFSStore-instance
LocalOverlayStore-instance -->|`upper-layer` configuation option| UD
US -->|OverlayFS lower layer| LS
US -->|OverlayFS upper layer| UD
```

`LocalOverlayStore` is a subclass of `LocalStore` implementing the `local-overlay` store.
It has additional configuration items for:

- `lower-store`: The lower store, which must be a `LocalFSStore`

This is specified with an escaped URL just like the `remote-store` setting of the two SSH stores types.

- `upper-layer`: The directory used as the upper layer of the OverlayFS

- `check-mount`: Whether to check the filesystem mount configuration

Here is an example of how the nix.conf configuration might look:
```
store = local-overlay?lower-store=/mnt/lower/%3Fread-only%3Dtrue&upper-layer=/mnt/scratch/nix/upper/store
```

With `check-mount` enabled, on initialization it checks that an OverlayFS mount exists matching these parameters:

- The lower layer must be the lower store's "real store directory"

- The upper layer must be the directory specified for this purpose

The database for the `local-overlay` store is exactly like that for a regular local store:

- Same schema, including foreign key constraints

- Created empty on opening the store if it doesn't exist

- Opened existing one otherwise

## Data structure

These are diagrams for the possible states of a single store object.
Except for the closure property mandating that present store objects must have all their references also be present, store objects are independent.
We can therefore to a large extent get away with considering only a single store object.

### Graph Key

All the graphs below follow these conventions:

#### Nodes

- Right angle corners: Absent store object
- Rounded corners: Present store object

#### Edges

- Solid edge: "adding" or "deduplicating" direction
- Dotted edge: "deleting" direction

The graph when reduced to just one type of edge is always acyclic, and thus represents a partial order.

### Lower Store

While in use as a lower store by one or more overlay stores, a store most only grow "monotonically".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Linux, changes to the underlying storage of overlay devices has undefined behaviour, so a store used as a lower store cannot grow at all.

That is to say, the only way it is allowed to change is by extended it with additional store objects.

```mermaid
graph TD
A -->|Add in store| B
A[Absent Store Object]
B("Present store Object")
```

### Overlay Store logical view

The overlay store by contrast allows regular arbitrary options, almost.

```mermaid
graph TD
A -->|Add in store| B
B -.->|Delete from store| A
A[Absent Store Object]
B("Present store Object")
```

The exception is objects that are part of the lower store.
They cannot be logically deleted, but are always part of the overlay store.

### Both stores simplified view

We can take the [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product_of_graphs) of these two graphs,
and additionally tweak it to cover the exception from above:

```mermaid
flowchart TD
A -->|Create in lower store| B
A -->|Create in overlay store| C
B -->|Create in overlay store| Both
C -->|Create in lower store| Both
C -.->|Delete in overlay store| A
Both -.->|Faux-delete in overlay store| B
A[Absent Store Object]
B("Lower Layer Present")
C("Upper Layer Present")
Both("Both layers present")
```

In particular, "Faux-delete" refers to the exception.
While the upper layer can "forget" about the store object, since the lower layer still contains it, the overlay store does also.
In particular, any of the 3 "present" nodes mean the store object is logically present.

### Both Store accurate physical view

The above graph doesn't cover duplication vs deduplication in the "both" state.
It also doesn't distinguish between stores directories and metadata.
Let's now make a more complex graph which covers both of these things:

```mermaid
flowchart TD
A -->|Create in lower store| B
A -->|Create in overlay store| C
B -->|Create in upper store after created in lower store| D1
C -->|Create in lower store after created in upper store| D0
D0 -->|Manual deduplication| D1
subgraph Both
%%"Both layers metadata"
D0
D1
end
C -.->|Delete in overlay store| A
Both -.->|Faux-delete in overlay store| B
A[Absent Store Object]
B("Lower layer filesystem data<br>Lower store metadata")
C("Upper layer filesystem data<br>Overlay DB present")
D0("Both layers filesystem data<br>Lower store metadata<br>Overlay DB Metadata")
D1("Lower layer filesystem data<br>Lower store metadata<br>Overlay DB metadata")
```

Whenever the filesystem part of a store object reside in a layer, the metadata must also reside in that layer.
I.e. a lower layer of store dir store object must have lower store metadata (exact mechanism is abstract and unspecified, could be SQlite DB or daemon), and an upper layer of store object must have an overlay DB entry.
However, when we just have the store object in the lower layer, we may also have metadata in the upper layer.
That means there are two cases when the metadata is in both layers: the duplicated case (both layers filesystem) and deduplicated case (just lower layer filesystem).

## Operation

As discussed in the motivation, store objects are never knowingly duplicated.
The `local-overlay` store while in operation ensures that store objects are stored exactly once:
either in the lower store or the upper layer directory.
No file system data should ever be duplicated by `local-overlay` itself.

Non-filesystem data, what goes in the DB (references, signatures, etc.) is duplicated.
Any store object from the lower store that the `local-overlay` needs has that information copied into the `local-overlay` store's DB.
This includes information for the closure of any such store object, because the normal closure property enforced by the DB's foreign key constraints is upheld.

Store objects can still end up duplicated if the lower store later gains a store object the upper store already had.
This is because when the `local-overlay` is remounted, it doesn't know how the lower store may have changed, and when the lower store is added to, any upper store directories / DBs are not in general visible either.
We can have a "fsck" operation however that manually scans for missing / duplicated objects.

## Read-only `local` Store

*This was [already approved](https://github.com/NixOS/nix/pull/8356#event-9483342493) by the Nix team on an experimental basis, as an experimental feature that is trivial enough to be approved with out requiring an RFC.
It is still included here just to provide context.*

In order to facilitate using `local-overlay` where the lower store is entirely read only (read only SQLite files too, not just store directory), it is useful to also implement a new "read-only" setting on the `local` store.
The main thing this does is use SQLite's [immutable mode](https://www.sqlite.org/c3ref/open.html).

This is a separate feature;
it is perfectly possible to implement the `local-overlay` without this or vice-versa.
But for maximum usability, we want to do both.

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

Because the `local-overlay` store is a completely separate store implementation, the interactions with the rest of Nix are fairly minimal and well-defined.
In particular, users of other stores and not the `local-overlay` store will not be impacted at all.

# Drawbacks
[drawbacks]: #drawbacks

## Read-only local store is delicate

SQLite's immutuable mode doesn't mean "we promise not to change the database".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this paragraph very surprising. The rest of the RFC is written as if the lower store can monotonically grow without being a problem for the overlays and that sounded really cool! And then suddenly this paragraph says that none of that is possible due to technical issues.

With this technical issue in place I am not so sure if a local-overlay store is that attractive or useful. It sounds like a deal-breaker that makes the abstraction kind of useless. If the sqlite database in the lower store can't be updated online, then what is the point?

How is it different from a lower store containing nix-registration-paths text file with all valid paths that we nix-store --load-db on startup into the overlay sqlite database? This is something we have already today and is well-tested (the ISO depend on it). And it seems to give the same benefits to me. The lower store can be updated offline, and if you remount it then new paths can be added to the overlay sqlite database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah we address this later in https://github.com/NixOS/rfcs/pull/152/files#diff-36321d79e9a5897b43eaf0f9c622b4a14978b61641f934bc4d33cdc9a49f6669R345

I think we could put way more focus on "This is an optimisation for --load-db". As that is the main selling point here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it links to a comment I made on discourse haha. Great that I'm arguing with myself from the past. :')

It relies on the database not just being *logically* immutable (the meaning doesn't change), but *physically* immutable (no bytes of the on-disk files change).
This is because it is forgoing synchronization altogether and thus relying that nothing can be rearranged e.g. in the middle of a query (invalidating in-progress traversals of data structures, etc.).
This means there is no hope of, say "append only" mode where "publishers" only add new store objects to a local store, while read-only mode "subscribers" are able to deal with other store objects just fine.

This is an inconvenience for rolling out new versions of the lower store, but not a show stopper.
One solution is "multi version concurrency control" where consumers get an immutable snapshot of the lower store for the duration of each login.
New snapshots can only be gotten when consumers log in again, and old snapshots can only be retired once every consumer viewing them logs out.

## `local-overlay` lacks normal form for the database

A slight drawback with the architecture is a lack of a normal form.
A store object in the lower store may or may not have a DB entry in the `overlay-local` store.
This is the two "both" nodes in the last diagram.

This introduces some flexibility in the system: the same "logical" layered store can be represented in multiple different "physical" configurations.
This isn't a problem *per se*, but does mean there is a bit more complexity to consider during testing and system administration.

## Deleting isn't intuitive

For "deleting" lower store objects in the `local-overlay` store,
we don't actually remove them but just remove the upper DB entry.
This is somewhat surprising, but reflects the fact that the lower store is logically immutable (even when it isn't a `local` store opened in read-only mode).
By deleting the upper DB entry, we are not removing the object from the `local-overlay` store, but we are still resetting it to the initial state.

# Alternatives
[alternatives]: #alternatives

## Stock Nix with OverlayFS ("overlay everything" from motivation)

It is possible to use OverlayFS on the entire Nix store with stock Nix.
Just as added store objects would appear in the upper layer, so would the SQLite DB after any modification.

There are a few problems with this:

1. Once the lower store is "forked" in this way, there is no "merge".
The modified DB in the upper layer will completely cover up the lower store's DB.
If any new store objects are added to the lower store, the overlayFS-combined local store will never know.

2. The database can be very large.
For example, Replit's 16 TB store has a 634 MB database.
OverlayFS doesn't know how to duplicate only part of a SQLite database, so we have to duplicate the whole thing per user/consumer.
This will waste lots of space with information the consumer may not care about.
And for a many-user product like Replit's, that will be wasting precious disk quota from the user's perspective too.

## Use `nix-store --load-db` with the above

As suggested [on discourse](https://discourse.nixos.org/t/super-colliding-nix-stores/28462/7), it is possible to augment the DB with additional paths after it has been created.
Indeed, this is how the ISO installer for NixOS works.
Running this periodically can allow the consumer to pick up any new paths added to the lower store since the last run.

This does solve the first problem, but with poor performance penalty.
We don't know what paths have changed, so we have slurp up the entire DB.
With Replit's 16 TB store's DB, this took 10 minutes.

We could out of band try to track "revisions" so just new paths are added, but then we are adding a new source of complexity vs the "statelessness" of on-demand accessing the lower store (and relying on monotonicity).

## Bind mounts instead of overlayfs for `local-overlay` Store

Instead of mounting the entire lower store dir underneath ours via OverlayFS, we could bind mount individual store objects as we need them.
The bind-mounting store would not longer be the union of the lower store with the additional store objects, instead the lower store acts more as a special optimized substituter.

This gives a normal form in that objects are bind-mounted if and only if they have a DB entry in the bind-mounting store;
There is no "have store object, don't yet have DB entry" middle state to worry about.

The downside of this is that Nix needs elevate permissions in order to create those bind mounts, and the impact of having arbitrarily many bind mounts is unknown.
Even if this design works fine once set up, the imposition of an O(n) initialization setting up each bind mount is prohibitive for many use-cases.

## Store implementations using FUSE

We could have a single FUSE mount that could manually implement the "bind on demand" semantics described above without cluttering the mount namespace with an entry per each shared store object.
FUSE however is quite primitive, in that every read need to be shuffled via the FUSE server.
There is nothing like a "control plane vs data plane" separation where Nix could tell the OS "this directory is that other directory", and the OS can do the rest without involving the FUSE server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is false, FUSE passthrough can lead to bypass the FUSE server.
There are multiple folks in the Linux filesystem ecosystem that are working on bringing FUSE ecosystems with a relatively on-par performance with in-kernel filesystem or reuse the existing filesystems.

https://lwn.net/Articles/932060/
https://github.com/extfuse/extfuse
https://lpc.events/event/16/contributions/1339/

I am not really convinced of not pursuing this alternative as I feel like this bring the maximum flexibility and compatibility for all the usecases instead of making it a very limited thing based on OverlayFS.

If you are interested into chatting more on how to make this alternative possible, feel free to ping me, I know quite about FUSE filesystems and I am planning to write a nixstorefs at some point, which will rely on FUSE semantics first.

I also challenge the "worse performance" than in-kernel mounting solutions, it would be good to bring data on that, if you only pay the open cost, this is quite cheap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I have mixed feelings about the RFC because I feel like the FUSE approach is a much better route than this one, I can say that I am not satisfied and feel like it should be more ambitious if it is going to take a RFC route.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is fantastic! That is getting at the heart of the problems with FUSE. I wish I had known about this earlier.

However, even if I did, I would have advocated starting with the OverlayFS approach. That is because all these things are not yet mainline, and kernel development / trying out bleeding edge features is vastly more expensive.

IMO the right thing to do is

  1. Accept what we have on an experimental basis, using it to drum up interests and move us towards being able to pool resources on this.
  2. Get in communication with these Kernel devs; indeed I was already emailing back and forth with Amir Goldstein about some restrictions in OverlayFS.
  3. Try to be an early adopter of this stuff as it matures; nudge its development so it better meets our needs.

Also CC @flokli, because Tvix may be better positioned to be at the vanguard of trying this stuff out, as they are already exploring FUSE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclosure: I am also a tvix-store developer ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, your argument is sound and convinced me :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad I didn't realize that. @RaitoBezarius you should shepherd this :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaaaaaaaa, OK for the nerdsnipe.

Copy link

@ballit6782 ballit6782 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be another approach with FUSE : using MergerFS with symlinkify=true and cache.symlinks=true. This way, FUSE is only used to resolve the symlink:

  • for a lot of applications, the opened directory fd will be reused, so most operations will run directly on the underlying store
  • later uses of the symlink would be cached by the kernel, so even without fuse-ebpf, we'd need the jump to userspace only once per symlink

This also relies on lower stores only growing monotonically when used, so that links would not go stale.

That means the performance of FUSE are potentially worse than these in-kernel mounting solutions.

Also, this would require a good deal more code than the other solutions.
Perhaps this is a good thing; the status quo of Nix having to keep the OS and DB in sync is not as elegant as Nix squarely being the "source of truth", providing both filesystem and non-filesystem data about each store object directly.
But it represents as significantly greater departure from the status quo.

# Unresolved questions
[unresolved]: #unresolved-questions

None at this time.

# Future work
[future]: #future-work

## Alternative to read-only local store

We are considering a hybrid between the `local` store and `file://` store.
This would *not* use NARs, but would use "NAR info" files instead of a SQLite database.
This would side-step the concurrency issues of SQLite's read-only mode, and make "append only" usage of the store not require much synchronization.
(This is because the internal structure of the filesystem, unlike the internal structure of SQLite, is not visible to clients.)

It is true that this is much slower used directly --- that is why Nix switched to using SQLite in the first place --- but in combination with a `local-overlay` store this doesn't matter.
Since non-filesystem data is copied into the `local-overlay` store's DB, it will effectively act as a cache, speeding up future queries.
Each NAR info file only needs to be read once.

## "Pivoting" between lower store

Suppose a lower store wants to garbage collect some paths that overlay stores use.
As written above, this is illegal.
But if there was a "migration period" where the overlay stores could know what store objects were going to go away, they could copy them to the upper layer.

A way to implement that by exposed both the old and new versions of the lower store.
(This is simply enough to do with file systems that expose snapshots; it is out of scope for Nix itself).
Nix would them look at both the old and new lower stores, compute the diff, and then copy things over.

This goes well with the "fsck" operation, which also needs to check the entire upper layer for dangling references.