S3 backup design doc (NESE) #2637

CodyCBakerPhD · 2025-11-11T21:47:45Z

NESE Tape analog/duplicate of #2627

Based only on rough estimates at this point; need to follow up and get a more concrete commitment before

TL;DR though, somewhere between 1/4-1/3 of the cost of the S3 Replication/Deep Glacier approach

We are no longer considering the use of a bucket in a different region.

We are expecting a bulk of 6PB over the next 5 years, not 30PB.

Co-authored-by: Cody Baker <[email protected]>

When the AWS docs say "GB", they mean 10^9 bytes, not 2^30. Co-authored-by: Cody Baker <[email protected]>

Co-authored-by: Cody Baker <[email protected]>

Clarify purpose of calculating the expected bucket storage cost covered by AWS already.

Updated the S3 backup design document to include NESE-specific considerations and costs, clarified backup requirements, and refined limitations and future cost projections.

Added a cost comparison table for NESE Tape and Deep Glacier approaches over five years.

Updated the cost projections for the DANDI Archive's data volume ramp-up and revised the cost table for NESE and Deep Glacier approaches.

mvandenburgh · 2025-11-12T18:49:46Z

doc/design/s3-backup-nese.md

+
+## Cost
+
+Note that unlike the [Deep Glacier approach](https://github.com/dandi/dandi-archive/blob/b3e0a9df4188533723fb2ad4a95506aa724fc089/doc/design/s3-backup.md), no egress cost would be induced from the data transfer as all operations are strictly under the open data bucket.


Do we know this for sure? AWS would probably take notice if our data transfer costs suddenly started spiking on a recurring basis due to a backup script downloading the entire multi-PB bucket.

As far as I understand, this isn't a technical limitation of the open data account; from our perspective, the billing for the entire AWS account is paid for by the AWS Open Data program. So, we could apply the same thinking to the Deep Glacier approach - if we put the backup bucket in the open data AWS account, the storage costs would also theoretically be covered by the open data program. In practice, it's more of a policy issue - is AWS willing to pay for backup? Framed that way, I don't think there is a clear winner between the two approaches in terms of AWS transfer costs.

Do we know this for sure?

Yes.

Historically we have performed identical backup strategies (not in theory; in practice) for Dartmouth (DropBox) and MIT (ORCD/Engaging).

AWS would probably take notice if our data transfer costs suddenly started spiking on a recurring basis due to a backup script downloading the entire multi-PB bucket.

IDK if they notice or care; but what would they do? Back-charge the cost to us after the fact? IDT they can even do that (legally).

In general this is a 'fair use' of the open data program.

As far as I understand, this isn't a technical limitation of the open data account; from our perspective, the billing for the entire AWS account is paid for by the AWS Open Data program. So, we could apply the same thinking to the Deep Glacier approach - if we put the backup bucket in the open data AWS account, the storage costs would also theoretically be covered by the open data program.

However, our sponsorship (or at least other sponsorships such as EMBER/BossDB) come with expectations and allowed total sizes; it is not a free account for us to just do whatever we want. [If it was, why wouldn't we run the Hub and all other compute through that account?]

I haven't seen the DANDI contract stipulations, but from what Satra says about it and from what EMBER/BossDB and others have said, the main restriction is total size. For example, for EMBER, we currently cannot exceed 2.6 PB without approval.

To use the same account to make a bucket of equivalent size (regardless of storage tier, which again, they did not seem to care much about) would be to double the total used size - or in other words - halve the maximum possible size of the archive.

In practice, it's more of a policy issue - is AWS willing to pay for backup? Framed that way, I don't think there is a clear winner between the two approaches in terms of AWS transfer costs.

If AWS said they were willing to pay for it all (consider me doubtful - keep in mind for them such costs are largely ephemeral since they indeed SET their own prices), it would be the best option by far because it is just a couple of buttons for us to press one time and no extra cost thereafter.

mvandenburgh · 2025-11-12T19:18:30Z

doc/design/s3-backup-nese.md

+## Limitations and Considerations
+
+- **Upper bound on total storage**: No quota for an upper limit of long-term storage has yet been provided, aside from the grand 70 PB total of the entire NESE store; the providers claim expansion is possible as need arises.
+- **Replication is eventually consistent**: No guarantees about replication speed (the time between an object finishing upload into the primary bucket and when it is available in the backup bucket) are provided. Using previous bandwidth experience to the old Dropbox backup, multi-gigabit speeds should be possible and is expected to keep up with ingest rates on the primary S3 bucket.
+


I think there are two additional considerations worth thinking about here.

If we're considering a cost basis purely based on monthly bills for cloud services, the NESE backup approach is clearly cheaper. But, as mentioned further up in the document, there is likely going to be some manual work to perform the transfer. This will consume engineering time and add to the maintenance burden of the system much more than the Deep Glacier approach; outside of the initial work to set it up and do the initial transfer, the Deep Glacier approach is largely hands-off and automatic.

One of the motivations for backup is obviously data security - the data collected on DANDI is the most valuable part of the project, and any data loss bugs could be catastrophic for the project as a whole. This is one of the reasons Add Asset garbage collection design doc #2367 is currently on hold; we want to get an ironclad backup system into place before we start deleting "garbage" from the bucket, in case there is a bug in our code. With the Deep Glacier design, there is essentially no hand-rolled code being run - it is simply configuring various options in AWS to facilitate backup and then deferring the actual backup/replication logic to AWS. With the NESE design, there would presumably need to be at least some custom code/scripts written to facilitate the download of data from S3 followed by the upload to Dartmouth's system; this theoretically leaves more room for bugs to sneak in.

Note, I'm not necessarily arguing for or against either design at this point (I haven't read this enough times/thought it through enough yet), I just want to make sure these points are discussed.

If we're considering a cost basis purely based on monthly bills for cloud services, the NESE backup approach is clearly cheaper. But, as mentioned further up in the document, there is likely going to be some manual work to perform the transfer. This will consume engineering time and add to the maintenance burden of the system much more than the Deep Glacier approach; outside of the initial work to set it up and do the initial transfer, the Deep Glacier approach is largely hands-off and automatic.

As mentioned in certain parts of the document and PR; we do need to get a solid commitment from Dartmouth IT, but from conversations they indicated it would be covered by them, so 'automatic for us'.

Even at a minimum it would compare to the work I've done with the MIT backup (a similar approach; move the data from S3 to a local HPC and from there the tape copy process can begin). Not a huge engineering effort and has been done before so not starting from scratch

One of the motivations for backup is obviously data security - the data collected on DANDI is the most valuable part of the project, and any data loss bugs could be catastrophic for the project as a whole. This is one of the reasons #2367 is currently on hold; we want to get an ironclad backup system into place before we start deleting "garbage" from the bucket, in case there is a bug in our code. With the Deep Glacier design, there is essentially no hand-rolled code being run - it is simply configuring various options in AWS to facilitate backup and then deferring the actual backup/replication logic to AWS. With the NESE design, there would presumably need to be at least some custom code/scripts written to facilitate the download of data from S3 followed by the upload to Dartmouth's system; this theoretically leaves more room for bugs to sneak in.

Yes, this is true - I'll add it into the doc

However, has anyone ever rigorously tested the claimed S3 restoration services? One might speculate that bugs could also exist in there, but no one (a) could know about it ahead of time outside of Amazon, and (b) would not know about it until it is 'too late' for us because we need it

Updated the document to reflect changes in the S3 Backup design, including storage capacities and service details.

Clarified maximum file size limitation and added OSN storage cost details.

CodyCBakerPhD · 2026-01-02T18:14:18Z

Replaced by #2684 which has all final details solidified with NESE (via Dartmouth) and ORCD

Quick link: https://github.com/CodyCBakerPhD/dandi-archive/blob/all_backup/doc/design/all_backup_options.md

mvandenburgh and others added 14 commits November 3, 2025 17:30

Design doc for S3 backup

1cbd19a

Fix outdated info

dbe010b

We are no longer considering the use of a bucket in a different region.

Update cost analysis with accurate inputs

da16c20

We are expecting a bulk of 6PB over the next 5 years, not 30PB.

Fix typo in bulk retrieval cost

21d8b90

Add worst-case disaster recovery cost estimate

3e74365

Add missing periods

d7553b3

Co-authored-by: Cody Baker <[email protected]>

Improve readability of LaTeX and fix conversion units

e2fd49d

Co-authored-by: Cody Baker <[email protected]>

Fix conversion units

d314719

When the AWS docs say "GB", they mean 10^9 bytes, not 2^30. Co-authored-by: Cody Baker <[email protected]>

Fix dollar amount formatting

fdc0cb6

Co-authored-by: Cody Baker <[email protected]>

Improve readability and fix conversion factors

6b69903

Fix cost analysis

b3e0a9d

Clarify purpose of calculating the expected bucket storage cost covered by AWS already.

start NESE copy

d8044f4

Updated the S3 backup design document to include NESE-specific considerations and costs, clarified backup requirements, and refined limitations and future cost projections.

Update cost analysis with NESE and Deep Glacier table

5c4e8c3

Added a cost comparison table for NESE Tape and Deep Glacier approaches over five years.

Revise future cost estimates and data volume table

f9a1903

Updated the cost projections for the DANDI Archive's data volume ramp-up and revised the cost table for NESE and Deep Glacier approaches.

mvandenburgh reviewed Nov 12, 2025

View reviewed changes

CodyCBakerPhD added 4 commits December 16, 2025 12:11

Revise S3 Backup documentation for clarity and accuracy

cf904e1

Updated the document to reflect changes in the S3 Backup design, including storage capacities and service details.

Update storage cost estimates in S3 backup design

4ebfacf

Update limitations and cost details in S3 backup design

3747316

Clarified maximum file size limitation and added OSN storage cost details.

Rename s3-backup-tape.md to s3-backup-nese.md

89feccb

CodyCBakerPhD closed this Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

S3 backup design doc (NESE) #2637

S3 backup design doc (NESE) #2637

Uh oh!

CodyCBakerPhD commented Nov 11, 2025 •

edited

Loading

Uh oh!

mvandenburgh Nov 12, 2025

Uh oh!

CodyCBakerPhD Nov 12, 2025 •

edited

Loading

Uh oh!

mvandenburgh Nov 12, 2025

Uh oh!

CodyCBakerPhD Nov 12, 2025

Uh oh!

CodyCBakerPhD commented Jan 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## Cost

		Note that unlike the [Deep Glacier approach](https://github.com/dandi/dandi-archive/blob/b3e0a9df4188533723fb2ad4a95506aa724fc089/doc/design/s3-backup.md), no egress cost would be induced from the data transfer as all operations are strictly under the open data bucket.

S3 backup design doc (NESE) #2637

S3 backup design doc (NESE) #2637

Uh oh!

Conversation

CodyCBakerPhD commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvandenburgh Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

CodyCBakerPhD Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mvandenburgh Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

CodyCBakerPhD Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

CodyCBakerPhD commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CodyCBakerPhD commented Nov 11, 2025 •

edited

Loading

CodyCBakerPhD Nov 12, 2025 •

edited

Loading

CodyCBakerPhD commented Jan 2, 2026 •

edited

Loading