Skip to content

Conversation

@CodyCBakerPhD
Copy link

@CodyCBakerPhD CodyCBakerPhD commented Nov 11, 2025

NESE Tape analog/duplicate of #2627

Based only on rough estimates at this point; need to follow up and get a more concrete commitment before

TL;DR though, somewhere between 1/4-1/3 of the cost of the S3 Replication/Deep Glacier approach

mvandenburgh and others added 14 commits November 3, 2025 17:30
We are no longer considering the use of a bucket in a different region.
We are expecting a bulk of 6PB over the next 5 years, not 30PB.
Co-authored-by: Cody Baker <[email protected]>
When the AWS docs say "GB", they mean 10^9 bytes, not 2^30.

Co-authored-by: Cody Baker <[email protected]>
Clarify purpose of calculating the expected bucket storage cost covered
by AWS already.
Updated the S3 backup design document to include NESE-specific considerations and costs, clarified backup requirements, and refined limitations and future cost projections.
Added a cost comparison table for NESE Tape and Deep Glacier approaches over five years.
Updated the cost projections for the DANDI Archive's data volume ramp-up and revised the cost table for NESE and Deep Glacier approaches.

## Cost

Note that unlike the [Deep Glacier approach](https://github.com/dandi/dandi-archive/blob/b3e0a9df4188533723fb2ad4a95506aa724fc089/doc/design/s3-backup.md), no egress cost would be induced from the data transfer as all operations are strictly under the open data bucket.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know this for sure? AWS would probably take notice if our data transfer costs suddenly started spiking on a recurring basis due to a backup script downloading the entire multi-PB bucket.

As far as I understand, this isn't a technical limitation of the open data account; from our perspective, the billing for the entire AWS account is paid for by the AWS Open Data program. So, we could apply the same thinking to the Deep Glacier approach - if we put the backup bucket in the open data AWS account, the storage costs would also theoretically be covered by the open data program. In practice, it's more of a policy issue - is AWS willing to pay for backup? Framed that way, I don't think there is a clear winner between the two approaches in terms of AWS transfer costs.

Copy link
Author

@CodyCBakerPhD CodyCBakerPhD Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know this for sure?

Yes.

Historically we have performed identical backup strategies (not in theory; in practice) for Dartmouth (DropBox) and MIT (ORCD/Engaging).

AWS would probably take notice if our data transfer costs suddenly started spiking on a recurring basis due to a backup script downloading the entire multi-PB bucket.

IDK if they notice or care; but what would they do? Back-charge the cost to us after the fact? IDT they can even do that (legally).

In general this is a 'fair use' of the open data program.

As far as I understand, this isn't a technical limitation of the open data account; from our perspective, the billing for the entire AWS account is paid for by the AWS Open Data program. So, we could apply the same thinking to the Deep Glacier approach - if we put the backup bucket in the open data AWS account, the storage costs would also theoretically be covered by the open data program.

However, our sponsorship (or at least other sponsorships such as EMBER/BossDB) come with expectations and allowed total sizes; it is not a free account for us to just do whatever we want. [If it was, why wouldn't we run the Hub and all other compute through that account?]

I haven't seen the DANDI contract stipulations, but from what Satra says about it and from what EMBER/BossDB and others have said, the main restriction is total size. For example, for EMBER, we currently cannot exceed 2.6 PB without approval.

To use the same account to make a bucket of equivalent size (regardless of storage tier, which again, they did not seem to care much about) would be to double the total used size - or in other words - halve the maximum possible size of the archive.

In practice, it's more of a policy issue - is AWS willing to pay for backup? Framed that way, I don't think there is a clear winner between the two approaches in terms of AWS transfer costs.

If AWS said they were willing to pay for it all (consider me doubtful - keep in mind for them such costs are largely ephemeral since they indeed SET their own prices), it would be the best option by far because it is just a couple of buttons for us to press one time and no extra cost thereafter.

Comment on lines 46 to 50
## Limitations and Considerations

- **Upper bound on total storage**: No quota for an upper limit of long-term storage has yet been provided, aside from the grand 70 PB total of the entire NESE store; the providers claim expansion is possible as need arises.
- **Replication is eventually consistent**: No guarantees about replication speed (the time between an object finishing upload into the primary bucket and when it is available in the backup bucket) are provided. Using previous bandwidth experience to the old Dropbox backup, multi-gigabit speeds should be possible and is expected to keep up with ingest rates on the primary S3 bucket.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two additional considerations worth thinking about here.

  1. If we're considering a cost basis purely based on monthly bills for cloud services, the NESE backup approach is clearly cheaper. But, as mentioned further up in the document, there is likely going to be some manual work to perform the transfer. This will consume engineering time and add to the maintenance burden of the system much more than the Deep Glacier approach; outside of the initial work to set it up and do the initial transfer, the Deep Glacier approach is largely hands-off and automatic.

  2. One of the motivations for backup is obviously data security - the data collected on DANDI is the most valuable part of the project, and any data loss bugs could be catastrophic for the project as a whole. This is one of the reasons Add Asset garbage collection design doc #2367 is currently on hold; we want to get an ironclad backup system into place before we start deleting "garbage" from the bucket, in case there is a bug in our code. With the Deep Glacier design, there is essentially no hand-rolled code being run - it is simply configuring various options in AWS to facilitate backup and then deferring the actual backup/replication logic to AWS. With the NESE design, there would presumably need to be at least some custom code/scripts written to facilitate the download of data from S3 followed by the upload to Dartmouth's system; this theoretically leaves more room for bugs to sneak in.

Note, I'm not necessarily arguing for or against either design at this point (I haven't read this enough times/thought it through enough yet), I just want to make sure these points are discussed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're considering a cost basis purely based on monthly bills for cloud services, the NESE backup approach is clearly cheaper. But, as mentioned further up in the document, there is likely going to be some manual work to perform the transfer. This will consume engineering time and add to the maintenance burden of the system much more than the Deep Glacier approach; outside of the initial work to set it up and do the initial transfer, the Deep Glacier approach is largely hands-off and automatic.

As mentioned in certain parts of the document and PR; we do need to get a solid commitment from Dartmouth IT, but from conversations they indicated it would be covered by them, so 'automatic for us'.

Even at a minimum it would compare to the work I've done with the MIT backup (a similar approach; move the data from S3 to a local HPC and from there the tape copy process can begin). Not a huge engineering effort and has been done before so not starting from scratch

One of the motivations for backup is obviously data security - the data collected on DANDI is the most valuable part of the project, and any data loss bugs could be catastrophic for the project as a whole. This is one of the reasons #2367 is currently on hold; we want to get an ironclad backup system into place before we start deleting "garbage" from the bucket, in case there is a bug in our code. With the Deep Glacier design, there is essentially no hand-rolled code being run - it is simply configuring various options in AWS to facilitate backup and then deferring the actual backup/replication logic to AWS. With the NESE design, there would presumably need to be at least some custom code/scripts written to facilitate the download of data from S3 followed by the upload to Dartmouth's system; this theoretically leaves more room for bugs to sneak in.

Yes, this is true - I'll add it into the doc

However, has anyone ever rigorously tested the claimed S3 restoration services? One might speculate that bugs could also exist in there, but no one (a) could know about it ahead of time outside of Amazon, and (b) would not know about it until it is 'too late' for us because we need it

Updated the document to reflect changes in the S3 Backup design, including storage capacities and service details.
Clarified maximum file size limitation and added OSN storage cost details.
@CodyCBakerPhD
Copy link
Author

CodyCBakerPhD commented Jan 2, 2026

Replaced by #2684 which has all final details solidified with NESE (via Dartmouth) and ORCD

Quick link: https://github.com/CodyCBakerPhD/dandi-archive/blob/all_backup/doc/design/all_backup_options.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants