S3 Backup design doc #2627

mvandenburgh · 2025-11-03T17:32:41Z

This PR lays out a design for S3 backup using S3 Replication and the Glacier Deep Storage class. Related #524

doc/design/s3-backup.md

CodyCBakerPhD · 2025-11-03T20:06:45Z

doc/design/s3-backup.md

+
+The DANDI Archive is expecting a ramp-up in data volume of 6 PB of new data over each of the next five years, culminating in a total of 30PB.
+
+Scaling up the previous analysis means that the monthly costs will be projected to rise to a total of **~$31,000/month** once all of that data is seated. While $1000/month may be a feasible ongoing cost, $30000/month is not.


With numbers on this scale it would be useful to compare total cost to equivalent effort of just enhancing the size of the MIT Engaging partition

Also worth pointing out long-term sustainability costs since all AWS costs are dependent on time (cost per month for all time) whereas the MIT backup being on equity hardware is less so

The numbers are not as bad now that we've corrected the data volume assumption. It may still be a good idea to compare to MIT Engaging, though we will want to take into account non-financial costs if we do that (engineering time, differences in reliability, and the ongoing costs in those terms of maintaining that backup long-term).

How are we currently storing data on that partition? DANDI uses S3 (and, in theory, MinIO for local development or non-standard cloud deployment); what kind of storage system do the Engaging backups use?

The output of df -hT .

Filesystem Type hstor004-n1:/group/dandi/001 nfs4

@kabilar Do you recall the quote from MIT for expanding that storage? Were there long-term costs or just a one-time deal?

We are no longer considering the use of a bucket in a different region.

We are expecting a bulk of 6PB over the next 5 years, not 30PB.

doc/design/s3-backup.md

CodyCBakerPhD · 2025-11-04T17:42:53Z

doc/design/s3-backup.md

+Scaling up the previous analysis means that the monthly costs will be projected to rise to a total of **~$6,100/month** once all of that data is seated.
+The worst-case disaster recovery cost would similarly scale up to a total of **~$16000**.


Would appreciate seeing a table of cost estimates per year assuming a 1PB increase per year (plus perhaps an extra 500 TB worst-case jump in the next year due to Kabi's latest LINC estimate), with a grand total after 5 years in the last column

doc/design/s3-backup.md

Co-authored-by: Cody Baker <[email protected]>

When the AWS docs say "GB", they mean 10^9 bytes, not 2^30. Co-authored-by: Cody Baker <[email protected]>

Co-authored-by: Cody Baker <[email protected]>

Clarify purpose of calculating the expected bucket storage cost covered by AWS already.

CodyCBakerPhD · 2025-11-04T20:11:39Z

doc/design/s3-backup.md

+$$
+
+while the associated backup costs would represent only an additional $`\$5900 / \$126000 \approxeq 4.6\%`$ of the cost of the storage itself.
+To help provide a significant level of safety to an important dataset, AWS may be willing to cover such a low marginal cost.


Suggested change

To help provide a significant level of safety to an important dataset, AWS may be willing to cover such a low marginal cost.

To help provide a significant level of safety to an important database, it may be worth reaching out to see if AWS may be willing to cover such a low marginal cost.

Original wording sounds as if we are speaking for AWS

Although - given their previous seeming lack of concern for applying glacier to the main archive contents (to 'save ephemeral costs'), I am guessing their perspective is less about the monetery aspect (which is being waived either way) than it is about actual additional storage at the data center (essentially doubling the size of the archive, even as it grows)

CodyCBakerPhD · 2025-11-09T20:59:20Z

@satra Two things relevant to this discussion

in your current discussion with AWS cloud architects / open data team, have they had any thoughts or suggestions on the topic of backup?
has MIT Engaging cluster team ever given you a quote for storage expansion (i.e., one-time, any recurrent costs, etc.)? @kabilar was unaware of any

Design doc for S3 backup

1cbd19a

kabilar requested review from satra and yarikoptic November 3, 2025 19:57

CodyCBakerPhD reviewed Nov 3, 2025

View reviewed changes

doc/design/s3-backup.md Show resolved Hide resolved

CodyCBakerPhD reviewed Nov 3, 2025

View reviewed changes

doc/design/s3-backup.md Outdated Show resolved Hide resolved

CodyCBakerPhD reviewed Nov 3, 2025

View reviewed changes

doc/design/s3-backup.md Outdated Show resolved Hide resolved

CodyCBakerPhD reviewed Nov 3, 2025

View reviewed changes

Fix outdated info

dbe010b

We are no longer considering the use of a bucket in a different region.

waxlamp force-pushed the s3-backup-design-doc branch 6 times, most recently from dd89dbd to 333137a Compare November 4, 2025 15:34

Update cost analysis with accurate inputs

da16c20

We are expecting a bulk of 6PB over the next 5 years, not 30PB.

waxlamp force-pushed the s3-backup-design-doc branch from 333137a to da16c20 Compare November 4, 2025 15:35

Fix typo in bulk retrieval cost

21d8b90

waxlamp force-pushed the s3-backup-design-doc branch from 2efa373 to 7429734 Compare November 4, 2025 15:54

Add worst-case disaster recovery cost estimate

3e74365

waxlamp force-pushed the s3-backup-design-doc branch from 7429734 to 3e74365 Compare November 4, 2025 15:55

CodyCBakerPhD reviewed Nov 4, 2025

View reviewed changes