-
Notifications
You must be signed in to change notification settings - Fork 54
DRS write support #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
DRS write support #416
Conversation
|
@patmagee pointed out that the DEL endpoint doesn't support passports!?! We need to rethink this, maybe add a POST /drs/object/id/delete endpoint? From ChatGPT 5: Yes — the HTTP/1.1 spec does not forbid a
Key implications:
If you want, I can give you examples of DELETE requests with a JSON body and how popular frameworks like Express, Flask, and Spring Boot handle them — since the real-world behavior is inconsistent. Would you like me to do that? |
|
Hi @briandoconnor - Thanks for the PR which has stimulated me to respond with some details of a different implementation that we've been working on for a use case in the GEL context. To summarise our approach; GEL forms part of the UK NHS Genomic Medicine Service (GMS) and we are exploring DRS as a standard to share genomic files with partners in the GMS. A new use case for us is to enable genomic labs to share genomic data with GEL such that it is then available over a DRS API. As part of our DRS implementation we already added support for POST requests on the {
"objects": [
{
"name": "my_data.fastq",
"size": 12345,
"mime_type": "text/fastq",
"checksums": [
{
"checksum": "string",
"type": "string"
}
],
"description": "my FASTQ",
"aliases": [
"string"
]
}
]
}indicating that the client would like to upload a file of 12345 bytes and with the supplied name and checksum somewhere ( The server responds with a response that looks like: {
"objects": {
"my_data.fastq": {
"id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"self_uri": "string",
"name": "my_data.fastq",
"size": 12345,
"mime_type": "text/fastq",
"checksums": [
{
"checksum": "string",
"type": "string"
}
],
"description": "string",
"aliases": [
"string"
],
"upload_methods": [
{
"type": "s3",
"access_url": {
"url": "s3://bucket/some/prefix",
"headers": [
"string"
]
},
"region": "string",
"credentials": {
"AccessKeyId": "string",
"SecretAccessKey": "string",
"SessionToken": "string"
}
},
{
"type": "https",
"access_url": {
"url": "https://pre.signed.url/?X-Aws...",
"headers": [
"string"
]
}
}
]
}
}
}The client can then select their preferred Once the data is uploaded using the selected A feature we like about this implementation is that the Comments on this suggestion are very welcome, and I'm happy to share openapi specs and/or example client and server code from our prototype implementation of this scheme if that would be useful. If so, please let me know whether a separate PR would make sense? I write this comment without completely reviewing your suggested implementation just so that you're aware of how we're thinking about this, and before I get distracted again ;) I will review your PR in more detail and revert with more thoughts on how we could potentially combine these approaches if that would be useful. |
|
Hey @briandoconnor thanks for the PR! DRS write APIs has been something we've been excited to use in our Git DRS file tracking system as well as for other use cases like our TES implementation. We had a couple discussion points we've identified—some are blockers for our use case, others are clarifying questions. 1. Multipart Upload SupportTo +1 @grsr's point, our use case also covers pushing large files to S3 in which multipart upload becomes necessary. The current design provides only a single upload URL via POST /objects, which doesn't support multipart uploads for large objects (e.g., S3's multipart upload API). Following this thought, would the server determines multipart strategy based on file size and backend capabilities + return multiple upload URLs? 2. All-or-Nothing DeletionTo my understanding, deleting a DRS object removes it from all storage locations simultaneously. Would this mean that you need access to all buckets to delete a DRS object? Or only the subset of files in your accessible buckets are deleted and the DRS record itself stays intact if not all files locations are deleted? Curious if a 3. Storage Location Registration ScopeIs the storage bucket registration handled through the API or exclusively via server-side configuration? This is more a clarifying question to understand the scoping / abilities of an admin via the API. 4. GA4GH Passport Integration ArchitectureIt seems like passports are currently embedded directly into POST calls rather than following a standardized authentication flow. Could you clarify the rationale behind that? This tight coupling makes it difficult to reuse / implement the same authentication mechanism across other GA4GH APIs (eg TES). Appreciate the docs / PR / design created around this, it's been really helpful when thinking through our implementation of these new APIs! Cheers |
💬 Review Comment — DRS Write Support / Issue #415Excellent progress on extending DRS for write operations. 1️⃣ Multipart Upload via Signed URLs (Explicit Flow)Normatively define a storage-agnostic multipart upload flow that maps to S3/GCS/Azure semantics while remaining provider-agnostic. Include required fields, expected status codes, and example request/response pairs for each step. Proposed endpoint sequence
Checksum semantics
2️⃣ DELETE Semantics — Per-Location LifecycleCurrent draft implies “DELETE = purge object + all locations.” Use cases
Proposal
3️⃣ Authentication — Passport Exchange SidecarKeep DRS APIs authentication-method-agnostic. Optionally provide POST /ga4gh/aai/v1/passport:exchange to exchange a GA4GH Passport JWT for a short-lived repository-scoped bearer token; document required claims and returned scope/ttl.
4️⃣ Search / DiscoveryDRS core is ID-based only; discovery is required in real deployments. We could deploy a GA4GH Data Connect or extend with a search API Data ConnectProvide an optional GA4GH Data Connect-compatible search endpoint that indexes DRS metadata (id, path, url, checksums, tags). Specify the query parameters, supported filters, result schema, and performance expectations for checksum lookups. SELECT id, path, url
FROM drs_objects
WHERE checksum_type='sha-256' AND checksum_value='abc...'Search
Additional High-level gaps to addressIf the comments above are agreeable, the following items should be addressed before committing to development:
✅ PR Review Checklist
|
|
Re. #416 (comment) This approach is attractive — client-side tools like the AWS CLI and SDKs already support multipart uploads, retries, and error recovery. However, AFAIK support for STS-style temporary credentials is uneven across S3 implementations:
Because of that, I’d suggest:
That preserves full SDK compatibility where STS exists, while remaining portable across other S3-compatible stores. |
|
Hi @bwalsh - Thanks very much for the comments. Yes, that is the intent of our proposal: we supply a presigned POST URL which is the simplest fallback option for clients ( On your last bullet, do you mean to present the capability of the server to return On the note on an inline policy, that is indeed how we implement the Further comments very welcome! I will be presenting our proposal at the Cloud WS call on November 10th if you'd like to discuss any of this further there? |
Yes.
Thanks looking forward to it. I'll get it on the calendar |
|
Hi all - I have just created a separate PR for our proposal here: #418. It is sufficiently different from Brian's proposal that I think it is cleaner to have a separate PR rather than try to make my changes over the top of Brian's. Further discussion very welcome, and I hope the proposal is documented sufficiently to allow others to review it and to understand the motivation for this different approach. |
Overview
This pull request updates the Data Repository Service (DRS) OpenAPI specification to enhance functionality in several key ways without creating breaking changes (hopefully). Key changes include 1) ability to create new objects, 2) providing a mechanism for callers to identify write back locations (e.g. clouds+regions, on premise systems), 3) providing a mechanism for callers to see what locations they are authorized to write to. Claude Code was used to generate some of these changes.
Related issues
Related Standards
This implementation aligns with:
eLwazi-hosted GA4GH Hackathon
The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.
Built Documentation
The human-readable documentation: https://ga4gh.github.io/data-repository-service-schemas/preview/feature/issue-415-write-support/docs/
Issues/questions for discussion
Key Benefits
Workflow Examples
Simple single-location upload:
Multi-location replication:
See Ticket for More Information
Ticket #415 has much more information including approaches implemented and what that looks like from a request/response perspective.