Skip to content

Conversation

@briandoconnor
Copy link
Contributor

@briandoconnor briandoconnor commented Aug 1, 2025

Overview

This pull request updates the Data Repository Service (DRS) OpenAPI specification to enhance functionality in several key ways without creating breaking changes (hopefully). Key changes include 1) ability to create new objects, 2) providing a mechanism for callers to identify write back locations (e.g. clouds+regions, on premise systems), 3) providing a mechanism for callers to see what locations they are authorized to write to. Claude Code was used to generate some of these changes.

Related issues

Related Standards

This implementation aligns with:

eLwazi-hosted GA4GH Hackathon

The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.

Built Documentation

The human-readable documentation: https://ga4gh.github.io/data-repository-service-schemas/preview/feature/issue-415-write-support/docs/

Issues/questions for discussion

  • do we want bulk options here as well?
  • how would a URL work for write back for an on premise solution? I guess the DRS server implementation would need to handle this?
  • does the upload URL support more advanced upload techniques like multi-threading?
  • need to think about a shared filesystem in addition to upload URL option, for systems that might want to use that approach (e.g. shared filesystem for example)
  • do we need a cleaner way to say DRS write is optional? And valid 1.6 implementations could completely lack upload endpoints? Or do we want to rely on error codes like 501 not implemented?

Key Benefits

  • Multi-Location Support: Users can upload to multiple cloud regions/providers
  • Authorization Aware: Check permissions before attempting uploads
  • Efficient URL Management: Request upload URLs on-demand to avoid expiration
  • Flexible Replication: Upload to additional locations before finalizing
  • Resource Management: Quota tracking per location
  • Discovery: Service capabilities clearly advertised

Workflow Examples

Simple single-location upload:

  1. Check service-info → see available locations
  2. POST /objects with target_storage_location
  3. Upload to provided URL
  4. POST /objects/{id}/finalize

Multi-location replication:

  1. Check write authorizations → see permitted locations
  2. POST /objects with primary location
  3. Upload to primary location
  4. POST /objects/{id}/upload-urls for secondary location
  5. Upload to secondary location
  6. POST /objects/{id}/finalize with both locations

See Ticket for More Information

Ticket #415 has much more information including approaches implemented and what that looks like from a request/response perspective.

@briandoconnor briandoconnor changed the base branch from master to develop August 1, 2025 12:37
@briandoconnor
Copy link
Contributor Author

briandoconnor commented Aug 11, 2025

@patmagee pointed out that the DEL endpoint doesn't support passports!?! We need to rethink this, maybe add a POST /drs/object/id/delete endpoint?

From ChatGPT 5:

Yes — the HTTP/1.1 spec does not forbid a DELETE request from having a body (payload), but the semantics are tricky:

  • RFC 7231, Section 4.3.5 (the most relevant modern HTTP/1.1 spec) says:

    A payload within a DELETE request has no defined semantics; sending a payload body on a DELETE request might cause some existing implementations to reject the request.

  • In practice, many servers, proxies, and frameworks ignore a DELETE request body entirely, or even reject it, because historically most DELETE requests have no payload.

  • REST guidelines generally assume that the URI identifies the resource to delete and that any necessary parameters go in the path or query string, not in the body.

Key implications:

  • You can send a payload in a DELETE request from a client (e.g., using curl -X DELETE -d '{...}'), but you can’t rely on it being supported unless you know the server is explicitly designed to read and use it.

  • If you need to send additional instructions (e.g., conditional delete, batch delete), it’s often safer to:

    • Encode parameters in query strings:
      DELETE /items?type=old
    • Or use a different method (like POST to a /delete endpoint) if you must pass complex JSON.

If you want, I can give you examples of DELETE requests with a JSON body and how popular frameworks like Express, Flask, and Spring Boot handle them — since the real-world behavior is inconsistent. Would you like me to do that?

@grsr
Copy link
Collaborator

grsr commented Aug 11, 2025

Hi @briandoconnor - Thanks for the PR which has stimulated me to respond with some details of a different implementation that we've been working on for a use case in the GEL context. To summarise our approach; GEL forms part of the UK NHS Genomic Medicine Service (GMS) and we are exploring DRS as a standard to share genomic files with partners in the GMS. A new use case for us is to enable genomic labs to share genomic data with GEL such that it is then available over a DRS API.

As part of our DRS implementation we already added support for POST requests on the /objects endpoint which simply writes a fully constructed DRS object to the database that backs our existing implementation. This solves the metadata upload problem in a straightforward way, but we also wanted to support some means of negotiating where the files themselves should be uploaded to (for us this is currently a GEL managed AWS S3 bucket) that supports multiple cloud storage suppliers and potentially on-prem systems as well. To this end we have implemented a separate /upload-request endpoint where a client POSTs a payload that looks like:

{
  "objects": [
    {
      "name": "my_data.fastq",
      "size": 12345,
      "mime_type": "text/fastq",
      "checksums": [
        {
          "checksum": "string",
          "type": "string"
        }
      ],
      "description": "my FASTQ",
      "aliases": [
        "string"
      ]
    }
  ]
}

indicating that the client would like to upload a file of 12345 bytes and with the supplied name and checksum somewhere (objects is an array so multiple files can be requested at once, this is useful so that the server can choose to co-locate related objects such as a CRAM file and its index, or 2 FASTQ files from a paired-end sequencing run, this would be implementation dependent though).

The server responds with a response that looks like:

{
  "objects": {
    "my_data.fastq": {
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "self_uri": "string",
      "name": "my_data.fastq",
      "size": 12345,
      "mime_type": "text/fastq",
      "checksums": [
        {
          "checksum": "string",
          "type": "string"
        }
      ],
      "description": "string",
      "aliases": [
        "string"
      ],
      "upload_methods": [
        {
          "type": "s3",
          "access_url": {
            "url": "s3://bucket/some/prefix",
            "headers": [
              "string"
            ]
          },
          "region": "string",
          "credentials": {
            "AccessKeyId": "string",
            "SecretAccessKey": "string",
            "SessionToken": "string"
          }
        },
        {
          "type": "https",
          "access_url": {
            "url": "https://pre.signed.url/?X-Aws...",
            "headers": [
              "string"
            ]
          }
        }
      ]
    }
  }
}

The client can then select their preferred upload_method (intended as the analogue of the existing DRS access_method) from those offered by the server, and upload the data using the details supplied. For general HTTPS support we return an S3 pre-signed URL which can directly be POSTed to, but for large genomic data files we want to be able to take advantage of multi-part uploads and other optimisations already implemented in AWS tools, so we also implement an S3 upload_method where we supply time limited AWS credentials which allow the client to use native AWS libraries etc. to upload data to S3 in the bucket and using the prefix supplied. This mechanism can naturally be extended to support additional cloud providers, and also additional protocols such as SFTP etc. which might form one way to upload data to on-prem file systems if they support an SFTP interface (or indeed an S3 interface as is increasingly common). Note that the spec for the optional credentials field on an upload_method is simply that it is a JSON dictionary, so the spec for this endpoint is completely neutral wrt. to implementation specific approaches to auth.

Once the data is uploaded using the selected upload_method the client then writes a full DRS object to the DRS server using a POST request to the /objects endpoint, and we make no further changes to the existing DRS spec. It would also be possible to use a new endpoint to POST the DRS object itself if minimal interference with existing implementations is desirable.

A feature we like about this implementation is that the /upload-request endpoint is completely separate from the rest of the DRS endpoints and so can be entirely optional. We have tried to keep the payloads as close to the eventual DRS object as possible, but this is not a hard requirement, and there is lots of scope to extend the payloads to support negotiating additional constraints, such as the object size limit that you include in your implementation.

Comments on this suggestion are very welcome, and I'm happy to share openapi specs and/or example client and server code from our prototype implementation of this scheme if that would be useful. If so, please let me know whether a separate PR would make sense?

I write this comment without completely reviewing your suggested implementation just so that you're aware of how we're thinking about this, and before I get distracted again ;) I will review your PR in more detail and revert with more thoughts on how we could potentially combine these approaches if that would be useful.

@quinnwai
Copy link

quinnwai commented Oct 24, 2025

Hey @briandoconnor thanks for the PR! DRS write APIs has been something we've been excited to use in our Git DRS file tracking system as well as for other use cases like our TES implementation. We had a couple discussion points we've identified—some are blockers for our use case, others are clarifying questions.

1. Multipart Upload Support

To +1 @grsr's point, our use case also covers pushing large files to S3 in which multipart upload becomes necessary. The current design provides only a single upload URL via POST /objects, which doesn't support multipart uploads for large objects (e.g., S3's multipart upload API). Following this thought, would the server determines multipart strategy based on file size and backend capabilities + return multiple upload URLs?

2. All-or-Nothing Deletion

To my understanding, deleting a DRS object removes it from all storage locations simultaneously. Would this mean that you need access to all buckets to delete a DRS object? Or only the subset of files in your accessible buckets are deleted and the DRS record itself stays intact if not all files locations are deleted? Curious if a DELETE /objects/{id}/locations/{storage_location_id} would make sense here in which there's more granular delete possibilities for submitters managing their own buckets within the subset of available buckets.

3. Storage Location Registration Scope

Is the storage bucket registration handled through the API or exclusively via server-side configuration? This is more a clarifying question to understand the scoping / abilities of an admin via the API.

4. GA4GH Passport Integration Architecture

It seems like passports are currently embedded directly into POST calls rather than following a standardized authentication flow. Could you clarify the rationale behind that? This tight coupling makes it difficult to reuse / implement the same authentication mechanism across other GA4GH APIs (eg TES).

Appreciate the docs / PR / design created around this, it's been really helpful when thinking through our implementation of these new APIs! Cheers

@bwalsh
Copy link
Member

bwalsh commented Oct 27, 2025

💬 Review Comment — DRS Write Support / Issue #415

Excellent progress on extending DRS for write operations.
Below are a few clarifications and additions that would make the proposal production-ready and interoperable across implementations.


1️⃣ Multipart Upload via Signed URLs (Explicit Flow)

Normatively define a storage-agnostic multipart upload flow that maps to S3/GCS/Azure semantics while remaining provider-agnostic. Include required fields, expected status codes, and example request/response pairs for each step.

Proposed endpoint sequence

Step Method & Path Description
1. Initialize upload POST /ga4gh/drs/v1/uploads Create a new upload session. Body includes {filename, size, checksum, part_size}. Returns upload_id, part_size, expires_at.
2. Get signed URL for part GET /ga4gh/drs/v1/uploads/{upload_id}/parts/{part_number} Returns a time-limited signed_url for direct upload of part n. Response may include expected checksum headers.
3. Report uploaded part POST /ga4gh/drs/v1/uploads/{upload_id}/parts Client reports {part_number, etag, checksum} after upload.
4. Complete upload POST /ga4gh/drs/v1/uploads/{upload_id}:complete Server validates all parts, computes/verifies the full-object checksum, and promotes the object to DrsObject.
5. Abort upload DELETE /ga4gh/drs/v1/uploads/{upload_id} Cancels an incomplete session and cleans up temporary parts.

Checksum semantics

  • DrsObject.checksums[] MUST contain at least one checksum (e.g., sha-256) computed over the entire object bytes.
  • Provider-specific composite hashes (e.g., s3-checksum-sha256-composite, etag) may be stored as additional checksums but cannot replace the canonical full-object hash.
  • Servers SHOULD validate each part checksum during upload and re-compute the full-object checksum at completion, rejecting if mismatched with any client-supplied expected value.

2️⃣ DELETE Semantics — Per-Location Lifecycle

Current draft implies “DELETE = purge object + all locations.”
In practice, finer control is needed.

Use cases

  • Decommission one region/bucket while retaining others.
  • GDPR / data-residency enforcement.
  • Tier migration or corruption isolation.

Proposal

  • Assign stable location_ids within DrsObject.locations[].

  • Add lifecycle endpoints:

    • DELETE /objects/{id}/locations/{location_id} → hard-delete one replica.
    • PATCH /objects/{id} (op remove) → detach only.
    • POST /objects/{id}/locations/{location_id}:retire → soft-delete / tombstone.
  • Define GC and audit behavior to ensure consistency.


3️⃣ Authentication — Passport Exchange Sidecar

Keep DRS APIs authentication-method-agnostic. Optionally provide POST /ga4gh/aai/v1/passport:exchange to exchange a GA4GH Passport JWT for a short-lived repository-scoped bearer token; document required claims and returned scope/ttl.

POST /ga4gh/aai/v1/passport:exchange
  • Accepts a GA4GH Passport JWT.
  • Returns a short-lived, repository-scoped bearer token usable across DRS/WES/TRS.
  • Advertise this endpoint in service-info.capabilities or via OPTIONS issuer hints.

4️⃣ Search / Discovery

DRS core is ID-based only; discovery is required in real deployments. We could deploy a GA4GH Data Connect or extend with a search API

Data Connect

Provide an optional GA4GH Data Connect-compatible search endpoint that indexes DRS metadata (id, path, url, checksums, tags). Specify the query parameters, supported filters, result schema, and performance expectations for checksum lookups.

SELECT id, path, url
FROM drs_objects
WHERE checksum_type='sha-256' AND checksum_value='abc...'

Search

GET /ga4gh/drs/v1/objects/search?checksum=sha-256:HEX&path=/...&url=...
Advertise via service-info.capabilities.


Additional High-level gaps to address

If the comments above are agreeable, the following items should be addressed before committing to development:

  • Add request/response JSON schemas and example payloads for each endpoint.
  • Specify HTTP status codes, idempotency and retry semantics (especially for part uploads and complete).
  • Clarify lifecycle timing: how long parts are retained, GC policies, and abort/cleanup guarantees.
  • Define auth scopes/claims required for each operation and error cases for auth/authorization failures.
  • Describe concurrency and consistency guarantees for multipart completion (e.g., locking or version checks).
  • Include explicit validation/error responses for checksum mismatches and part-reporting failures.
  • Call out metrics / audit logging expectations and service limits (part size range, max parts).

✅ PR Review Checklist

Item Action
[ ] Define full-object checksum semantics (hash of entire payload) Add normative text
[ ] Permit but label provider-specific composite hashes Example: etag, s3-checksum-sha256-composite
[ ] Add explicit multipart upload path & sequence /uploads init/parts/complete/abort
[ ] Clarify finalize behavior Verify and record full-object checksum
[ ] Add per-location lifecycle endpoints DELETE/PATCH/retire
[ ] Document GC vs detach Add short subsection
[ ] Reference passport exchange sidecar /ga4gh/aai/v1/passport:exchange
[ ] Add optional search/discovery capability Prefer Data Connect; extension OK
[ ] Expose all new capabilities in service-info multipart-upload, per-location-delete, passport-exchange, search

@bwalsh
Copy link
Member

bwalsh commented Oct 28, 2025

Re. #416 (comment)

This approach is attractive — client-side tools like the AWS CLI and SDKs already support multipart uploads, retries, and error recovery.
By supplying short-lived, scoped credentials, the server offloads multipart logic to the client and simplifies development.

However, AFAIK support for STS-style temporary credentials is uneven across S3 implementations:

  • AWS S3, MinIO, and Ceph RGW (≥ Quincy) support AssumeRole / GetSessionToken and accept inline policies that can scope permissions to a single bucket/key and a minimal action set (s3:PutObject*, s3:AbortMultipartUpload, etc.). These systems can safely issue short-lived upload credentials.
  • Wasabi, Backblaze, DigitalOcean Spaces, Cloudflare R2, and most other clones do not expose STS; they only accept long-lived keys. In those environments, the only safe delegation mechanism is signed URLs.

Because of that, I’d suggest:

  1. Keep signed URLs as the normative, lowest-common-denominator upload method (upload_methods.signed_url).

  2. Allow the "s3" credentials variant optionally, but require:

    • SessionToken present → STS-only
    • explicit expires_at
    • inline policy scoped to the specific object/prefix
    • discovery via service-info.capabilities (e.g., drs:upload-methods.s3.credentials)

That preserves full SDK compatibility where STS exists, while remaining portable across other S3-compatible stores.

@grsr
Copy link
Collaborator

grsr commented Oct 28, 2025

Hi @bwalsh - Thanks very much for the comments. Yes, that is the intent of our proposal: we supply a presigned POST URL which is the simplest fallback option for clients (upload_method.type = "https"), and optionally an upload_method with type s3 for clients that want to use native S3 libraries. We could include an expires_at header along with the rest of the credentials, but DRS does not currently do this when it returns a presigned GET URL from the /access endpoint so if we think we need it here then perhaps we should also add it to the AccessURL schema? We could make it optional in any case to accommodate servers who don't need to set time-limited credentials?

On your last bullet, do you mean to present the capability of the server to return s3 upload_methods as well as https? If so I agree we should indeed include that, or do you mean that you can fetch credentials from there? In which case, for our application, we want the flexibility to be able to supply different credentials for different files (e.g. because we store different file types in different services, in our case we want to be able to use S3 for general files but AWS sequence store for specific genomic files) and so I think this needs to be file specific.

On the note on an inline policy, that is indeed how we implement the s3 upload_method on our server: we generate as AWS IAM session policy at runtime that is scoped to the specific prefix (plus some other permissions required to support multipart uploads) but we don't expose that in the DRS response because it is a server-side configuration.

Further comments very welcome! I will be presenting our proposal at the Cloud WS call on November 10th if you'd like to discuss any of this further there?

@bwalsh
Copy link
Member

bwalsh commented Oct 28, 2025

@grsr

On your last bullet, do you mean to present the capability of the server to return s3 upload_methods as well as https? If so I agree we should indeed include that, ...

Yes. https method must always be there, the s3 method should be there if the backend has a secure way to vend minimal scope credentials.

I will be presenting our proposal at the Cloud WS call on November 10th

Thanks looking forward to it. I'll get it on the calendar

@grsr
Copy link
Collaborator

grsr commented Nov 2, 2025

Hi all - I have just created a separate PR for our proposal here: #418. It is sufficiently different from Brian's proposal that I think it is cleaner to have a separate PR rather than try to make my changes over the top of Brian's. Further discussion very welcome, and I hope the proposal is documented sufficiently to allow others to review it and to understand the motivation for this different approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants