Skip to content

@201 Add RO-Crate entity filters for targeted lookups#273

Merged
daniel-thom merged 6 commits intomainfrom
201-optimize-checking-for-existing-records
May 1, 2026
Merged

@201 Add RO-Crate entity filters for targeted lookups#273
daniel-thom merged 6 commits intomainfrom
201-optimize-checking-for-existing-records

Conversation

@arjlai221
Copy link
Copy Markdown
Collaborator

Summary

This MR fixes the RO-Crate existence-check path by extending the existing
GET /workflows/{id}/ro_crate_entities API with optional filters for:

  • file_id
  • entity_id

Instead of repeatedly fetching the full RO-Crate entity list and scanning it
client-side, callers can now perform targeted lookups against the server.

Problem

The previous implementation relied on listing all RO-Crate entities for a
workflow and then filtering in client code to determine whether a matching
record already existed.

That has two problems:

  • it does unnecessary work for large workflows
  • it does not scale well for repeated existence checks

What Changed

Server / API

  • Added optional file_id and entity_id query parameters to
    GET /workflows/{id}/ro_crate_entities
  • Plumbed those filters through the live router, server contract, HTTP server,
    and RO-Crate transport
  • Updated the RO-Crate list query to apply those filters in SQL
  • Applied the same filters to the count query so pagination metadata stays
    consistent

Client

  • Updated the generated Rust client to pass the new query parameters
  • Updated the checked-in Python and Julia generated clients for parity
  • Added Rust helper methods for filtered lookup:
    • list_ro_crate_entities_with_filters
    • find_ro_crate_entity_by_file_id
    • find_ro_crate_entity_by_entity_id
  • Switched find_entity_for_file() to use a targeted filtered lookup instead of
    full-list fetch + scan
  • Added warning behavior if the Rust find_* helpers observe multiple matches
    and return the first result

Data Integrity

  • Added a migration creating a partial unique index on
    (workflow_id, file_id) where file_id IS NOT NULL
  • Migration deduplicates existing conflicting (workflow_id, file_id) rows
    before creating the unique index

Error Handling

  • Duplicate RO-Crate inserts for the same workflow/file link now return a
    structured 422 response instead of falling through as a generic server error

Tests

Added and strengthened RO-Crate coverage for:

  • filtering by file_id
  • filtering by entity_id
  • combined filters
  • empty / non-matching filtered results
  • pagination metadata on filtered listing
  • duplicate workflow/file link rejection with explicit 422 assertion

Validation

Ran successfully:

  • cargo fmt --all --check
  • cargo check --tests
  • cargo clippy --all --all-targets --all-features -- -D warnings
  • dprint check

Note: full integration test execution is still limited in this environment
because sandboxed test server port binding is restricted.

Notes / Risk

  • The new migration is intentionally data-shaping: if duplicate
    (workflow_id, file_id) rows already exist, it keeps the newest row before
    adding the unique index.
  • The transport trait signature changed to carry the new optional filters, so
    any out-of-tree implementors will need to update accordingly.

@arjlai221 arjlai221 self-assigned this Apr 22, 2026
@arjlai221 arjlai221 requested a review from daniel-thom April 22, 2026 17:24
@arjlai221 arjlai221 changed the title Add RO-Crate entity filters for targeted lookups #203 Add RO-Crate entity filters for targeted lookups Apr 23, 2026
@arjlai221 arjlai221 changed the title #203 Add RO-Crate entity filters for targeted lookups #201 Add RO-Crate entity filters for targeted lookups Apr 23, 2026
@arjlai221 arjlai221 changed the title #201 Add RO-Crate entity filters for targeted lookups @201 Add RO-Crate entity filters for targeted lookups Apr 23, 2026
@@ -0,0 +1,112 @@
# RO-Crate Filter Staged Review
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete?

configuration,
workflow_id,
Some(0),
Some(2),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if for whatever reason multiple docs wiith the same file_id exist we can log a warning

{
Ok(result) => result,
Err(e) => {
let error_string = e.to_string();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not an explicit need, just covers an error case. This bit of code just converts the error message into something more human readable

"message": message
}));
return Ok(
CreateRoCrateEntityResponse::UnprocessableContentErrorResponse(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the same error be applied to the update-crate function?

@@ -0,0 +1,13 @@
DELETE FROM ro_crate_entity
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going here? Deleting rows without warning?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just there to de-duplicate rows that have thee same (workflow_id, file_id) so the index creation below runs successfully. No information is lost; if there are two rows with the same file_id, both rows point to the same file.

Comment thread src/server/api/ro_crate.rs Outdated
("RO-Crate entity already exists", "unknown")
};

warn!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude:

Log format inconsistency. src/server/api/ro_crate.rs:390 uses file_id={:?} for
  Option<i64>, yielding file_id=Some(123) or file_id=None. CLAUDE.md asks for file_id=123 style   for parser scripts. Easy fix: split into two log statements based on body.file_id.is_some(),
   or format as file_id={} after unwrap_or(-1) + a separate has_file=true/false.

@daniel-thom daniel-thom merged commit a4ea726 into main May 1, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants