Skip to content

New work item: crate r2c2_statement #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

pchampin
Copy link
Collaborator

@pchampin pchampin commented Mar 19, 2025

The idea of this crate is to be the first component of the "common API".

It would focus on RDF terms, triples and quads, and would provide

  • lightweight wrapper types (either defined or imported from utility crates) to guarantee the syntactic validity of some building blocks (IRI, language tags...)
  • traits for different term types (Subject, Predicate, Object, GraphName)
  • possibly other smaller traits that would be shared by those above (something like MaybeIri, MaybeLiteral...)

Also, since triple terms will force use to define a notion of Triple, it might make sense to also define Quad in this crate, although this stretches the scope of the crate a little bit. Should we name it instead r2c2_term_statement, which is more accurate, but a little verbose...


edited the current proposal contains some code, mostly to illustrate the intended content proposed Work Item -- please do not focus on the code itself, but on the general spirit.

Also following discussions with @Tpt, the proposal is now to have two complementary crates

  • one crate defining only traits and simple types, but providing no code to enforce the contracts of those traits and types (this is the responsibility of implementers)
  • one crate providing validation code (as a helper for implementers who want to use it -- but they may instead use validation code from their own implementation)

@pchampin pchampin added the new-work-item Must label PRs proposing a new work item for the CG. label Mar 19, 2025
Copy link
Collaborator

@Tpt Tpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It's definitely the most important goal of our CG but sadly likely one of the trickiest to get right. We need to find a compromise between ease of use and versatility and I fear it won't be easy.

term/src/lib.rs Outdated
//! 1. define or import simple wrapper types for building blocks
//! (IRIs, language tags...)
//! 2. define traits for different kinds of terms
//! (Subject, Predicate, Object, GraphName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is imho going a bit too much into the "how" direction. It does not sound obvious that these should be traits and not enums.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll argue in favor of traits here:

What I aim is to avoid as much data transformation as possible when communicating between two implementations. That's why I try to favor lighweight wrapper types, and traits.

Imagine I want to consume some triples produced by oxttl to canonicalize them with sohpia_c14n. (I'll focus on subjects but of course the same would apply to predicates and objects). If Subject was an enum, I would have to transform the subjects produced by oxttl into that enum. And then sophia_c14n would have to transform this enum again into its own internal representation.

If OTOH Subject is a trait, which the types of oxttl implement, and which sophia_c14n accepts as input, then the data produced by oxttl can be passed directly to sophia_c14n, which then will transform it directly into its own internal representation. That's one transformation less.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other side having an enum makes manipulation easier. I tend to think this is a compromise to be done when we know more about how we represent IRIs/blank nodes/... and should not be set in stone at the beginning of this work item.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to defer this discussion, the goal was not to set anything in stone. I've just pushed a commit to clarify that the proposed design was just an example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Perfect!

Tpt
Tpt previously approved these changes Mar 19, 2025
@pchampin
Copy link
Collaborator Author

Thank you! It's definitely the most important goal of our CG but sadly likely one of the trickiest to get right. We need to find a compromise between ease of use and versatility and I fear it won't be easy.

Agreed. I tried to not be too specific in the PR, but on the other hand, keeping things too abstract make them without substance. I don't think it would make sense to agree an a very abstract work-item if we don't have some agreement on what it will contain.

But of course, we don't need to figure out all the details up-front.

@Tpt
Copy link
Collaborator

Tpt commented Mar 19, 2025

I don't think it would make sense to agree an a very abstract work-item if we don't have some agreement on what it will contain.

Yes! What about something in the line of "It would provide types to encode and manipulate RDF concepts like IRI, blank node, literal, term and triple", making the scope clear while leaving the struct vs trait undefined?

Should we name it instead r2c2_term_statement, which is more accurate, but a little verbose...

I would tend to prefer r2c2_model in the line of RDF/JS DataModel or r2c2_concepts in the line of RDF concepts & abstract syntax. I agree that Quad is likely in scope.

@pchampin
Copy link
Collaborator Author

Re. terminology:

  • I consider, maybe wrongly, that "type" encompasses "struct" and "enum" (as well as atomic types), but not "trait". I believe this is consistent with the use of the use of the keyword type in Rust, but I can see how traits are a kind of (higher level) types as well.

  • I would expect a crate named r2c2_model or r2c2_concepts to also include the notion of Graph and Datatype, which is not the goal here. That's why I didn't go for that. r2c2_foundation ?

Add comment to clarify that the proposed design can be challenged.
@pchampin
Copy link
Collaborator Author

  • I would expect a crate named r2c2_model or r2c2_concepts to also include the notion of Graph and Datatype, which is not the goal here. That's why I didn't go for that. r2c2_foundation ?

thinking a little more about this... r2c2_statement would also work, IMO. I would understand that a crate named "statement" also includes the building blocks of statements (i.e. terms), while the opposite sounds like scope creep.

@pchampin
Copy link
Collaborator Author

@Tpt

On the other side having an enum makes manipulation easier.

I've been giving this more thoughts, and I believe that there is a way to have the best of both words (traits and enums). More specifically, a Subject trait would provide one main method (let's call it subject_info() as working title), whose result would be a lightweight enum similar to oxrdf::SubjectRef -- and similarly, of course, for other traits Predicate, Object.

That method enum providing everything there is to know about the subject (resp. predicate, object), any other method that the traits may provide could have a default impl based on the result of subjer_info. So implementers would generally only need to implement that one method to implement the trait.

@pchampin pchampin changed the title New work item: crate r2c2_term New work item: crate r2c2_statement Apr 21, 2025
in addition to defining the core traits and types,
it proposes 2 proof-of-concept implementations, for oxrdf and rdf_types,
(behind the feature gate 'poc_impl')
and demonstrate interoperability between the two
by testing roundtripping of both implementation via the other
@Tpt Tpt dismissed their stale review April 21, 2025 15:21

Significant changes

@Tpt
Copy link
Collaborator

Tpt commented Apr 21, 2025

Thank you so much for pushing this.

Some major pain points I have with it as a starting point:

  • It contains what I consider to be details like the ground method. Imho this should be the topic of a v2 after we get a first design ready and is definitely not something we should have in a starting point that enables fast iteration.
  • It enforces data structures for Literal/IRI/LangTag with validation. Imho this should not be part of a basic interoperability crate, implementations might want to be more or less lenients. The starting point should only be a set of traits and enum without significant algorithms in it. This is an interop crate, not an implementation. But happy to be challenged on it.

@pchampin
Copy link
Collaborator Author

  • It contains what I consider to be details like the ground method. Imho this should be the topic of a v2 after we get a first design ready and is definitely not something we should have in a starting point that enables fast iteration.

Absolutely agreed. The ground methods were mostly here as an example of additional methods that could be provided (as a convenience for users) with default implementation (as a convenience for implementers). Which methods are or are not included there is indeed to be discussed later.

I'm happy to comment out the ground method for the moment.


  • It enforces data structures for Literal/IRI/LangTag with validation. Imho this should not be part of a basic interoperability crate, implementations might want to be more or less lenients. The starting point should only be a set of traits and enum without significant algorithms in it. This is an interop crate, not an implementation.

I hear your point, but I still have mixed feelings about this...

But happy to be challenged on it.

Here we go :)

If IRIs in R2C2 (the same reasoning applies to language tags) did not provide any guarantee of validity, it would mean that

  • there would never be any cost for producers: just ship whatever you have as an R2C2 IRI ;
  • there would always be a cost for consumers: alwats check the IRIs that you get, you never know.

This is not ideal, in particular because many producers will actually produce valid IRIs (hopefully!), but consumers will still need to check them every time.

With the proposed design:

  • lenient producers must pay the cost of checking their data before (Iri::new)
  • conservative producers can still ship whatever they have without any additional cost (Iri::new_unchecked)
  • lenient consumers can accept whatever they get without checking them, confident that it satisifies RFC3987
  • conservative consumers must pay the cost of checking their additional constraints

As you can see, the sweet spot is for implementations following Postel's law: conservative producers and lenient consumers. The burden of additional checks is taken only by the implementations who depart from Postel's law.

For this to work, we need to ensure that the guarantees provided by R2C2 correspond to the MUSTs in the spec, nothing more, nothing less. That's why, for example, blank node labels are not constrained (while several implementations, including mine, constrain them to be valid SPARQL bnode labels).

@pchampin
Copy link
Collaborator Author

I just had a long discussion about this with @labra and an idea came up to hide all validation code behind a feature gate:

  • without the feature, the wrapper types provided by the API would only have a new_unchecked method; the crate would therefore remain very lean, but the responsibility of producing valid data would be entirely left to the user
  • with the feature, the wrapper types would also provide a new method that would perform some validation.

In my story above, strict producer would probably use the crate without the feature (they only need the new_unchecked method), while lenient producers will enable it in order to use the new method.

@Tpt would that be an acceptable middle-ground for you?

@Tpt
Copy link
Collaborator

Tpt commented Jun 17, 2025

@pchampin Sorry for the answer delay. This sounds much better. However, I am still a bit scared that mixing the two features in the same crate will make versioning more painful: it is likely we will want the traits crates to be as stable as possible whereas it is more fine for implementations to see breaking changes

@pchampin
Copy link
Collaborator Author

@pchampin Sorry for the answer delay. This sounds much better. However, I am still a bit scared that mixing the two features in the same crate will make versioning more painful: it is likely we will want the traits crates to be as stable as possible whereas it is more fine for implementations to see breaking changes

that's a very valid point, thanks.
Another option, then, would be to have two crates:

  • r2c2_statement contains traits and wrapper types but no validation code (only new_unchecked consructors for wrapper types)
  • r2c2_statement_validation defines extension traits for the wrapper types, providing the "validating" constructors (new)

This way, we could keep the versioning of the API independant from the versioning of the validation code.

* statement contains only traits and simple wrapper types
* statement_validation contains code for validating the contract
  of the wrapper types
@pchampin
Copy link
Collaborator Author

@Tpt for clarity, I updated my proposed code with the split between statement and statement_validation.

@GordianDziwis
Copy link

@pchampin you asked for some feedback:

I think you could replace traits like trait Object with the rust std traits From, Into, TryFrom, TryInto and AsRef.

This would also free up the name Object, so ObjectProxy can be named Object.

This would result in an idiomatic and intuitive API:

let literal = object.try_into()?;
let object = literal.into();

fn foo(in: impl Into) {};

@pchampin
Copy link
Collaborator Author

pchampin commented Jun 27, 2025

@GordianDziwis thanks a lot for this feedback. That's an interesting perspective, I have to think about it more.

As I wrote at the top of this thread: the first step is to first adopt the work item -- then we can discuss design choices (with dedicated issues to discuss them). I'm taking your comment as a support for the work item, right?

@GordianDziwis
Copy link

@pchampin Sorry for the noise.

Concerning the actual work item of having two crates:

This way, we could keep the versioning of the API independant from the versioning of the validation code.

Isn't this covered by semantic versioning? While the validation code could change faster, the API should be very stable, so dependents wouldn't have to do anything if a change in the validation code causes a minor API bump.

one crate providing validation code (as a helper for implementers who want to use it -- but they may instead use validation code from their own implementation)

Why would I want to implement my own validation code?

Also, I think this is problematic:

pub trait IriValid<'a> {

If there is a trait IriValid, this does imply Iri could be not a valid IRI.

An Iri should be by default a valid IRI, being able to create non-valid RDF types should be optional. How about making the validation code a feature that is enabled by default?

@pchampin
Copy link
Collaborator Author

@pchampin Sorry for the noise.
no problem :)

I'm responding to your questions in a different order, hopefully to makes things clearer.

one crate providing validation code (as a helper for implementers who want to use it -- but they may instead use validation code from their own implementation)

Why would I want to implement my own validation code?

You need to remember that this group does not aim to provide "yet another RDF implementation in Rust", but to provide a thin interoperability layer between existing implementations (such as Oxigraph, rdf-types, Rudof or Sophia). Each of these implementation already has its own code to validate IRIs, language tags, etc.

Now, if you wanted to make your own RDF implementation, and make it compliant with R2C2 from the start, then of course, you might want to reuse the validation code provided by R2C2. But our goal is not to erase diversity of implementations, merely to help interoperability between them.

Concerning the actual work item of having two crates:

This way, we could keep the versioning of the API independant from the versioning of the validation code.

Isn't this covered by semantic versioning? While the validation code could change faster, the API should be very stable, so dependents wouldn't have to do anything if a change in the validation code causes a minor API bump.

Semantic versioning dictates that any breaking change bumps the major version number.

So if we make breaking change in the validation code, this will create a new major version, even if the "pure API" (i.e. the traits and types currently in the r2c2_statement crate in my example code) did not change. From now on, projects using major version N+1 of R2C2 will not interoperate anymore with projects using version N, even though the API on which they both rely for interoperability has not changed.

Also, I think this is problematic:

pub trait IriValid<'a> {

If there is a trait IriValid, this does imply Iri could be not a valid IRI.

An Iri should be by default a valid IRI, being able to create non-valid RDF types should be optional. How about making the validation code a feature that is enabled by default?

granted, that name is confusing and should be changed. But again, I'd rather table this kind of discussion once we accept the work item.

@GordianDziwis
Copy link

You need to remember that this group does not aim to provide "yet another RDF implementation in Rust", but to provide a thin interoperability layer between existing implementations (such as Oxigraph, rdf-types, Rudof or Sophia). Each of these implementation already has its own code to validate IRIs, language tags, etc.

Now, if you wanted to make your own RDF implementation, and make it compliant with R2C2 from the start, then of course, you might want to reuse the validation code provided by R2C2. But our goal is not to erase diversity of implementations, merely to help interoperability between them.

I follow your reasoning that follows from the goal of diversity, and I am absolutely on board for a diversity of implementations for concepts like a graph or a dataset. But do we want diversity in IRI, Subject, Triple etc. implementations?

My instinct would be to get rid of multiple implementations of the same validation logic. First because of duplicated effort, second because of better interoperability, the same implementation is guaranteed to be interoperable.

So if we make breaking change in the validation code, this will create a new major version, even if the "pure API" (i.e. the traits and types currently in the r2c2_statement crate in my example code) did not change.

My assumption was that breaking changes to the API of the validation code are infrequent, because it is very well-defined what a valid x is it won't change what a valid x is, and the API is basically a single function for each x.

Please feel free to ignore my comment in favor of GSD. I want just to give input, not block progress with my opinions. So, thumbs up to the proposal.

@pchampin
Copy link
Collaborator Author

pchampin commented Jul 4, 2025

My instinct would be to get rid of multiple implementations of the same validation logic.

mandatory XKCD reference: https://xkcd.com/927/ 😈

More seriously:

  • I hope that future implementations would depend on the validation code of R2C2 rather than deploying their own, and maybe future versions of the existing projects will drop their own validation code and rely on R2C2, to reduce duplicate effort.
  • However, some implementations may want to keep their own validation code, for example because they want to be more strict than the standard itself. For example, Jena "knows" the specific rules of the most common IRI schemes and complains about <http:x> for example. Such detailed validation is not required by the RDF spec, so it should not be part of our validation code IMO, but I sympathize with the will to be more strict.

My assumption was that breaking changes to the API of the validation code are infrequent, because it is very well-defined what a valid x is it won't change what a valid x is, and the API is basically a single function for each x.

Fair enough. One might argue, still, that "it is very well-defined what a valid x is" is overly optimistic, because standards are sometimes interpreted differently by different people, and because some of those standards are moving targets (see the history of BCP47, for example).

I want just to give input, not block progress with my opinions. So, thumbs up to the proposal.

Great, thank you 👍

@KonradHoeffner
Copy link

KonradHoeffner commented Jul 9, 2025

Thank you for all the work!
Unfortunately I'm the type of person who has to use stuff in my own projects to see if fits or not. The definitions look very clean and easy to use but it it's impossible for me to decide whether these traits are perfectly defined or if there are problems, especially as I'm also not well versed with the Rust type system.

Given that RickView is based on Sophia, I can't use those traits directly until they get adopted by Sophia, the only candidate where I can test it is HDT, I will try to find the time to check if it is possible to refactor it using those traits in an experimental branch to find problems adopting them.

However first I hope it's OK if I state my general preferences, past experiences with Rust RDF APIs (mostly Sophia) and requirements:

  1. Nothing is more frustrating than a trait that cannot be made into a trait object. For example RickView supports different kinds of graphs based on which file format is loaded, but because the graph trait was not object safe(?), there is boilerplate code using an enum with cases for the different variants with exactly the same code in each of the match arms. So my number one concern is that all traits are object safe wherever that could make sense in practice (I don't know if it makes sense to have e.g. Triples from different implementations in the same collection). But I'm not sure if those traits here are object safe or not right now.
  2. Similarly, traits whose implementations cannot be safely shared across threads are also a huge hassle, as with RDF you are often operating with large amounts of data where parallelization makes sense. So rather than supporting the theoretical one-in-a-million use case where someone has for example an IRI based on a string type that is somehow not thread safe, I would prefer if that is defined at a trait level (I guess this means adding Send + Sync everywhere?).

Similar to @GordianDziwis I do not prefer diversity of implementations at all, my preferred state would be to just have one that everybody uses. So I would rather have an opinionated compromise that every implementation can achieve than support everything that already exists.

@KonradHoeffner
Copy link

KonradHoeffner commented Jul 9, 2025

As for the String type Cow<'a, str> used for the IRI and literals (and I guess by extension then for triples), I'm generally a fan of immutable variables so for example in a Java implementation I would not want a method I call with my IRI changing that IRI without me knowing.
However given that Rust requires the explicit "mut" specifier I guess it is not a problem in this programming language.

However I have used your MownStr in the past, which is described as more efficient but I guess this would not work on a high level interoperable trait level?

Also I have problems finding the right String type when querying RDF.
For example let's assume I am querying a SPARQL endpoint or some other graph and I get a collection of let's say a few million triples back.
To not get an out of memory error, each IRI should only be in memory once.
Is that supported by the current API?

  • Strings waste space on duplication
  • Pointers are unsafe
  • References cannot be returned out of functions in Rust (also the graph may be compressed e.g. HDT so there is nothing to point to)
  • MownStr (and I guess Cow is the same way) works great for something like a specific SP? triple pattern query because the query function can just return back the given subject and property but not when e.g. querying all triples from a triple store and some of them share the same subject IRIs and you can't return a bunch of triples where some are references to others.
  • Rc is not thread-safe
  • Arc has synchronization overhead

The current state has Iri defined as pub struct Iri<'a>(Cow<'a, str>);, would that be compatible with that use case? Or am I totally off base here and that has nothing to do with it? Need to think about it some more, but those are my first thoughts.

@pchampin
Copy link
Collaborator Author

pchampin commented Jul 17, 2025

@KonradHoeffner thanks for this feedback :)
The discussions about dyn-compatible traits, thread-safe traits and different kinds of str, are in my opinion design questions that do not call the acceptation of this work item into question.

The discussion about the general goal of the crate, however, is more fundamental. The goal is deliberately to provide a thin (or as thin as possible) interface between different implementations, not a new implementation to replace them. In other words, the goal is more to embrace diversity than to try to eliminate it.

Note that this work item does not prevent the group to also propose a "reference implementation" aiming to replace the others. But his is not the goal that this work item aims to achieve.

@KonradHoeffner
Copy link

KonradHoeffner commented Jul 18, 2025

Ah, sorry if I'm misunderstanding, feel free to go ahead with this PR if this is out of scope.

I'm in favor of the idea to define traits for RDF terms and triples (quads I don't need but maybe others do).
However I still don't think it's a good goal to embrace diversity in implementations.
It's a good thing for cultures and languages and so on but with programming libraries I don't think they should be as different as possible.
If there are ten different existing libraries and one of them makes a strange choice, then should the other nine libraries adopting this API and all of the users suffer for that or wouldn't it be better to just write the one library maintainer and ask if they could conform to a compromise in their next version or they don't use this API.
So for me the priorities would be ease of use, interoperability, conformity to the standard documents and that the trait implementations can be optimized for memory and CPU efficiency for different tasks.
Then, library maintainers adopting this API can chase the goal of adopting this API which guarantees that they are compatible with the standards and maybe put a badge in their README that they are compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-work-item Must label PRs proposing a new work item for the CG.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants