Position Statement: The RDF* and SPARQL* Approach to Annotate Statements in RDF and to Reconcile RDF and Property Graphs

This post presents my position statement for the W3C Workshop on Web Standardization for Graph Data.

Update (November 18, 2020): Work on a spec that defines the approach has started.

Update (August 6, 2020): Native support for RDF* and SPARQL* has found its way into the following
RDF-related programming libraries: Eclipse RDF4J, Apache Jena, RDF.rb, and N3.js.

Update (April 23, 2020): In addition to the two RDF graph database systems mentioned in
the blog post (Blazegraph and AnzoGraph), two more such systems have added support
for RDF* and SPARQL* in the meantime; these systems are Stardog and GraphDB.

Update (June 27, 2019): We now have a W3C mailing list to discuss question related to RDF* and SPARQL*.

Update (June 9, 2019): In the meantime, I have defined SPARQL* Update.

The lack of a convenient way to annotate RDF triples and to query such annotations has been a long standing issue for RDF. Such annotations are a native feature in other contemporary graph data models (e.g., edge properties in the Property Graph model) and there exist a number of popular use cases, including the annotation of statements with certainty scores, weights, temporal restrictions, and provenance information. To mitigate the inherent lack of a native support for such annotations in the purely triple-based data model of RDF, there exist several proposals to capture such annotations in the RDF context (e.g., RDF reification as proposed in the RDF specifications, singleton properties, single-triple named graphs). However, these proposals have a number of shortcomings and none of them has yet been adopted as a (de facto) standard.

We are proposing an alternative approach that is based on nesting of RDF triples and of query patterns. This approach has already attracted interest not only in the RDF and Semantic Web research community (as indicated by some blog posts and by winning the People’s Choice Best Poster Award at ISWC 2017) but also among RDF system vendors. In fact, the approach is already supported in two commercial RDF graph database systems (Blazegraph and AnzoGraph) and in an extension of the popular Open Source framework Apache Jena. Important properties of the approach are that

  1. it allows for a compact representation of data and queries,
  2. it is backwards-compatible with the aforementioned existing approaches,
  3. it can serve naturally as a foundation for achieving interoperability between the RDF and the Property Graphs world, and
  4. it can be employed as a common conceptual framework to capture more specific annotation-related extensions of RDF and SPARQL (such as temporal or probabilistic extensions).

The goal of this position statement is to bring the approach to the attention of the workshop attendees and to put on the workshop agenda a discussion regarding standardization opportunities for this approach.

In the remainder of this position statement we outline the approach and elaborate more on its properties.

Overview of the Approach

The basis of the proposed approach is to extend RDF with a notion of nested triples. More precisely, with this extension, called RDF*, any triple that represents metadata about another triple may directly contain this other triple as its subject or its object. For instance, suppose we want to capture a statement indicating the age of Bob together with the metadata fact that we are 90% certain about this statement. RDF* allows us to represent both the data and the metadata by using a nested triple as follows.

   <<:bob foaf:age 23>> ex:certainty 0.9 .

Notice that we write the nested triple using an extension of the RDF Turtle syntax that captures the notion of nested triples by enclosing any embedded triple using the strings ‘<<‘ and ‘>>’. This extended syntax is called Turtle* and it is specified in Section 3.3 of our technical report.

Given the outlined notion of RDF* which supports (arbitrarily deep) nesting of triples, the crux of the proposed approach is to extend the RDF query language SPARQL accordingly. That is, in the extended query language, called SPARQL*, triple patterns may also be nested, which gives users a query syntax in which accessing specific metadata about a triple is just a matter of mentioning the triple in the subject (or object) position of a metadata-related triple pattern. For instance, by adopting the aforementioned syntax for nesting, we may query for all age-statements and their respective certainty as follows (prefix declarations omitted).

   SELECT ?p ?a ?c WHERE {
     <<?p foaf:age ?a>> ex:certainty ?c .

Notice that the query is represented in a very compact form; in particular, in contrast to the corresponding queries for other proposals (e.g., RDF reification, singleton properties), this compact syntax does not require users to write verbose patterns or other constructs whose only purpose is to match artifacts that these proposals introduce to establish the relationship between a triple and the metadata about it.

In addition to nested triple patterns, SPARQL* introduces a new type of BIND clauses that allows us to express the example query in the following, semantically equivalent form.

   SELECT ?p ?a ?c WHERE {
     BIND (<<?p foaf:age ?a>> AS ?t)
     ?t ex:certainty ?c .

The latter example also highlights the fact that in SPARQL*, variables in query results may be bound not only to IRIs, literals, or blank nodes, but also to full RDF* triples. For a detailed formalization of SPARQL*, including the complete extension of the full W3C specification of SPARQL, refer to Sections 4-5 of the technical report.

Properties of the Approach

We emphasize three orthogonal perspectives on the proposed approach:

  1. On one hand, RDF* and SPARQL* may be understood–and used–simply as syntactic sugar on top of RDF and SPARQL. That is, any RDF*-specific syntax such as Turtle* may be parsed directly into plain RDF data that uses RDF reification or any of the other approaches to annotate statements in RDF. Likewise, SPARQL* queries may be rewritten into ordinary SPARQL queries. Based on such conversions, RDF* and SPARQL* may be supported easily by implementing wrappers on top of existing RDF triple stores. Then, users can query either RDF* data or RDF data with other forms of statement annotations, both by using SPARQL*. The formal mappings necessary as a foundation of such wrapper-based implementations have already been defined and studied, and there exists an initial set of conversion tools.
  2. On the other hand, the proposal may also be conceived of as a new abstract data model in its own right. As such, it may be implemented by developing techniques to execute SPARQL* queries directly on a physical storage model that is designed to support RDF* natively. The formal foundations of this perspective exist; that is, we have defined the RDF* data model and a formal semantics of SPARQL*. Moreover, the RDF graph database systems Blazegraph and AnzoGraph provide native support for RDF* and SPARQL*, and so does the aforementioned extension of Apache Jena.
  3. A third perspective on the approach is that it presents a step towards closing the gap between the RDF and the Property Graphs world. That is, by extending RDF and SPARQL with a feature that is similar to the notion of edge properties in Property Graphs, the approach may serve as an abstraction for integrating RDF data and Property Graphs. In fact, in addition to the aforementioned RDF*-to-RDF mappings, there already exist formal definitions of direct mappings from RDF* to Property Graphs and vice versa, and these mappings have been implemented in conversion tools.

33 Replies to “Position Statement: The RDF* and SPARQL* Approach to Annotate Statements in RDF and to Reconcile RDF and Property Graphs”

  1. Hi Peter, I have several responses for you.

    SPARQL* is an extension of SPARQL. The main new feature that SPARQL* adds to SPARQL is the notion of nested triple patterns, which are meant to match nest triples that are possible with RDF*. Such nested triple patterns are not defined to match RDF reification triples as in your example. If you want to match such reification triples, the features of SPARQL are sufficient; i.e., you do not need any SPARQL*-specific features for that. In this sense, it is incorrect to say that SPARQL* cannot do that; it can because it is an extension of SPARQL and, thus, has all the features of SPARQL.

  2. SPARQL* query question: Suppose I have the reified triples

    :r1 rdf:subject :S1; rdf:predicate :P1; rdf:object :O1 .
    :r2 rdf:subject :S2; rdf:predicate :P2; rdf:object :O2; a rdf:Statement .

    and I want to discover the reified triples, I believe the following SPARQL* graph pattern would be syntactically incorrect:

    select * { <> }

    However this would be syntactically correct:

    select * { <> a rdf:Statement }

    So this means that :O1 cannot be matched using SPARQL*, but it can using SPARQL:

    select * {:S1 ^rdf:subject [ rdf:predicate:P1; rdf:object ?o ] }

  3. Hi Matthias, given this snippet of N3 syntax, it is not clear whether the graph { ex:bob foaf:age 23 } has an uncertainty of 0.9 or the triple inside that graph. I bet the N3 spec tells you it is the graph. In more general terms, the question for the approach that you outline is: How would this approach be able to distinguish between statements about individual triples versus statements about whole graphs where such graphs may happen to contain a single triple only?

  4. Thank you for your quick response, Olaf.
    I can understand that there is a problem if someone wants to emulate RDF* with named graphs. But N3 is not about named but quoted graphs. In addition, when handling N3 we see a triple as itself being already a graph. Therefore, nesting of quoted graphs is no problem. Sticking to your NLP problem someone could write:

    { ex:bob foaf:age 23 } ex:certainty 0.9 .
    { ex:bob foaf:knows ex:alice } ex:certainty 0.75 .
    } ex:derivedFrom "23-year-old Bob is on a date with Alice." .

    Believe me, I don’t want to invalidate your work. I just want to understand the exact reason why N3 doesn’t meet your needs.

    PS: I hope the HTML annotations work.

  5. Hi Matthias, as you mention yourself, this feature of N3 focuses on quoted graphs rather than individual triples. Of course, you may decide to use this feature only for graphs that contain a single triple and, then, interpret the metadata to be about the triple rather than the graph. However, by doing so, you will run into the same issues that I described in my response to Mark’s comment below and, in particular, in the email at https://lists.w3.org/Archives/Public/public-rdf-star/2020Feb/0013.html

  6. Hi Olaf,

    in which way is RDF* and SPARQL* different from Notation 3 (https://www.cambridge.org/core/product/identifier/S1471068407003213/type/journal_article)?

    N3 introduces quoted graphs to be able to write triples about graphs, e.g.

    { :bob foaf:age 23 } ex:certainty 0.9 .

    In fact, I would describe RDF* as a subset of N3. Would you agree?

    What was your reason to create RDF* instead of building on the already existing standard from Tim Berners-Lee?

  7. Hi Mark, thanks for your comments and sorry for the late reply; my world is a bit upside down at the moment.

    I agree that some use cases in which people want to attach properties to statements should better be modeled differently, and the n-ary relations approach is a more suitable way in that context. However, the lack of a good approach to annotate triples has been a recurring issue over the past 20+ years since RDF exists; for me, this is an indication that people have use cases in which it seems to be more natural to have triples associated with additional properties instead of modeling things in terms of n-ary relations. Another indication is the increasing popularity of the Property Graph model supported by graph database framework and systems such as Apache Tinkerpop and Neo4j.

    To your question regarding named graphs: Within an application you may want to capture both, metadata/annotations of individual statements as well as metadata about graphs. For instance, assume you have an NLP tool that analyzes text documents and produces RDF triples together with a certainty score for each triple. Hence, you have annotations on the statement level, which you may want to capture in your dataset; such annotations are the focus of the RDF* approach but, of course, you may also use (single-triple) named graphs for this purpose. However, assume now that you may also want to capture metadata about the whole graph (including the statement-specific annotations) that came out of your NLP tool; e.g., you may want to record that this graph was produced from a particular text document using a particular version of your NLP tool. Such graph-level metadata may also be captured by using named graphs (and this use case is the actual focus of the notion of named graphs). However, the latter is not possible anymore if you are using the concept of named graphs already to capture the statement-level annotations. That’s an example for what I meant when I wrote that using named graphs to capture statement-level annotations/metadata “inhibits an application of named graphs for other use cases.” For another comment on a related question, refer to https://lists.w3.org/Archives/Public/public-rdf-star/2020Feb/0013.html

  8. The use case(s) for the RDF* model are already well-supported by existing standards. Most requirements for attaching properties to statements are better handled by more careful modeling (e.g., creating a record that captures all of the desired information for a relationship). Examples here https://www.w3.org/TR/swbp-n-aryRelations/. I’ve seen such approaches characterized as work-arounds, but proposed alternatives, when it comes to actually implementing storage and querying reduce to special-cases of these approaches. That is to say, that existing models could realize the desired gains by specialized handling at the storage layer or through custom optimizations. The remaining cases are, broadly, dealing with metadata about the statements as statements, and, for those, named graphs (a.k.a., contexts) already provide an answer. Moreover, a named graph contains a *set* of statements so that metadata is readily extended beyond just one statement. I notice in your article that you discount the use of named graphs by suggesting that using them for annotating individual statements “inhibits an application of named graphs for other use cases”, but you do not say which other uses are inhibited and how.

    I don’t have much to say about the syntactic aspect of RDF*/SPARQL*. There’s a risk of confusing the two broad classes of use-cases above with this additional syntax. The risk is there already, but may be increased by making it easier to make such statements without really distinguishing between the very distinct semantics associated with each.

  9. Cory, support of RDFS and/or OWL requires a definition of a model-theoretic semantics of the RDF* data model. I have started drafting one: https://lists.w3.org/Archives/Public/public-rdf-star/2019Aug/0013.html

    For the domain/range of such metadata properties, I think rdf:Statement could be used.

    Discussions towards creating a specification as a W3C Community Group Report happen on the following mailing list: https://lists.w3.org/Archives/Public/public-rdf-star/

  10. How do you see this supported, or not, by RDFS and/or OWL? What would be the domain of these properties and is there active work on the specification side?
    Thanks for taking this on!

  11. Cory, in principle, the answer to your question is the as if your question had been concerned with standard (non-nested) triples only. In other words, your question is actually not specific to RDF* and statement-level metadata, but it is related to the meaning of named graphs. Then, in my opinion, a triple (nested or not) in a graph is to be considered within the context of that graph.

  12. Do you see these statements bound to or independent of a graph?. E.g. where ?t ex: certainty ?c .is based on the certainty within that graph, another graph may have another certainty. for the same triple (but a different quad).

  13. Paul, I see what you mean. RDF* can be used for both. Similarly, I have seen instances of both of these cases being captured via edge properties in Property Graphs.

  14. Olaf, thanks for your response. I understand you answer, but I think I did not well articulate my question. In particular, when I said “metadata which describes the predicate”, I probably should have said “metadata which describes the instantiation of the predicate”. To take a specific comparison, consider the two following RDF* statements:
    1) <> created 01/01/2019
    2) <> at 01/01/2019

    (1) is a piece of metadata about the triple, stating when it was created. (2) is an annotation adding more specific information to the triple. These cases seem quite different to me – but perhaps I am making a false distinction here. I am assuming you would use RDF* in both situations?

  15. Paul, I recognize this distinction and I think it is an important one to make. However, I disagree with your statement that “the latter is more typical of edge properties in property graphs.” In contrast, I would say that the former is more like edge properties in PGs. To me the predicate of an RDF triple serves a purpose similar to the label of an edge in a PG (that is, it represents the type of relationship captured by the edge/triple), and I do not see the edge properties to represent descriptions of the edge labels. Instead, edge properties represent additional information about the relationship captured by the edge, which may be metadata but also other sorts of annotations. So, to answer your question: The feature that RDF* adds to RDF (nested triples) is appropriate only for the former (capturing metadata and other forms of annotations about triples). You can do the latter (adding descriptions about the predicates in triples) in RDF, and thus in RDF*, by adding triples whose subject is a URI that is the predicate in other triples. Notice that this is not possible in PGs; I mean, there is no way to provide additional information about edge labels in a PG.

  16. Olaf,
    To my mind, there is a subtle (or not so subtle) distinction between metadata which describes a triple, e.g. its provenance, degree of certainty, or a datestamp for when it was created; and metadata which describes the predicate within the triple. The latter is more typical of edge properties in property graphs. An example might be the duration for which the predicate is true. Assuming you recognise my distinction, do you see RDF* as being appropriate to both cases, or only to the former?

  17. Robert, conceptually, the primary difference is that the Singleton Properties approach requires the creation of URIs to be used as triple identifiers whereas the RDF*/SPARQL* approach does not need such artificial artifacts; instead, in the RDF*/SPARQL* approach, a triple itself is used directly in the metadata about it. As a consequence, queries do not need to contain additional patterns whose only purpose is to access the triple identifiers in order to then find the corresponding metadata based on these triple identifiers.

  18. In which way is this approach and/or implementation different to the Singleton Property?

  19. Hi Steffen, it is very similar. The only conceptual difference is that RDF* does not use explicit statement identifiers; instead, a triple itself is its identifier.

  20. Olaf,
    This is very cool, and something I’ve actually been hoping to see for a while. The notion that a triple is a resource is an obvious one, but one that has been very difficult to put into practice, and its correspondence to attributional information on specific predicates does a very nice job of bridging the gap between PGs and RDF.

    I’m hoping that you’ll follow through on this to make this a member note. This, along with the notion of fully implementing predicate variable paths would make me a happy camper indeed.

Leave a Reply

Your email address will not be published. Required fields are marked *