Update (November 18, 2020): Work on a spec that defines the approach has started.
Update (August 6, 2020): Native support for RDF* and SPARQL* has found its way into the following
RDF-related programming libraries: Eclipse RDF4J, Apache Jena, RDF.rb, and N3.js.
Update (April 23, 2020): In addition to the two RDF graph database systems mentioned in
the blog post (Blazegraph and AnzoGraph), two more such systems have added support
for RDF* and SPARQL* in the meantime; these systems are Stardog and GraphDB.
Update (June 27, 2019): We now have a W3C mailing list to discuss question related to RDF* and SPARQL*.
Update (June 9, 2019): In the meantime, I have defined SPARQL* Update.
The lack of a convenient way to annotate RDF triples and to query such annotations has been a long standing issue for RDF. Such annotations are a native feature in other contemporary graph data models (e.g., edge properties in the Property Graph model) and there exist a number of popular use cases, including the annotation of statements with certainty scores, weights, temporal restrictions, and provenance information. To mitigate the inherent lack of a native support for such annotations in the purely triple-based data model of RDF, there exist several proposals to capture such annotations in the RDF context (e.g., RDF reification as proposed in the RDF specifications, singleton properties, single-triple named graphs). However, these proposals have a number of shortcomings and none of them has yet been adopted as a (de facto) standard.
We are proposing an alternative approach that is based on nesting of RDF triples and of query patterns. This approach has already attracted interest not only in the RDF and Semantic Web research community (as indicated by some blog posts and by winning the People’s Choice Best Poster Award at ISWC 2017) but also among RDF system vendors. In fact, the approach is already supported in two commercial RDF graph database systems (Blazegraph and AnzoGraph) and in an extension of the popular Open Source framework Apache Jena. Important properties of the approach are that
The goal of this position statement is to bring the approach to the attention of the workshop attendees and to put on the workshop agenda a discussion regarding standardization opportunities for this approach.
In the remainder of this position statement we outline the approach and elaborate more on its properties.
The basis of the proposed approach is to extend RDF with a notion of nested triples. More precisely, with this extension, called RDF*, any triple that represents metadata about another triple may directly contain this other triple as its subject or its object. For instance, suppose we want to capture a statement indicating the age of Bob together with the metadata fact that we are 90% certain about this statement. RDF* allows us to represent both the data and the metadata by using a nested triple as follows.
<<:bob foaf:age 23>> ex:certainty 0.9 .
Notice that we write the nested triple using an extension of the RDF Turtle syntax that captures the notion of nested triples by enclosing any embedded triple using the strings ‘<<‘ and ‘>>’. This extended syntax is called Turtle* and it is specified in Section 3.3 of our technical report.
Given the outlined notion of RDF* which supports (arbitrarily deep) nesting of triples, the crux of the proposed approach is to extend the RDF query language SPARQL accordingly. That is, in the extended query language, called SPARQL*, triple patterns may also be nested, which gives users a query syntax in which accessing specific metadata about a triple is just a matter of mentioning the triple in the subject (or object) position of a metadata-related triple pattern. For instance, by adopting the aforementioned syntax for nesting, we may query for all age-statements and their respective certainty as follows (prefix declarations omitted).
SELECT ?p ?a ?c WHERE { <<?p foaf:age ?a>> ex:certainty ?c . }
Notice that the query is represented in a very compact form; in particular, in contrast to the corresponding queries for other proposals (e.g., RDF reification, singleton properties), this compact syntax does not require users to write verbose patterns or other constructs whose only purpose is to match artifacts that these proposals introduce to establish the relationship between a triple and the metadata about it.
In addition to nested triple patterns, SPARQL* introduces a new type of BIND clauses that allows us to express the example query in the following, semantically equivalent form.
SELECT ?p ?a ?c WHERE { BIND (<<?p foaf:age ?a>> AS ?t) ?t ex:certainty ?c . }
The latter example also highlights the fact that in SPARQL*, variables in query results may be bound not only to IRIs, literals, or blank nodes, but also to full RDF* triples. For a detailed formalization of SPARQL*, including the complete extension of the full W3C specification of SPARQL, refer to Sections 4-5 of the technical report.
We emphasize three orthogonal perspectives on the proposed approach:
We have studied this language in terms of a number of typical computation-related questions that people focus on when analyzing query languages. In Theoretical Computer Science such types of questions are captured in the form of formally-defined problems and the subject of study is to achieve a mathematically-proven understanding of how difficult it is to solve any possible instance of a given problem. A typical example of such a problem in the context of database query languages is to decide for any given answer that may be in the result of a given query over a given database whether this answer is indeed in the query result. You may notice that this decision problem is related to the task of producing the query result. In fact, the difficulty of this problem (for a given query language) is one of the most commonly used formal measures to identify how computationally complex a query language is. The problem is usually called the evaluation problem of a query language.
One of our contributions in our paper has been to show that evaluation problem of the GraphQL language is “very simple”. More precisely, we show that the problem is NL-complete. It is well-known that NL-complete problems can be solved in practice by programs that use a high degree of parallelism. Hence, this gives us our first positive observation about the GraphQL language. Admittedly, this is a very abstract finding. However, to appreciate this property of the GraphQL language we may look at other well-known query languages. For instance, we may consider the Relational Algebra, which is the formal foundation of SQL. It has been shown that the complexity of the evaluation problem of the Relational Algebra is PSPACE-complete. The same has been shown for version 1.0 of the RDF query language SPARQL. Problems in this class are highly intractable. As another example, for general conjunctive queries the evaluation problem is NP-complete. NP-complete problems are commonly believed to be less complex than PSPACE-complete problems but still intractable. In contrast, there is a class called PTIME which consists of problems that are known to be tractable; that is, for each such problem there exists an algorithm whose runtime is polynomial in the size of the input. This is the case for SPARQL basic graph patterns (which comprise the conjunctive fragment of SPARQL) for which the evaluation problem is in PTIME. How does this compare to the NL-completeness of the evaluation problem of GraphQL? The good news for GraphQL fans is that NL is an even lower complexity class than PTIME.
After having done this initial theoretical comparison of the GraphQL language to other query languages, we have looked at another, more practical computation-related problem. This problem is called the enumeration problem and is concerned with how difficult it is to produce the complete result of a query. To study this problem we made our life a bit easier by considering only a particular class of GraphQL queries. We call the queries in this class “non-redundant queries in ground-typed normal form“. Informally, these are queries for which the process of field collection (as specified in Section 6.3.2 of the GraphQL spec) is not necessary. For these queries, and any arbitrary set of data, we have shown that the respective query results can be produced symbol-by-symbol with only constant time delay between symbols. This property is the best that one can hope for in a query language and not many query languages possess this property!
At this point it has to be emphasized that focusing on non-redundant queries in ground-typed normal form is not a limitation of our work. In contrast! As we also show in our paper, every arbitrary GraphQL query can be rewritten into a non-redundant, ground-typed query that is guaranteed to produce the same result. For the necessary rewriting rules refer to Proposition 3.9 in our paper.
Given our positive finding regarding the constant time delay, we can also infer the following property of the GraphQL language: The time required to produce the complete query result (for any non-redundant query in ground-typed normal form) depends linearly on the size of this result. This is another highly desirable property of a query language!
The only caveat regarding the latter finding is that there are cases in which the size of the result of a GraphQL query depends exponentially on the size of the query. For a simple example, consider querying data with a GraphQL schema whose query type has a field that allows us to retrieve data about a person, say Alice; this data has a scalar field name with value Alice and another field knows with an array of two data objects about two other persons; each of these data objects also has a knows field referring back to Alice. Then, for a query of the form
start { knows { knows { ... { knows { name } } ... } } }
in which 2xN knows fields are nested, the result contains the value Alice 2^N times! In other words, increasing the size of the query linearly (adding two more levels of nesting in each step) will cause the size of the query result to increase exponentially. This is not just a theoretical issue. In the introduction of our paper we describe an experiment in which we have observed the issue when querying the public GraphQL API of Github. As a sidenote, and in all fairness to the GraphQL language, such an exponential result size blow-up is possible in many query languages (all that it needs is the possibility of expressing conjunctive queries).
While we are the first to formally quantify the possible extend of the result size blow-up of GraphQL, the basic issue of potentially huge results has been recognized before in the GraphQL community. Existing approaches to address this issue are to restrict queries i) based on their nesting depth, ii) based on a calculation of the maximum possible number of result nodes at each level of the nesting depth, or ii) based on some cost estimation or complexity estimation for which users have to provide various cost factors for different elements of GraphQL schemas. Notice that all these approaches use heuristics and none of them takes into account the actual data being queried. As a consequence, these approaches fall short in providing a robust solution for the issue as they can fail in both directions: discarding requests for which the response may be of a manageable size, and allowing requests for which producing the complete response is too resource intensive.
Instead of relying on data-agnostic heuristics and estimates, we introduce an approach which is accurate. This approach comes in the form of an algorithm that returns the exact size of the result of any given non-redundant GraphQL query in ground-typed normal form (see above) over any possible data set (without actually producing this query result). We use this algorithm to prove that the result size can be computed in an amount of time that depends only linearly on the product of the query size times the data set size. Hence, our algorithm achieves this complexity bound. The idea of the algorithm is to recursively consider every subquery and add up the sizes of results of the subquery for every part of the queried data for which the subquery has to be evaluated. During the recursive process, data objects are labeled with the result sizes of subqueries evaluated at them. These labels are then used to avoid the repeated calculation of subquery result sizes.
As an example, consider some simple example data that may be illustrated as the following graph of objects (where the node named r in the graph corresponds to the query object that would have to be specified in a GraphQL schema for this data).
Moreover, consider the following GraphQL query.
query { start { advisor { univ { name } } friend { univ { name } } } }
Apparently, the result of this query over our example data would be the following JSON object.
start: { advisor: { univ: { name: UCh } } friend: { univ: { name: UCh } } }
Now, instead of immediately starting to produce this result, suppose we first want to calculate its size. We measure the size of GraphQL query results in terms of symbols, where every field name counts as a symbol, and so does every scalar value and every special character (such as colons and curly braces). For instance, the aforementioned result of our example query happens to be of size 26. However, we want to calculate this size without already having produced the query result. To this end, we may first look at the root field start of the query object, and we observe that the number of result symbols that would be produced from this field is 4+x where the number 4 covers the symbols in the first and the last line of the example result and x is the sum of the result symbols obtained from going to the data object u and evaluating each of the two subqueries, q1 = advisor{univ{name}} and q2 = friend{univ{name}}. To make the following explanation more concise, let us denote the size of these two subquery results by writing size(q1,u) and size(q2,u), respectively. Thus, we know that the size of the complete query result is equivalent to
4 + size(q1,u) + size(q2,u)
What remains is to calculate the sizes of the two subquery results, which we can do in a recursive fashion using a similar reasoning as before: First, let’s focus on size(q1,u). By looking at the data object u, we observe that the root field advisor of subquery q1 is present in u. Therefore, we know that the subquery result will contain 4 result symbols from the advisor field plus the symbols obtained from going to data object v and evaluating the subquery q3 = univ{name}. Hence, we have that
size(q1,u) = 4 + size(q3,v)
and we need to enter the recursion again to calculate size(q3,v). In this case, we have that
size(q3,v) = 4 + size(q4,w)
with subquery q4 = name, and another recursion step is needed for size(q4,w). This step now brings us to a base case of the recursion because, by looking at data object w, we see that the result of subquery q4 is name:UCh and, thus, contains exactly 3 symbols. At this point, our algorithm adds a label to data object w to indicate that the result size of subquery q4 at this object is 3. Next, the recursion comes back to calculating the value of size(q3,v), for which it now becomes clear that this value is 7. So, data object v is labeled to indicate that the result of subquery q3 at this object is 7. Returning now to the calculation of size(q1,u), it becomes clear that size(q1,u)=11 and the data object u is labeled accordingly.
Being back at the initial step of the recursion now, we still have to calculate the value of size(q2,u). Observe that q2 contains the subquery q3 that we have seen before and it also has to be evaluated at the same data object as before (that is, object v). Therefore, to calculate
size(q1,u) = 4 + size(q3,v)
it is now possible to obtain the value of size(q3,v) by reading the corresponding label at object v instead of going into the recursion again for size(q3,v). The remaining steps of the calculation should not be difficult to guess at this point.
As can be observed from this example, labeling the data objects during the execution of our algorithm enables the algorithm to avoid repeating the same calculations and, thus, to achieve the polynomial runtime (even in cases in which actually producing the query result would have an exponential runtime). For a pseudo-code representation of the complete algorithm refer to our paper. We also have a prototypical JavaScript implementation of the algorithm.
Given this algorithm, we propose that GraphQL servers should perform the following steps for any given query:
Before wrapping up, I would like to also highlight some of the conceptual contributions that we have made in addition to all the aforementioned technical contributions. As a foundation of our work, we have developed a formally more precise definition of the GraphQL language. The main reason why this was necessary is that the semantics of GraphQL queries—i.e., the definition of what the expected result of any given query is—is given in the GraphQL specification by means of a recursive program specified by pseudo code. This recursion is based on an operation to resolve any field in a query. Surprisingly, this operation is not fully specified and, instead, simply assumes access to an “internal function […] for determining the […] value of [the] field.” While the lack of a more precise definition of this internal function is likely to be intentional (to allow for implementations of GraphQL on top of arbitrary database back-ends), it makes a systematic analysis of the GraphQL language unworkable. As a basis for providing a well-defined formalization of the language, we had to introduce a formal model that defines the notion of a so-called GraphQL graph. This notion is an artifact of realizing that GraphQL enables users to query an underlying data set in terms of a virtual, graph-like view of this data set. The form of this view depends on the GraphQL schema being used, and the resolver functions that implement the schema can be understood to establish the view. Our notion of a GraphQL graph is an abstraction to conceptualize such a view independent of its relationship to the underlying data set. As a sidenote for readers who are familiar with the notion of Property Graphs as supported by popular graph database systems like Neo4J and by the Apache Tinkerpop graph processing framework, a GraphQL graph is very similar to a Property Graph (with a few additional, GraphQL-specific features). The technical details of all these definitions can be found in our paper.
As a final remark, I want to thank the GraphQL community (in particular, the initial creators of the original GraphQL spec) for the excellent documentation of GraphQL which has made our work of formalizing the language a pleasant journey.
]]>While the experimental evaluations in the various LDF-related research papers have provided us with a comprehensive elementary understanding of the existing proposals and their respective trade-offs, I strongly believe there is many more interesting work to be done regarding LDFs.
However, you know what I always thought would be great to have in this context? Since the beginning of the LDF work, I was looking for a way that allows us to achieve a more fundamental understanding of possible LDF interfaces, including interfaces that have not yet been implemented! In particular, I was after a formal framework that allows us to organize LDF interfaces into some kind of a lattice, or perhaps multiple lattices, based on the fundamental trade-offs that the interfaces entail. Such lattices would not only provide us with a more complete picture of how different interfaces compare to each other, they would also be a basis for making more informed decisions about whether it is worth to spend the time implementing and studying a possible interface experimentally.
As you likely have guessed by now, such a formal framework is not just an idea anymore. Together with Jorge Pérez and Ian Letter at the Universidad de Chile, we have developed an abstract machine model for which we have shown that it is a suitable foundation for the type of formal framework described above. From a computer science point of view, the most exciting part of this work is that our abstract machine model presents a basis for defining new complexity measures that allow us to capture many more aspects of computation in a client-server setting than what is captured by the classical measure of computational complexity. We will present this work next week at the 16th International Semantic Web Conference (ISWC). If you are interested in reading about our machine model and how we applied it to study various existing types of LDF interfaces, refer to our research paper about it (and, yes, we have actual lattices in that paper
]]>During the last week of June, I co-organized a Dagstuhl seminar on Federated Semantic Data Management together with Maria-Esther Vidal and Johann-Christoph Freytag. It was a very intense week with a packed schedule and almost no time to catch some breath (exactly like how a Dagstuhl seminar should be I guess
To start with, we had scheduled a few short, survey-style talks on a number of topics related to the seminar. In particular, these talks covered:
While these talks were meant to establish a common understanding of key concepts and terminology, the major focus of the seminar was on discussions and working groups. To this end, we had invited a good mix of participants from the Semantic Web field, from Databases, as well as from application areas. Due to this mix, we ended up on several occasions and in different constellations discussing and reflecting in depth the fundamental assumptions and the core ideas of federated semantic data management. These general discussions and reflections kept re-emerging not only during the sessions, but also during the meals, the coffee breaks, and the evenings in Dagstuhl’s wine cellar. In my opinion, clearly articulating and repeatedly arguing about these assumptions and ideas was a long-needed discussion to be had in the community. After this week, I would guess that many of the participants have a much clearer understanding of what federated semantic data management can and should be, and I am certain that this understanding will be reflected in the reports that the working groups are preparing.
Speaking of working groups, the seminar was structured around four topics addressed by four separate working groups who came together occasionally to report on their progress and obtain feedback from the other groups. The topics were:
Each of the working groups is currently preparing a summary of their discussions and results. These summaries will become part of our Dagstuhl report (to be published some time in August if all goes well). In addition to this report, we are planning to document the discussions and the results of the seminar in a collection of more detailed publications.
What’s next? We have some ideas to keep the momentum and to advance the discussions around the seminar topics in a more continuous community process. Stay tuned.
]]>