- UNFINISHED DRAFT -

Semantic Data Federation

(Previously entitled "Thoughts on RDF Aggregation and SPARL Query Distribution")

David Booth, Ph.D.
Comments are invited: david@dbooth.org

Latest version: http://dbooth.org/2009/query/

Views expressed herein are those of the author and do not necessarily reflect the official position of any entity affiliated with the author.

Abstract

This document captures some thoughts on RDF data federation and query distribution, which is particularly relevant to SPARQL, but may also be relevant to query languages.

Introduction

I have previously written about the notion of an aggregator [ http://dbooth.org/2008/stc/slides.ppt ] whose purpose is to provide uniform access to multiple data sources that may be represented in different formats, vocabularies or ontologies than the consumers need, as illustrated below.

For example, a consumer -- typically some kind of client application -- might make SPARQL queries across information that was derived from multiple data sources, and the aggregator could use a set of ontologies and rules to transform the incoming data to the data needed by the consumer.

The aggregator may be providing:

a REST-based, SPARQL interface for consumer queries;
transformation to/from the source data;
query distribution.

For reading and writing, the aggregator may provide, for each data source:

syntactic transformation from the data source's native format to RDF and vice versa;
ontological transformations (or inferencing) from/to the data source's native ontology to/from one or more intermediate ontologies;
ontological transformation (or inferencing) from/to the intermediate ontologies to/to the ontologies used in the query.

For writing, the transformations or inferences must be performed in the opposite direction. For example, if read access involves inference rules that transform data expressed in ontology X to ontology Y, then write access involves inference rules in the opposite direction.

The SPARQL CONSTRUCT facility [@@sparql@@] is helpful in query distribution, since it returns another RDF graph.

Data source adaptors

Each data source is connected to the aggregator by an adaptor, which performs:

syntactic transformation between the data source's native format and RDF;
ontological transformation between the native ontology and one or more intermediate ontologies; and
caching of the resulting RDF.

If data from the data source is cached, the adaptor should faithfully implement HTTP caching [@@RFC2616@@].

Query transformation rules

In general, query distribution involves transforming a given query into n subqueries to be applied to n data sources. Such transformations can be expressed as query transformation rules (or inference rules). Given a consumer query and metadata about the available data sources, transform it into a query distribution recipe consisting of:

a set of subqueries for the n data sources that will be used to fulfill the consumer's query;
for each of the n subqueries, a data-source-specific set of post-processing rules for transforming the results of that data source into either an intermediate or final ontology;
a global set of post-processing rules for combining the results of the n subqueries.

A write query will involve transformations in the opposite direction.

Query distribution rules may be based on URI patterns, query patterns, whatever.

Cost estimation and optimization

Rules may also be provided to perform cost estimation prior to executing the query, and these cost estimates may be used in selecting a preferred query distribution. Actual costs may also be aggregated from the measured actual costs of the individual subqueries, and this cost information may be recorded for use in optimizing future queries. Costs may be expressed in terms of computational expense (e.g., CPU time), real time, currency or in other ways.

Queries that are more costly than expected can be prevented by use of a cost approval threshold. If the estimated cost of the query exceeds the cost approval threshold, then the query may be rejected or the user may be prompted for a confirmation before the query is processed. ("Your query is estimated to require 2 hours 23 minutes to complete. Are you sure you want to proceed?") Multiple thresholds could also be used, associated with multiple levels of approval.

Consumer-side query distribution

As an optimization, in some cases it may be more efficient for the subqueries to be performed directly by the consumer, rather than being performed by the aggregator. For example, if there is a high transmission cost involved in transferring data from the data sources to the aggregator and thence to the consumer, then it may be more efficient to transfer the data directly from the data sources to the consumer. In such cases, instead of the aggregator performing the transformation as aggregation functions described above, the query distribution recipe can be returned to the consumer, and the consumer can simply execute this recipe to obtain the desired results. This may proceed as follows:

Consumer query is sent to the aggregator.
HTTP content negotation is used to decide whether the aggregator should perform the query distribution itself or return the query distribution recipe.
Aggregator applies a query distribution strategy to create or select a query distribution recipe.
If consumer-side query distribution is requested, the query distribution recipe is returned. Otherwise, it is executed by the aggregator and the query results are returned.

URI translation

Data sources can be identified by URIs, though one should bear in mind that the URI that names a data source is not necessarily the URI at which its data should be accessed: naming and locating are two different things, though in many cases the same URI will be used both for naming and for access. In the event that a different URI should be used for access, a URI translating proxy [ http://dbooth.org/2008/irsw/#proxy ] can conveniently provide transparent access, so that consumers can always use the data source URIs as though they are access URIs. For example, a data source may be identified by the URI http://ds1.example but its data might actually be accessed from http://foo1.example .

The aggregator should also provide URI translation information, so that this information need not be maintained by consumers --Apache URL rewrite rules, for example. If consumer-side query distribution is used, these rewrite rules could be included with the query distribution recipe.

Abstractly, a data source is similar to a named graph, except:

the content of a data source may change, i.e., it does not necessarily hold a fixed set of RDF triples; and
a data source is accessible by an HTTP SPARQL query.

Optimization / distribution strategies

A query optimization or distribution strategy can be viewed in various ways:

as a set of query transformation rules, as described above;
as an algorithm that, given a consumer query and data source metadata, either:

returns a query distribution recipe; or
returns the consumer query results.

The aggregator should support a pluggable strategy interface, so that different optimization/distribution strategies can be plugged in.

Data source metadata

Data source metadata may include, for example:

access control requirements, expressed in a named graph for example
query cost estimation information
provenance and data freshness information
information about what data is contained in that data source

Provenance and named graphs

Named graphs [@@add ref for RDF named graphs@@] are useful for tracking provenance. Each named graph can have provenance info. When new RDF assertions are generated from inference rules, the entailments can be bundled into a named graph and the provenance information can indicate the dependencies that were used to create it, such as other named graphs or data sources and the particular set of inference rules produced the entailments. If the inference rules are themselves expressed in RDF, they can also be bundled into a named graph and identified by URI. This is one reason why an RDF representation of RIF[@@] would be useful.

Caching and cache re-generation

As described in slides 27-35 of "RDF as a Lingua Franca: Key Architectural Strategies" [ http://dbooth.org/2009/stc/dbooth-SemTechConf-2009.ppt ] it is helpful to be able to cache particular named graphs of data. This may be data that came directly from one of the data sources, or it may be data that has been derived or "distilled" from other data, using a set of rules and ontologies that are plugged in to the aggregator. These named graphs may also be used (with rules and ontologies) to create even more named graphs, as illustrated on slide 33.

Since these named graphs are generated from rules and ontologies that can be recorded and applied again, it is useful to think of them as being cached data that depend on the datasets, rules and ontologies from which they are derived. Along with the rules and ontologies that are used to generate a particular named graph, additional metadata can be recorded (associated with that named graph) to indicate such things as characteristics of the cache (e.g., when it was last updated) and policies for re-generating it when it is stale (e.g., on demand, eagerly whenever any of its dependents changes, or periodically based on a schedule). The dependency graph of these named graphs (or "caches") is analogous to a dependency graph used by Make [ http://www.gnu.org/software/make/ ] or Ant [ http://ant.apache.org/ ]. The aggregator should automatically (re-)generate these named graphs as needed, based on this metadata.

In some cases the aggregator may actually generate new named graphs that did not previously exist. For example, a rule may indicate how to generate a monthly report having a new URI that includes the year and month.

One advantage of this approach is that the working set [ http://en.wikipedia.org/wiki/Working_set ] of data that must be considered by the inference engine at any point is limited to the dependents of the named graph that is to be generated, thus improving efficiency and scalability.

Key information associated with each cache includes:

rules used to generate or refresh that cache;
policies that indicate when the cache should be regenerated, such as eager (whenever a dependency changes), lazy (when data from the cache is requested) or periodic (such as daily, hourly or weekly); and
dependencies -- the named graphs or data sources on which this cache depends. This (either explicitly or implicitly) should include the rules that generate the target cache, since a change in the rules would render the target cache invalid.

Of course other information may also be associated with a cache, such as:

timestamp of when it was last refreshed;
description of its purpose;
who last refreshed it;
who owns it;
permissions required for access.

Mapping rules (mapcar)

Usually a cache is generated by applying the regeneration rules to the combined data contained in its dependencies. This is the common, general case, and in this case the regeneration policies would normally be applied at the granularity of each dependency as a whole. For example, if A is dependent on B and C, and A has an eager regeneration policy, then all of A would be regenerated each time either B or C changes.

However, if A and B consist of corresponding members a1..an and b1..bn, and each ai depends only on bi and uses the same regeneration rules R, then the regeneration of A as a whole can be performed more efficiently by regenerating only each ai for which bi has changed. Thus given B and R, A is regenerating by mapping R across all members of B and collecting the result as A, just like the classic "mapcar" function from Lisp, or the "map" function in Perl. This is illustrated in figure @@.
Elements of B are mapped to elements of A.

Elements of B are mapped to elements of A.

This mapcar function can also be generalized to cases where A depends on multiple caches, B[1]..B[m], such that A[i] = R(union from j=1..m of B[j][i]).

Bidirectional cache access

Thus far the discussion has been focused on read only access to a cache. But life gets much more interesting when cache users can also write to the cache, thus causing data to be propagated back into that cache's dependencies. Of course, the ability to do this depends on the rules that were used to generate that cache. If information is lost when the rules are applied, then data will not in general be able to be propagated back to the dependencies. For example, if the user tries to change a value 5 to 6 in A, but that value was computed by adding 2 and 3 from dependencies B and C, then it cannot be propagated back without more information. However, some transformations are information preserving and can be bidirectional. @@Add reference to Ben Grossoff paper@@

@To Do@@

@@
To search:
federated SPARQL query
DARQ
RDFCube query processor/Sesame

References

@@TBD@@
http://www-bd.lip6.fr/roses/doku.php?id=but_du_projet

18-May-2010: Minor editorial improvements.
20-Oct-2009: Added mapcar diagram and explanation. Highlighted key things associated with a cache: rules, policies, dependencies.
16-Jul-2009: Changed the title from ""Thoughts on RDF Aggregation and SPARL Query Distribution" to "Semantic Data Federation" and added more on caching.
10-Apr-2009: Initial draft published.