- UNFINISHED DRAFT -
Semantic Data Federation
(Previously entitled
"Thoughts on RDF Aggregation and SPARL
Query Distribution")
Views
expressed
herein are those of the author and do not necessarily
reflect the official position of any entity affiliated with the author.
Abstract
This document captures some thoughts on RDF data federation and query
distribution, which is particularly relevant to SPARQL, but may also be
relevant to query languages.
Introduction
I have previously written about the notion of an aggregator [
http://dbooth.org/2008/stc/slides.ppt ] whose purpose is to provide
uniform access to multiple data sources that may be represented in
different formats, vocabularies or ontologies than the consumers
need, as illustrated below.

For example, a consumer -- typically some kind of client
application -- might make SPARQL queries across information that was
derived from multiple data sources, and the aggregator could use a set
of ontologies and rules to transform the incoming data to the data
needed by the consumer.
The aggregator may be providing:
- a REST-based, SPARQL interface for consumer queries;
- transformation to/from the source data;
- query distribution.
For reading and writing, the aggregator may provide, for each data
source:
- syntactic transformation from the data source's native format to
RDF and vice versa;
- ontological transformations (or inferencing) from/to the data
source's native ontology to/from one or more intermediate ontologies;
- ontological transformation (or inferencing) from/to the
intermediate ontologies to/to the ontologies used in the query.
For writing, the transformations or inferences must be performed in the
opposite direction. For example, if read access involves
inference rules that transform data expressed in ontology X to ontology
Y, then write access involves inference rules in the opposite direction.
The SPARQL CONSTRUCT facility [@@sparql@@] is helpful in query
distribution, since it returns another RDF graph.
Data source adaptors
Each data source is connected to the aggregator by an adaptor, which
performs:
- syntactic transformation between the data source's native format
and RDF;
- ontological transformation between the native ontology and one or
more intermediate ontologies; and
- caching of the resulting RDF.
If data from the data source is cached, the adaptor should faithfully
implement HTTP caching [@@RFC2616@@].
Query transformation rules
In general, query distribution involves transforming a given query into
n subqueries to be applied to n data sources. Such
transformations can be expressed as query
transformation
rules (or inference rules). Given a
consumer query and metadata about the available data sources, transform
it into a query distribution recipe
consisting of:
- a set of subqueries for the n data sources that will be used to
fulfill the consumer's query;
- for each of the n subqueries, a data-source-specific set of
post-processing rules for transforming the results of that data source
into either an intermediate or final ontology;
- a global set of post-processing rules for combining the results
of the n subqueries.
A write query will involve transformations in the opposite direction.
Query distribution rules may be based on URI patterns, query patterns,
whatever.
Cost estimation and optimization
Rules may also be provided to perform cost estimation prior to
executing the query, and these cost estimates may be used in selecting
a preferred query distribution. Actual costs may also be
aggregated from the measured actual costs of the individual subqueries,
and this cost information may be recorded for use in optimizing future
queries. Costs may be expressed in terms of computational expense
(e.g., CPU time), real time, currency or in other ways.
Queries that are more costly than expected can be prevented by use of a
cost approval threshold.
If the estimated cost of the query exceeds the cost approval threshold,
then the query may be rejected or the user may be prompted for a
confirmation before the query is processed. ("Your query is
estimated to require 2 hours 23 minutes to complete. Are you sure
you want to proceed?") Multiple thresholds could also be
used, associated with multiple levels of approval.
Consumer-side query distribution
As an optimization, in some cases it may be more efficient for the
subqueries to be
performed directly by the consumer, rather than being performed by the
aggregator. For example, if there is a high transmission cost
involved in transferring data from the data sources to the aggregator
and thence to the consumer, then it may be more efficient to transfer
the data directly from the data sources to the consumer. In such
cases, instead of the aggregator performing the transformation as
aggregation functions described above, the query distribution recipe
can be returned to the consumer, and the consumer can simply execute
this recipe to obtain the desired results. This may proceed as
follows:
- Consumer query is sent to the aggregator.
- HTTP content negotation is used to decide whether the aggregator
should perform the query distribution itself or return the query
distribution recipe.
- Aggregator applies a query distribution strategy to create or
select a query distribution recipe.
- If consumer-side query distribution is requested, the query
distribution recipe is returned. Otherwise, it is executed by the
aggregator and the query results are returned.
URI translation
Data sources can be identified by URIs, though one should bear in mind
that the URI that names a
data source is not necessarily the URI at which its data should be accessed: naming and locating are
two different things, though in many cases the same URI will be used
both for naming and for access. In the event that a different URI
should be used for access, a URI
translating proxy [ http://dbooth.org/2008/irsw/#proxy ] can
conveniently provide transparent access, so that consumers can always
use the data source URIs as though they are access URIs. For
example, a data source may be identified by the URI http://ds1.example
but its data might actually be accessed from http://foo1.example
.
The aggregator should also provide URI translation information, so that
this information need not be maintained by consumers --Apache URL
rewrite rules, for example. If consumer-side query
distribution is used, these rewrite rules could be included with the
query distribution recipe.
Abstractly, a data source is similar to a named graph, except:
- the content of a data source may change, i.e., it does not
necessarily hold a fixed set of RDF triples; and
- a data source is accessible by an HTTP SPARQL query.
Optimization / distribution strategies
A query optimization or distribution strategy can be viewed in various
ways:
- as a set of query transformation rules, as described above;
- as an algorithm that, given a consumer query and data source
metadata, either:
- returns a query distribution recipe; or
- returns the consumer query results.
The aggregator should support a pluggable strategy interface, so that
different optimization/distribution strategies can be plugged in.
Data source metadata
Data source metadata may include, for example:
- access control requirements, expressed in a named graph for
example
- query cost estimation information
- provenance and data freshness information
- information about what data is contained in that data source
Provenance and named graphs
Named graphs [@@add ref for RDF named graphs@@] are useful for tracking
provenance. Each named graph can have provenance info. When
new RDF assertions are generated from inference rules, the entailments
can be bundled into a named graph and the provenance information can
indicate the dependencies that were used to create it, such as other
named graphs or data sources and the particular set of inference rules
produced the entailments. If the inference rules are themselves
expressed in RDF, they can also be bundled into a named graph and
identified by URI. This is one reason why an RDF representation
of RIF[@@] would be useful.
Caching and cache re-generation
As described in slides 27-35 of "RDF as a Lingua Franca: Key
Architectural Strategies" [
http://dbooth.org/2009/stc/dbooth-SemTechConf-2009.ppt ] it is helpful
to be able to cache particular named graphs of data. This may be
data that came directly from one of the data sources, or it may be data
that has been derived or "distilled" from other data, using a set of
rules and ontologies that are plugged in to the aggregator. These
named graphs may also be used (with rules and ontologies) to create
even more named graphs, as illustrated on slide 33.
Since these named graphs are generated from rules and ontologies that
can be recorded and applied again, it is useful to think of them as
being cached data that depend on the datasets, rules and ontologies
from which they are derived. Along with the rules and ontologies
that are used to generate a particular named graph, additional metadata
can be recorded (associated with that named graph) to indicate such
things as characteristics of the cache (e.g., when it was last updated)
and policies for re-generating it when it is stale (e.g., on demand,
eagerly whenever any of its dependents changes, or periodically based
on a schedule). The dependency graph of these named graphs (or
"caches") is analogous to a dependency graph used by Make [
http://www.gnu.org/software/make/ ] or Ant [ http://ant.apache.org/
]. The aggregator should automatically (re-)generate these named
graphs as needed, based on this metadata.
In some cases the aggregator may actually generate new named graphs
that did not previously exist. For example, a rule may indicate
how
to generate a monthly report having a new URI that includes the year
and month.
One advantage of this approach is that the working set [
http://en.wikipedia.org/wiki/Working_set ] of data that must be
considered by the inference engine at any point is limited to the
dependents of the named graph that is to be generated, thus improving
efficiency and scalability.
Key information associated with each cache includes:
- rules used to generate or
refresh that cache;
- policies that indicate
when the cache should be regenerated, such as eager (whenever a
dependency changes), lazy (when data from the cache is requested) or
periodic (such as daily, hourly or weekly); and
- dependencies -- the named
graphs or data sources on which this cache depends. This (either
explicitly or implicitly) should include the rules that generate the
target cache, since a change in the rules would render the target cache
invalid.
Of course other information may also be associated with a cache, such
as:
- timestamp of when it was last refreshed;
- description of its purpose;
- who last refreshed it;
- who owns it;
- permissions required for access.
Mapping rules (mapcar)
Usually a cache is generated by applying the regeneration rules to the
combined data contained in its dependencies. This is the common,
general case, and in this case the regeneration policies would normally
be applied at the granularity of each dependency as a whole. For
example, if A is dependent on B and C, and A has an eager regeneration
policy, then all of A would be regenerated each time either B or C
changes.
However, if A and B consist of corresponding members a1..an and b1..bn,
and each ai depends only on bi and uses the same regeneration rules R,
then the regeneration of A as a whole can be performed more efficiently
by regenerating only each ai
for which bi has changed. Thus given B and R, A is regenerating
by mapping R across all members of B and collecting the result as A,
just like the classic "mapcar" function from Lisp, or the "map"
function in Perl. This is illustrated in figure @@.

This mapcar function can also be generalized to cases where A depends
on multiple caches, B[1]..B[m], such that A[i] = R(union from j=1..m of
B[j][i]).
Bidirectional cache access
Thus far the discussion has been focused on read only access to a
cache. But life gets much more interesting when cache users can
also write to the cache, thus causing data to be propagated back into
that cache's dependencies. Of course, the ability to do this
depends on the rules that were used to generate that cache. If
information is lost when the rules are applied, then data will not in
general be able to be propagated back to the dependencies. For
example, if the user tries to change a value 5 to 6 in A, but that
value was computed by adding 2 and 3 from dependencies B and C, then it
cannot be propagated back without more information. However, some
transformations are information preserving and can be
bidirectional. @@Add reference to Ben Grossoff paper@@
@To Do@@
@@
To search:
federated SPARQL query
DARQ
RDFCube query
processor/Sesame
References
@@TBD@@
http://www-bd.lip6.fr/roses/doku.php?id=but_du_projet
18-May-2010: Minor
editorial improvements.
20-Oct-2009: Added mapcar
diagram and explanation. Highlighted key things associated with a
cache: rules, policies, dependencies.
16-Jul-2009: Changed the
title from ""Thoughts on RDF Aggregation and SPARL Query Distribution"
to "Semantic Data Federation" and added more on caching.
10-Apr-2009: Initial draft
published.