David
Booth
HP Software
david@dbooth.org
This document:
http://dbooth.org/2006/identity/
The views expressed herein are
those of the author, and not necessarily those of HP. These
views are evolving, and comments are invited.
In one sense, the question is not "What does this URI mean?".
The question is "What can you do
with it?". The problem of
resource identity is the problem of locating appropriate
descriptive information about the resource -- information that
enables you to make use of that URI in a particular
application. On the other hand, the value of a URI is
increased when descriptive information uniquely identifies the
URI's resource, even if the resource is not completely described,
because it encourages the network effect.
Table of Contents
- Introduction
- The Problem of Resource
Identity
- Descriptive Information
- Descriptive Information Is Inherently
Partial
- OO Analogy
- Identifying Versus Describing:
Distinguishing One Resource from All Others
- Naming Versus
Identifying
- Resource Identity and the Network
Effect
- Algorithm
for Locating Authoritative Descriptive Information
- Should a URI Identify Both a Person
and an Information Resource?
- Information Resources Versus
Non-Information Resources
- Clarifying the
definition of information resource
- Proposed
definition of information resource
- Should an
information resource be inseparable from its URI?
- Conclusion
- References
The question is not "What does this URI mean?". It has no
intrinsic meaning. The question is "What can you do with it?".
Well, what do you want to
do with it?
- "I want to retrieve Web documents from it, and comment on their
content."
http://documents.example.com/foo.html
(HTTP GET, 200
response)
- "I want to use it to find Molly's friend Jack's phone
number."
http://foaf.example.com/mollys-contacts.rdf
(HTTP GET, 200 response; FOAF ontology)
- "I want to use it to retrieve my mobile phone status, model,
configuration and settings, from a browser.
http://acme-mobile.example.com/phones/id9988765/
(HTTP
GET, 200 response)
- "I want to use it in a mobile phone service provider
application, to unambiguously identify a particular phone, so that
I know who to bill for the calls that are made from it."
http://acme-mobile.example.com/phones/id9988765/
- "I want to use it to unambiguously identify the person who was
videotaped robbing the Willow Street Bank at 10:08AM on March 15,
2005. We do not yet know the person's name, but we know other
things about the person."
http://crimes.example.org/perpetrators/2005/willow-street-bank-robbery/
- "I want to use it as a Web address for retrieving a description
of the person who was videotaped robbing the Willow Street Bank at
10:08AM on March 15, 2005."
http://crimes.example.org/perpetrators/2005/willow-street-bank-robbery/
The association between a URI and its resource is contrived and
imaginary. It exists only because we follow certain
conventions and we believe it exists. We do so because it is
a useful fiction: it allows us to symbolically manipulate that
resource by proxy. We
use the URI to manipulate a model of that resource instead of
manipulating the actual resource.
Some URIs are merely Web addresses, and are used only to identify
"information
resources". Other URIs are used as names for things such as
concepts in an ontology, or mobile phones, or people.
In general, a URI is used as a globally unique name for a "resource", but
what resource? The
problem of resource identity is the problem of understanding what a
particular URI "means". Specifically, it is the problem of
understanding what resource
that URI names.
Suppose I own
a URI that
is used as a globally unique name for a concept in an
ontology.[1] How can I tell someone else what resource that
URI identifies? I can provide descriptive
information about it.
In RDF I can provide a set of assertions involving that URI; or in
HTML I might provide an English description of the
resource.
For example, the URI http://t-d-b.org?http://dbooth.org/2005/dbooth/
identifies me, David Booth, the author of this paper. You can
verify that by deferencing it to obtain authoritative descriptive information
that I have provided (after following the
303-redirect). This descriptive information is
expressed as a set of
assertions about the URI. It tells you that I
work for HP Software, I have email address dbooth@hp.com (as of
1-January-2005), etc. In this case the assertions are
expressed in English in an HTML document, though I could also have
expressed them in RDF or some other language. In conjunction
with your knowledge of English and some shared background
knowledge, this may be enough information to enable you to use my
URI in some applications as an unambiguous identifier for me.
In essence, these assertions that I have provided give you a
model that I have endorsed,
to enable you to make use of my URI in applications, whether this
model is expressed formally in RDF or informally in English.
Your application manipulates this model as a proxy for
me.
If the URI is used to name a real life entity or other
non-idealized entity, such
as an actual person, then it would be impossible to describe that
resource completely: it is not possible to describe everything
about a person. (In contrast, an idealized entity would be
something that can be completely characterized in a finite
description, such as a mathematical formula.)
For example, it is not possible to completely describe David Booth,
the person, because such a description would have to include all
possible information about me, and that isn't feasible to collect
or anticipate. If the descriptive information
includes my name, email address, phone number and physical address,
that might be enough for one application, but it won't be enough
for other applications that need my blood type, my mass, my
educational history, or my philosophical viewpoints.
Thus, descriptive information is
inherently partial.[1]
Fortunately, a resource does not need to be completely described in
order to productively use its URI in a model as a proxy for the
resource. What matters is that you are able to find
sufficient descriptive
information about me for the purpose of your application.
Thus, the problem of trying to completely describe or understand
the resource is not necessarily relevant.
However, if I give you descriptive information that you cannot
relate to anything else you already know, then you will not be able
to make use of the URI, even if that descriptive information may be
perfectly adequate for some other application. For example,
if you receive a set of assertions expressed using the FOAF
ontology[2], but you are unable to relate that ontology to the
ontology you are using, the URI will be meaningless: you will not
be able to make use of it as a proxy for its associated
resource. Thus, in one sense the
problem of resource identity is the
problem of locating appropriate descriptive
information. The identity of a URI is not always
relevant except as a convenient shorthand for the set of ways that
URI can be used.
Side note: Does the descriptive information describe the resource,
or does it really just describe the use of the URI? The
descriptive information is expressed in terms of the URI, since it
has no way of directly referring to the resource. For
example, it may provide a set of assertions involving that
URI. You can think of those assertions as indirectly
conveying information about the associated resource, but in some
sense that is also a useful fiction: the assertions are really just
telling you how you may use the URI in manipulating a model that
acts as a proxy for the resource, i.e., they tell you what you can
do with that URI.
In essence, when I (authoritatively) give you descriptive
information about me, I am endorsing the use of my URI within a
particular model (or application context). That model is a
crude symbolic representation of certain aspects of the real
me.
Another way to put it is that the authoritative descriptive
information that I publish licenses the use of my URI in certain
models. This is analogous to publishing some interfaces for
an object in an object oriented system. You can never be sure
that I won't (monotonically) publish an additional interface at
some point in the future, just as you cannot be sure I won't
publish more descriptive information about me.
Incidentally,
notions of "domain" and "range"
pertain to models -- not to reality.
On the other hand, even if it is not possible to completely
describe a resource, it
may be possible to unambiguously identify
that resource, in the sense
of conveying what resource it is, as distinct from all other
possible resources.
For example, if I provide descriptive information telling you that
the URI http://t-d-b.org?http://dbooth.org/2005/dbooth/
identifies all-and-only the actual, living person with email
address dbooth@hp.com as of 1-Jan-2005, that is sufficient to
unambiguously identify me, distinct from all other possible
resources. There is a finite number of actual, living people
and only one with that email address on that date. Assuming
we have a shared understanding of what an "actual, living person"
means, this information allows you to understand exactly which
resource I meant in the universe of all possible
resources.
7.
Naming Versus Identifying
Uniquely identifying a resource is
different from merely naming that resource. The URI
http://t-d-b.org?http://dbooth.org/2005/dbooth/
acts as a recognizable name for me (in a global namespace), but by
itself it gives you no information about what resource it
names. You need descriptive information to know what resource
it identifies. Caution: I may slip sometimes and use the word
"identify" when I meant "name".
The ability to uniquely identify a resource -- in the sense of
conveying the distinction between this resource and all other
resources -- is important because it enables others to publish
additional descriptive information about the resource, beyond what
the URI owner provides. The
Semantic Web is all about the network
effect created by the use of
URIs as universal names. When a URI's resource
is uniquely identified, it enables "anyone to say anything" about
that resource[4].
If a URI's resource is not uniquely identified -- if others must
rely solely on authoritative descriptive information about that
resource -- then those who wish to make independent statements about it
run the
risk that they may have guessed wrong about what resource the URI
owner was intending to name. This hampers others' ability
to make statements about that resource, thus diminishing the value
of that URI. This is analogous to connecting, to the
telephone network, a telephone that nobody wants to call: it
consumes resources without contributing anything to the network
effect.
For example, if I only said that http://t-d-b.org?http://dbooth.org/2005/dbooth/
is a globally unique name for an individual who works at HP with
the name "David Booth", then it could be referring to any of four
"David Booth"s who currently work for HP. And if I didn't say
that it refers to an actual, living person then it could have been
referring to a fictional person who bears my name. And if I
didn't say that it refers to "all and only" me, then in theory it
might be referring to only a part of me, or the combination of me and
something else. (Of course, in colloquial language
people are not usually so legalistic as to say that they are
referring to "all and only" a particular person. So in
practice it is generally assumed that if I publish information
saying that my URI names "the person who works at HP with
email address dbooth@hp.com as of 1-Jan-2005", I am referring to an
actual person and all and only that person. It would be
misleading to mean otherwise.) In any case, if others are not
sure which "David Booth" is intended then they are less able and
likely to publish other statements about me using that URI, thus
inhibiting the network effect.
8.1
Algorithm for Locating Authoritative Descriptive Information
URI users need a well-defined algorithm for locating authoritative
descriptive information so that the identity of the URI’s resource can
potentially be determined. I believe the algorithm implied by the
WebArch is the following. Given an http URI u:
IF u contains a fragment identifier,
then dereference the racine. (The racine is the URI resulting
from stripping off the fragment identifier.)
ELSE
Dereference u.
IF the response code is 2xx,
then conclude that the resource is an “information resource”.
ELSE IF the response code is a
303 redirect to another URI,
then dereference that.
ELSE Out of luck.
This algorithm will be relevant in the next section.
Addendum
29-May-2007: This example is wrong, because there is no such thing as
"the union of me and my Web page", since union is a set operation and I
and my Web page are individuals, not sets.
Since anything can be a resource, and http URIs can be used for any
purpose [6], I could mint a URI, u1,
that names the union of me and
my Web page.
Clearly this URI may not be usable by your particular application,
which may need to discriminate more finely between me and my Web
page. But it may meet my needs perfectly well, and it is,
after all, my URI.
Would this be anti-Web or anti-social? It
depends.
First of all, the WebArch[7] says that a person is not an “information
resource”, and the TAG’s httpRange-14 decision also says[6] that if an
HTTP GET on URI u1 results
in a 2xx response code, then the associated resource is an “information
resource” and thus is not also supposed to be a person. Thus, I
would be violating the WebArch by minting a URI u1 that names both me and my
Web page.
Is this dichotomy between people and “information resources” really
necessary in the Web Architecture[7]? Given that I can also mint
additional URIs u2 and u3 to specifically name me and my
Web page individually, why should the WebArch prohibit u1 from naming the union?
What would break if u1 were permitted? One thing that breaks is
the user’s ability to determine the identity of the resource associated
with u1: if dereferencing u1 yields a 2xx response, then the
user has no further algorithm for locating additional authoritative
descriptive information about u1’s
resource.
If we accept the TAG’s[9] decision on the httpRange-14 issue[6], then
an HTTP 2xx response must not be returned when a URI that names a
non-“information resource” is dereferenced. To abide by this
rule, we therefore need to know whether a given resource is or is not
an “information resource”. What exactly is the difference between
an “information resource” and a non-“information resource”?
Unfortunately the WebArch[7] is not very clear on this point.
This recently arose as a practical question in the W3C Semantic Web
Best Practice and Deployment working group[8] as the group was trying
to decide what URIs to use to name the individual words in
WordNet[10], and how to serve metadata associated with those
URIs. In particular: Is a WordNet word an “information resource”
or not? If so, then the URI for it can directly return an HTTP
2xx response when the URI is dereferenced. If not, then the URI
should either contain a fragment identifier or dererencing it should
result in an HTTP 303 redirect instead of a 2xx response.
Given the ambiguity of the TAG’s guidance on this point, my personal
advice on minting URIs is to adopt a conservative strategy: if there is
any doubt whatsoever about whether the resource you wish to name should
be considered an “information resource”, then assume it is not an
“information resource” and either use a fragment identifier in the URI
or return an HTTP 303 redirect when the URI is dereferenced. By
this strategy, unless the TAG outright reverses itself, you will be
assured of remaining in conformance even if the TAG further clarifies
the distinction between an “information resource” and a
non-“information resource”.
10.1 Clarifying the
definition of information resource
If the TAG[9] further clarifies the difference between an “information
resource” and a non-“information resource”, what should it say?
How should “information resource” be defined more precisely? In
my opinion, it should be:
- Clear and unambiguous.
URI owners should be able to clearly determine whether the resource
that they wish to name is or is not an “information resource”.
(However, this does not mean that others who use a URI will necessarily
be able to determine what resource it names or whether it names an
“information resource”.)
- Based on objective criteria.
Determination cannot be based on the whims of a URI owner who wishes to
name that resource. (Otherwise the same resource could be viewed
by one URI owner as an “information resource” and by another owner as a
non-“information resource”.) Observable characteristics are
helpful.
Furthermore, the class of “information resources” should be:
- Disjoint from everything
else. If a resource is known to be an “information
resource”, one should be able to conclude that it is not also any other
kind of resource.
If an HTTP 2xx response is received upon dereferencing a URI, one
should be able to conclude that the URI only names an “information
resource”, rather than being left in limbo about what other kind of
entity the resource might also be, with no clear algorithm for locating
authoritative descriptive information about it.
10.2 Proposed
definition of information resource
Given these requirements, I suggest adopting a definition that views an
“information resource” as being only a network source/sink of
representations and nothing more: conceptually, a function
from time and requests to representations. ("Requests" can take
HTTP request input, cookies, content
negotiation, etc. into account.)
Addendum
4-June-2007: Nick Gall of Gartner Group pointed out that this
definition is remarkably similar to Roy Fielding's definition of
"resource" in his PhD dissertation[11], though Roy does not distinguish
between "information resource" and the broader term "resource".
10.2 Should an
information resource be inseparable from its URI?
Should a URI be inextricably linked to the “information resource” that
it names? Or, the other way around, would it be reasonable
to say that two different URIs, both yielding 2xx responses when
dereferenced, might name the same “information resource”? I
don’t know. I am concerned that this may mean that receiving an
HTTP 2xx response code would not be sufficient to uniquely identify the
“information resource” associated with a URI, and thus a user of that
URI would have no well-defined algorithm for determining the identity
of the resource. Thus, this may mean that an “information
resource” should instead be defined as being "only a URI-named network
source/sink of representations". I would be interested in hearing
other peoples’ thoughts on this question.
Addendum
29-May-2007: Feedback at the WWW2006 workshop at which this paper was presented was
uniformly negative toward the idea that an information resource must
include its URI. However, when I read the proposed charter for POWDER, in which URI
patterns would be used to make statements about groups of resources --
resources named by URIs matching those patterns -- this question again
came to mind.
Although the identity of a URI's resource is a useful fiction, in
one sense what really matters is what you can do with the URI.
Dereferencing an http URI should (indirectly) yield authoritative
information about its resource, but many resources inherently can
never be fully described. Fortunately, most applications only
need sufficient
descriptive information for their task at hand. On the other
hand, a URI is more useful to others (by the network effect) if its
resource is unambiguously identified
(i.e., distinguished from
all other possible resources), even if that resource is not
completely described,
because this enables others to independently provide descriptive
information about that resource.
12. Acknowledgements
This work was performed while the author was working for HP.
1. Pat Hayes, on Semantic Web Best Practices mailing list,
26 January 2006: http://lists.w3.org/Archives/Public/public-swbp-wg/2006Jan/0153
2. Friend Of A Friend (FOAF) ontology project: http://www.foaf-project.org/
3. "Architecture of the World Wide Web Volume 1", W3C
Technical Architecture Group,
http://www.w3.org/TR/webarch/#uri-benefits
4. "Resource Description Framework (RDF): Concepts and Abstract
Data Model"
W3C Working Draft 29 August 2002: http://www.w3.org/TR/2002/WD-rdf-concepts-20020829/
5. "Four Uses of a URL: Name, Concept, Web Location and Document
Instance", David Booth,
http://www.w3.org/2002/11/dbooth-names/dbooth-names_clean.htm
6. Roy Fielding, on W3C Technical Architecture Group (TAG)
mailing list, 18 June 2005: http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html
7. Definition of "information resource": "Architecture of the World
Wide Web Volume 1" (WebArch), W3C Technical Architecture Group, http://www.w3.org/TR/webarch/#def-information-resource
8. W3C Semantic Web Best Practices and Deployment Working Group: http://www.w3.org/2001/sw/BestPractices/
9. W3C Technical Architecture Group (TAG): http://www.w3.org/2001/tag/
10. WordNet: http://wordnet.princeton.edu/
11. Roy Fielding, "Architectural Styles and the Design of Network-based
Software Architectures", PhD dissertation, University of
California, Irvine, 2000:
http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
19-May-2009: Updated my
email address.
4-Jun-2007: Added
reference to Roy Fielding's dissertation.
29-May-2006: Added "sink"
to proposed definition of information resource. Changed
"identify" to "name" in several places where I had been sloppy.
Added workshop feedback about an information resource including its URI.
23-May-2006: Presented at WWW2006 workshop