URIs and the Myth of Resource Identity

David Booth
HP Software
david@dbooth.org
This document: http://dbooth.org/2006/identity/

The views expressed herein are those of the author, and not necessarily those of HP. These views are evolving, and comments are invited.

Abstract

In one sense, the question is not "What does this URI mean?". The question is "What can you do with it?". The problem of resource identity is the problem of locating appropriate descriptive information about the resource -- information that enables you to make use of that URI in a particular application. On the other hand, the value of a URI is increased when descriptive information uniquely identifies the URI's resource, even if the resource is not completely described, because it encourages the network effect.

Introduction
The Problem of Resource Identity
Descriptive Information
Descriptive Information Is Inherently Partial
OO Analogy
Identifying Versus Describing: Distinguishing One Resource from All Others
Naming Versus Identifying
Resource Identity and the Network Effect

Algorithm for Locating Authoritative Descriptive Information

Should a URI Identify Both a Person and an Information Resource?
Information Resources Versus Non-Information Resources

Clarifying the definition of information resource
Proposed definition of information resource
Should an information resource be inseparable from its URI?

Conclusion
References

1. Introduction

The question is not "What does this URI mean?". It has no intrinsic meaning. The question is "What can you do with it?".

Well, what do you want to do with it?

"I want to retrieve Web documents from it, and comment on their content."
http://documents.example.com/foo.html (HTTP GET, 200 response)
"I want to use it to find Molly's friend Jack's phone number."
http://foaf.example.com/mollys-contacts.rdf (HTTP GET, 200 response; FOAF ontology)
"I want to use it to retrieve my mobile phone status, model, configuration and settings, from a browser.
http://acme-mobile.example.com/phones/id9988765/ (HTTP GET, 200 response)
"I want to use it in a mobile phone service provider application, to unambiguously identify a particular phone, so that I know who to bill for the calls that are made from it."
http://acme-mobile.example.com/phones/id9988765/
"I want to use it to unambiguously identify the person who was videotaped robbing the Willow Street Bank at 10:08AM on March 15, 2005. We do not yet know the person's name, but we know other things about the person."
http://crimes.example.org/perpetrators/2005/willow-street-bank-robbery/
"I want to use it as a Web address for retrieving a description of the person who was videotaped robbing the Willow Street Bank at 10:08AM on March 15, 2005."
http://crimes.example.org/perpetrators/2005/willow-street-bank-robbery/

2. The Problem of Resource Identity

The association between a URI and its resource is contrived and imaginary. It exists only because we follow certain conventions and we believe it exists. We do so because it is a useful fiction: it allows us to symbolically manipulate that resource by proxy. We use the URI to manipulate a model of that resource instead of manipulating the actual resource.

Some URIs are merely Web addresses, and are used only to identify "information resources". Other URIs are used as names for things such as concepts in an ontology, or mobile phones, or people. In general, a URI is used as a globally unique name for a "resource", but what resource? The problem of resource identity is the problem of understanding what a particular URI "means". Specifically, it is the problem of understanding what resource that URI names.

3. Descriptive Information

Suppose I own a URI that is used as a globally unique name for a concept in an ontology.[1] How can I tell someone else what resource that URI identifies? I can provide descriptive information about it. In RDF I can provide a set of assertions involving that URI; or in HTML I might provide an English description of the resource.

For example, the URI http://t-d-b.org?http://dbooth.org/2005/dbooth/ identifies me, David Booth, the author of this paper. You can verify that by deferencing it to obtain authoritative descriptive information that I have provided (after following the 303-redirect). This descriptive information is expressed as a set of assertions about the URI. It tells you that I work for HP Software, I have email address dbooth@hp.com (as of 1-January-2005), etc. In this case the assertions are expressed in English in an HTML document, though I could also have expressed them in RDF or some other language. In conjunction with your knowledge of English and some shared background knowledge, this may be enough information to enable you to use my URI in some applications as an unambiguous identifier for me. In essence, these assertions that I have provided give you a model that I have endorsed, to enable you to make use of my URI in applications, whether this model is expressed formally in RDF or informally in English. Your application manipulates this model as a proxy for me.

4. Descriptive Information Is Inherently Partial

If the URI is used to name a real life entity or other non-idealized entity, such as an actual person, then it would be impossible to describe that resource completely: it is not possible to describe everything about a person. (In contrast, an idealized entity would be something that can be completely characterized in a finite description, such as a mathematical formula.)

For example, it is not possible to completely describe David Booth, the person, because such a description would have to include all possible information about me, and that isn't feasible to collect or anticipate. If the descriptive information includes my name, email address, phone number and physical address, that might be enough for one application, but it won't be enough for other applications that need my blood type, my mass, my educational history, or my philosophical viewpoints.

Thus, descriptive information is inherently partial.[1]

Fortunately, a resource does not need to be completely described in order to productively use its URI in a model as a proxy for the resource. What matters is that you are able to find sufficient descriptive information about me for the purpose of your application. Thus, the problem of trying to completely describe or understand the resource is not necessarily relevant.

However, if I give you descriptive information that you cannot relate to anything else you already know, then you will not be able to make use of the URI, even if that descriptive information may be perfectly adequate for some other application. For example, if you receive a set of assertions expressed using the FOAF ontology[2], but you are unable to relate that ontology to the ontology you are using, the URI will be meaningless: you will not be able to make use of it as a proxy for its associated resource. Thus, in one sense the problem of resource identity is the problem of locating appropriate descriptive information. The identity of a URI is not always relevant except as a convenient shorthand for the set of ways that URI can be used.

Side note: Does the descriptive information describe the resource, or does it really just describe the use of the URI? The descriptive information is expressed in terms of the URI, since it has no way of directly referring to the resource. For example, it may provide a set of assertions involving that URI. You can think of those assertions as indirectly conveying information about the associated resource, but in some sense that is also a useful fiction: the assertions are really just telling you how you may use the URI in manipulating a model that acts as a proxy for the resource, i.e., they tell you what you can do with that URI.

5. Object Oriented Programming Analogy

In essence, when I (authoritatively) give you descriptive information about me, I am endorsing the use of my URI within a particular model (or application context). That model is a crude symbolic representation of certain aspects of the real me.

Another way to put it is that the authoritative descriptive information that I publish licenses the use of my URI in certain models. This is analogous to publishing some interfaces for an object in an object oriented system. You can never be sure that I won't (monotonically) publish an additional interface at some point in the future, just as you cannot be sure I won't publish more descriptive information about me.

Incidentally, notions of "domain" and "range" pertain to models -- not to reality.

6. Identifying Versus Describing: Distinguishing One Resource from All Others

On the other hand, even if it is not possible to completely describe a resource, it may be possible to unambiguously identify that resource, in the sense of conveying what resource it is, as distinct from all other possible resources.
For example, if I provide descriptive information telling you that the URI http://t-d-b.org?http://dbooth.org/2005/dbooth/ identifies all-and-only the actual, living person with email address dbooth@hp.com as of 1-Jan-2005, that is sufficient to unambiguously identify me, distinct from all other possible resources. There is a finite number of actual, living people and only one with that email address on that date. Assuming we have a shared understanding of what an "actual, living person" means, this information allows you to understand exactly which resource I meant in the universe of all possible resources.

7. Naming Versus Identifying

Uniquely identifying a resource is different from merely naming that resource. The URI http://t-d-b.org?http://dbooth.org/2005/dbooth/ acts as a recognizable name for me (in a global namespace), but by itself it gives you no information about what resource it names. You need descriptive information to know what resource it identifies. Caution: I may slip sometimes and use the word "identify" when I meant "name".

8. Resource Identity and the Network Effect

The ability to uniquely identify a resource -- in the sense of conveying the distinction between this resource and all other resources -- is important because it enables others to publish additional descriptive information about the resource, beyond what the URI owner provides. The Semantic Web is all about the network effect created by the use of URIs as universal names. When a URI's resource is uniquely identified, it enables "anyone to say anything" about that resource[4].

If a URI's resource is not uniquely identified -- if others must rely solely on authoritative descriptive information about that resource -- then those who wish to make independent statements about it run the risk that they may have guessed wrong about what resource the URI owner was intending to name. This hampers others' ability to make statements about that resource, thus diminishing the value of that URI. This is analogous to connecting, to the telephone network, a telephone that nobody wants to call: it consumes resources without contributing anything to the network effect.

For example, if I only said that http://t-d-b.org?http://dbooth.org/2005/dbooth/
is a globally unique name for an individual who works at HP with the name "David Booth", then it could be referring to any of four "David Booth"s who currently work for HP. And if I didn't say that it refers to an actual, living person then it could have been referring to a fictional person who bears my name. And if I didn't say that it refers to "all and only" me, then in theory it might be referring to only a part of me, or the combination of me and something else. (Of course, in colloquial language people are not usually so legalistic as to say that they are referring to "all and only" a particular person. So in practice it is generally assumed that if I publish information saying that my URI names "the person who works at HP with email address dbooth@hp.com as of 1-Jan-2005", I am referring to an actual person and all and only that person. It would be misleading to mean otherwise.) In any case, if others are not sure which "David Booth" is intended then they are less able and likely to publish other statements about me using that URI, thus inhibiting the network effect.

8.1 Algorithm for Locating Authoritative Descriptive Information

URI users need a well-defined algorithm for locating authoritative descriptive information so that the identity of the URI’s resource can potentially be determined. I believe the algorithm implied by the WebArch is the following. Given an http URI u:

IF u contains a fragment identifier, then dereference the racine. (The racine is the URI resulting from stripping off the fragment identifier.)
ELSE

Dereference u.
IF the response code is 2xx, then conclude that the resource is an “information resource”.
ELSE IF the response code is a 303 redirect to another URI, then dereference that.
ELSE Out of luck.

This algorithm will be relevant in the next section.

9. Should a URI Name Both a Person and an Information Resource?

Addendum 29-May-2007: This example is wrong, because there is no such thing as "the union of me and my Web page", since union is a set operation and I and my Web page are individuals, not sets.

Since anything can be a resource, and http URIs can be used for any purpose [6], I could mint a URI, u1, that names the union of me and my Web page. Clearly this URI may not be usable by your particular application, which may need to discriminate more finely between me and my Web page. But it may meet my needs perfectly well, and it is, after all, my URI. Would this be anti-Web or anti-social? It depends.

First of all, the WebArch[7] says that a person is not an “information resource”, and the TAG’s httpRange-14 decision also says[6] that if an HTTP GET on URI u1 results in a 2xx response code, then the associated resource is an “information resource” and thus is not also supposed to be a person. Thus, I would be violating the WebArch by minting a URI u1 that names both me and my Web page.

Is this dichotomy between people and “information resources” really necessary in the Web Architecture[7]? Given that I can also mint additional URIs u2 and u3 to specifically name me and my Web page individually, why should the WebArch prohibit u1 from naming the union? What would break if u1 were permitted? One thing that breaks is the user’s ability to determine the identity of the resource associated with u1: if dereferencing u1 yields a 2xx response, then the user has no further algorithm for locating additional authoritative descriptive information about u1’s resource.

10. Information Resources Versus Non-Information Resources

If we accept the TAG’s[9] decision on the httpRange-14 issue[6], then an HTTP 2xx response must not be returned when a URI that names a non-“information resource” is dereferenced. To abide by this rule, we therefore need to know whether a given resource is or is not an “information resource”. What exactly is the difference between an “information resource” and a non-“information resource”? Unfortunately the WebArch[7] is not very clear on this point. This recently arose as a practical question in the W3C Semantic Web Best Practice and Deployment working group[8] as the group was trying to decide what URIs to use to name the individual words in WordNet[10], and how to serve metadata associated with those URIs. In particular: Is a WordNet word an “information resource” or not? If so, then the URI for it can directly return an HTTP 2xx response when the URI is dereferenced. If not, then the URI should either contain a fragment identifier or dererencing it should result in an HTTP 303 redirect instead of a 2xx response.

Given the ambiguity of the TAG’s guidance on this point, my personal advice on minting URIs is to adopt a conservative strategy: if there is any doubt whatsoever about whether the resource you wish to name should be considered an “information resource”, then assume it is not an “information resource” and either use a fragment identifier in the URI or return an HTTP 303 redirect when the URI is dereferenced. By this strategy, unless the TAG outright reverses itself, you will be assured of remaining in conformance even if the TAG further clarifies the distinction between an “information resource” and a non-“information resource”.

10.1 Clarifying the definition of information resource

If the TAG[9] further clarifies the difference between an “information resource” and a non-“information resource”, what should it say? How should “information resource” be defined more precisely? In my opinion, it should be:

Clear and unambiguous. URI owners should be able to clearly determine whether the resource that they wish to name is or is not an “information resource”. (However, this does not mean that others who use a URI will necessarily be able to determine what resource it names or whether it names an “information resource”.)
Based on objective criteria. Determination cannot be based on the whims of a URI owner who wishes to name that resource. (Otherwise the same resource could be viewed by one URI owner as an “information resource” and by another owner as a non-“information resource”.) Observable characteristics are helpful.

Furthermore, the class of “information resources” should be:

Disjoint from everything else. If a resource is known to be an “information resource”, one should be able to conclude that it is not also any other kind of resource. If an HTTP 2xx response is received upon dereferencing a URI, one should be able to conclude that the URI only names an “information resource”, rather than being left in limbo about what other kind of entity the resource might also be, with no clear algorithm for locating authoritative descriptive information about it.

10.2 Proposed definition of information resource

Given these requirements, I suggest adopting a definition that views an “information resource” as being only a network source/sink of representations and nothing more: conceptually, a function from time and requests to representations. ("Requests" can take HTTP request input, cookies, content negotiation, etc. into account.)

Addendum 4-June-2007: Nick Gall of Gartner Group pointed out that this definition is remarkably similar to Roy Fielding's definition of "resource" in his PhD dissertation[11], though Roy does not distinguish between "information resource" and the broader term "resource".

10.2 Should an information resource be inseparable from its URI?

Should a URI be inextricably linked to the “information resource” that it names? Or, the other way around, would it be reasonable to say that two different URIs, both yielding 2xx responses when dereferenced, might name the same “information resource”? I don’t know. I am concerned that this may mean that receiving an HTTP 2xx response code would not be sufficient to uniquely identify the “information resource” associated with a URI, and thus a user of that URI would have no well-defined algorithm for determining the identity of the resource. Thus, this may mean that an “information resource” should instead be defined as being "only a URI-named network source/sink of representations". I would be interested in hearing other peoples’ thoughts on this question.

Addendum 29-May-2007: Feedback at the WWW2006 workshop at which this paper was presented was uniformly negative toward the idea that an information resource must include its URI. However, when I read the proposed charter for POWDER, in which URI patterns would be used to make statements about groups of resources -- resources named by URIs matching those patterns -- this question again came to mind.

11. Conclusion

Although the identity of a URI's resource is a useful fiction, in one sense what really matters is what you can do with the URI. Dereferencing an http URI should (indirectly) yield authoritative information about its resource, but many resources inherently can never be fully described. Fortunately, most applications only need sufficient descriptive information for their task at hand. On the other hand, a URI is more useful to others (by the network effect) if its resource is unambiguously identified (i.e., distinguished from all other possible resources), even if that resource is not completely described, because this enables others to independently provide descriptive information about that resource.

12. Acknowledgements

This work was performed while the author was working for HP.

13. References

1. Pat Hayes, on Semantic Web Best Practices mailing list, 26 January 2006: http://lists.w3.org/Archives/Public/public-swbp-wg/2006Jan/0153

2. Friend Of A Friend (FOAF) ontology project: http://www.foaf-project.org/

3. "Architecture of the World Wide Web Volume 1", W3C Technical Architecture Group,
http://www.w3.org/TR/webarch/#uri-benefits

4. "Resource Description Framework (RDF): Concepts and Abstract Data Model"
W3C Working Draft 29 August 2002: http://www.w3.org/TR/2002/WD-rdf-concepts-20020829/

5. "Four Uses of a URL: Name, Concept, Web Location and Document Instance", David Booth,
http://www.w3.org/2002/11/dbooth-names/dbooth-names_clean.htm

6. Roy Fielding, on W3C Technical Architecture Group (TAG) mailing list, 18 June 2005: http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html

7. Definition of "information resource": "Architecture of the World Wide Web Volume 1" (WebArch), W3C Technical Architecture Group, http://www.w3.org/TR/webarch/#def-information-resource

8. W3C Semantic Web Best Practices and Deployment Working Group: http://www.w3.org/2001/sw/BestPractices/

9. W3C Technical Architecture Group (TAG): http://www.w3.org/2001/tag/

10. WordNet: http://wordnet.princeton.edu/

11. Roy Fielding, "Architectural Styles and the Design of Network-based Software Architectures", PhD dissertation, University of California, Irvine, 2000:
http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm