ORE User Guide - Resource Map Discovery

Abstract

Crawlers or harvesters must discover Resource Maps (ReMs) before the aggregations described by them can be understood. ReMs can be discovered in any number of ways and this document discusses some of the recommended discovery mechanisms. Other discovery mechanisms may evolve over time and vary based on the practices of particular communities. This user guide is one of several documents comprising the OAI-ORE specification and user guide.

1. Introduction

Resource Map (ReMs) discovery is a precondition of use. There is no single, best method for discovering ReMs. This document covers a variety of suggested ReM discovery mechanisms, grouped into the categories of: Batch Discovery, Resource Embedding and Response Embedding and examples are explored for each category. Additional categories and examples are expected to evolve over time.

1.1 Notational Conventions

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [IETF RFC 2119].

2. Batch Discovery

Batch discovery exists so agents can discover ReMs en masse. Note that ReMs are not limited to describing aggregations on the server where the ReMs reside. Although ReMs can be serialized in a number of formats, the initial serialization is in the Atom Syndication Format [RFC4287]. Thus, in each section a table is provided to clearly map between concepts of identification and datestamps between the transport protocol/format and the Resource Map Profile of Atom [ReMProfileofAtom].

2.1 ReMs in OAI-PMH

It is possible to define a new metadataPrefix in the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)[OAI-PMH] that contains ReMs. For example, this OAI-PMH request:

http://www.foo.edu/oai?verb=GetRecord&identifier=oai:foo.edu:object1&metadataPrefix=oai_rem

Would yield this response:

<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
         http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2007-02-08T08:55:46Z</responseDate>
  <request verb="GetRecord" identifier="oai:foo.edu:object1"
           metadataPrefix="oai_rem">http://foo.edu/oai2</request>
  <GetRecord>
   <record>
    <header>
      <identifier>oai:foo.edu:object1</identifier>
      <datestamp>2007-01-06</datestamp>
    </header>
    <metadata>
        <!-- Insert ReM here -->
    </metadata>
  </record>
 </GetRecord>
</OAI-PMH>

Table 1: Atom ReMs Discovered via OAI-PMH
Identification	OAI-PMH `record/header/identifier` MUST NOT equal either ReM Atom `/feed/id` or `/feed/link[@rel="self"]/@href`
Datestamp	OAI-PMH `record/header/datestamp` MUST be equal to ReM Atom `/feed/updated`

2.2 ReMs in SiteMaps

It is possible to construct a SiteMap [SiteMap] that consists of just ReMs, or possibly includes ReMs in its list of regular resources. For example, dereferencing this SiteMap URI:

http://www.foo.edu/sitemap-rem.xml

Would yield this response:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.foo.edu/objects/object1.atom</loc>
      <lastmod>2007-01-06</lastmod>
   </url>
   <url>
      <loc>http://www.foo.edu/objects/object2.atom</loc>
      <lastmod>2007-08-11</lastmod>
      <changefreq>weekly</changefreq>
   </url>
   <url>
      <loc>http://www.foo.edu/objects/object3.atom</loc>
      <lastmod>2007-03-15T18:30:02Z</lastmod>
      <priority>0.3</priority>
   </url>
...
</urlset>

Note that SiteMaps have a URI path hierarchy limitation for the resources for which they can describe. For example, this SiteMap:

http://www.foo.edu/a/b/sitemap-rem.xml

Can list the ReMs:

http://www.foo.edu/a/b/bar2.atom

and

http://www.foo.edu/a/b/c/bar3.atom

But not:

http://www.foo.edu/bar1.atom

Table 2: Atom ReMs Discovered via SiteMap
Identification	SiteMap `/urlset/url/loc` MUST equal `/feed/link[@rel="self"]/@href` for corresponding ReM, but MUST NOT equal `/feed/id`
Datestamp	When present, SiteMap `/urlset/url/lastmod` MUST be equal to ReM Atom `/feed/updated`

2.3 ReMs in Syndication Feeds

Even though the preliminary serialization of ReMs is in the Atom Syndication Format, there is no reason preventing the use of syndication formats such as Atom or RSS [RSS] for ReM discovery. However, care must be taken to separate conceptually the Resource Map from the syndication file listing the Resource Maps. In particular, the id of an Atom entry listing the URI of a Resource Map MUST be neither the URI of the Resource Map nor the Atom feed id of the Resource Map. Furthermore, an explicit difference must be made between the Atom feed used for discovery and the Atom feed that is the ReM. For example, this Atom Feed:

http://www.foo.edu/all-rems.atom

When dereferenced would yield:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>ReMs at www.foo.edu</title>
 <link href="http://www.foo.edu/" />
 <link href="http://www.foo.edu/all-rems.atom" rel="self"/>
 <updated>2007-08-15T18:30:02Z</updated>
 <author>
   <name>John Doe</name>
   <email>johndoe@foo.edu</email>
 </author>
 <id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id>

 <entry>
   <title>ReM For Object1</title>
   <link href="http://www.foo.org/objects/object1.atom"/>
   <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
   <updated>2007-01-06T00:00:00Z</updated>
 </entry>

 <entry>
   <title>ReM For Object2</title>
   <link href="http://www.foo.org/objects/object2.atom"/>
   <id>urn:uuid:9a2cc699-ccba-9e8b-132e-91da394e9a5c</id>
   <updated>2007-08-11T00:00:00Z</updated>
 </entry>

 <entry>
   <title>ReM For Object3</title>
   <link href="http://www.foo.org/objects/object3.atom"/>
   <id>urn:uuid:5225c895-cab8-8ebb-baaa-90da9d4efa6b</id>
   <updated>2007-03-15T18:30:02Z</updated>
 </entry>

</feed>

Table 3: Atom ReMs Discovered via Atom
Identification	Syndication Atom `/feed/entry/id` MUST NOT equal ReM Atom `/feed/id`; Syndication Atom `/feed/entry/link/@href` MUST equal ReM Atom `/feed/link[@rel="self"]/@href`
Datestamp	Syndication Atom `/feed/entry/updated` MUST equal ReM Atom `/feed/updated`

The same ReMs could be exposed via RSS 2.0. For example, this RSS feed:

http://www.foo.edu/all-rems.rss

When dereferenced would yield:

<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>ReMs at www.foo.edu</title>
    <link>http://www.foo.edu/</link>
    <description>All of the Resource Maps for resources at www.foo.edu</description>
  
    <item>
      <title>ReM for Object 1</title>
      <link>http://www.foo.org/objects/object1.atom</link>
      <description>ReM for Object 1</description>
      <pubDate>Sat, 06 Jan 2007 00:00:00 GMT</pubDate>
    </item>
  
    <item>
      <title>ReM for Object 2</title>
      <link>http://www.foo.org/objects/object2.atom</link>
      <description>ReM for Object 2</description>
      <pubDate>Sat, 11 Aug 2007 00:00:00 GMT</pubDate>
    </item>

    <item>
      <title>ReM for Object 3</title>
      <link>http://www.foo.org/objects/object2.atom</link>
      <description>ReM for Object 3</description>
      <pubDate>Thu, 15 Mar 2007 08:30:02 GMT</pubDate>
    </item>
   
  </channel>
</rss>

Table 4: Atom ReMs Discovered via RSS
Identification	RSS 2.0 `/rss/item/link` MUST NOT equal ReM Atom `/feed/id`; RSS 2.0 `/rss/item/link` MUST equal ReM Atom `/feed/link[@rel="self"]/@href`
Datestamp	RSS 2.0 `/rss/item/pubDate` MUST equal ReM Atom `/feed/updated` (after conversion from RFC-822 format to ISO 8601 format)

2.4 Combining OAI-PMH with Other Approaches

Resource Map Documents [ORE Model] can be included as metadata records in an OAI-PMH response. However, the OAI-PMH constructs must be removed before the Resource Map Document can be used as such. This has implications with respect to embedding the Resource Map in a resource (discussed below). OAI-PMH repositories issue OAI-PMH responses of MIME type text/xml or application/xml. These OAI-PMH responses must be processed into ReM responses (currently in Atom Syndication Format and of MIME type application/atom+xml). We envision these services taking an OAI-PMH GetRecord request as an argument, such as:

http://some.gateway.org/pmh2ore?=http://foo.edu/oai2?verb=GetRecord&metadataPefix=oai_rem&identifier=oai:foo.edu:object1

OCLC has already developed one such service. It takes an OAI-PMH GetRecord URI as an argument and strips out out the OAI-PMH elements, leaving only the child element of the OAI-PMH's <metadata> element. For example, this OAI-PMH GetRecord request:

http://alcme.oclc.org/oaicat/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:oaicat.oclc.org:2002/ocm11992160

When submitted as an argument to the OCLC service, produces just the <oai_dc> element:

http://purl.org/OAIUtil?getRecordURL=http://alcme.oclc.org/oaicat/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:oaicat.oclc.org:2002/ocm11992160

The values of the OAI-PMH <responseDate> and <request> elements are retained as HTTP response headers. The above example could also be combined with syndication formats. For example, if a repository has its ReMs in OAI-PMH, it could export the ReMs in an Atom Feed for applications that are not OAI-PMH aware:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>ReMs at www.foo.edu</title>
 <link href="http://www.foo.edu/" />
 <link href="http://www.foo.edu/all-rems.atom" rel="self"/>
 <updated>2007-08-15T18:30:02Z</updated>
 <author>
   <name>John Doe</name>
   <email>johndoe@foo.edu</email>
 </author>
 <id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id>

 <entry>
   <title>ReM For Object1</title>
   <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&amp;metadataPefix=oai_rem&amp;identifier=oai:foo.edu:object1"/>
   <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
   <updated>2007-01-06T00:00:00Z</updated>
 </entry>

 <entry>
   <title>ReM For Object2</title>
   <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&amp;metadataPefix=oai_rem&amp;identifier=oai:foo.edu:object1"/>
   <id>urn:uuid:9a2cc699-ccba-9e8b-132e-91da394e9a5c</id>
   <updated>2007-08-11T00:00:00Z</updated>
 </entry>

 <entry>
   <title>ReM For Object3</title>
   <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&amp;metadataPefix=oai_rem&amp;identifier=oai:foo.edu:object1"/>
   <id>urn:uuid:5225c895-cab8-8ebb-baaa-90da9d4efa6b</id>
   <updated>2007-03-15T18:30:02Z</updated>
 </entry>

</feed>

3. Resource Embedding

A common scenario for ReM discovery is for a human readable page in an aggregation to link to its corresponding ReM. This is most commonly accomplished using the HTML link element [HTML]. Alternatively, HTML A and IMG elements may point to ReMs, or the URI of the ReM can be exposed as an opaque string for human agents to paste into ORE-aware utilities.

We also envision the future availability of browser utilities such as Mozilla plugins that detect the presence of corresponding ReMs when embedded in resources and help guide the user in the (re)use of the aggregated resources.

3.1 HTML Link Element

The HTML link element can be used to direct agents from the aggregated HTML file to a corresponding ReM which describes the aggregation to which the HTML file is part. While this is a common case, there are actually four different scenarios regarding members of an aggregation and knowledge about their corresponding ReMs:

Full knowledge: the ReM is linked to by all resources in the aggregation.
Indirect knowledge: all but one of the resources in the aggregation link to a single, unique resource in the aggregation, which in turn links to the ReM.
Limited knowledge: only a subset of the resources in the aggregation (typically just a single resource) link to the ReM, and the remainder of the resources have no links at all.
Zero knowledge: none of the resources in the aggregation link to a ReM.

Note that the above scenarios are relative to a particular ReM. It is possible for aggregated resources to simultaneously have full knowledge about one ReM (typically authored by the same creators of the resources) and have zero knowledge about third party ReMs that describe aggregations of the same resources. Below is an example of how an HTML page could link to its corresponding ReM. Assuming this HTML page associated JPEGs form the aggregation, and the JPEGS do not use HTTP headers to link to the corresponding ReM (see below), this is an example of a limited knowledge scenario since only this HTML page links to the ReM.

<html>
<head>
<title>Hello World.</title>
<link href="http://example.net/hw.atom" type="application/atom+xml" rel="resourcemap" >
</head>
<body>
<img src="hello.jpeg">
<img src="world.jpeg">
</html>

In the above example, the HTML page links only to a single ReM. It could link to multiple ReMs, in which case it is the responsibility of the agent to differentiate the two aggregations. Next we consider an example where an HTML page is aware that it is aggregated, but does not the location of its ReM. Instead, it links to a page that does know the location of the ReM. There could be any number of these redirections. It is up to the author or maintainer of the resources and ReMs to choose which scenario best fits their usage profile.

<html>
<head>
<title>Chapter Twelve.</title>
<link href="http://mybook.com/toc.html" type="text/html" rel="indirectresourcemap" >
</head>
<body>
Welcome to chapter twelve... 
</body>
</html>

Since the HTML specification defines the values of rel attributes to be CDATA, we can use values of "resourcemap" and "indirectresourcemap" and still have valid XHTML.

3.2 HTML A and IMG Elements

HTML does not provide appropriate attributes in the A and IMG elements to link to a Resource Map as well as the target resource. This section suggests either the addition of extra attributes to the A and IMG elements (which would make otherwise valid HTML documents invalid), or re-purposing an existing attribute.

A similar but different scenario is when it is desirable to acknowledge relationships to other Aggregations [ORE Model]. In this scenario, we wish to cite not the ReM that describes the aggregation containing the current HTML page, but rather we wish to cite the ReM that describes the aggregation where the resource we are linking to (with the A or IMG elements) was originally discovered. This is accomplished using a separate attribute for the A or IMG elements. The example below shows how an HTML page cites the ReMs used to discover a PDF document about frogs and toads as well as examples images of each.

<html> 
...   
Here is a helpful reference for distinguishing 
<a href="http://example.org/pics/f-t.pdf" 
resourcemap="http://example.org/amphibians.atom">frogs vs. toads</a>.  
<p> 
Here is a frog
<img src="http://weluvfrogs.org/imgs/frog12.jpeg"
resourcemap="http://frogs.org/frogs.atom"> 
and here is a toad <img src="http://toadsrule.org/toad.gif"
resourcemap="http://toadsrule.org/toads.atom">.  
...  
</html>

This approach uses the non-standard attribute resourcemap. This can be used to provide hints to the ORE-aware user-agent, but is not guaranteed to be recognized, and is not valid XHTML. The only way to unambiguously link to other Aggregations or ReMs is to create a new ReM. See [ORE User Guide Resource Map] for how to do this.

Another approach to specifying the appropriate Resource Map without introducing a non-standard HTML attribute would be to place the Resource Map URI in an existing HTML attribute. For example, the rel attribute for the A element takes a space separated list of values in which we could place the Resource Map, but the IMG element does not share this attribute. Below is an example of how the Resource Map URI could be placed in the rel attribute, with the IMG elments placed inside a A element (with no href attribute).

<html> 
...   
Here is a helpful reference for distinguishing 
<a href="http://example.org/pics/f-t.pdf" 
rel="resourcemap=http://example.org/amphibians.atom">frogs vs. toads</a>.  
<p> 
Here is a frog
<a rel="resourcemap=http://frogs.org/frogs.atom"> 
<img src="http://weluvfrogs.org/imgs/frog12.jpeg">
</a> and here is a toad 
<a rel="resourcemap=http://toadsrule.org/toads.atom">
<img src="http://toadsrule.org/toad.gif">
</a>.
...  
</html>

3.3 Non-HTML Resources

It may be possible to embed links to ReMs in non-HTML resources, such as PDF or images, but these methods are considered too preliminary to discuss at this time.

3.4 Showing ReMs in HTML Pages

We propose exposing ReM URIs as opaque strings to facilitate future usage scenarios in which people copy and paste ReM URIs into applications such as blogs, forums or repository systems. This is commonly done with sites such as YouTube and Photobucket, and classified listings where strings are provided to the user to facilitate reuse (i.e., copy-n-paste) of the components in email, instant messaging systems, forums and HTML pages. We provide an example of how this could look for using an arXiv pre-print as an example.

4. Response Embedding

If we wish to have resources link to their corresponding ReMs, but not all of the aggregated resources are HTML, and thus cannot use the HTML link element, we can embed the link of the ReM in the response. For the moment, this means putting the URI of the ReM in an HTTP response header.

4.1 HTTP Link Header

The concept of a link HTTP response header existed in earlier versions of the HTTP protocol [RFC2068], but the lack of a compelling use case probably led to it being removed from the current HTTP specification. A recent Internet Draft proposes a method for converting HTML link element semantics into HTTP Link response headers [HTTP Header Linking]. Although this draft has yet to be promoted to an RFC, the approach is straightforward. If we wanted to promote the hello world example above from limited knowledge to full knowledge, the JPEGs could link to their corresponding ReM with the HTTP link response header. The example below shows an HTTP request and response with the ReM in a link header.

(request)  	HEAD http://www.example.net/hello.jpeg HTTP/1.1
                Host: www.example.net
                Connection: close

(response)      HTTP/1.1 200 OK
                Date: Sat, 26 May 2007 22:43:10 GMT
                Server: Apache/2.2.0
                Last-Modified: Sat, 26 May 2007 19:32:04 GMT
                ETag: "c3596-816-92123500"
                Accept-Ranges: bytes
                Content-Length: 2070
                Link: <http://example.net/hw.atom>; type="application/atom+xml"; rel="resourcemap"
                Content-Type: image/jpeg
                Connection: close

5. Methods Not Recommended for ReM Discovery

5.1 ReMs in Simple Files

It is possible to create an HTML page consisting of ReMs and link it from a web site for robots to discover, such as:

<a href="http://www.foo.edu/objects/object1.atom">ReM 1</a>
<a href="http://www.foo.edu/objects/object2.atom">ReM 2</a>
<a href="http://www.foo.edu/objects/object3.atom">ReM 3</a>
...

While this would not be incorrect and would result in exposing ReMs to web crawlers, it could lead to confusion if human agents were to accidently load this page. Attempts to hide such a page from human agents and present it only to crawlers would likely be detected as link spam.

5.2 URI Conflation

The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable "splash page", either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a "splash page" for an object:

(ReM)                   http://www.foo.edu/objects/object1.atom
(Splash Page)           http://www.foo.edu/objects/object1.html
(Conflated URI)         http://www.foo.edu/objects/object1

Similarly, clients MUST NOT refer to the ReM using the conflated URI constructed along the lines of HTTP 303 redirection [DFKI TM-07-01]:

(ReM)                   http://www.foo.edu/data/objects/object1
(Splash Page)           http://www.foo.edu/page/objects/object1
(Conflated URI)         http://www.foo.edu/resource/objects/object1

The purpose of these restrictions is to allow URI-R to be an unambiguous identifier for the ReM and not be conflated with identifiers for other resources (especially resources that are likely to be a member of the aggregation described by the ReM, such as human readable splash pages).

Note that these restrictions do not prevent a ReM from being used as a the basis or "ingredient" of a splash page. Servers MAY choose to include stylesheets with ReMs to make them suitable for use by human agents. Although this is an option, clients should note that there is no requirement for ReMs and splash pages to be transformable from one to another; a ReM may not have the same URIs as a splash page and vice versa.

6. References

[DFKI TM-07-01]: Cool URIs for the Semantic Web, L. Sauermann, R. Cyganiak, M. Völkel, DFKI Technical Memo TM-07-01, 2007. Available at http://www.dfki.uni-kl.de/dfkidok/publications/TM/07/01/tm-07-01.pdf.
[HTML]: HTML 4.01 Specification, D. Raggett, A. Le Hors, I. Jacobs (eds.), W3C Recommendation 24 December 1999. Available at http://www.w3.org/TR/html4/.
[HTTP Header Linking]: HTTP Header Linking, M. Nottingham, IETF Draft, 2006. Available at http://www.mnot.net/drafts/draft-nottingham-http-link-header-00.txt.
[OAI-PMH]: The Open Archives Initiative Protocol for Metadata Harvesting, C. Lagoze, H. Van de Sompel, M. Nelson, S. Warner, 2002. Available at http://www.openarchives.org/OAI/openarchivesprotocol.html.
[ORE Model]: ORE Specification - Abstract Data Model, C. Lagoze, H. Van de Sompel, M. Nelson, R. Sanderson, S. Warner, 2007. Available at http://www.openarchives.org/ore/datamodel.
[ORE User Guide Resource Map]: ORE User Guide - Resource Map Implementation in Atom, C. Lagoze, H. Van de Sompel, M. Nelson, R. Sanderson, S. Warner, 2007. Available at http://www.openarchives.org/ore/atom-implementation.
[ReMProfileofAtom]: ORE Specification - Resource Map Profile of Atom, C. Lagoze, H. Van de Sompel, M. Nelson, R. Sanderson, S. Warner, 2007. Available at http://www.openarchives.org/ore/atom.
[RFC2068]: IETF RFC 2068: Hypertext Transfer Protocol - HTTP/1.1, R. Fielding, J. Gettys, J. Mogul, H. Frystyk, T. Berners-Lee, January 1997. Available at http://www.ietf.org/rfc/rfc2068.txt.
[RFC2119]: IETF RFC 2119: Key words for use in RFCs to Indicate Requirement Levels, S. Bradner, March 1997. Available at http://www.ietf.org/rfc/rfc2119.txt.
[RFC4287]: IETF RFC 4287: The Atom Syndication Format, M. Nottingham, R. Sayre, December 2005. Available at http://www.ietf.org/rfc/rfc4287.txt.
[RSS]: RSS, 2007. Available at http://en.wikipedia.org/wiki/RSS_(file_format).
[SiteMap]: Sitemaps XML format, 2007. Available at http://http://www.sitemaps.org/.

Date	Editor	Description
2007-12-10	mln	public alpha release
2007-10-15	mln	alpha release to ORE-TC