DO NOT USE, SEE CURRENT ResourceSync SPECIFICATIONS

ResourceSync Framework Specification - Alpha Draft

13 August 2012

This version:
http://www.openarchives.org/rs/0.1/resourcesync
Latest version:
http://www.openarchives.org/rs/resourcesync
Previous version:
none
Editors (in alphabetical order):
Bernhard Haslhofer, Cornell University Information Science
Martin Klein, Los Alamos National Laboratory
Carl Lagoze, University of Michigan
Michael Nelson, Old Dominion University
Robert Sanderson, Los Alamos National Laboratory
Herbert Van de Sompel, Los Alamos National Laboratory
Simeon Warner, Cornell University

Abstract

Need abstract.

Status of this Document

This is an alpha draft distributed for public comment.

Table of Contents

1. Introduction
    1.1 Motivating Examples
    1.2 Notational Conventions
2. ResourceSync Basics
    2.1 Walkthrough
    2.2 Overview
        2.2.1 Destination Perspective
        2.2.2 Source Perspective
3. Describing Content
    3.1 Sitemap
        3.1.1 loc
        3.1.2 lastmod and expires
        3.1.3 rs:fixity
        3.1.4 rs:size
        3.1.5 rs:mimetype
        3.1.6 rs:contentencoding
        3.1.7 xhtml:meta and xhtml:link
    3.2 Large Sitemaps
4. Transferring Content
    4.1 HTTP Content Transfer
    4.2 Dump
        4.2.1 Manifest
    4.3 Alternate Content Transfer
        4.3.1 Alternate Content Location
        4.3.2 Partial Content
        4.3.3 Alternate Interpretation
5. Communicating Change Events
    5.1 Change Sets
    5.2 Pushing Change Sets
        5.2.1 XMPP
        5.2.2 HTTP Callback
6. Providing Access to Versions
    6.1 Historical Change Sets
    6.2 Historical Content
        6.2.1 Link to Version
        6.2.2 Link to Memento TimeGate
7. Advertising Capabilities
    7.1 robots.txt
    7.2 Discovery Links
        7.2.1 xhtml:link Element
        7.2.2 HTTP Link Headers
        7.2.3 HTML Link Headers
    7.3 host-meta Description
8. References

Appendices

A. XML Element Overview
B. Alternate Dump Formats: WARC
C. Acknowledgements
D. Change Log

1. Introduction

The Web is highly dynamic, with resources continuously being created, updated, and deleted. As a result, the use of resources from a remote server involves the challenge of remaining in step with its changing content. In many cases, there is no need to perfectly reflect a server's evolving content and therefore well established resource discovery techniques, such as recurrent Web harvesting, suffice as an updating mechanism. However, there are significant use cases that require low latency and high accuracy in reflecting a remote server's changing content. These requirements have typically been addressed by ad-hoc technical approaches implemented within a small group of collaborating servers. There have been no widely adopted, Web-based approaches.

This ResourceSync specification introduces a range of easy to implement capabilities that a server may support in order to enable remote servers to remain more tightly in sync with its evolving resources. It also describes how a server can advertise the capabilities it supports. Remote servers can inspect this information to determine how to best remain aligned with evolving content.

Each capability provides a different synchronization functionality, such as a list of a server's resources or its recently changed resources, including what the nature of the change was: create, update, or delete. Most capabilities are based on extensions for Sitemaps and new ways to use them. Capabilities can be combined to achieve varying levels of functionality and hence meet different local or community requirements. This modularity provides flexibility and makes ResourceSync suitable for a broad range of use cases.

This document is structured as follows:

1.1. Motivating Examples

Many projects and services have synchronization needs and have implemented ad hoc solutions. ResourceSync provides a standard synchronization method that will reduce implementation effort and facilitate easier reuse. This section describes four motivating examples with differing needs and complexities.

Consider first the case of a website for a small museum collection. The website may contain just a few dozen static web pages. With standard tools the maintainer can create a Sitemap to enhance harvesting by commodity search engines. In doing so the information is also available to services using ResourceSync.

When building services over Linked Data it is often desirable to maintain a local copy of key data for improved access and availability. Harvesting can be enabled by publishing a ResourceSync Sitemap for the collection. In many cases Linked Data records are small and so harvesting via individual HTTP GET requests is slow because of the large number of round-trips for a small amount of content. Publishing a dump in which content is aggregated in a ZIP file in a standard way makes this more efficient for the client and less burdensome for the server. Continued synchronization is enabled by either updating the Sitemap or, more efficiently, by publishing change sets listing only the changed resources and/or content dumps.

The arXiv.org archive of scientific articles has used a custom mirroring solution to propagate resource changes to a set of mirror sites and interacting services on a daily basis. There are about 2.4 million resource files with about 1600 changes (creates, updates) per day. The mirroring system currently in place uses HTTP with custom change descriptions, and occasional rsync to verify the copies and to cope with any errors in the incremental updates. It would be desirable to have a solution that allows any interested third-party service to synchronize with arXiv using standard software. Both accuracy and low implementation barrier are important. Within ResourceSync, arXiv.org could publish each metadata and full-text record as a separate web resource with its own URI. In this one-to-many scenario multiple clients (such as the mirror lanl.arXiv.org or any third party) could stay accurately in synchronization with either all or a portion of arXiv.org. This would extend the article metadata sharing (currently provided via OAI-PMH) to full-text in a web friendly fashion.

It is important to have access to the most recent versions of data resources in order to maintain efficient and accurate computation. DBPedia is a frequently used set of Linked Data, and is updated up to twice a second. While it may not be important to maintain second-granularity synchronization, there are millions of resources changing at a very high rate and existing solutions are unable to provide acceptable latency. The ResourceSync framework enables a push-based framework for alerting interested clients about changes using a publish and subscribe methodology. This builds upon ResourceSync's pull-based approaches, simply changing the network transport layer to a more appropriate technique for high throughput. The resources may be synchronized by a simple HTTP GET call, or by transferring the changes only using more advanced techniques.

1.2. Notational Conventions

This specification uses the terms "resource", "request", "response", "entity", "entity-body", "entity-header", "content negotiation", "client", "user agent", and "server" as described in [RFC 2616].

Throughout this document, the following namespace prefix bindings are used:

PrefixNamespace URIDescription
(none)http://www.sitemaps.org/schemas/sitemap/0.9 Sitemap XML elements defined in the Sitemap protocol
xhtmlhttp://www.w3.org/1999/xhtml Elements introduced in the XHTML namespace
xmpphttp://jabber.org/protocol/pubsub Elements of the PubSub extension to the XMPP protocol
rshttp://www.openarchives.org/rs/terms/ Elements introduced and defined in this specification

Table 1.1: Namespace prefix bindings used in this document

2. ResourceSync Basics

This section provides an overview of the various ResourceSync capabilities that a server may support in order to enable remote servers to become and remain synchronized with its evolving resources. The following terms are introduced:

2.1. Walkthrough

Let's assume a Source, http://example.com/, that wants to make it easy for Destinations to follow its changing content. A very basic first step towards that goal is for this Source to publish a Sitemap like many servers already do. A Sitemap lists the URIs of resources that a Source wants Destinations to know about, as shown in Example 2.1.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
     <loc>http://example.com/res1</loc>
   </url>
   <url>
     <loc>http://example.com/res2</loc>
   </url>
</urlset>

Example 2.1: A basic Sitemap

A Destination can find out about the existence of a Sitemap in the Source's robots.txt file, published at the conventional location: http://example.com/robots.txt. Example 2.2 shows a robots.txt file that indicates the Source's Sitemap is available at http://example.com/sitemap.xml. The Destination can use the information in the Sitemap to start collecting the Source's content by issuing HTTP GET requests against the listed URIs.

User-agent: *
Sitemap: http://example.com/sitemap.xml

Example 2.2: A robots.txt file pointing to a Sitemap

The Source can provide additional information in the Sitemap to help the Destination with optimizing the process of collecting content. For example, if a Destination has previously acted upon a Source's Sitemap, it would be good to allow it to determine whether the Sitemap itself has changed since its last visit or whether specific resources have changed since then. Also, the Destination may not be interested in all of the Source's content but only content with a certain topic. A Source can express such information in a Sitemap using existing Sitemap elements or extension elements introduced by ResourceSync. Example 2.3 shows a Sitemap in which its time of publication was added as well as the last modification date and categories for the listed resources. A Destination can use such information to minimize the number of HTTP requests it needs to issue in order to remain up-to-date with the content it requires.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
     <loc>http://example.com/res1</loc>
     <lastmod>2012-08-08T08:15:00Z</lastmod>
     <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
   </url>
   <url>
     <loc>http://example.com/res2</loc>
     <lastmod>2012-08-08T13:22:00Z</lastmod>
     <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/>
   </url>
</urlset>

Example 2.3: A Sitemap with additional information

In order to describe its changing content in a more timely manner, a Source can increase the frequency at which it publishes an up-to-date Sitemap. But changes may be so frequent or the size of the content collection so vast that updating a complete Sitemap may be impractical. In such cases, a Source can implement an additional capability that focuses on communicating information about changes only. To this end, ResourceSync introduces Change Sets. A Change Set is a special-purpose Sitemap that lists only recently changed resources as well as the nature of their change: create, update, delete. It is up to a Source to decide what the temporal interval is that is covered by a Change Set, for example, listing all changes that occurred during the previous hour, the current day, or since the most recent publication of a Sitemap. Example 2.4 shows a Change Set that lists two change events, one update and one deletion. It also contains some of the additional information that was described above.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset" 
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/>
   </url>
</urlset>

Example 2.4: A Change Set

One way by which a Destination can find out whether a Source supports the Change Set capability is by inspecting its Sitemap. The Sitemap in Example 2.5 shows a link to the Source's current Change Set that is available at http://example.com/changesets/most_recent.xml. A Destination can recurrently issue an HTTP GET request against this URI to obtain information about recent changes that occurred at the Source, compare those with changes it already acted upon, and process the remaining ones. In order to allow a Destination to remain even more tightly synchronized with a Source, ResourceSync also introduces a capability that consists of a Source recurrently pushing Change Sets that describe new change events to a Destination via publish/subscribe technology.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/changesets/most_recent.xml" 
               rel="current http://www.openarchives.org/rs/changeset"/>
   <url>
     <loc>http://example.com/res1</loc>
   </url>
      <url>
     <loc>http://example.com/res22</loc>
   </url>
</urlset>

Example 2.5: A basic Sitemap with a pointer to a Change Set

It may occur that a Destination is not always able to process the current Change Set before the Source replaces it with a new one, for example, because it goes off-line. When becoming operational again, the Destination may want to catch up with changes that occurred and would likely do so by obtaining the Source's current Change Set. However, while this Change Set will contain information about the recent changes that occurred at the Source, it may not cover all changes for the entire period during which the Destination was unavailable.

To address this problem, a Source may implement a memory capability that allows a Destination to obtain an historical overview of changes going back to before those listed in the recent Change Set. This overview is made available as one or more interlinked historical Change Sets, each covering changes that occurred in a given time interval. Example 2.6 shows the Source's current Change Set but this time with the inclusion of a link to a historical Change Set, which is available at http://example.com/changesets/20120807.xml. This historical Change Set may link to a prior Change Set using the same mechanism. The example also shows that the Change Set includes a link to itself expressing it is the current one. With this memory capability in place, a Destination can collect one or more historical Change Sets, moving backwards in time, following the links that have both the "prev" and "http://www.openarchives.org/rs/changeset" relation types. Once a historical Change Set is obtained that includes a change that the Destination already acted upon, it can stop collecting even older changes and start acting upon the unprocessed ones.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset" 
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/changesets/20120807.xml" 
               rel="prev http://www.openarchives.org/rs/changeset"/>
   <xhtml:link href="http://example.com/changesets/most_recent.xml" 
               rel="current http://www.openarchives.org/rs/changeset"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/>
   </url>
</urlset>

Example 2.6: A Change Set with a link to a historical Change Set

2.2. Overview

The previous section provides a concrete walkthrough of some capabilities that a Source can implement and it describes how a Destination can leverage those capabilities to remain aligned with the Source's changing content. This section provides a high-level overview of the various ResourceSync capabilities and it shows how these fit in a Destination's processes aimed at remaining in step with changes. This overview is summarized in Table 2.1 that lists Destination processes as columns and Source capabilities as rows, with cells indicating the usability of a capability for a given process. The next sections provide technical details about each ResourceSync capabilities.

Source CapabilitiesDestination Processes
 Baseline SynchronizationIncremental SynchronizationAudit
Describing Content 
     SitemapsX X
Transferring Content 
     HTTP GETXX 
     DumpX  
     Alternate Content TransferXX 
Communicating Change Events 
     Change Sets XX
     Pushing Change Sets XX
Providing Access to Versions 
     Historical Change Sets XX
     Historical Content X 

Table 2.1: Source capabilities versus Destination processes

2.2.1. Destination Perspective

From the perspective of a Destination, three key processes are enabled by the ResourceSync capabilities:

Baseline Synchronization - In order to become synchronized with a Source, the Destination must make an initial copy of the content of a Source. This requires a list of resources hosted by a Source (Sitemap) and obtaining those resources (Dump, HTTP GET, Alternate Content Transfer).

Incremental Synchronization - A Destination may remain in sync with a Source by repeatedly performing a Baseline Synchronization but this will be inefficient in many situations. To increase efficiency, a Source may communicate information about change events that involve its resources (Change Sets, Pushing Change Sets). This allows a Destination to only obtain new and updated resources (HTTP GET, Alternate Content Transfer). In order to cope with outages, or changes at the Source that occur more frequently than the Destination attempts to synchronize, the Source may keep a historical record of change events and/or versions of resources as they change over time (historical Change Sets, historical Content).

Audit - In order to verify whether it is in sync with the Source, a Destination must be able to check that the content it obtained matches the current resources hosted by the Source. This requires a list of resources hosted by the Source (Sitemap, Change Set, historical Change Set), and metadata that characterizes the resources' most recent state, such as last modification time, size, and fixity.

2.2.2. Source Perspective

From the perspective of a Source, the ResourceSync capabilities that can be supported to enable Destinations to remain in sync with changing content can be grouped into four categories:

Describing Content - In order to describe its content, a Source can recurrently make an up-to-date Sitemap available. A basic Sitemap provides the URIs of resources that the Source wants Destinations to know about. But additional information can be added to the Sitemap to optimize the Destination's process of obtaining a Source's resources. Such information includes the Sitemap's publication time and the last modification time and categories for resources.

Transferring Content - The default mechanism to obtain a resource is to issue an HTTP GET against its URI. But the Source may support two additional content transfer capabilities:

Communicating Change Events - In order to achieve low synchronization latency, a Source may communicate information about change events that involve its resources:

Providing Access to Versions - In order to allow a Destination to catch up with missed changes that occurred at the Source, the Source may keep a historical record of change events and/or versions of resources as they change over time:

3. Describing Content

A Source may publish a description of its content in order to allow Destinations to keep track of the content state. This information enables a Destination to make a copy of all or part of the content, or to update a local copy to remain synchronized with changes at a Source. The Sitemap format was created to improve the efficiency and reliability of web harvesting and is the basis of content description within ResourceSync. Optional extensions provide facilities for improved synchronization and verification.

3.1. Sitemap

ResourceSync leverages the wide-spread adoption and tool-support of the Sitemaps XML format. Destinations can discover a Sitemap via a Source's robots.txt file as, for example, shown in Example 3.1.

   
   User-agent: *
   Sitemap: http://example.com/sitemap.xml

Example 3.1: Minimal robots.txt file

A minimal Sitemap is simply a list of all of the resources provided by a Source. The structure of a Sitemap is shown in Example 3.2. It must have the urlset root element and information about each resource is contained within a url element. This example shows a single resource http://example.com/res1. It is recommended that a last modification time for the entire Sitemap be included using an xhtml:meta element with date and time conforming to the W3C Datetime syntax.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
     <loc>http://example.com/res1</loc>
   </url>
</urlset> 

Example 3.2: Minimal Sitemap structure and xhtml:meta

Information about each resource described by a Sitemap is conveyed within a url element. At minimum the location of the resource must be specified using the loc element. All other information is optional. Other elements from the Sitemaps format, or from other schemas, are permitted. Consuming applications should ignore unrecognized content or elements. Elements that are useful for ResourceSync are summarized in Table 3.1, Table 3.2, and Table 3.3 and described in the sections that follow.

ElementUseDescription
<loc>requiredURL of the resource as defined in the Sitemaps protocol.
<lastmod rs:type="created">optionalDate of last modification of the resource as defined in the Sitemaps protocol and expressed as a W3C Datetime. If attribute "created" is given, the type of modification equals a creation of the resource.
<lastmod rs:type="updated">optionalDate of last modification of the resource as defined in the Sitemaps protocol and expressed as a W3C Datetime. If attribute "updated" is given, the type of modification equals an update of the resource.
<expires>optionalDate of deletion of the resource. This date must be in the past. Expressed as a W3C Datetime.

Table 3.1: Child elements of the url element to identify the resource and express change types.

<rs:fixity>optional, repeatableDigest of the entity-body of a resource representation, computed using one of several algorithms. For most applications the MD5 digest defined in RFC 2616, Sec. 14.15 is recommended.
<rs:size>optionalSize of the entity-body of a resource representation. The value must be equal to the value of the Content-Length entity-header in the HTTP response and must be computed as defined in RFC 2616, Sec. 4.4
<rs:mimetype>optionalMIME-Type of the entity-body of a resource representation. The value must be equal to the value of the Content-Type entity-header in the HTTP response as defined in RFC 2616, Sec. 14.17
<rs:contentencoding>optionalContent encoding of the entity-body of a resource representation. The value must be equal to the value of the Content-Encoding entity-header in the HTTP response as defined in RFC 2616, Sec. 14.11

Table 3.2: Child elements of the url element to express representation specific information.

<xhtml:meta>
<xhtml:link>
optional, repeatableKeyword or term assigned to a resource, which may originate from existing controlled vocabularies. This element may be repeated to indicate multiple categories.

Table 3.3: Child element of the url element expressing keywords usable for filtering.

3.1.1. loc

The loc element is used to convey the location of each resource described. Within each url element there should be exactly one loc element as defined in the Sitemaps XML format. It should contain a dereferencable URI from which a client may download content. Example 3.3 below shows a minimal Sitemap describing the locations of two resources.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
   </url>
   <!-- one url element for each resource ... -->
</urlset> 

Example 3.3: Simple Sitemap describing two resource locations

With the location information alone, a Destination can retrieve content from the listed resources. By doing this repeatedly the Destination can check whether the content has changed. Such use may be sufficient for some small-scale use cases but would be an inefficient way to synchronize large collections, or collections that change frequently.

3.1.2. lastmod and expires

The lastmod and expires elements may be used to convey the last modification or deletion time of the resource. This information allows a client to determine whether or not there is new content to download. It is recommended that the last modification or deletion time be included with each url element.

The content of lastmod is defined by the Sitemaps XML format and must conform to the W3C Datetime syntax. The use of a complete date and time expressed in UTC with the form YYYY-MM-DDThh:mm:ss[.s]Z is recommended. Note that UTC indication or time zone offset specification is mandatory if time information is included.

The last modification information can be enhanced with an indication about the resource change type. For an updated or created resource the lastmod element can be given the attribute rs:type with the value "updated" or "created" accordingly. For a deleted resource the expires element, that is already commonly used in Sitemaps, should be used instead. The value of the expires element must conform to the W3C Datetime syntax.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="created">2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
   </url>
</urlset> 

Example 3.4: Use of lastmod and expires elements.

Addition of the last modification information allows a client to check for updates without accessing each resource individually. A Destination may compare the last modification time with that of a local copy and thus determine whether there has been a change and perhaps new content should be downloaded. In case of expires, the local copy of the corresponding content should be removed.

3.1.3. rs:fixity

The rs:fixity element may be used to convey fixity information in the form of a digest of the entity-body obtained when the resource's URL is dereferenced. This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated content then this rs:fixity element should not be used.

The rs:fixity element has a mandatory type attribute that specifies the type and format of the digest as shown in Table 3.4. The md5 digest requires little effort to compute, is small to transfer, and is likely adequate for most change detection scenarios. It is thus recommended that the md5 digest be used as the default. However, md5 digests are not strong and therefore should not be used to guarantee authenticity. For this purpose, digests such as sha-256 would be appropriate. Multiple rs:fixity elements may be used to convey multiple digests using different algorithms.

typeDescription
md5MD5 digest of the entity-body encoded in base64 as defined for the Content-MD5 header in [RFC 2616, Sec. 14.15] and [RFC 1864], e.g. Q2hlY2sgSW50ZWdyaXR5IQ==.
sha-1SHA-1 digest of the entity-body encoded in base64 according to [RFC 4648].
sha-256SHA-256 digest of the entity-body encoded in base64 according to [RFC 4648].

Table 3.4: Defined values for fixity type.

Example 3.5 shows use of the rs:fixity element to convey MD5 entity-body digests.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt==</rs:fixity>
   </url>
</urlset> 

Example 3.5: Use of the rs:fixity element

Fixity information may be used as a supplement or alternative to last modification time, as a means to allow clients to detect whether content has changed as compared to a local copy. Fixity information provides a much better means to detect corruption of a downloaded copy than other descriptive information, and thus supports checking of a downloaded copy without having to download it again.

3.1.4. rs:size

The rs:size element may be used to convey the size of the the entity-body obtained when the resource's URL is dereferenced. This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated content, then the rs:size element should not be used.

The value of the rs:size element should be equal to the value of the Content-Length entity-header in the HTTP response (if present) and must be computed as defined in RFC 2616, Sec. 4.4. Example 3.6 shows use of the rs:size element.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
      <rs:size>15672</rs:size>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity>
      <rs:size>93660664</rs:size>
   </url>
</urlset> 

Example 3.6: Use of rs:size.

3.1.5. rs:mimetype

The rs:mimetype element may be used to convey the MIME-Type of the entity-body obtained when the resource's URL is dereferenced. This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated content, then the rs:mimetype element should not be used.

The value of the rs:mimetype element should be equal to the value of the Content-Type entity-header in the HTTP response (if present) and the value should be defined in the IESG MIME-Type registry. The use is optional and not repeatable.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
      <rs:size>15672</rs:size>
      <rs:mimetype>text/html; charset=utf-8</rs:mimetype>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity>
      <rs:size>93660664</rs:size>
      <rs:mimetype>application/pdf</rs:mimetype>
   </url>
</urlset> 

Example 3.7: Use of rs:mimetype.

3.1.6. rs:contentencoding

The rs:contentencoding element may be used to convey the type of encoding used on the entity-body obtained when the resource's URL is dereferenced. This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated content, then the rs:contentencoding element should not be used.

The value of the rs:contentencoding element should be equal to the value of the Content-Encoding entity-header in the HTTP response (if present). The use is optional and not repeatable.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
      <rs:size>15672</rs:size>
      <rs:contentencoding>gzip</rs:contentencoding>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity>
      <rs:size>93660664</rs:size>
      <rs:contentencoding>compress</rs:contentencoding>
   </url>
</urlset> 

Example 3.8: Use of rs:contentencoding.

3.1.7. xhtml:meta and xhtml:link

The xhtml:meta and the xhtml:link element may be used to convey information useful to filter or select resources of interest. Typical uses would be to indicate grouping or classification of resources where some groups or classifications might be selected by a Destination. If the information to be conveyed includes a URI, the xhtml:link element should be used, the xhtml:meta element otherwise. Example 3.9 shows how both elements can be used.

No restrictions are placed on the grouping scheme or the form of the category strings. However, the use of URIs from web ontologies or other controlled vocabularies will likely make this information more useful and is thus recommended.

Both elements are optional and may be repeated to indicate multiple categories or tags that apply to a resource.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
      <xhtml:meta name="DC.subject" content="Crocodiles"/>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/>
   </url>
</urlset>

Example 3.9: Use of the xhtml:link and xhtml:meta elements to enable filtering of content descriptions

3.2. Large Sitemaps

The Sitemaps XML format specifies that a single Sitemap must not include more than 50,000 url elements and must not be larger than 10MB in uncompressed format. A Sitemap Index may be used to list up to 50,000 individual Sitemap files and thus extend the format to up to 2.5 billion resources.

ResourceSync does not change the Sitemap Index format, examples are included here for convenience. A Sitemap Index has a format very similar to a Sitemap. The root element is sitemapindex and each Sitemap is described in a sitemap element. For each Sitemap the location is specified with the loc element (cf. 3.1.1 loc) and, optionally, the last modification time for the Sitemap may be specified with the lastmod element (cf. 3.1.2 lastmod). It is recommended that a last modification time for the entire Sitemap Index be included using an xhtml:meta element with date and time conforming to the W3C Datetime syntax. The following example shows a Sitemap index listing two individual Sitemaps.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
              xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <sitemap>
      <loc>http://example.com/sitemap1.xml</loc>
      <lastmod>2012-08-08T10:00:00Z</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://example.com/sitemap2.xml</loc>
      <lastmod>2012-08-08T15:00:00Z</lastmod>
   </sitemap>
</sitemapindex>

Example 3.10: Sitemap Index with two Sitemaps.

A Source may provide in the Sitemap index an indication of the xhtml:meta and xhtml:link values associated with resources in each individual Sitemap. It does this by aggregating the set of element values in each Sitemap and including them with xhtml:meta or xhtml:linkelements inside the corresponding sitemap element of the Sitemap Index. This allows Destinations to filter and retrieve only those Sitemaps that match their selection criteria. It is not intended that this mechanism overrides the specification of xhtml:meta or xhtml:link values for each resource and so each included Sitemap must still list the corresponding elements for each resource. Example 3.11 shows a Sitemap Index with aggregated categories.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
              xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <sitemap>
      <loc>http://example.com/sitemap1.xml</loc>
      <lastmod>2012-08-08T10:00:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
      xhtml:meta name="DC.subject" content="Crocodiles"/>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
   </sitemap>
   <sitemap>
      <loc>http://example.com/sitemap2.xml</loc>
      <lastmod>2012-08-08T15:00:00Z</lastmod>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Animals" title="Animals"/>
   </sitemap>
</sitemapindex>

Example 3.11: Use of aggregated xhtml:link and xhtml:meta values with a Sitemap Index

4. Transferring Content

When a Destination detects that it is out-of-sync with a Source the next step towards synchronization is the transfer of newly created and updated content. No content transfer is needed for deletions. ResourceSync supports several methods to accomplish this process.

4.1. HTTP Content Transfer

The default method for a Destination to obtain changed content from a Source is to issue an HTTP GET request against the changed resource. A resource's URI can be taken from the loc element that can be found in the retrieved Sitemap. These requests initiate the transfer of single resource representations which means, especially for Baseline Synchronization, a method for batch content transfer is desirable.

4.2. Dump

To reduce the number of HTTP GET requests necessary to transfer content, a Source can publish a Dumps which package its content. The Source's capability to publish Dumps needs to be advertised to Destinations. In case the Source publishes Sitemaps, the way to make a Dump discoverable is to include an xhtml:link element in the Sitemap. An example for a Dump discovery link is shown in Example 4.1.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/dump/dump.zip"
               rel="http://www.openarchives.org/rs/dump"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
   </url>
</urlset> 

Example 4.1: Dump discovery link.

A Dump is a package that contains content hosted by a Source. A Dump may be used to transfer resources from a Source in bulk, without a Destination having to request the resources separately. A Baseline Synchronization is a typical scenario for a Destination to obtain a Dump.

The default Dump format for ResourceSync is the Zip file format. However, it is possible for a Source to publish Dumps in other formats such as WARC. Appendix B provides guidelines to implement a Dump in the WARC format.

4.2.1. Manifest

Each Dump must contain a manifest.xml file. The manifest describes the content of the Dump. It is formatted as a Sitemap with additional descriptive elements. For each resource, described within the url element, a Manifest must include the element rs:path describing the mapping between the resource's URI and its relative file path in the Dump. Example 4.2 shows a simple manifest.xml file for a Dump containing two resources.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
      <rs:size>15672</rs:size>
      <rs:mimetype>text/html; charset=utf-8</rs:mimetype>
      <rs:contentencoding>gzip</rs:contentencoding>
      <rs:path>resources/res1</rs:path>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
      <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity>
      <rs:size>93660664</rs:size>
      <rs:mimetype>application/pdf</rs:mimetype>
      <rs:contentencoding>compress</rs:contentencoding>
      <rs:path>resources/res2</rs:path>
   </url>
</urlset> 

Example 4.2: A Dump Manifest.

The requirements for the use of all ResourceSync Sitemaps elements, summarized in Table 3.1, Table 3.2, Table 3.3, and Table 3.4, apply for Dump manifest files as well. Particularly the use of the rs:mimetype and rs:contentencoding elements are recommended here. Table 4.1 summarizes the XML element required in Dump manifests.

ElementUseDescription
<rs:path>requiredRelative resource file path within a Dump.

Table 4.1: Dump Manifest rs:pathelement.

4.3. Alternate Content Transfer

Certain scenarios may require a Source to offer alternate methods of content transfer. ResourceSync recognizes the following cases:

This section describes how these scenarios can be addressed in the ResourceSync framework.

4.3.1. Alternate Content Location

In case where a Source promotes an alternate content location for its content, it needs to advertise the proper URIs to Destinations. It can do so by including an xhtml:link element as a child to each url element. The xhtml:link element can contain a reference to the alternate location of the resource and the proper relation type and therefore convey the information required for the Destination to obtain the content. Example 4.3 shows how a link can be included in a ResourceSync modified Sitemap.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
       <loc>http://example.com/res1</loc>
       <lastmod>2012-08-08T08:15:00Z</lastmod>
       <xhtml:link rel="alternate http://www.openarchives.org/rs/mirror"
                   href="http://example.com/example-com-mirror/res1"/>
   </url>
<urlset>

Example 4.3: Alternate Content Transfer from a mirror site.

4.3.2. Partial Content

Scenarios exist where it is more efficient for a Destination to only transfer the part of a resource that has actually changed instead of the entire resource. Minor changes such as fixed typos in or minor additions to large resources, for example, may not justify the transfer of the entire document, especially if these kind of changes occur frequently. ResourceSync supports the transfer of partial content. A Source can include an xhtml:link element as a child to each url element. It can contain a reference to the partial content, a protocol, specifying the details of the partial content transfer between the Source and the Destination, and the proper relation.

However, the implementation of this capability is left up to the Source and in general implementation will be media type specific. Whichever protocol the Source uses, it needs to be understood by the Destination in order to complete the partial resource transfer. Example 4.4 shows an xhtml:link element containing information needed by a Destination for partial content transfer.

Note that partial content transfer is only applicable in Change Sets (introduced in Section 5.1) but not in Sitemaps.

<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
       <loc>http://example.com/res1</loc>
       <lastmod>2012-08-08T08:15:00Z</lastmod>
       <xhtml:link rel="http://www.openarchives.org/rs/partial" 
                   rs:protocol="http://example.com/protocols/changesonly"
                   href="http://example.com/res1/diff251"/>
   </url>
<urlset>

Example 4.4: Partial Content Transfer in a Change Set.

Note this is a forward reference. At this point we have not introduced Change Sets yet.

4.3.3. Alternate Interpretation

For alternate content transfer it is essential for a Destination to understand what to expect when dereferencing the URI provided in the loc element. Example 4.5 shows an example where the element contains further information about the resource that has changed. The URI shown is a baseURL of an OAI-PMH repository and the rs:protocol attribute points a Destination to the appropriate protocol specification. This additional pointer enables a Destination to understand that the given URI conforms to the OAI-PMH protocol.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
       <loc rs:protocol="http://www.openarchives.org/OAI/openarchivesprotocol.html">
            http://example.com/oaipmh</loc>
       <lastmod>2012-08-08T08:15:00Z</lastmod>
   </url>
<urlset>

Example 4.5: Alternate Content Transfer in an OAI-PMH repository.

5. Communicating Change Events

A Source may support communication of changes in its content as a way to enable Destination to efficiently follow those changes. A Source may publish a description of recent changes, or may use XMPP PubSub or HTTP Callback to push changes to a subscribing Destination.

5.1. Change Sets

The ResourceSync framework introduces the notion of a Change Set that describes changes at a Source. The Change Set is a special-purpose Sitemap that lists only recently changed resources as well as the nature of their change. A Change Set is identified by a URI and if a Destination dereferences this URI, it can expect a set of recent changes to be returned.

Destinations, in order to keep up with the Source's changes, need to become aware if Change Sets are provided. If a Source implements Sitemaps to describe its content it can include the discovery link to a Change Set. The xhtml:link element can be used for that purpose. Example 5.1 shows a Sitemap including an xhtml:link element enabling the Destination to discover the Change Set provided by a Source.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/changesets/most_recent.xml" 
               rel="current http://www.openarchives.org/rs/changeset"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
   </url>
</urlset> 

Example 5.1: Change Set discovery link in a Sitemap.

The frequency of content change on a Source's end as well as the acceptable latency for synchronization on a Destination's end may vary between scenarios. In the ResourceSync framework it is up to a Source to decide what the temporal interval is that is covered by a Change Set. It may list all changes that occurred during the previous hour, the current day, or since the most recent publication of a Sitemap. A recent Change Set published by one Source may very well cover a much smaller or much larger temporal interval than a recent Change Set of another Source. It is up to a Source to define what "recent" means for its individual scenario.

Change Sets are based on the Sitemap format which means that each Change Set:

The recommended addition of the attribute rs:type="changeset" to the urlset root element helps to distinguish between Change Sets and Sitemaps. Sitemaps do not have this attribute. The recommended last modification time of the entire Change Set in the xhtml:meta element must be a date and time conforming to the W3C Datetime syntax. This time stamp provides one way for Destinations to determine whether a Change Set is new.

Since the purpose of Change Sets is to convey informatation about changes in content hosted by a Source, it is essential to indicate the nature of the change. Three types of content change are defined in the ResourceSync framework: created, updated, and deleted. Each url element must include one and only one of the following child elements to indicate the change type and when it occurred:

For all three options, the date and time of the change must be included conforming to the W3C Datetime syntax. Table 5.1 summarizes the three change types and their corresponding XML elements.

Change TypeXML Element
Create<lastmod rs:type="created">2012-07-17T19:22:00Z</lastmod>
Update<lastmod rs:type="updated">2012-07-17T19:22:00Z</lastmod>
Delete<expires>2012-07-17T19:22:00Z</expires>

Table 5.1: XML elements expressing content change in Change Sets.

Example 5.2 shows the content of a Change Set with three url elements each of which describes one change event. This example shows one update, one deletion, and one creation. The urlset root element contains the attribute rs:type="changeset" and an xhtml:meta time stamp is also included.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset" 
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
   </url>
   <url>
      <loc>http://example.com/res3</loc>
      <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
   </url>
</urlset>

Example 5.2: Change Set describing three content changes: an update, a deletion, and a creation.

As seen in previous sections, a Source can add several optional child elements to each url element. This is also applicable for Change Sets. For example, the elements rs:size and rs:fixity become particularly important for the Destination process Audit. Example 5.3 shows a Change Set with multiple optional elements for one change event.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
      <rs:size>15672</rs:size>
      <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity>
      <rs:mimetype>application/pdf</rs:mimetype>
      <rs:contentencoding>gzip</rs:contentencoding>
      <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/>
      <xhtml:meta name="DC.subject" content="Crocodiles"/>
   </url>
</urlset>

Example 5.3: Change Set with multiple optional elements.

A unique identifier for each change event might be useful in some use cases. This specification does not define a dedicated event identifying element. However, the ResourceSync framework recommends the combination of the values of the loc and of the lastmod elements to be used for this purpose. It is important to note that the framework considers it to be the Source's responsibility to provide a sufficient granularity for the lastmod value to ensure a truly unique identifier.

5.2. Pushing Change Sets

In the previous section a Source publishes Change Sets at a self-defined frequency. A Destination periodically needs to check for updates by pulling the Change Set. This setup implies a latency since the publication interval is usually unknown to the Destination.

For scenarios where this latency is unexceptable or Destinations simply can not continuously pull for Change Sets, the ResourceSync framework features push-based approaches. These approaches are suitable, for example, for environments with high frequency content changes at the Source's end and a high synchronization demands at the Destination's end. Since change events can rapidly and continuously be pushed to Destinations, the latency inflicted by the Destination's "guessing" of when to pull for a new Change Set is eliminated.

Two push based approaches are described below: one based on XMPP and one based on HTTP Callback.

5.2.1. XMPP

The Extensible Messaging and Presence Protocol (XMPP), more specifically, its PubSub extension, allows a Source to support subscription to Change Sets communicated via XMPP messaging infrastructure.

A Destination here also needs to become aware of this capability being offered by a Source. Similar to the previous section, Example 5.4 shows a Sitemap including an xhtml:link element. It enables Destinations to discover all necessary information to receive the push-based Change Set provided by a Source.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="xmpp:pubsub.example.com" 
               rs:protocol="http://xmpp.org/extensions/xep-0060.html" 
               rs:pubsubnode="Example_Node_Name"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
   </url>
</urlset> 

Example 5.4: Discovery link for pushing Change Sets via XMPP PubSub.

An XMPP message sent by a Source is encapsulated in an xmpp:iq element. This element contains, amongst others, the address of the sender and the recipient. The protocol's PubSub extension adds the xmpp:pubsub and the xmpp:publish element. The latter contains the name of the XMPP PubSub node the message is published to.

The body of the XMPP PubSub message is contained in an xmpp:item element. As shown in Example 5.5, the message itself is a Change Set encapsulated by the urlset element and each change event contained within a url element. All elements, required and optional, as introduced in the previous section, apply here too.

Example 5.5 shows the same change events as seen in Example 5.2 but in form of an XMPP PubSub message. It is up to the Source to decide whether to bundle more than one change events into one XMPP PubSub message (as seen in Example 5.5) or to send one message per change event in which case the encapsulating urlset would only include one url element.

<xmpp:iq from="sender@example.com" type="set" to="destination.com" id="liAJUz3S"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xmpp="http://jabber.org/protocol/pubsub"
         xmlns:rs="http://www.openarchives.org/rs/terms/">
   <xmpp:pubsub>
      <xmpp:publish node="PubSub_NodeName">
         <xmpp:item id="3294">
            <urlset rs:type="changeset">
               <url>
                  <loc>http://example.com/res1</sm:loc>
                  <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
               </url>
               <url>
                  <loc>http://example.com/res2</sm:loc>
                  <expires>2012-08-08T13:22:00Z</expires>
               </url>
               <url>
                  <loc>http://example.com/res3</sm:loc>
                  <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
               </url>
            </urlset>
         </xmpp:item>
      </xmpp:publish>
   </xmpp:pubsub>
</xmpp:iq>

Example 5.5: Push-based XMPP message containing a Change Set.

The xmpp:item element contains an identifier that is used within XMPP to distinguish between messages and, for example, to purge individual (persistent) messages from an XMPP server.

5.2.2. HTTP Callback

HTTP callback allows Sources to directly push Change Sets to registered Destinations without the need for other infrastructure.

Example 5.6 shows how a Source can advertise the availability of HTTP callback in its Sitemap using the xhtml:link element. The rs:protocol attribute indicates the protocol that this capability conforms to, and the href attribute gives the location of the subscription interface. The subscription interface provided by the Source allows Destinations to register their corresponding HTTP callback URIs.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/subscribe" 
               rs:protocol="http://example.com/protocol/callback"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2012-08-08T13:22:00Z</lastmod>
   </url>
</urlset> 

Example 5.6: Discovery link for pushing Change Sets via HTTP Callback.

With this method a Source can push Change Sets to the specified URIs of registered Destinations. It is again up to the Source to decide whether to push Change Sets containing only one change event or bundle multiple change events into one Change Set. Example 5.7 shows the same three change events as seen in Example 5.5 but communicated via the HTTP callback method in one Change Set.

>> Subscription Request << 
POST /subscribe HTTP/1.1
Host: example.com

callbackURI=http://aggregator.org/callback


>> Change Notification << 
POST /callback HTTP/1.1
Host: aggregator.org

<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
    <url>
       <loc>http://example.com/res1</sm:loc>
       <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
    </url>
    <url>
       <loc>http://example.com/res2</sm:loc>
       <expires>2012-08-08T13:22:00Z</expires>
    </url>
    <url>
       <loc>http://example.com/res3</sm:loc>
       <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
    </url>
</urlset>

Example 5.7: Push-based HTTP Callback method communicating a Change Set.

6. Providing Access to Versions

6.1. Historical Change Sets

As mentioned in Section 5.1, a Destination has no control over how many change events a Source includes in one Change Set. For scenarios with a relatively low resource change frequency, for example, a Change Set generated over the course of one day might not contain many changes and hence be of very reasonable size. However, in cases with high change frequencies, the same Change Set may grow to an extent that is unreasonable to be communicated to Destinations (again, this is at the Source's discretion to decide), or may provide unacceptable latency.

In order to enable Sources in high change frequency scenarios to communicate all changes, without having to accumulate all of them into one Change Set, a Source may provide historical Change Sets. These historical Change Sets can be seen as digests of past change events, covering a time span prior to the one covered by the current Change Set.

A Destination can access the historical Change Sets by following a link that is included in the current Change Set. Such a link can be seen in Example 6.1. The first link points to the URI of the current Change Set with the relation "current". The second link, with the relation "prev", points to a historical Change Set that covers changes that occurred in a time span previous and adjacent to the one covered by the current Change Set. This historical Change Set can in its turn include a link with a "prev" relation pointing at an even earlier historical Change Set, etc. A Destination can follow these links with a "prev" relation to collect all needed or available historical Change Sets. By analyzing the change events listed in the gathered Change Sets, for example looking at the datetime of each change, a Destination can determine whether it already processed a change. As soon as a Change Set is encountered that lists a previously processed change, there is no need to collect even more Change Sets.

With historical Change Sets a Destination has yet another option to "catch up" with a Source in case it has missed Change Sets and the Source has not yet generated a new Sitemap and a new Dump.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <xhtml:link href="http://example.com/changesets/most_recent.xml" 
               rel="current http://www.openarchives.org/rs/changeset"/>
   <xhtml:link href="http://example.com/changesets/20120807.xml" 
               rel="prev http://www.openarchives.org/rs/changeset"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
   </url>
   <url>
      <loc>http://example.com/res3</loc>
      <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
   </url>
</urlset>

Example 6.1: Change Set with links to historical Change Sets.

6.2. Historical Content

A Source may implement a capability that allows a Destination to obtain prior versions of resources. Where Destinations need to obtain all versions of a resource, not just the current one, this capability becomes very useful. The ResourceSync framework features two implementation alternatives.

6.2.1. Link to Version

In addition to having a generic URI that applies to all versions of a resource, a Source may mint a URI that is associated with each particular version. When communicating about the resource, its generic URI is provided in the loc element whereas the URI of the specific version of the resource (the historical content) can be provided using an xhtml:link element that has a relation type of "self" and of "memento". It is up to the Source to decide for how long the version resource remains accessible.

Example 6.2 shows a Change Set with version URIs included. In this example the URIs are minted with the help of the value of the lastmod elements.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
      <xhtml:link href="http://example.com/20120808081500/res1" 
                  rel="memento"/>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
      <xhtml:link href="http://example.com/20120808132200/res2" 
                  rel="memento"/>
   </url>
   <url>
      <loc>http://example.com/res3</loc>
      <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
      <xhtml:link href="http://example.com/20120808145700/res3" 
                  rel="memento"/>
   </url>
</urlset>

Example 6.2: Change Set with links to a resource version.

6.2.2. Link to Memento TimeGate

In addition to having a generic URI that applies to all versions of a resource, a Source can associate a TimeGate with the resource, as per the Memento protocol [Memento Internet Draft]. A TimeGate supports negotiation in the datetime dimensions to obtain a version of the resource as it existed at a specified moment in time, for example, the time provided in lastmod. When communicating about the resource, its generic URI is provided in the loc element whereas the URI of the TimeGate associated with the resource can be provided using an xhtml:link element that has a relation type of "timegate". It is up to the Source to decide for how long version resources remains accessible.

An example of a Change Set with links to a Memento TimeGate is shown in Example 6.3.

<?xml version="1.0" encoding="UTF-8"?>
<urlset rs:type="changeset"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/"
        xmlns:xhtml="http://www.w3.org/1999/xhtml">
   <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
   <url>
      <loc>http://example.com/res1</loc>
      <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod>
      <xhtml:link href="http://example.com/timegate/http://example.com/res1" 
                  rel="timegate"/>
   </url>
   <url>
      <loc>http://example.com/res2</loc>
      <expires>2012-08-08T13:22:00Z</expires>
      <xhtml:link href="http://example.com/timegate/http://example.com/res2" 
                  rel="timegate"/>
   </url>
   <url>
      <loc>http://example.com/res3</loc>
      <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod>
      <xhtml:link href="http://example.com/timegate/http://example.com/res3" 
                  rel="timegate"/>
   </url>
</urlset>

Example 6.3: Change Set with links to Memento TimeGate.

7. Advertising Capabilities

7.1. robots.txt

Example 7.1 shows how Destinations can discover a Sitemap via a Source's robots.txt file.


   User-agent: *
   Sitemap: http://example.com/sitemap.xml

Example 7.1: robots.txt

7.2. Discovery Links

7.2.1. xhtml:link Element

7.2.2. HTTP Link Headers

7.2.3. HTML Link Headers

7.3. host-meta Description

Based on Web Host Metadata specifications [RFC 6415]

8. References

[Web Architecture]
Architecture of the World Wide Web, Volume One, I. Jacobs and N. Walsh, Editors, World Wide Web Consortium, 15 January 2004.
[RFC 1864]
IETF The Content-MD5 Header Field -- HTTP/1.1, J. Myers and M. Rose, October 1995.
[RFC 2616]
IETF RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding, et al., June 1999.
[RFC 4287]
IETF RFC 4287: The Atom Syndication Format, M. Nottingham, R. Sayre, December 2005.
[RFC 4648]
IETF The Base16, Base32, and Base64 Data Encodings, S. Josefsson, October 2006.
[RFC 5988]
IETF RFC 5988: Web Linking, M. Nottingham, October 2010.
[RFC 6120]
IETF RFC 6120: Extensible Messaging and Presence Protocol (XMPP): Core, P. Saint-Andre, March 2011.
[RFC 6415]
IETF RFC 6415: Web Host Metadata, E. Hammer-Lahav, B.Cook, October 2011.
[Sitemaps]
Sitemaps XML format and protocol, sitemaps.org, 27 February 2008.
[W3C Datetime]
Date and Time Formats, Misha Wolf, Charles Wicksteed, 15 September 1997.
[The Open Archives Initiative Protocol for Metadata Harvesting]
The Open Archives Initiative Protocol for Metadata Harvesting, C. Lagoze, H. Van de Sompel et al., December 2008
[Memento Internet Draft]
Memento Internet Draft, H. Van de Sompel, M. L. Nelson, R. D. Sanderson, May 2012
[XEP-0060: Publish-Subscribe]
XEP-0060: Publish-Subscribe, Peter Millard, Peter Saint-Andre, Ralph Meijer, July 2010
[WARC]
WARC File Format, June 2006

A. XML Element Overview

to come: text with brief intro of Table A.1 containing all here introduced XML elements and the technologies they can be used in.

to come: incorporate Dump somehow and show that a Manifest is required for it.

XML ElementTechnology
 SitemapSitemap IndexManifestChange Set
<sitemap>required
<sitemapindex>required
<urlset>requiredrequiredrequired
<url>requiredrequiredrequired
<loc>requiredrequiredrequiredrequired
<lastmod rs:type="updated"> or
<lastmod rs:type="created"> or
<expires>
optionaloptionaloptionalrequired
<rs:fixity>optionaloptionaloptionaloptional
<rs:size>optionaloptionaloptionaloptional
<rs:mimetype>optionaloptionaloptionaloptional
<rs:contentencoding>optionaloptionaloptionaloptional
<rs:path>required
<xhtml:meta>optionaloptionaloptionaloptional
<xhtml:link>optionaloptionaloptionaloptional

Table A.1: All covered XML elements and the technologies they are used for.

B. Alternate Dump Formats: WARC

to come

C. Acknowledgements

This specification is the work of NISO and the Open Archives Initiative. Funding for ResourceSync is provided by the Alfred P. Sloan Foundation. UK participation is supported by the JISC.

This specification is based on the meetings of the ResourceSync Technical Committee. The Technical Committee includes the editors and (in alphabetical order): Manuel Bernhardt (Delving B.V.), Richard Jones (Cottage Labs), Graham Klyne (University of Oxford), Stuart Lewis (University of Edinburgh), Kevin Ford (Library of Congress), David Rosenthal (LOCKSS), Christian Sadilek (Red Hat), Shlomo Sanders (Ex Libris, Inc.), Sjoerd Siebinga (Delving B.V.), Ed Summers (Library of Congress), and Jeff Young (Online Computer Library Center).

Check participant status and affiliations

D. Change Log

Date Editor Description
2012-08-13 martin, herbert, simeon, bernhard first alpha-spec draft

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Use of this page is tracked to collect anonymous traffic data. See OAI privacy policy.