![]() |
![]() |
Open Archives Initiative ResourceSync Framework Specification |
![]() |
1. Introduction
1.1 Motivating Examples
1.2 Notational Conventions
2. ResourceSync Basics
2.1 Walkthrough
2.2 Overview
2.2.1 Destination Perspective
2.2.2 Source Perspective
3. Describing Content
3.1 Sitemap
3.1.1 loc
3.1.2 lastmod and expires
3.1.3 rs:fixity
3.1.4 rs:size
3.1.5 rs:mimetype
3.1.6 rs:contentencoding
3.1.7 xhtml:meta and xhtml:link
3.2 Large Sitemaps
4. Transferring Content
4.1 HTTP Content Transfer
4.2 Dump
4.2.1 Manifest
4.3 Alternate Content Transfer
4.3.1 Alternate Content Location
4.3.2 Partial Content
4.3.3 Alternate Interpretation
5. Communicating Change Events
5.1 Change Sets
5.2 Pushing Change Sets
5.2.1 XMPP
5.2.2 HTTP Callback
6. Providing Access to Versions
6.1 Historical Change Sets
6.2 Historical Content
6.2.1 Link to Version
6.2.2 Link to Memento TimeGate
7. Advertising Capabilities
7.1 robots.txt
7.2 Discovery Links
7.2.1 xhtml:link Element
7.2.2 HTTP Link Headers
7.2.3 HTML Link Headers
7.3 host-meta Description
8. References
A. XML Element Overview
B. Alternate Dump Formats: WARC
C. Acknowledgements
D. Change Log
The Web is highly dynamic, with resources continuously being created, updated, and deleted. As a result, the use of resources from a remote server involves the challenge of remaining in step with its changing content. In many cases, there is no need to perfectly reflect a server's evolving content and therefore well established resource discovery techniques, such as recurrent Web harvesting, suffice as an updating mechanism. However, there are significant use cases that require low latency and high accuracy in reflecting a remote server's changing content. These requirements have typically been addressed by ad-hoc technical approaches implemented within a small group of collaborating servers. There have been no widely adopted, Web-based approaches.
This ResourceSync specification introduces a range of easy to implement capabilities that a server may support in order to enable remote servers to remain more tightly in sync with its evolving resources. It also describes how a server can advertise the capabilities it supports. Remote servers can inspect this information to determine how to best remain aligned with evolving content.
Each capability provides a different synchronization functionality, such as a list of a server's resources or its recently changed resources, including what the nature of the change was: create, update, or delete. Most capabilities are based on extensions for Sitemaps and new ways to use them. Capabilities can be combined to achieve varying levels of functionality and hence meet different local or community requirements. This modularity provides flexibility and makes ResourceSync suitable for a broad range of use cases.
This document is structured as follows:Many projects and services have synchronization needs and have implemented ad hoc solutions. ResourceSync provides a standard synchronization method that will reduce implementation effort and facilitate easier reuse. This section describes four motivating examples with differing needs and complexities.
Consider first the case of a website for a small museum collection. The website may contain just a few dozen static web pages. With standard tools the maintainer can create a Sitemap to enhance harvesting by commodity search engines. In doing so the information is also available to services using ResourceSync.
When building services over Linked Data it is often desirable to maintain a local copy of key data for improved access and availability. Harvesting can be enabled by publishing a ResourceSync Sitemap for the collection. In many cases Linked Data records are small and so harvesting via individual HTTP GET requests is slow because of the large number of round-trips for a small amount of content. Publishing a dump in which content is aggregated in a ZIP file in a standard way makes this more efficient for the client and less burdensome for the server. Continued synchronization is enabled by either updating the Sitemap or, more efficiently, by publishing change sets listing only the changed resources and/or content dumps.
The arXiv.org archive of scientific articles has used a custom mirroring solution to propagate resource changes to a set of mirror sites and interacting services on a daily basis. There are about 2.4 million resource files with about 1600 changes (creates, updates) per day. The mirroring system currently in place uses HTTP with custom change descriptions, and occasional rsync to verify the copies and to cope with any errors in the incremental updates. It would be desirable to have a solution that allows any interested third-party service to synchronize with arXiv using standard software. Both accuracy and low implementation barrier are important. Within ResourceSync, arXiv.org could publish each metadata and full-text record as a separate web resource with its own URI. In this one-to-many scenario multiple clients (such as the mirror lanl.arXiv.org or any third party) could stay accurately in synchronization with either all or a portion of arXiv.org. This would extend the article metadata sharing (currently provided via OAI-PMH) to full-text in a web friendly fashion.
It is important to have access to the most recent versions of data resources in order to maintain efficient and accurate computation. DBPedia is a frequently used set of Linked Data, and is updated up to twice a second. While it may not be important to maintain second-granularity synchronization, there are millions of resources changing at a very high rate and existing solutions are unable to provide acceptable latency. The ResourceSync framework enables a push-based framework for alerting interested clients about changes using a publish and subscribe methodology. This builds upon ResourceSync's pull-based approaches, simply changing the network transport layer to a more appropriate technique for high throughput. The resources may be synchronized by a simple HTTP GET call, or by transferring the changes only using more advanced techniques.
This specification uses the terms "resource", "request", "response", "entity", "entity-body", "entity-header", "content negotiation", "client", "user agent", and "server" as described in [RFC 2616].
Throughout this document, the following namespace prefix bindings are used:
Prefix | Namespace URI | Description |
---|---|---|
(none) | http://www.sitemaps.org/schemas/sitemap/0.9 |
Sitemap XML elements defined in the Sitemap protocol |
xhtml | http://www.w3.org/1999/xhtml |
Elements introduced in the XHTML namespace |
xmpp | http://jabber.org/protocol/pubsub |
Elements of the PubSub extension to the XMPP protocol |
rs | http://www.openarchives.org/rs/terms/ |
Elements introduced and defined in this specification |
Table 1.1: Namespace prefix bindings used in this document
This section provides an overview of the various ResourceSync capabilities that a server may support in order to enable remote servers to become and remain synchronized with its evolving resources. The following terms are introduced:
Let's assume a Source, http://example.com/, that wants to make it easy for Destinations to follow its changing content. A very basic first step towards that goal is for this Source to publish a Sitemap like many servers already do. A Sitemap lists the URIs of resources that a Source wants Destinations to know about, as shown in Example 2.1.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res2</loc> </url> </urlset>
Example 2.1: A basic Sitemap
A Destination can find out about the existence of a Sitemap in the Source's robots.txt file, published at the conventional location: http://example.com/robots.txt. Example 2.2 shows a robots.txt file that indicates the Source's Sitemap is available at http://example.com/sitemap.xml. The Destination can use the information in the Sitemap to start collecting the Source's content by issuing HTTP GET requests against the listed URIs.
User-agent: * Sitemap: http://example.com/sitemap.xml
Example 2.2: A robots.txt file pointing to a Sitemap
The Source can provide additional information in the Sitemap to help the Destination with optimizing the process of collecting content. For example, if a Destination has previously acted upon a Source's Sitemap, it would be good to allow it to determine whether the Sitemap itself has changed since its last visit or whether specific resources have changed since then. Also, the Destination may not be interested in all of the Source's content but only content with a certain topic. A Source can express such information in a Sitemap using existing Sitemap elements or extension elements introduced by ResourceSync. Example 2.3 shows a Sitemap in which its time of publication was added as well as the last modification date and categories for the listed resources. A Destination can use such information to minimize the number of HTTP requests it needs to issue in order to remain up-to-date with the content it requires.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/> </url> </urlset>
Example 2.3: A Sitemap with additional information
In order to describe its changing content in a more timely manner, a Source can increase the frequency at which it publishes an up-to-date Sitemap. But changes may be so frequent or the size of the content collection so vast that updating a complete Sitemap may be impractical. In such cases, a Source can implement an additional capability that focuses on communicating information about changes only. To this end, ResourceSync introduces Change Sets. A Change Set is a special-purpose Sitemap that lists only recently changed resources as well as the nature of their change: create, update, delete. It is up to a Source to decide what the temporal interval is that is covered by a Change Set, for example, listing all changes that occurred during the previous hour, the current day, or since the most recent publication of a Sitemap. Example 2.4 shows a Change Set that lists two change events, one update and one deletion. It also contains some of the additional information that was described above.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/> </url> </urlset>
Example 2.4: A Change Set
One way by which a Destination can find out whether a Source supports the Change Set capability is by inspecting its Sitemap. The Sitemap in Example 2.5 shows a link to the Source's current Change Set that is available at http://example.com/changesets/most_recent.xml. A Destination can recurrently issue an HTTP GET request against this URI to obtain information about recent changes that occurred at the Source, compare those with changes it already acted upon, and process the remaining ones. In order to allow a Destination to remain even more tightly synchronized with a Source, ResourceSync also introduces a capability that consists of a Source recurrently pushing Change Sets that describe new change events to a Destination via publish/subscribe technology.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <xhtml:link href="http://example.com/changesets/most_recent.xml" rel="current http://www.openarchives.org/rs/changeset"/> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res22</loc> </url> </urlset>
Example 2.5: A basic Sitemap with a pointer to a Change Set
It may occur that a Destination is not always able to process the current Change Set before the Source replaces it with a new one, for example, because it goes off-line. When becoming operational again, the Destination may want to catch up with changes that occurred and would likely do so by obtaining the Source's current Change Set. However, while this Change Set will contain information about the recent changes that occurred at the Source, it may not cover all changes for the entire period during which the Destination was unavailable.
To address this problem, a Source may implement a memory capability that allows a Destination to obtain an historical overview of changes going back to before those listed in the recent Change Set. This overview is made available as one or more interlinked historical Change Sets, each covering changes that occurred in a given time interval. Example 2.6 shows the Source's current Change Set but this time with the inclusion of a link to a historical Change Set, which is available at http://example.com/changesets/20120807.xml. This historical Change Set may link to a prior Change Set using the same mechanism. The example also shows that the Change Set includes a link to itself expressing it is the current one. With this memory capability in place, a Destination can collect one or more historical Change Sets, moving backwards in time, following the links that have both the "prev" and "http://www.openarchives.org/rs/changeset" relation types. Once a historical Change Set is obtained that includes a change that the Destination already acted upon, it can stop collecting even older changes and start acting upon the unprocessed ones.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <xhtml:link href="http://example.com/changesets/20120807.xml" rel="prev http://www.openarchives.org/rs/changeset"/> <xhtml:link href="http://example.com/changesets/most_recent.xml" rel="current http://www.openarchives.org/rs/changeset"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/> </url> </urlset>
Example 2.6: A Change Set with a link to a historical Change Set
The previous section provides a concrete walkthrough of some capabilities that a Source can implement and it describes how a Destination can leverage those capabilities to remain aligned with the Source's changing content. This section provides a high-level overview of the various ResourceSync capabilities and it shows how these fit in a Destination's processes aimed at remaining in step with changes. This overview is summarized in Table 2.1 that lists Destination processes as columns and Source capabilities as rows, with cells indicating the usability of a capability for a given process. The next sections provide technical details about each ResourceSync capabilities.
Source Capabilities | Destination Processes | ||
---|---|---|---|
Baseline Synchronization | Incremental Synchronization | Audit | |
Describing Content | |||
Sitemaps | X | X | |
Transferring Content | |||
HTTP GET | X | X | |
Dump | X | ||
Alternate Content Transfer | X | X | |
Communicating Change Events | |||
Change Sets | X | X | |
Pushing Change Sets | X | X | |
Providing Access to Versions | |||
Historical Change Sets | X | X | |
Historical Content | X |
Table 2.1: Source capabilities versus Destination processes
From the perspective of a Destination, three key processes are enabled by the ResourceSync capabilities:
Baseline Synchronization - In order to become synchronized with a Source, the Destination must make an initial copy of the content of a Source. This requires a list of resources hosted by a Source (Sitemap) and obtaining those resources (Dump, HTTP GET, Alternate Content Transfer).
Incremental Synchronization - A Destination may remain in sync with a Source by repeatedly performing a Baseline Synchronization but this will be inefficient in many situations. To increase efficiency, a Source may communicate information about change events that involve its resources (Change Sets, Pushing Change Sets). This allows a Destination to only obtain new and updated resources (HTTP GET, Alternate Content Transfer). In order to cope with outages, or changes at the Source that occur more frequently than the Destination attempts to synchronize, the Source may keep a historical record of change events and/or versions of resources as they change over time (historical Change Sets, historical Content).
Audit - In order to verify whether it is in sync with the Source, a Destination must be able to check that the content it obtained matches the current resources hosted by the Source. This requires a list of resources hosted by the Source (Sitemap, Change Set, historical Change Set), and metadata that characterizes the resources' most recent state, such as last modification time, size, and fixity.
From the perspective of a Source, the ResourceSync capabilities that can be supported to enable Destinations to remain in sync with changing content can be grouped into four categories:
Describing Content - In order to describe its content, a Source can recurrently make an up-to-date Sitemap available. A basic Sitemap provides the URIs of resources that the Source wants Destinations to know about. But additional information can be added to the Sitemap to optimize the Destination's process of obtaining a Source's resources. Such information includes the Sitemap's publication time and the last modification time and categories for resources.
Transferring Content - The default mechanism to obtain a resource is to issue an HTTP GET against its URI. But the Source may support two additional content transfer capabilities:
Communicating Change Events - In order to achieve low synchronization latency, a Source may communicate information about change events that involve its resources:
Providing Access to Versions - In order to allow a Destination to catch up with missed changes that occurred at the Source, the Source may keep a historical record of change events and/or versions of resources as they change over time:
A Source may publish a description of its content in order to allow Destinations to keep track of the content state. This information enables a Destination to make a copy of all or part of the content, or to update a local copy to remain synchronized with changes at a Source. The Sitemap format was created to improve the efficiency and reliability of web harvesting and is the basis of content description within ResourceSync. Optional extensions provide facilities for improved synchronization and verification.
ResourceSync leverages the wide-spread adoption and tool-support of the Sitemaps XML format.
Destinations can discover a Sitemap via a Source's robots.txt
file as, for example, shown in
Example 3.1.
User-agent: *
Sitemap: http://example.com/sitemap.xml
Example 3.1: Minimal robots.txt file
A minimal Sitemap is simply a list of all of the resources provided by a Source.
The structure of a Sitemap is shown in Example 3.2. It must have the urlset
root element and information about each resource is contained within a url
element. This example shows a single resource http://example.com/res1
.
It is recommended that a last modification time for the entire Sitemap be included using an xhtml:meta
element with date and time conforming to the W3C Datetime syntax.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<url>
<loc>http://example.com/res1</loc>
</url>
</urlset>
Example 3.2: Minimal Sitemap structure and xhtml:meta
Information about each resource described by a Sitemap is conveyed within a url
element. At minimum the location of the resource must be specified using the loc
element. All other information is optional. Other elements from the Sitemaps format, or from other schemas, are permitted. Consuming applications should ignore unrecognized content or elements. Elements that are useful for ResourceSync are summarized in
Table 3.1,
Table 3.2, and
Table 3.3
and described in the sections that follow.
Element | Use | Description |
---|---|---|
<loc> | required | URL of the resource as defined in the Sitemaps protocol. |
<lastmod rs:type="created"> | optional | Date of last modification of the resource as defined in the Sitemaps protocol and expressed as a W3C Datetime. If attribute "created" is given, the type of modification equals a creation of the resource. |
<lastmod rs:type="updated"> | optional | Date of last modification of the resource as defined in the Sitemaps protocol and expressed as a W3C Datetime. If attribute "updated" is given, the type of modification equals an update of the resource. |
<expires> | optional | Date of deletion of the resource. This date must be in the past. Expressed as a W3C Datetime. |
Table 3.1: Child elements of the url
element to identify the resource and express change types.
<rs:fixity> | optional, repeatable | Digest of the entity-body of a resource representation, computed using one of several algorithms. For most applications the MD5 digest defined in RFC 2616, Sec. 14.15 is recommended. |
<rs:size> | optional | Size of the entity-body of a resource representation. The value must be equal to the value of the Content-Length entity-header in the HTTP response and must be computed as defined in RFC 2616, Sec. 4.4 |
<rs:mimetype> | optional | MIME-Type of the entity-body of a resource representation. The value must be equal to the value of the Content-Type entity-header in the HTTP response as defined in RFC 2616, Sec. 14.17 |
<rs:contentencoding> | optional | Content encoding of the entity-body of a resource representation. The value must be equal to the value of the Content-Encoding entity-header in the HTTP response as defined in RFC 2616, Sec. 14.11 |
Table 3.2: Child elements of the url
element to express representation specific information.
<xhtml:meta> | optional, repeatable | Keyword or term assigned to a resource, which may originate from existing controlled vocabularies. This element may be repeated to indicate multiple categories. |
Table 3.3: Child element of the url
element expressing keywords usable for filtering.
The loc
element is used to convey the location of each resource described. Within each url
element there should be exactly one loc
element as defined in the Sitemaps XML format. It should contain a dereferencable URI from which a client may download content. Example 3.3 below shows a minimal Sitemap describing the locations of two resources.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res2</loc> </url> <!-- one url element for each resource ... --> </urlset>
Example 3.3: Simple Sitemap describing two resource locations
With the location information alone, a Destination can retrieve content from the listed resources. By doing this repeatedly the Destination can check whether the content has changed. Such use may be sufficient for some small-scale use cases but would be an inefficient way to synchronize large collections, or collections that change frequently.
The lastmod
and expires
elements may be used to convey the last modification or deletion time of the resource. This information allows a client to determine whether or not there is new content to download. It is recommended that the last modification or deletion time be included with each url
element.
The content of lastmod
is defined by the Sitemaps XML format and must conform to the W3C Datetime syntax. The use of a complete date and time expressed in UTC with the form YYYY-MM-DDThh:mm:ss[.s]Z
is recommended. Note that UTC indication or time zone offset specification is mandatory if time information is included.
The last modification information can be enhanced with an indication about the resource change type. For an updated or created resource the lastmod
element can be given the attribute rs:type
with the value "updated" or "created" accordingly. For a deleted resource the expires
element, that is already commonly used in Sitemaps, should be used instead. The value of the expires
element must conform to the W3C Datetime syntax.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="created">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> </url> </urlset>
Example 3.4: Use of lastmod
and expires
elements.
Addition of the last modification information allows a client to check for updates without accessing each resource individually. A Destination may compare the last modification time with that of a local copy and thus determine whether there has been a change and perhaps new content should be downloaded.
In case of expires
, the local copy of the corresponding content should be removed.
The rs:fixity
element may be used to convey fixity information in the form of a digest of the entity-body obtained when the resource's URL is dereferenced.
This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated
content then this rs:fixity
element should not be used.
The rs:fixity
element has a mandatory type
attribute that specifies the type and format of the digest as shown in Table 3.4. The md5
digest requires little effort to compute, is small to transfer, and is likely adequate for most change detection scenarios. It is thus recommended that the md5
digest be used as the default. However, md5
digests are not strong and therefore should not be used to guarantee authenticity. For this purpose, digests such as sha-256
would be appropriate. Multiple rs:fixity
elements may be used to convey multiple digests using different algorithms.
type | Description |
---|---|
md5 | MD5 digest of the entity-body encoded in base64 as defined for the Content-MD5 header in [RFC 2616, Sec. 14.15] and [RFC 1864], e.g. Q2hlY2sgSW50ZWdyaXR5IQ== . |
sha-1 | SHA-1 digest of the entity-body encoded in base64 according to [RFC 4648]. |
sha-256 | SHA-256 digest of the entity-body encoded in base64 according to [RFC 4648]. |
Table 3.4: Defined values for fixity type
.
Example 3.5 shows use of the rs:fixity
element to convey MD5 entity-body digests.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt==</rs:fixity> </url> </urlset>
Example 3.5: Use of the rs:fixity
element
Fixity information may be used as a supplement or alternative to last modification time, as a means to allow clients to detect whether content has changed as compared to a local copy. Fixity information provides a much better means to detect corruption of a downloaded copy than other descriptive information, and thus supports checking of a downloaded copy without having to download it again.
The rs:size
element may be used to convey the size of the the entity-body obtained when the resource's URL is dereferenced. This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of varying or content-negotiated content, then the rs:size
element should not be used.
The value of the rs:size
element should be equal to the value of the Content-Length entity-header in the HTTP response (if present) and must be computed as defined in RFC 2616, Sec. 4.4. Example 3.6 shows use of the rs:size
element.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> <rs:size>15672</rs:size> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity> <rs:size>93660664</rs:size> </url> </urlset>
Example 3.6: Use of rs:size
.
The rs:mimetype
element may be used to convey the MIME-Type of the entity-body obtained when the
resource's URL is dereferenced.
This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of
varying or content-negotiated content, then the rs:mimetype
element should not be used.
The value of the rs:mimetype
element should be equal to the value of the Content-Type entity-header in the HTTP response (if present) and the value should be
defined in the IESG MIME-Type registry.
The use is optional and not repeatable.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> <rs:size>15672</rs:size> <rs:mimetype>text/html; charset=utf-8</rs:mimetype> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity> <rs:size>93660664</rs:size> <rs:mimetype>application/pdf</rs:mimetype> </url> </urlset>
Example 3.7: Use of rs:mimetype
.
The rs:contentencoding
element may be used to convey the type of encoding used on the entity-body obtained when the resource's URL is dereferenced.
This information is only useful in the case that there is a consistent representation returned. If that is not the case, for example because of
varying or content-negotiated content, then the rs:contentencoding
element should not be used.
The value of the rs:contentencoding
element should be equal to the value of the Content-Encoding entity-header in the HTTP response (if present). The use is optional and not repeatable.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> <rs:size>15672</rs:size> <rs:contentencoding>gzip</rs:contentencoding> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity> <rs:size>93660664</rs:size> <rs:contentencoding>compress</rs:contentencoding> </url> </urlset>
Example 3.8: Use of rs:contentencoding
.
The xhtml:meta
and the xhtml:link
element may be used to convey information useful to filter or select resources
of interest. Typical uses would be to indicate grouping or classification of resources where some groups or classifications might be selected
by a Destination.
If the information to be conveyed includes a URI, the xhtml:link
element should be used, the xhtml:meta
element
otherwise. Example 3.9 shows how both elements can be used.
No restrictions are placed on the grouping scheme or the form of the category strings. However, the use of URIs from web ontologies or other controlled vocabularies will likely make this information more useful and is thus recommended.
Both elements are optional and may be repeated to indicate multiple categories or tags that apply to a resource.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> <xhtml:meta name="DC.subject" content="Crocodiles"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Fish" title="Fish"/> </url> </urlset>
Example 3.9: Use of the xhtml:link
and xhtml:meta
elements to enable filtering of content descriptions
The Sitemaps XML format specifies that a single Sitemap must not include more than 50,000 url
elements and must not be larger than 10MB in uncompressed format. A Sitemap Index may be used to list up to 50,000 individual Sitemap files and thus extend the format to up to 2.5 billion resources.
ResourceSync does not change the Sitemap Index format, examples are included here for convenience. A Sitemap Index has a format very similar to a Sitemap. The root element is sitemapindex
and each Sitemap is described in a sitemap
element. For each Sitemap the location is specified with the loc
element (cf. 3.1.1 loc) and, optionally, the last modification time for the Sitemap may be specified with the lastmod
element (cf. 3.1.2 lastmod).
It is recommended that a last modification time for the entire Sitemap Index be included using an
xhtml:meta
element with date and time conforming to the
W3C Datetime syntax.
The following example shows a Sitemap index listing two individual Sitemaps.
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2012-08-08T10:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/sitemap2.xml</loc> <lastmod>2012-08-08T15:00:00Z</lastmod> </sitemap> </sitemapindex>
Example 3.10: Sitemap Index with two Sitemaps.
A Source may provide in the Sitemap index an indication of the xhtml:meta
and xhtml:link
values associated with resources in each individual Sitemap. It does this by aggregating the set of element values in each Sitemap and including them with xhtml:meta
or xhtml:link
elements inside the corresponding sitemap
element of the Sitemap Index. This allows Destinations to filter and retrieve only those Sitemaps that match their selection criteria. It is not intended that this mechanism overrides the specification of xhtml:meta
or xhtml:link
values for each resource and so each included Sitemap must still list the corresponding elements for each resource. Example 3.11 shows a Sitemap Index with aggregated categories.
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <sitemap> <loc>http://example.com/sitemap1.xml</loc> <lastmod>2012-08-08T10:00:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> xhtml:meta name="DC.subject" content="Crocodiles"/> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> </sitemap> <sitemap> <loc>http://example.com/sitemap2.xml</loc> <lastmod>2012-08-08T15:00:00Z</lastmod> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Animals" title="Animals"/> </sitemap> </sitemapindex>
Example 3.11: Use of aggregated xhtml:link
and xhtml:meta
values with a Sitemap Index
When a Destination detects that it is out-of-sync with a Source the next step towards synchronization is the transfer of newly created and updated content. No content transfer is needed for deletions. ResourceSync supports several methods to accomplish this process.
The default method for a Destination to obtain changed content from a Source is to issue an
HTTP GET request against the changed resource. A resource's URI can be taken from the loc
element
that can be found in the retrieved Sitemap.
These requests initiate the transfer of single resource representations which means, especially for Baseline
Synchronization, a method for batch content transfer is desirable.
To reduce the number of HTTP GET requests necessary to transfer content, a Source can publish a Dumps
which package its content.
The Source's capability to publish Dumps needs to be advertised to Destinations. In case the Source
publishes Sitemaps, the way to make a Dump discoverable is to include an xhtml:link
element in the
Sitemap. An example for a Dump discovery link is shown in Example 4.1.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<xhtml:link href="http://example.com/dump/dump.zip"
rel="http://www.openarchives.org/rs/dump"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2012-08-08T13:22:00Z</lastmod>
</url>
</urlset>
Example 4.1: Dump discovery link.
A Dump is a package that contains content hosted by a Source. A Dump may be used to transfer resources from a Source in bulk, without a Destination having to request the resources separately. A Baseline Synchronization is a typical scenario for a Destination to obtain a Dump.
The default Dump format for ResourceSync is the Zip file format. However, it is possible for a Source to publish Dumps in other formats such as WARC. Appendix B provides guidelines to implement a Dump in the WARC format.
Each Dump must contain a manifest.xml
file. The manifest describes the content of the
Dump. It is formatted as a Sitemap with additional descriptive elements.
For each resource, described within the url
element,
a Manifest must include the element rs:path
describing the mapping between the
resource's URI and its relative file path in the Dump.
Example 4.2 shows a simple manifest.xml
file for a Dump containing two resources.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> <rs:size>15672</rs:size> <rs:mimetype>text/html; charset=utf-8</rs:mimetype> <rs:contentencoding>gzip</rs:contentencoding> <rs:path>resources/res1</rs:path> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> <rs:fixity type="md5">A7kjY2sgSW50ZWdyaX6sgt=</rs:fixity> <rs:size>93660664</rs:size> <rs:mimetype>application/pdf</rs:mimetype> <rs:contentencoding>compress</rs:contentencoding> <rs:path>resources/res2</rs:path> </url> </urlset>
Example 4.2: A Dump Manifest.
The requirements for the use of all ResourceSync Sitemaps elements, summarized in
Table 3.1, Table 3.2,
Table 3.3, and Table 3.4,
apply for Dump manifest files as well.
Particularly the use of the rs:mimetype
and rs:contentencoding
elements are recommended here.
Table 4.1 summarizes the XML element required in Dump manifests.
Element | Use | Description |
---|---|---|
<rs:path> | required | Relative resource file path within a Dump. |
Table 4.1: Dump Manifest rs:path
element.
Certain scenarios may require a Source to offer alternate methods of content transfer. ResourceSync recognizes the following cases:
In case where a Source promotes an alternate content location for its content, it needs to advertise
the proper URIs to Destinations.
It can do so by including an xhtml:link
element as a child to each url
element.
The xhtml:link
element can contain a reference to the alternate location of the resource
and the proper relation type and therefore convey the information required for the Destination to obtain
the content.
Example 4.3 shows how a link can be included in a ResourceSync modified Sitemap.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
<xhtml:link rel="alternate http://www.openarchives.org/rs/mirror"
href="http://example.com/example-com-mirror/res1"/>
</url>
<urlset>
Example 4.3: Alternate Content Transfer from a mirror site.
Scenarios exist where it is more efficient for a Destination to only transfer the part of a resource
that has actually changed instead of the entire resource. Minor changes such as fixed typos in or minor additions
to large resources, for example, may not justify the transfer of the entire document, especially if these kind of
changes occur frequently.
ResourceSync supports the transfer of partial content. A Source can include an xhtml:link
element as a child to each url
element. It can contain a reference to the partial content,
a protocol, specifying the details of the partial content transfer between the Source and the Destination,
and the proper relation.
However, the implementation of this capability is left up to the Source and in general implementation
will be media type specific. Whichever protocol the Source uses, it needs to be understood by the
Destination in order to complete the partial resource transfer.
Example 4.4 shows an xhtml:link
element
containing information needed by a Destination for partial content transfer.
Note that partial content transfer is only applicable in Change Sets (introduced in Section 5.1) but not in Sitemaps.
<urlset rs:type="changeset"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
<xhtml:link rel="http://www.openarchives.org/rs/partial"
rs:protocol="http://example.com/protocols/changesonly"
href="http://example.com/res1/diff251"/>
</url>
<urlset>
Example 4.4: Partial Content Transfer in a Change Set.
Note this is a forward reference. At this point we have not introduced Change Sets yet.
For alternate content transfer it is essential for a Destination to understand what to expect when dereferencing
the URI provided in the loc
element.
Example 4.5 shows an example where the element contains further information about the
resource that has changed. The URI shown is a baseURL of an OAI-PMH repository and the
rs:protocol
attribute points a Destination to the appropriate protocol specification.
This additional pointer enables a Destination to understand that the given URI conforms to the OAI-PMH protocol.
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<url>
<loc rs:protocol="http://www.openarchives.org/OAI/openarchivesprotocol.html">
http://example.com/oaipmh</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
</url>
<urlset>
Example 4.5: Alternate Content Transfer in an OAI-PMH repository.
A Source may support communication of changes in its content as a way to enable Destination to efficiently follow those changes. A Source may publish a description of recent changes, or may use XMPP PubSub or HTTP Callback to push changes to a subscribing Destination.
The ResourceSync framework introduces the notion of a Change Set that describes changes at a Source. The Change Set is a special-purpose Sitemap that lists only recently changed resources as well as the nature of their change. A Change Set is identified by a URI and if a Destination dereferences this URI, it can expect a set of recent changes to be returned.
Destinations, in order to keep up with the Source's changes, need to become aware if Change Sets are provided.
If a Source implements Sitemaps to describe its content it can include the discovery link to a Change Set.
The xhtml:link
element can be used for that purpose.
Example 5.1 shows a Sitemap including an xhtml:link
element enabling the Destination
to discover the Change Set provided by a Source.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<xhtml:link href="http://example.com/changesets/most_recent.xml"
rel="current http://www.openarchives.org/rs/changeset"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2012-08-08T13:22:00Z</lastmod>
</url>
</urlset>
Example 5.1: Change Set discovery link in a Sitemap.
The frequency of content change on a Source's end as well as the acceptable latency for synchronization on a Destination's end may vary between scenarios. In the ResourceSync framework it is up to a Source to decide what the temporal interval is that is covered by a Change Set. It may list all changes that occurred during the previous hour, the current day, or since the most recent publication of a Sitemap. A recent Change Set published by one Source may very well cover a much smaller or much larger temporal interval than a recent Change Set of another Source. It is up to a Source to define what "recent" means for its individual scenario.
Change Sets are based on the Sitemap format which means that each Change Set:
urlset
root element with the recommended attribute rs:type="changeset"
,xhtml:meta
element as a child of the urlset
element expressing the last modification time of the Change Set,url
entry as a child of the urlset
element for each changed resource, andloc
element as a child of the url
element.
The recommended addition of the attribute rs:type="changeset"
to the
urlset
root element helps to distinguish between Change Sets and Sitemaps.
Sitemaps do not have this attribute.
The recommended last modification time of the entire Change Set in the xhtml:meta
element must be a date and time conforming to the W3C Datetime syntax.
This time stamp provides one way for Destinations to determine whether a Change Set is new.
Since the purpose of Change Sets is to convey informatation
about changes in content hosted by a Source, it is essential to indicate the
nature of the change. Three types of content change are defined in the ResourceSync
framework: created, updated, and deleted.
Each url
element must include one and only one of the following child elements
to indicate the change type and when it occurred:
<lastmod rs:type="created">
datetime</lastmod>
<lastmod rs:type="updated">
datetime</lastmod>
<expires>
datetime</expires>
.For all three options, the date and time of the change must be included conforming to the W3C Datetime syntax. Table 5.1 summarizes the three change types and their corresponding XML elements.
Change Type | XML Element |
---|---|
Create | <lastmod rs:type="created">2012-07-17T19:22:00Z</lastmod> |
Update | <lastmod rs:type="updated">2012-07-17T19:22:00Z</lastmod> |
Delete | <expires>2012-07-17T19:22:00Z</expires> |
Table 5.1: XML elements expressing content change in Change Sets.
Example 5.2 shows the content of a Change Set with three url
elements each of which
describes one change event. This example shows one update, one deletion, and one creation.
The urlset
root element contains the attribute rs:type="changeset"
and an xhtml:meta
time stamp is also included.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> </url> <url> <loc>http://example.com/res3</loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> </url> </urlset>
Example 5.2: Change Set describing three content changes: an update, a deletion, and a creation.
As seen in previous sections, a Source can add several optional child elements to each
url
element. This is also applicable for Change Sets. For example, the elements
rs:size
and rs:fixity
become particularly important for the Destination process Audit.
Example 5.3 shows a Change Set with multiple optional elements
for one change event.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> <rs:size>15672</rs:size> <rs:fixity type="md5">Q2hlY2sgSW50ZWdyaXR5IQ==</rs:fixity> <rs:mimetype>application/pdf</rs:mimetype> <rs:contentencoding>gzip</rs:contentencoding> <xhtml:link rel="DCTERMS.subject" href="http://en.wikipedia.org/wiki/Category:Frogs" title="Frogs"/> <xhtml:meta name="DC.subject" content="Crocodiles"/> </url> </urlset>
Example 5.3: Change Set with multiple optional elements.
A unique identifier for each change event might be useful in some use cases.
This specification does not define a dedicated event identifying element.
However, the ResourceSync framework recommends the combination of the values of
the loc
and of the lastmod
elements to be used for this purpose.
It is important to note that the framework considers it to be the Source's responsibility
to provide a sufficient granularity for the lastmod
value to ensure a truly unique
identifier.
In the previous section a Source publishes Change Sets at a self-defined frequency. A Destination periodically needs to check for updates by pulling the Change Set. This setup implies a latency since the publication interval is usually unknown to the Destination.
For scenarios where this latency is unexceptable or Destinations simply can not continuously pull for Change Sets, the ResourceSync framework features push-based approaches. These approaches are suitable, for example, for environments with high frequency content changes at the Source's end and a high synchronization demands at the Destination's end. Since change events can rapidly and continuously be pushed to Destinations, the latency inflicted by the Destination's "guessing" of when to pull for a new Change Set is eliminated.
Two push based approaches are described below: one based on XMPP and one based on HTTP Callback.
The Extensible Messaging and Presence Protocol (XMPP), more specifically, its PubSub extension, allows a Source to support subscription to Change Sets communicated via XMPP messaging infrastructure.
A Destination here also needs to become aware of this capability being offered by a Source. Similar to the previous section,
Example 5.4 shows a Sitemap including an xhtml:link
element. It enables Destinations
to discover all necessary information to receive the push-based Change Set provided by a Source.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<xhtml:link href="xmpp:pubsub.example.com"
rs:protocol="http://xmpp.org/extensions/xep-0060.html"
rs:pubsubnode="Example_Node_Name"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2012-08-08T13:22:00Z</lastmod>
</url>
</urlset>
Example 5.4: Discovery link for pushing Change Sets via XMPP PubSub.
An XMPP message sent by a Source is encapsulated in an xmpp:iq
element. This element contains, amongst
others, the address of the sender and the recipient. The protocol's PubSub extension adds the xmpp:pubsub
and
the xmpp:publish
element. The latter contains the name of the XMPP PubSub node the message is published
to.
The body of the XMPP PubSub message is contained in an xmpp:item
element.
As shown in Example 5.5, the message itself is a Change Set encapsulated by the
urlset
element and each change event contained within a url
element.
All elements, required and optional, as introduced in the previous section, apply here too.
Example 5.5 shows the same change events as seen in Example 5.2
but in form of an XMPP PubSub message. It is up to the Source to decide whether to bundle more than one change
events into one XMPP PubSub message (as seen in Example 5.5) or to send one message per
change event in which case the encapsulating urlset
would only include one url
element.
<xmpp:iq from="sender@example.com" type="set" to="destination.com" id="liAJUz3S" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xmpp="http://jabber.org/protocol/pubsub" xmlns:rs="http://www.openarchives.org/rs/terms/"> <xmpp:pubsub> <xmpp:publish node="PubSub_NodeName"> <xmpp:item id="3294"> <urlset rs:type="changeset"> <url> <loc>http://example.com/res1</sm:loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</sm:loc> <expires>2012-08-08T13:22:00Z</expires> </url> <url> <loc>http://example.com/res3</sm:loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> </url> </urlset> </xmpp:item> </xmpp:publish> </xmpp:pubsub> </xmpp:iq>
Example 5.5: Push-based XMPP message containing a Change Set.
The xmpp:item
element contains an identifier that is used within XMPP
to distinguish between messages and, for example, to purge individual
(persistent) messages from an XMPP server.
HTTP callback allows Sources to directly push Change Sets to registered Destinations without the need for other infrastructure.
Example 5.6 shows how a Source can advertise the availability
of HTTP callback in its Sitemap using the xhtml:link
element.
The rs:protocol
attribute indicates the protocol that this capability
conforms to, and the href
attribute gives the location of the
subscription interface. The subscription interface provided by the Source allows
Destinations to register their corresponding HTTP callback URIs.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/>
<xhtml:link href="http://example.com/subscribe"
rs:protocol="http://example.com/protocol/callback"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2012-08-08T08:15:00Z</lastmod>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2012-08-08T13:22:00Z</lastmod>
</url>
</urlset>
Example 5.6: Discovery link for pushing Change Sets via HTTP Callback.
With this method a Source can push Change Sets to the specified URIs of registered Destinations. It is again up to the Source to decide whether to push Change Sets containing only one change event or bundle multiple change events into one Change Set. Example 5.7 shows the same three change events as seen in Example 5.5 but communicated via the HTTP callback method in one Change Set.
>> Subscription Request << POST /subscribe HTTP/1.1 Host: example.com callbackURI=http://aggregator.org/callback >> Change Notification << POST /callback HTTP/1.1 Host: aggregator.org <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/res1</sm:loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</sm:loc> <expires>2012-08-08T13:22:00Z</expires> </url> <url> <loc>http://example.com/res3</sm:loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> </url> </urlset>
Example 5.7: Push-based HTTP Callback method communicating a Change Set.
As mentioned in Section 5.1, a Destination has no control over how many change events a Source includes in one Change Set. For scenarios with a relatively low resource change frequency, for example, a Change Set generated over the course of one day might not contain many changes and hence be of very reasonable size. However, in cases with high change frequencies, the same Change Set may grow to an extent that is unreasonable to be communicated to Destinations (again, this is at the Source's discretion to decide), or may provide unacceptable latency.
In order to enable Sources in high change frequency scenarios to communicate all changes, without having to accumulate all of them into one Change Set, a Source may provide historical Change Sets. These historical Change Sets can be seen as digests of past change events, covering a time span prior to the one covered by the current Change Set.
A Destination can access the historical Change Sets by following a link that is included in the current Change Set. Such a link can be seen in Example 6.1. The first link points to the URI of the current Change Set with the relation "current". The second link, with the relation "prev", points to a historical Change Set that covers changes that occurred in a time span previous and adjacent to the one covered by the current Change Set. This historical Change Set can in its turn include a link with a "prev" relation pointing at an even earlier historical Change Set, etc. A Destination can follow these links with a "prev" relation to collect all needed or available historical Change Sets. By analyzing the change events listed in the gathered Change Sets, for example looking at the datetime of each change, a Destination can determine whether it already processed a change. As soon as a Change Set is encountered that lists a previously processed change, there is no need to collect even more Change Sets.
With historical Change Sets a Destination has yet another option to "catch up" with a Source in case it has missed Change Sets and the Source has not yet generated a new Sitemap and a new Dump.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <xhtml:link href="http://example.com/changesets/most_recent.xml" rel="current http://www.openarchives.org/rs/changeset"/> <xhtml:link href="http://example.com/changesets/20120807.xml" rel="prev http://www.openarchives.org/rs/changeset"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> </url> <url> <loc>http://example.com/res3</loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> </url> </urlset>
Example 6.1: Change Set with links to historical Change Sets.
A Source may implement a capability that allows a Destination to obtain prior versions of resources. Where Destinations need to obtain all versions of a resource, not just the current one, this capability becomes very useful. The ResourceSync framework features two implementation alternatives.
In addition to having a generic URI that applies to all versions of a
resource, a Source may mint a URI that is associated with each
particular version. When communicating about the resource, its
generic URI is provided in the loc
element whereas the URI of the
specific version of the resource (the historical content) can be provided using
an xhtml:link
element that has a relation type of "self" and of
"memento".
It is up to the Source to decide
for how long the version resource remains accessible.
Example 6.2 shows a Change Set with version URIs included. In this example the URIs are minted
with the help of the value of the lastmod
elements.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> <xhtml:link href="http://example.com/20120808081500/res1" rel="memento"/> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> <xhtml:link href="http://example.com/20120808132200/res2" rel="memento"/> </url> <url> <loc>http://example.com/res3</loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> <xhtml:link href="http://example.com/20120808145700/res3" rel="memento"/> </url> </urlset>
Example 6.2: Change Set with links to a resource version.
In addition to having a generic URI that applies to all versions of a
resource, a Source can associate a TimeGate with the resource, as per
the Memento protocol
[Memento Internet Draft].
A TimeGate supports negotiation in the datetime dimensions to obtain a version of the
resource as it existed at a specified moment in time, for example, the
time provided in lastmod
. When communicating about the resource, its
generic URI is provided in the loc
element whereas the URI of the
TimeGate associated with the resource can be provided using an
xhtml:link
element that has a relation type of "timegate". It is up to the
Source to decide for how long version resources remains accessible.
An example of a Change Set with links to a Memento TimeGate is shown in Example 6.3.
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xhtml:meta name="DCTERMS.modified" content="2012-08-08T16:30:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> <xhtml:link href="http://example.com/timegate/http://example.com/res1" rel="timegate"/> </url> <url> <loc>http://example.com/res2</loc> <expires>2012-08-08T13:22:00Z</expires> <xhtml:link href="http://example.com/timegate/http://example.com/res2" rel="timegate"/> </url> <url> <loc>http://example.com/res3</loc> <lastmod rs:type="created">2012-08-08T14:57:00Z</lastmod> <xhtml:link href="http://example.com/timegate/http://example.com/res3" rel="timegate"/> </url> </urlset>
Example 6.3: Change Set with links to Memento TimeGate.
Example 7.1 shows how
Destinations can discover a Sitemap via a Source's robots.txt
file.
User-agent: *
Sitemap: http://example.com/sitemap.xml
Example 7.1: robots.txt
Based on Web Host Metadata specifications [RFC 6415]
to come: text with brief intro of Table A.1 containing all here introduced XML elements and the technologies they can be used in.
to come: incorporate Dump somehow and show that a Manifest is required for it.
XML Element | Technology | |||
---|---|---|---|---|
Sitemap | Sitemap Index | Manifest | Change Set | |
<sitemap> | required | |||
<sitemapindex> | required | |||
<urlset> | required | required | required | |
<url> | required | required | required | |
<loc> | required | required | required | required |
<lastmod rs:type="updated"> or<lastmod rs:type="created"> or<expires> | optional | optional | optional | required |
<rs:fixity> | optional | optional | optional | optional |
<rs:size> | optional | optional | optional | optional |
<rs:mimetype> | optional | optional | optional | optional |
<rs:contentencoding> | optional | optional | optional | optional |
<rs:path> | required | |||
<xhtml:meta> | optional | optional | optional | optional |
<xhtml:link> | optional | optional | optional | optional |
Table A.1: All covered XML elements and the technologies they are used for.
to come
This specification is the work of NISO and the Open Archives Initiative. Funding for ResourceSync is provided by the Alfred P. Sloan Foundation. UK participation is supported by the JISC.
This specification is based on the meetings of the ResourceSync Technical Committee. The Technical Committee includes the editors and (in alphabetical order): Manuel Bernhardt (Delving B.V.), Richard Jones (Cottage Labs), Graham Klyne (University of Oxford), Stuart Lewis (University of Edinburgh), Kevin Ford (Library of Congress), David Rosenthal (LOCKSS), Christian Sadilek (Red Hat), Shlomo Sanders (Ex Libris, Inc.), Sjoerd Siebinga (Delving B.V.), Ed Summers (Library of Congress), and Jeff Young (Online Computer Library Center).
Check participant status and affiliations
Date | Editor | Description |
---|---|---|
2012-08-13 | martin, herbert, simeon, bernhard | first alpha-spec draft |
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
Use of this page is tracked to collect anonymous traffic data. See OAI privacy policy.