![]() |
![]() |
Open Archives Initiative ResourceSync Framework Specification |
![]() |
This ResourceSync specification describes a synchronization framework for the web consisting of various capabilities that allow third party systems to remain synchronized with a server's evolving resources. The capabilities can be combined in a modular manner to meet local or community requirements. The specification also describes how a server can advertise the synchronization capabilities it supports and how third party systems can discover this information. The specification repurposes the document formats defined by the Sitemap protocol and introduces extensions for them.
This specification is a beta draft released for public comment. Feedback is most welcome on the ResourceSync Google Group.
Editors' Note: The current version of this specification only details ZIP as the format to package content. While it is the intention to stick with ZIP as the recommended packaging format, discussions are ongoing regarding the use of other packaging formats.
Editors' Note: This specification only details pull-based approaches that allow a Destination to remain informed about a Source's evolving resources. Discussions are ongoing about augmenting those with push-based (notification) approaches, for example, based on publish/subscribe technology. These push-based approaches would allow a Source to communicate, among others, the availability of a new Resource List or Resource Dump, the updating of a Change List, the change of a particular resource, etc.
1. Introduction
1.1 Motivating Examples
1.2 Notational Conventions
2. ResourceSync Basics
2.1 Walkthrough
2.2 Overview
2.2.1 Source Perspective
2.2.2 Destination Perspective
2.2.3 Discovery Perspective
2.2.4 Overview Summary
3. Sitemap Document Formats
4. Describing Resources
4.1 Resource List
4.2 Resource List Index
5. Packaging Resources
5.1 Resource Dump
5.1.1 Resource Dump Manifest
5.1.2 Resource Dump Manifest Index
6. Describing Changes
6.1 Change List
7. Packaging Changes
7.1 Change Dump
7.1.1 Change Dump Manifest
7.1.2 Change Dump Manifest Index
8. Linking to Related Resources
8.1 Mirrored Content
8.2 Alternate Representations
8.3 Patching Content
8.4 Resources and Metadata about Resources
8.5 Prior Versions of Resources
8.6 Republishing Resources
9. Providing Historical Data
9.1 Resource Dump Archives
9.2 Change List Archives
9.3 Change Dump Archives
10. Advertising Capabilities
10.1 Capability List
10.2 Capability List Index
10.3 Discovery
10.3.1 ResourceSync Well-Known URI
10.3.2 X/HTML Link Element
10.3.3 HTTP Link Header
11. References
The web is highly dynamic, with resources continuously being created, updated, and deleted. As a result, using resources from a remote server involves the challenge of remaining in step with its changing content. In many cases, there is no need to reflect a server's evolving content perfectly, and therefore well established resource discovery techniques, such as crawling, suffice as an updating mechanism. However, there are significant use cases that require low latency and high accuracy in reflecting a remote server's changing content. These requirements have typically been addressed by ad-hoc technical approaches implemented within a small group of collaborating systems. There have been no widely adopted, web-based approaches.
This ResourceSync specification introduces a range of easy to implement capabilities that a server may support in order to enable remote systems to remain more tightly in step with its evolving resources. It also describes how a server can advertise the capabilities it supports. Remote systems can inspect this information to determine how best to remain aligned with the evolving data.
Each capability provides a different synchronization functionality, such as a list of the server's resources or its recently changed resources, including what the nature of the change was: create, update, or delete. All capabilities are implemented on the basis of the document formats introduced by the Sitemap protocol. Capabilities can be combined to achieve varying levels of functionality and hence meet different local or community requirements. This modularity provides flexibility and makes ResourceSync suitable for a broad range of use cases.
This document is structured as follows:
Many projects and services have synchronization needs and have implemented ad hoc solutions. ResourceSync provides a standard synchronization method that will reduce implementation effort and facilitate easier reuse of resources. This section describes motivating examples with differing needs and complexities.
Consider first the case of a website for a small museum collection. The website may contain just a few dozen static web pages. The maintainer can create a Resource List of these web pages and expose it to services that leverage ResourceSync.
When building services over Linked Data it is often desirable to maintain a local copy of data for improved access and availability. Harvesting can be enabled by publishing a Resource List for the Dataset. In many cases resource representations exposed as Linked Data are small and so retrieving them via individual HTTP GET requests is slow because of the large number of round-trips for a small amount of content. Publishing a Resource Dump that points to content packaged and described in ZIP files makes this more efficient for the client and less burdensome for the server. Continued synchronization is enabled by recurrently publishing an up-to-date Resource List or Resource Dump, or, more efficiently, by publishing a Change List that provides information about resource changes only.
For many years now, the arXiv.org collection of scientific articles has used a custom mirroring solution to propagate resource changes to a set of mirror sites and interacting services on a daily basis. The collection contains about 2.4 million files and there are about 1,600 changes (creates, updates) per day. The mirroring system currently in place uses HTTP with custom change descriptions, and occasionally rsync to verify the copies and to cope with any errors in the incremental updates. The approach assumes a tight connection between arXiv.org and its mirrors. It would be desirable to have a solution that allows any interested third party systems to accurately synchronize with arXiv.org using commodity software. arXiv.org could publish both metadata records and full-text content as separate web resources with their own URI. Leveraging ResourceSync capabilities including Resource Lists, Resource Dumps, Change Lists, and Change Dumps, both existing mirrors, such as lanl.arXiv.org, and new parties could remain accurately in sync with the arXiv.org collection. This would extend the openly available metadata sharing capability provided by arXiv.org, currently implemented via OAI-PMH, to full-text sharing in a web-friendly fashion.
This specification uses the terms "resource", "representation", "request", "response", "content negotiation", "client", and "server" as described in [Architecture of the World Wide Web].
Throughout this document, the following namespace prefix bindings are used:
| Prefix | Namespace URI | Description |
|---|---|---|
http://www.sitemaps.org/schemas/sitemap/0.9 |
Sitemap XML elements defined in the Sitemap protocol | |
rs | http://www.openarchives.org/rs/terms/ |
Namespace for elements and attributes introduced in this specification |
Table 1.1: Namespace prefix bindings used in this document
This section provides an overview of the various ResourceSync capabilities that a server may support in order to enable remote systems to become and remain synchronized with its evolving resources. The following terms are introduced:
Let's assume a Source, http://example.com/, that exposes changing content that others would like to remain synchronized with.
A first step towards making this easy for Destinations is for the Source to publish a Resource List
that conveys the URIs of resources available for synchronization. This Resource List is
expressed as a Sitemap. As shown in Example 2.1, the Source conveys the URI of
each resource as the value of the
<loc> child element of a
<url> element. Note the
<rs:md> child element of the <urlset>
root element. It expresses that the Sitemap implements ResourceSync's Resource List capability and conveys
the datetime of the Resource List's most recent update, allowing a Destination to quickly determine whether
it has previously processed this specific Resource List.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
</url>
<url>
<loc>http://example.com/res2</loc>
</url>
</urlset>
Example 2.1: A Resource List
The Source can provide additional information in the Resource List to
help the Destination optimize the process of collecting
content and verifying its accuracy. For example,
when the Source expresses the datetime of the most recent modification
for a resource, a Destination can determine whether or not it already
holds the current version, minimizing the number of HTTP requests it
needs to issue in order to remain up-to-date. Example 2.2 shows this information
conveyed using Sitemap's <lastmod> element.
When the Source also conveys a hash for a specific bitstream, a Destination can verify whether
the process of obtaining it was successful.
Example 2.2 shows this information conveyed using the hash
attribute on the <rs:md> element.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/>
</url>
</urlset>
Example 2.2: A Resource List with additional information
In order to describe its changing content in a more timely manner, the Source can increase the frequency at which it publishes an up-to-date Resource List. However, changes may be so frequent or the size of the content collection so vast that regularly updating a complete Resource List may be impractical. In such cases, the Source can implement an additional capability that communicates information about changes only. To this end, ResourceSync introduces Change Lists. A Change List lists resources as they change, along with the nature of the change (create, update, or delete) and the time that the change occurred. A Destination can recurrently obtain a Change List from the Source, inspect the listed changes to discover those it has already acted upon, and process the remaining ones. Changes in a Change List are provided in chronological order, making it straightforward for a Destination to determine which changes it already processed. The longer that Change Lists are maintained by the Source, the better the odds are for a Destination to catch up on changes it missed because it was offline, for example.
Example 2.3 shows a Change List.
The value of the capability attribute of the <rs:md> child element of <urlset> makes it clear
that, this time, the Sitemap is a Change List and not a Resource List.
The Change List conveys two
resource changes, one being an update and the other a deletion, as can be
seen from the value of the change attribute of the
<rs:md> element. The example also shows the use of the
<lastmod> element to convey the time of the changes. Note that these times are used to
order the Change List chronologically.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res2.pdf</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change="updated"/>
</url>
<url>
<loc>http://example.com/res3.tiff</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="deleted"/>
</url>
</urlset>
Example 2.3: A Change List
A Destination can issue HTTP GET requests against each resource URI listed in a Resource List. For
large Resource Lists, issuing all of these requests may be cumbersome. Therefore, ResourceSync introduces a
capability that a Source can use to
make packaged content available. A Resource Dump, implemented as a Sitemap, contains pointers to packaged content.
Each content package referenced in a Resource Dump is a ZIP file that contains the Source's bitstreams along with a Resource Dump Manifest
that describes each. The Resource Dump Manifest itself is also implemented as a Sitemap.
A Destination can retrieve a Resource Dump, obtain content packages by dereferencing the contained pointers, and unpack the retrieved packages.
Since the Resource Dump Manifest also lists the URI the Source associates with each bitstream, a Destination is able to achieve
the same result as obtaining the data by dereferencing the URIs listed in a Resource List.
Example 2.4 shows a Resource Dump that points at a single content package. Dereferencing the URI of that package leads to a ZIP file
that contains the Resource Dump Manifest shown in
Example 2.5. It indicates that the Source's ZIP file contains two bitstreams.
The path attribute of the <rs:md> element conveys
the file path of the bitstream in the ZIP file (the relative file system path where the bitstream
would reside if the ZIP were unpacked), whereas the <loc> attribute conveys the URI associated with the bitstream at the Source.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcedump"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/resourcedump.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</url>
</urlset>
Example 2.4: A Resource Dump
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcedump-manifest"
modified="2013-01-03T19:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-03T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
path="/resources/res1"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-03T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
path="/resources/res2"/>
</url>
</urlset>
Example 2.5: A Resource Dump Manifest detailing the content of a ZIP file
ResourceSync also introduces a Capability List, which is a way for the Source to describe the capabilities it supports.
Example 2.6 shows an example of such a description.
It indicates that the Source supports the Resource List, Resource Dump, and Change List capabilities and it lists their respective URIs.
Note the inclusion of a <rs:ln> child element of <urlset> that links to a description of
the data that the Source makes available.
There are various ways for a Destination to discover a Source's Capability List. The recommended approach
leverages the well-known URI specification [RFC5785] and consists of the Source making the Capability List
available at /.well-known/resourcesync.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln href="http://example.com/info-about-source.xml"
rel="describedby"
type="application/xml"/>
<rs:md capability="capabilitylist"
modified="2013-01-02T14:00:00Z"/>
<url>
<loc>http://example.com/dataset1/resourcelist.xml</loc>
<rs:md capability="resourcelist"/>
</url>
<url>
<loc>http://example.com/dataset1/resourcedump.xml</loc>
<rs:md capability="resourcedump"/>
</url>
<url>
<loc>http://example.com/dataset1/changelist.xml</loc>
<rs:md capability="changelist"/>
</url>
</urlset>
Example 2.6: A Capability List with the description of the ResourceSync capabilities of a Source
In many cases, there is a need to group the documents described so far. For example,
the Sitemap protocol prescribes a maximum of 50,000 resources per Sitemap and a Source may easily have more resources
that are subject to synchronization. In this case, multiple Resource Lists are published as well as a Resource List Index that points
to each of them. The Resource List Index is expressed using Sitemap's <sitemapindex> document format. Similarly, in
order to group the current and previous Change Lists, a Change List Archive is published. It too is expressed using
Sitemap's <sitemapindex> document format. This pattern is used throughout the synchronization framework.
Example 2.7 shows a Resource List Index that points at two Resource Lists.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/resourcelist-part2.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist-part1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Example 2.7: A Resource List Index expressed using the <sitemapindex> document format
The previous section provides a concrete walkthrough of some capabilities that a Source can implement and describes how a Destination can leverage those capabilities to remain synchronized with the Source's changing data. This section provides a high-level overview of the various ResourceSync capabilities and shows how these fit in to a Destination's processes aimed at remaining in step with changes.
From the perspective of a Source, the ResourceSync capabilities that can be supported to enable Destination processes to remain in sync with its changing data can be summarized as follows:
Describing Content - In order to describe its data, a Source can maintain an up-to-date Resource List. A basic Resource List minimally provides the URIs of resources that the Source makes available for synchronization. However, additional information can be added to the Resource List to optimize the Destination's process of obtaining the Source's resources, including the most recent modification time of resources and fixity information such as content-based checksum or hash and length. Figure 1 shows a Source publishing up-to-date Resource Lists at times t2 and t4. At t4, too many resources need to be listed to fit in a single Resource List and hence multiple Resource Lists are published and grouped in a Resource List Index.
Packaging Content - In order to make its data available for download, a Source can recurrently make an up-to-date Resource Dump of its content available. A Resource Dump points at one or more packages, each of which contains bitstreams associated with resources hosted by the Source. Each package also contains a Resource Dump Manifest that provides metadata about the bitstreams contained in the package, minimally including their associated URI and their file path in the ZIP file. Figure 1 shows that the Source's most recent Resource Dump was published at time t3. It also shows the availability of a Resource Dump Archive that additionally points at a Resource Dump that was published at time t1.
Describing Changes - In order to achieve lower synchronization latency and/or to improve transfer efficiency, a Source may publish a Change List that provides information about changes to its resources. It is up to the Source to decide what the temporal interval is that is covered by a Change List, for example, expressing all the changes that occurred during the previous hour, the current day, or since the most recent publication of a Resource List. Per resource change, a Change List minimally conveys the URI of the changed resource as well as the datetime and nature of the change (create, update, delete). Since a Change List is organized on the basis of changes, it may list the same resource multiple times, once per change. Figure 1 shows that the Source's most current Change List covers resource changes that occurred between times t8 and t10. It also shows the availability of a Change List Archive that leads to Change Lists that cover prior temporal intervals.
Packaging Changes - In order to make content changes available for download, a Source can publish a Change Dump. A Change Dump points at one or more packages, each of which contains bitstreams that correspond to changes that occurred to a Source's resources. Each package also contains a Change Dump Manifest that provides metadata about the bitstreams provided in the Change Dump. Per bitstream, the Change Dump Manifest minimally includes the associated URI, the datetime when the change that resulted in the bitstream occurred, the nature of the change (create, update, delete) and the file path of the bitstream in the ZIP file. It is up to a Source to decide the temporal interval covered by a Change Dump, for example, covering all the resource changes that occurred during the previous hour, the current day, or since the most recent publication of a Dump. Since a Change Dump is organized on the basis of changes, the package(s) it points at may contain multiple bitstreams associated with any given resource, one per change. Figure 1 shows that the Source's most current Change Dump covers resource changes that occurred between times t9 and t11. It also shows the availability of a Change Dump Archive that leads to Change Dumps that cover prior temporal intervals.
Linking to Related Resources - There are several reasons to provide additional links from a resource subject to synchronization to related resources:
Archived Resource Dumps, Change Lists, and Change Dumps - The Source can make available Resource Dumps, Change Lists, and Change Dumps that were published prior to the current ones. To that end, it can publish a Resource Dump Archive, a Change List Archive, and a Change Dump Archive, respectively. For example, Figure 1 shows a Resource Dump Archive that points at the current Resource Dump published at t3 but also at a prior one published at t1. A Resource Dump Archive allows a Destination to obtain not only the current but also prior versions of a Source's resources. Change List Archives and Change Dump Archives allow a Destination to catch up on changes it may have missed, for example, because it went offline.
From the perspective of a Destination, three key processes are enabled by the ResourceSync capabilities; Figure 2 provides an overview:
Baseline Synchronization - In order to become synchronized with a Source, the Destination must make an initial copy of the Source's data. A Destination can obtain the Resource List that conveys the URIs of the Source's resources, and subsequently dereference those URIs one by one. A Destination can also obtain a Resource Dump that conveys the URIs of one or more content packages each of which contains bitstreams associated with the Source's resources. A Destination can dereference those URIs and subsequently unpack the retrieved content packages, guided by the contained Resource Dump Manifest.
Incremental Synchronization - A Destination can remain in sync with a Source by repeatedly performing a Baseline Synchronization. To increase efficiency and decrease latency, a Source may communicate information about changes to its resources via Change Lists. This allows a Destination to obtain up-to-date content by dereferencing the URIs of newly created and updated resources listed in the Change List. It also allows a Destination to remove its copies of deleted resources, if needed. A Source can also make a Change Dump available that points at one or more packages, each of which contains bitstreams that correspond to changes that occurred to a Source's resources. In this case the Destination first obtains the Change Dump, then obtains the package(s) by dereferencing the URI(s) listed in the Change Dump, and subsequently unpacks those, guided by the contained Change Dump Manifest. In order to allow a Destination to obtain not only the current version of a resource but also prior versions, a Source may provide mechanisms to discover and obtain archival copies. These include Resource Dump Archives and Change Dump Archives as well as links to resource versions.
Audit - In order to verify whether it is in sync with the Source, a Destination must be able to check that the content it obtained matches the current resources hosted by the Source both regarding coverage and accuracy. This requires an up-to-date list of resources hosted by the Source, which can be compiled on the basis of a Resource List and Change Lists. It also requires these Lists to contain metadata per resource that characterizes its most recent state, such as last modification time, length, and content-based hash.
In order to advertise the capabilities it supports, a Source publishes a Capability List. Such a list has an entry per supported capability, and the URI where the capability can be accessed as well as the capability type is conveyed for each. For example, Figure 3 depicts a Capability List for a Source that supports the following capabilities: Resource List, Resource Dump, Change List, and Change List Archive. Because these capabilities are conveyed in the same Capability List, they uniformely apply to the set of the Source's resources covered by that Capability List. For example, if a given resource appears in the Resource List then it must also appear in a Resource Dump and changes to the resource must also be reported in the Change List.
The distinction between a Change List and a Change List Archive is made clear by the use of a <urlset> or a
<sitemapindex> document, respectively. Each of the Change Lists provides a link with an up relation type pointing
to the Change List Archive.
The Capability List itself is typically made discoverable by a Source by publishing it at the ResourceSync
well-known URI /.well-known/resourcesync.
Links with a resourcesync relation type
in HTML pages or HTTP headers can also be used, in which case the linked Capability List must pertain to the
resource that provides the link; this means that the resource must be covered by all capabilities listed in that
Capability List. The various capability documents can also include a link
with a resourcesync relation type pointing at the Capability List they resort under.
Table 2.1 provides a summary of this Overview section. The table lists Destination processes as columns and Source capabilities as rows, with cells indicating the applicability of a capability for a given process.
| Source Capabilities | Destination Processes | ||
|---|---|---|---|
| Baseline Synchronization | Incremental Synchronization | Audit | |
| Describing Resources | |||
| Resource List | X | X | |
| Packaging Resources | |||
| Resource Dump | X | ||
| Describing Changes | |||
| Change List | X | X | |
| Packaging Changes | |||
| Change Dump | X | ||
| Linking to Related Resources | |||
| Mirrored Content | X | X | X |
| Alternate Representations | X | X | X |
| Patching Content | X | X | |
| Resources and Metadata about Resources | X | X | X |
| Prior Versions of Resources | X | X | |
| Republishing Resources | X | X | X |
| Providing Historical Data | |||
| Resource Dump Archive | X | ||
| Change List Archive | X | ||
| Change Dump Archive | X | ||
| Advertising Capabilities | |||
| Capability List | X | X | X |
Table 2.1: Source capabilities versus Destination processes
In order to convey information pertaining to resources in the ResourceSync framework, the Sitemap
(root element <urlset>) and Sitemap index (root element <sitemapindex>)
document formats introduced by the Sitemap protocol are used for a variety of purposes.
The <sitemapindex> document format is used when is it necessary to
group multiple documents of the <urlset> format.
The document formats, as well as their ResourceSync extension elements, are shown in Table 3.1.
The <rs:md> and <rs:ln> elements are introduced to express metadata and links, respectively.
Both are in the ResourceSync XML Namespace and can have attributes.
The attributes defined in this namespace are listed in Table 3.2 and detailed below.
The <rs:ln> element as well as several of the ResourceSync attributes are based upon other
specifications and in those cases inherit the semantics defined there; the "RFC" column of Table 3.2 refers to those specifications.
Communities can introduce additional attributes when needed but must use an XML Namespace other than that of ResourceSync.
| Sitemap | Sitemap Index |
|---|---|
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md />
<rs:ln />
<url>
<loc />
<lastmod />
<rs:md />
<rs:ln />
</url>
<url>
...
</url>
</urlset>
|
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:md />
<rs:ln />
<sitemap>
<loc />
<lastmod />
<rs:md />
<rs:ln />
</sitemap>
<sitemap>
...
</sitemap>
</sitemapindex>
|
Table 3.1: The Sitemap document formats including the ResourceSync extensions
The overall structure of the ResourceSync documents is as follows:
<urlset> or <sitemapindex> - These elements are the root elements of ResourceSync documents; they have one mandatory and one optional child element:
<rs:md> - In this context, the element conveys information about the document itself. Its use is
mandatory and it has two mandatory attributes:
capability - The value of the attribute conveys the nature of the document, e.g. whether the document is a Resource List, a Change List, a Manifest, etc.
Defined values are resourcelist, changelist, resourcedump, changedump, resourcedump-manifest,
changedump-manifest, and capabilitylist.modified - The value of the attribute is the last modification time of the document, expressed as a W3C Datetime;
the use of a complete date and time expressed in UTC using the format
YYYY-MM-DDThh:mm:ss[.s]Z is recommended.<rs:ln> - An optional and repeatable element used to support discovery of other documents,
for example, the Source's Capability List or a description of the nature of the Source's data.
The URI of such a document is provided in a mandatory href attribute. The rel attribute to express a relationship is also mandatory.
Other attributes can be used, such as type to express the media type of the document.<url> or <sitemap> - The <urlset> element should have zero or more <url> child elements, and the
<sitemapindex> element has zero or more <sitemap> child elements. Each such child element is used to convey information about
a resource that plays a role in the ResourceSync framework. They can have the following child elements:
<loc> - A mandatory element that conveys the URI of the resource that plays a role in the ResourceSync framework.<lastmod> - An element that conveys the last modification time of the resource with the URI provided in <loc>,
expressed as a W3C Datetime as described above.
Its use is optional in some, and mandatory in other documents.<changefreq> - An optional element that provides a hint about the change frequency
of the resource with the URI provided in <loc>. Defined values are always, hourly,
daily, weekly, monthly, yearly, and never.
The value always should be used for resources that change each time they are accessed.
The value never should be used for archived resources.<rs:md> - In this context, the element conveys metadata pertaining to the resource with the URI provided in <loc>.
The element is not repeatable, and is mandatory for some documents and optional for others. It can have several attributes and the ones defined in the
ResourceSync XML Namespace are as follows:
capability - When the attribute is used, its value indicates the nature of
that resource, e.g. whether it is a Resource List, a Change List, a Change Dump, etc. Defined values are listed in the above
description of the capability attribute. When the attribute is not used, this signifies that the resource
is subject to synchronization.change - The value of the attribute conveys the type of change that a resource underwent. Defined values are
created, updated, and deleted to convey the creation, update, and deletion of a resource, respectively.
This attribute is used in Change Lists and Change Dump Manifests.hash - The value of the attribute conveys fixity information for a resource representation returned when the URI in <loc> is dereferenced.
The attribute value is expressed in the form of a whitespace-delimited list of hash values.
Each hash value is represented by a hex-encoded digest and is preceeded by a token that identifies the utilized hash algorithm, e.g. md5:, sha-256:.length - The value of the attribute conveys the content length of a resource representation returned when the URI in <loc> is dereferenced.
The value of the length attribute should be equal to the value of the Content-Length header in the HTTP response
and must be computed as defined in RFC 2616, Sec. 4.4.
path - The attribute is only used in Resource Dump Manifests and Change Dump Manifests.
Its value conveys the file path of the bitstream associated with the URI in <loc> in the ZIP file. That is
the relative file system path where the bitstream would reside if the ZIP were unpacked.type - The value of the attribute conveys the media type of a resource representation returned when the URI in <loc> is dereferenced.
Registered values are listed in the IESG MIME-Type registry.<rs:ln> - In this context, an optional and repeatable element used to link to resources related to the one with the URI provided in <loc>, such as
a copy on a mirror site, a prior version of the resource, etc. (see Linking to Related Resources in Section 2.2.1).
It can have several attributes and the ones defined in the ResourceSync XML Namespace are as follows:
href - A mandatory attribute to convey the URI of the related resource.
rel - A mandatory attribute to convey the relationship between the resource with the URI in <loc>
and the one with the URI in href.hash, length, modified, path, type - Optional
attributes with meanings as described above and pertaining to the related resource.pri - An optional attribute used to express a priority among links with the same relation type.
The attribute value is an integer between 1 and 999,999, with a lower integer
indicating a higher priority and the abscence of the attribute indicating a value of 999,999.Table 3.2 lists the elements used in ResourceSync documents and for each shows the attributes in the ResourceSync XML Namespace that can be used with them. The "Specification" column refers to the specification where elements or attributes were introduced that ResourceSync equivalents are based upon and inherit their semantics from. A mark in the "Representation" column for an attribute indicates that it can only be used when a specific representation of a resource is concerned, whereas a mark in the "Resource" column indicates it is usable for a resource in general.
| Element/Attribute | Specification | Resource | Representation |
|---|---|---|---|
<urlset> or <sitemapindex> | Sitemap protocol | ||
<rs:md> | This specification | ||
capability | This specification | ||
modified | Atom Link Extensions | ||
<rs:ln> | RFC4287 | ||
rel | RFC4287 | ||
href | RFC4287 | ||
<url> or <sitemap> | Sitemap protocol | ||
<loc> | Sitemap protocol | ||
<lastmod> | Sitemap protocol | ||
<changefreq> | Sitemap protocol | ||
<rs:md> | This specification | ||
capability | This specification | ||
change | This specification | X | X |
hash | Atom Link Extensions | X | |
length | RFC4287 | X | |
path | This specification | X | |
type | RFC4287 | X | |
<rs:ln> | This specification | ||
hash | Atom Link Extensions | X | |
href | RFC4287 | X | X |
length | RFC4287 | X | |
modified | Atom Link Extensions | X | X |
path | This specification | X | |
rel | RFC4287 | X | X |
pri | RFC6249 | X | X |
type | RFC4287 | X |
Table 3.2: Elements and associated attributes defined for the ResourceSync documents
A Source may publish a description of the resources it makes available for synchronization. This information enables a Destination to make an initial copy of some or all of those resources, or to update a local copy to remain synchronized with changes.
A Resource List is introduced to list and describe the resources that a Source makes available for synchronization. It presents a snapshot of a Source's resources at a particular point in time.
A Resource List is based on the <urlset> document format introduced by the Sitemap protocol.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute that has a value of resourcelist and it must have a
modified attribute that conveys the Resource List's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per resource. This element does not have attributes, but uses
child elements to convey information about the resource. The <url> element has the following child elements:
<loc> child element provides the URI of the resource.<lastmod> child element and an optional <changefreq>
element with semantics as described in Section 3.<rs:md> child element provides further metadata about the resource.
It can have the attributes hash, length, and type, as described in Section 3.<rs:ln> child elements link to related resources as detailed in Section 8.Example 4.1 shows a Resource List with two resources.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e
sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784"
length="14599"
type="application/pdf"/>
</url>
</urlset>
Example 4.1: A Resource List
The Sitemap protocol has a limit of 50,000 resources per Sitemap. It introduces the Sitemap index to group up to 50,000 Sitemaps thus increasing the limit to 2.5 billion resources. The ResourceSync framework adopts this approach and introduces a Resource List Index that points to up to 50,000 Resource Lists. The union of the Resource Lists referred to in the Resource List Index represents the entire set of resources that a Source makes available for synchronization. This set of resources, regardless of whether it is conveyed in a single Resource List or in multiple Resource Lists via a Resource List Index, represents the state of the Source's data at a particular point in time - the creation time of the Resource List(s).
A Resource List Index is based on the <sitemapindex> document format introduced by the Sitemap protocol.
It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of resourcelist and it must have a
modified attribute that conveys the Resource List Index's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Resource List. This element does not have attributes,
but uses child elements to convey information about the Resource List. The <sitemap> element has the following child elements:
<loc> child element provides the URI of the Resource List.<lastmod> child element with semantics as described in Section 3.
The Destination can determine whether it has reached a Resource List or a Resource List Index based
on whether the root element is <urlset> or <sitemapindex>
respectively. A Resource List Index is shown in Example 4.2.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/resourcelist3.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist2.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelist1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Example 4.2: A Resource List Index
Example 4.2 refers to three Resource Lists identified by:
http://example.com/resourcelist3.xml, andhttp://example.com/resourcelist2.xml, andhttp://example.com/resourcelist1.xml.
Example 4.3 shows the content of the Resource List identified by the URI
http://example.com/resourcelist3.xml.
Structurally, it is identical to the Resource List shown in Example 4.1 but it contains an additional
<rs:ln> child element of <urlset>
that provides a navigational link with the relation type up to the parent Resource List Index
shown in Example 4.2.
This link is meant to ease navigation for Destinations and their adoption is therefore strongly recommended.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:ln rel="up"
href="http://example.com/dataset1/resourcelist-index.xml"/>
<rs:md capability="resourcelist"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res3</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c8753"
length="4385"
type="application/pdf"/>
</url>
<url>
<loc>http://example.com/res4</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md hash="md5:4556abdf8ebdc9802ac0c6a7402c9881"
length="883"
type="image/png"/>
</url>
</urlset>
Example 4.3: A Resource List with a navigational link to its parent Resource List Index
In order to provide Destinations with an efficient way to copy a Source's resources using a small number of HTTP requests, a Source may provide packaged bitstreams for its resources.
A Source can publish a Resource Dump, which provides links to packages of the resources' bitstreams. Each package is a ZIP file that contains the bitstreams of the Source's resources. The Resource Dump represents the Source's state at a particular point in time. It may be used to transfer resources from the Source in bulk, rather than the Destination having to make many separate requests. A typical scenario in which a Destination would obtain a Resource Dump is the Baseline Synchronization process.
A Resource Dump is based on the <urlset> document format introduced by the Sitemap protocol.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute that has a value of resourcedump and it must have a
modified attribute that conveys the Resource Dumps's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per ZIP package. This element does not have attributes, but uses
child elements to convey information about the package. The <url> element has the following child elements:
<loc> child element provides the URI of the package.<lastmod> child element with semantics as described in Section 3.<rs:md> child element with the type attribute to convey the MIME-Type of the package and
the length attribute to convey the length of the package. The child element may further have attributes such as
hash and type, as described in Section 3.
Content packages made discoverable by a Resource Dump use the ZIP file format. It is recommended to convey the
application/zip media type of the ZIP file as well as its length by means of the type and length
attributes of the <rs:md> child element, respectively.
Example 5.1 shows a Resource Dump document that points to three ZIP files.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcedump"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/resourcedump-part3.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="4765"/>
</url>
<url>
<loc>http://example.com/resourcedump-part2.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="9875"/>
</url>
<url>
<loc>http://example.com/resourcedump-part1.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="2298"/>
</url>
</urlset>
Example 5.1: A Resource Dump document
Each content package referred to from a Resource Dump must contain a Resource Dump Manifest
file that describes the package's constituent bitstreams. The file must be named manifest.xml and must be located at the top level of the ZIP package.
The Resource Dump Manifest is based on the <urlset> format.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute with a value of resourcedump-manifest and it must have a
modified attribute that conveys the Resource Dump Manifest's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per bitstream. This element does not have attributes, but uses
child elements to convey information about the bitstream. The <url> element has the following child elements:
<loc> child element provides the URI which the Source associates with the bistream.<lastmod> child element and an optional <changefreq>
element with semantics as described in Section 3.
<rs:md> child element must have a path attribute to convey the location of the bitstream within the package.
It can further have the attributes hash, length, and type, as described in
Section 3.<rs:ln> child elements link to related resources as detailed in Section 8.
Providing the URI of the bitstream enables a Destination to achieve the same result as obtaining the data by dereferencing the URIs listed in a Resource List.
The value of the path attribute is relative to root of the package and it is expressed with a leading slash (/) as demonstrated in
Example 5.2.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcedump-manifest"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"
path="/resources/res1"/>
</url>
<url>
<loc>http://example.com/res2</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e
sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784"
length="14599"
type="application/pdf"
path="/resources/res2"/>
</url>
</urlset>
Example 5.2: A Resource Dump Manifest
Since a Resource Dump Manifest is implemented as a Sitemap, it cannot
contain more than 50,000 <url> elements. However, it is sometimes convenient to be able to include more than
50,000 bitstreams in a single package. This is accommodated by a Resource Dump Manifest Index, based on the <sitemapindex> format.
The Resource Dump Manifest Index file must be named manifest.xml.
It is based on the <sitemapindex> document format. It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of resourcedump-manifest and it must have a
modified attribute that conveys the Resource Dump Manifest Index's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Resource Dump Manifest. This element does not have attributes,
but uses child elements to convey information about the Resource Dump Manifest. The <sitemap> element has the following child elements:
<loc> child element provides the location of the Resource Dump Manifest within the ZIP package.<lastmod> child element with semantics as described in Section 3.<rs:md> child element must have a path attribute to convey the location of the Resource Dump Manifest within the package.Destinations can determine whether a package contains a Resource Dump Manifest or a Resource Dump Manifest Index by inspecting the root element of the
manifest.xml file. Example 5.3 shows an example of a Resource Dump Manifest Index.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcedump-manifest"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>/manifests/part3.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part3.xml"/>
</sitemap>
<sitemap>
<loc>/manifests/part2.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part2.xml"/>
</sitemap>
<sitemap>
<loc>/manifests/part1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part1.xml"/>
</sitemap>
</sitemapindex>
Example 5.3: A Resource Dump Manifest Index
A Source may publish a record of the changes to its content over a period of time. This enables Destinations to efficiently follow the changes, and in doing so supports incremental synchronization.
A Change List is a document that contains a description of changes to the resources at a Source. Unlike a Resource List, if a resource underwent multiple changes, it will be listed multiple times in the Change List. It is up to the Source to determine the frequency with which it publishes or updates Change Lists and also the time period that the Change List covers. A Source may choose to publish some number of recent changes, or only the changes from the a particular period, such as the last day or week.
A Change List is based on the <urlset> document format introduced by the Sitemap protocol.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute that has a value of changelist and it must have a
modified attribute that conveys the Change List's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per changed resource. This element does not have attributes, but uses
child elements to convey information about the changed resource. The <url> element has the following child elements:
<loc> child element provides the URI of the changed resource.<lastmod> child element with semantics as described in Section 3.<rs:md> child element must have the attribute change to convey the nature of the change.
It may take values created, updated, and deleted.
It can further have attributes hash, length, and type, as described in Section 3.<rs:ln> child elements link to related resources as detailed in Section 8.The datetime of the resource change can be used by Destinations to determine if it has already been processed. A Destination can walk through the Change List until it reaches a datetime before it last requested the Change List, and then start processing the new changes in order of their occurrence. In the same manner as processing a Resource List, the Destination can retrieve a representation of the resource by dereferencing its URI.
All entries in a Change List must be in chronological order. In particular, the least recently changed resource must be listed at the beginning of the Change List, while the most recently changed resource must be listed at the end of the document. This ordering supports Destinations in processing the changes, however, sophisticated Destinations may reorder the Change List to avoid unncessary processing, for example, only process the most recent change to a resource. Example 6.1 shows the content of a Change List with changes to three resources. The example shows one creation, one update, and one deletion and the changes are in chronological order.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res1.html</loc>
<lastmod>2013-01-02T11:00:00Z</lastmod>
<rs:md change="created"/>
</url>
<url>
<loc>http://example.com/res2.pdf</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change="updated"/>
</url>
<url>
<loc>http://example.com/res3.tiff</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="deleted"/>
</url>
</urlset>
Example 6.1: A Change List describing three resource changes
A unique identifier for each change might be useful in some situations. No explicit identity is defined
in this specification, but the combination of the content in the <loc> and of the <lastmod> elements of
the <url> element
is recommended for this purpose. The Source is responsible for providing a sufficiently granular time for the content of the <lastmod>
element to ensure that this combination results in a truly unique identifier.
In order to reduce the number of requests required to obtain resource changes, a Source may provide packaged bitstreams for changed resources.
To make content changes available for download, a Source can publish Change Dumps that refer to packages of the changed bitstreams. Similar to Change Lists, it is up to the Source to determine the time period a Change Dump covers or how many bitstreams are contained in each package. Each package is a ZIP file that contains bitstreams of the resources after each change.
A Change Dump is based on the <urlset> document format introduced by the Sitemap protocol.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute that has a value of changedump and it must have a
modified attribute that conveys the Change Dumps's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per ZIP package of changed content. This element does not have attributes, but uses
child elements to convey information about the package of changed content. The <url> element has the following child elements:
<loc> child element provides the URI of the package.<lastmod> child element with semantics as described in Section 3.<rs:md> child element with a type attribute to convey the MIME-Type of the package and
a length attribute to convey the length of the package. It may further have the attributes hash and type, as described in
Section 3.
Content packages made discoverable by a Change Dump use the ZIP file format. It is recommended to convey the
application/zip media type of the ZIP file as well as its length by means of the type and length
attributes of the <rs:md> child element, respectively.
Example 7.1 shows a Change Dump document with three pointers to packages of changed content.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changedump"
modified="2013-01-03T09:00:00Z"/>
<url>
<loc>http://example.com/changedump-part3.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="3109"/>
</url>
<url>
<loc>http://example.com/changedump-part2.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="6629"/>
</url>
<url>
<loc>http://example.com/changedump-part1.zip</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md type="application/zip"
length="8124"/>
</url>
</urlset>
Example 7.1: A Change Dump
Each package of changed content referred to from a Change Dump must contain a Change Dump Manifest file that describes
the file's constituent bitstreams. The file has to be named manifest.xml and has to be packaged at the top level of the ZIP
package.
Like in a Change List, all entries in a Change Dump Manifest must be in chronological order, meaning the document starts with a reference to
the least recently changed bitstream and ends with a reference to the most recently changed bitstream.
The Change Dump Manifest is based on the <urlset> format.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute with a value of changedump-manifest and it must have a
modified attribute that conveys the Change Dump Manifest's last modification time.<rs:ln> child element of <urlset> points to the Capability List with the relation type
resourcesync (see Section 10).<url> child element of <urlset> per changed bitstream. This element does not have attributes, but uses
child elements to convey information about the bitstream. The <url> element has the following child elements:
<loc> child element provides the URI which the Source associates with the changed bistream.<lastmod> child element with semantics as described in Section 3.<rs:md> child element must have a change attribute to convey the type of change to the resource.
It may take values created, updated, and deleted. It also must have a path attribute to convey
the location of the bitstream within the ZIP package. It can further have the attributes hash, length, and
type, as described in Section 3.<rs:ln> child elements link to related resources as detailed in Section 8.Provinding the URI of the changed bitstream enables a Destination to achieve the same result as obtaining the data by dereferencing the URIs listed in a Change List. The path is relative to root of the package and it is expressed with a leading slash (/) as demonstrated in Example 7.2.
A Change Dump Manifest is shown below. It shows a total of four changes to three resources.
The resource identified by the URI http://example.com/res1.html is included twice. It was first created and later
updated, which accounts for the two changes. While the URI in <loc> child element is the same, the path attribute
of the <rs:md> child element refers to a different bitstream for each.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changedump-manifest"
modified="2013-01-03T21:00:00Z"/>
<url>
<loc>http://example.com/res1.html</loc>
<lastmod>2013-01-01T05:00:00Z</lastmod>
<rs:md change="created"
hash="md5:1c1b0e264fa9b7e1e9aa6f9db8d6362b"
length="4339"
type="text/html"
path="/changes/res1.html"/>
</url>
<url>
<loc>http://example.com/res2.pdf</loc>
<lastmod>2013-01-01T09:00:00Z</lastmod>
<rs:md change="updated"
hash="md5:f906610c3d4aa745cb2b986f25b37c5a
sha-256:f138185cddef488264a0323aee56e7647e89cd7a4d6e45ba28b3be26234a6d09"
length="38297"
type="application/pdf"
path="/changes/res2.pdf"/>
</url>
<url>
<loc>http://example.com/res3.tiff</loc>
<lastmod>2013-01-02T11:00:00Z</lastmod>
<rs:md change="deleted"/>
</url>
<url>
<loc>http://example.com/res1.html</loc>
<lastmod>2013-01-02T14:00:00Z</lastmod>
<rs:md change="updated"
hash="md5:0988647082c8bc51778894a48ec3b576"
length="5426"
type="text/html"
path="/changes/res1-v2.html"/>
</url>
</urlset>
Example 7.2: A Change Dump Manifest
A Change Dump Manifest is based on the <urlset> format which means it can not contain more than 50,000 <url>
elements. If a Source wishes to package more bitstreams, it must implement a Change Dump Manifest Index.
The Change Dump Manifest Index file must be named manifest.xml and points to Change Dump Manifests included within the package.
A Change Dump Manifest Index is based on the <sitemapindex> document format introduced by the Sitemap protocol.
It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of changedump-manifest and it must have a
modified attribute that conveys the Change Dump Manifest Index's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Change Dump Manifest. This element does not have attributes,
but uses child elements to convey information about the Change Dump Manifest. The <sitemap> element has the following child elements:
<loc> child element provides the location of the Change Dump Manifest within the package.<lastmod> child element with semantics as described in Section 3.<rs:md> child element must have a path attribute to convey the location of the Change Dump Manifest within the package.
Destinations can determine whether a package contains a Change Dump Manifest or a Change Dump Manifest Index by inspecting the root element of the
manifest.xml file. Example 7.3 shows a Change Dump Manifest Index pointing to three Change Dump Manifests.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changedump-manifest"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>/manifests/part3.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part3.xml"/>
</sitemap>
<sitemap>
<loc>/manifests/part2.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part2.xml"/>
</sitemap>
<sitemap>
<loc>/manifests/part1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
<rs:md path="/manifests/part1.xml"/>
</sitemap>
</sitemapindex>
Example 7.3: A Change Dump Manifest Index
In order to facilitate alternative approaches to obtain content for a resource that is subject to synchronization, a Source may provide links from that resource to related resources. The following cases are considered, and detailed in the remainder of this section:
As usual, the <loc> child element of <url> conveys the URI of the
resource that is subject to synchronization. The information about a related resource is provided in a
<rs:ln> child element of <url>. The possible attributes for
<rs:ln> are described in Section 3.
In case a Destination is not able to adequately interpret the information conveyed in
a <rs:ln> element, it should refrain from accessing the related resource and rather
use the URI provided in <loc> to retrieve the resource.
In order to reduce the load on its primary access mechanism, a Source may convey one or mirror locations for a resource.
A <rs:ln> element
is introduced to express each mirror location for the resource. This element has the following attributes:
rel attribute with a value of duplicate.href attribute that conveys the URI of the mirrored resource.pri attribute to express a prioritization among multiple mirror locations, each expressed by means of an individual
<rs:ln> element. The use of pri is detailed in Section 3.<rs:ln> child element of
<url> in Section 3.
Example 8.1 shows how a Source conveys information about prioritized mirror locations for a resource.
Since the three locations conveyed by <rs:ln> elements point to duplicates
of the resource specified in
<loc>, the values for each of the attributes of <rs:md> are expected
to be identical for the resource and its mirrors. Hence, they should be omitted from the <rs:ln> elements.
The last <rs:ln> element points to a mirror location where the resource is accessible
via a protocol other than HTTP as can be seen from the URI scheme. Even though the resources are duplicates, their last modified datetimes may vary.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T17:00:00Z</lastmod>
<rs:md change="updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
<rs:ln rel="duplicate"
pri="1"
href="http://mirror1.example.com/res1"
modified="2013-01-02T18:00:00Z"/>
<rs:ln rel="duplicate"
pri="2"
href="http://mirror2.example.com/res1"
modified="2013-01-02T18:30:00Z"/>
<rs:ln rel="duplicate"
pri="3"
href="gsiftp://gridftp.example.com/res1"
modified="2013-01-02T18:30:00Z"/>
</url>
</urlset>
Example 8.1: Mirrored content
A resource may have multiple representations available from different URIs.
A resource may, for example, be identified by a generic URI such as http://example.com/res1. After performing content
negotiation with the server, a client may, for example, obtain the resource's HTML representation available from the specific URI
http://example.com/res1.html. Another client may ask for and retrieve the PDF representation of the
resource from the specific URI http://example.com/res1.pdf.
Which representation a client obtains, can, amongst others, depend on its
preferences in terms of media type and language, its geographical location, and its device type.
A Source can express that a resource is subject to synchronization by conveying its
generic URI in <loc>. In this case, per alternate representation that the Source wants to
advertise, a <rs:ln> element is introduced. This element has the following attributes:
rel attribute with a value of alternate.href attribute that conveys the specific URI of the alternate
representation of the resource.type attribute that conveys the media type of the
alternate representation.<rs:ln> child element of
<url> in Section 3.
Cases exist in which there is no generic URI for a resource, only specific URIs.
This may occur, for example, when a resource has different representations available for different devices.
In this case the URI in <loc> will be a specific URI, and
<rs:ln> elements with an alternate relation type are still used to refer
to alternate representations available from other specific URIs.
Example 8.2 shows how to promote a generic URI in <loc>
while also pointing to alternate representations available from specific URIs, for example, through content
negotiation.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="updated"/>
<rs:ln rel="alternate"
href="http://example.com/res1.html"
modified="2013-01-02T18:00:00Z"
type="text/html"/>
<rs:ln rel="alternate"
href="http://example.com/res1.pdf"
modified="2013-01-02T18:00:00Z"
type="application/pdf"/>
</url>
</urlset>
Example 8.2: Generic URI and alternates with specific URIs
In cases where a particular representation is considered the subject of synchronization,
its specific URI is provided in
<loc>. The associated generic URI, if one exists, can be provided using a
<rs:ln> element. This element has the following attributes:
rel attribute with a value of canonical.href attribute that conveys the generic URI associated with the
specific URI provided in <loc>.<rs:ln> child element of
<url> in Section 3.This approach might be most appropriate for Resource Dump Manifests and Change Dump Manifests that describe bitstreams contained in a ZIP file.
Example 8.3 shows a Source promoting a specific URI in <loc>
while also pointing to the
resource's generic URI by means of an <rs:ln> element.
Metadata pertaining to the representation available from that specific URI is
conveyed by means of
attributes of the <rs:md> element.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res1.html</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"/>
<rs:ln rel="canonical"
href="http://example.com/res1"
modified="2013-01-02T18:00:00Z"/>
</url>
</urlset>
Example 8.3: Specific URI and alternate with generic URI
In order to increase the efficiency of updating a resource, a Source may make a description of the changes that the resource underwent available, in addition to the entire changed resource. Especially when frequent minor changes and/or changes to large resources are concerned, such an approach may be attractive. It will, however, require an unambiguous way to describe the changes in such a way that a Destination can construct the most recent version of the resource by appropriately patching the previous version with the description of the changes.
A Source can express that it makes a description of resource changes available
by providing the URI of the resource in <loc>, as usual, and by
introducing a <rs:ln> element with the following attributes:
rel attribute with a value of http://www.openarchives.org/rs/terms/patch.href attribute that conveys the URI of the description of the resource changes.type attribute that conveys the media type of the change description. That media type
must be such that it allows to unambiguously apply the described changes to the previous version of the
resource to construct the current one.<rs:ln> child element of
<url> in Section 3.Example 8.4 shows a Source that expresses changes that a JSON resource underwent
expressed using the application/json-patch media type introduced in
JSON Patch. It also shows the Source conveying changes to a large TIFF file
using an experimental media type that may, for example, be described in a community specification. A Destination that
does not understand the media type should ignore the description of changes and use the URI in <loc>
to obtain the most recent version of the resource.
Another example of a well-specified media type for expressing changes to XML document is
application/patch-ops-error+xml, as specified in RFC 5261.
Expressing resource changes in this manner is only applicable to Change Lists (as in Example 8.4)
and Change Dumps.
When doing so for a Change Dump, the entry in the Change Dump Manifest must have the path attribute
for the <rs:ln>
element that points to the change description that is included in the content package.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res4</loc>
<lastmod>2013-01-02T17:00:00Z</lastmod>
<rs:md change="updated"
hash="sha-256:f4OxZX_x_DFGFDgghgdfb6rtSx-iosjf6735432nklj"
length="56778"
type="application/json"/>
<rs:ln rel="http://www.openarchives.org/rs/terms/patch"
href="http://example.com/res4-json-patch"
modified="2013-01-02T17:00:00Z"
hash="sha-256:y66dER_t_HWEIKpesdkeb7rtSc-ippjf9823742opld"
length="73"
type="application/json-patch"/>
</url>
<url>
<loc>http://example.com/res5-full.tiff</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="updated"
hash="sha-256:f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk"
length="9788456778"
type="image/tiff"/>
<rs:ln rel="http://www.openarchives.org/rs/terms/patch"
href="http://example.com/res5-diff"
modified="2013-01-02T18:00:00Z"
hash="sha-256:h986gT_t_87HTkjHYE76G558hY-jdfgy76t55sadJUYT"
length="4533"
type="application/x-tiff-diff"/>
</url>
</urlset>
Example 8.4: A Change List with links to document that detail how to patch resources
Cases exist where both resources and metadata about those resources must be synchronized.
From the ResourceSync perspective, both the resource and the metadata about it are regarded as resources
with distinct URIs that are subject to synchronization. As usual, each gets its distinct
<url> block and each URI is conveyed in a <loc> child element of the
respective block. If required, the inter-relationship between both resources is expressed by means of
a <rs:ln> element with appropriate relation types added to each block.
The <rs:ln>
element has the following attributes:
rel attribute. When pointing from a resource to metadata that describes it,
its value is describedby; when pointing from metadata to the resource described by the metadata,
its value is
describes.href attribute. When pointing from a resource to metadata that describes it,
its value is the URI of the metadata resource; when pointing from metadata to the resource described by it,
the value is the URI of the described resource.<rs:ln> child element of
<url> in Section 3.
Example 8.5 shows how a Source can express this inter-relationship between the two resources.
Since the <rs:ln> child element can contain all optional attributes introduced in
Section 3, a Destination can, for example by analyzing the last modification time, determine whether it needs to
synchronize with any of the linked resources.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res2.pdf</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="updated"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="application/pdf"/>
<rs:ln rel="describedby"
href="http://example.com/res2_dublin-core_metadata.xml"
modified="2013-01-02T18:00:00Z"
type="application/xml"/>
</url>
<url>
<loc>http://example.com/res2_dublin-core_metadata.xml</loc>
<lastmod>2013-01-02T19:00:00Z</lastmod>
<rs:md change="updated"
type="application/xml"/>
<rs:ln rel="describes"
href="http://example.com/res2.pdf"
modified="2013-01-02T19:00:00Z"
hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
length="14599"
type="application/pdf"/>
</url>
</urlset>
Example 8.5: Linking between a resource and metadata about a resource in a Change List
A Source may provide access to prior versions of a resource to allow Destinations to obtain a historical perspective, rather than just remaining synchronized with the most recent version. The approach to do so leverages a common resource versioning paradigm that consists of:
When communicating about the resource, its time-generic URI is provided in <loc>.
A first approach consists of conveying the time-specific URI of the resource for the moment the communication
about it takes place. This is achieved by introducing a <rs:ln> element with
the following attributes:
rel attribute with a value of memento.href attribute that conveys the time-specific URI of the resource at the
moment of communication. This URI allows a Destination to obtain that specific version during a catch-up
operation, for example because it had been offline, even if the resource has meanwhile changed again.<rs:ln> child element of
<url> in Section 3. It is recommended
to include the last modification and fixity information for both
the time-generic and the time-specific URI as doing so unambiguously conveys
the tight temporal relationship between both.
A second approach consists of pointing to a TimeGate associated with the time-generic resource. A TimeGate
supports negotiation in the datetime dimension, as introduced in the Memento protocol
[Memento Internet Draft], to obtain a version of the
resource as it existed at a specified moment in time. This allows to obtain
the version as it existed at the moment of communication about the resource by using the
<lastmod> value for datetime negotiation, but it also allows obtaining other versions
by using different datetime values. A pointer to a TimeGate is introduced by using a
<rs:ln> element with
the following attributes:
rel attribute with a value of timegate.href attribute that conveys the URI of TimeGate associated with the
time-generic resource.<rs:ln> child element of
<url> in Section 3
should not be used as they are meaningless for TimeGates.
Example 8.6 shows a Change List with a link to a prior version of a resource as well as a link to a Timegate.
Note that the values of the hash, length, and type attributes are identical between the
<rs:md> child element and the <rs:ln> child element that points to the prior version.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res1</loc>
<lastmod>2013-01-03T07:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"
change="updated"/>
<rs:ln rel="memento"
href="http://example.com/20130103070000/res1"
modified="2013-01-03T07:00:00Z"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
<rs:ln rel="timegate"
href="http://example.com/timegate/http://example.com/res1"/>
</url>
</urlset>
Example 8.6: Links to a resource version and a Memento TimeGate
A special kind of Destination, henceforth called an Aggregator,
may retrieve content from a Source, republish it, and in its turn act as a Source for
the republished content. In such an Aggregator scenario, it may be important for a Destination that
synchronizes with the Aggregator to
understand the provenance of the content and to be able to verify its accuracy with the original Source from which the
Aggregator obtained content. When communicating about a republished resource, the Aggregator can
provide such provenance
information by introducing a <rs:ln> element with the following attributes:
rel attribute with a value of via.href attribute that conveys the URI of the resource at the Source from which
the Aggregator obtained the content.<rs:ln> child element of
<url> in Section 3.If a chain of such aggregations takes place, existing via links should be maintained
and additional ones should be added in order to allow tracing the entire provenance chain.
This is shown in examples 8.7, 8.8, and
8.9 that illustrate a process that starts with
an original Source that publishes a Change List.
A first Aggregator consumes that Change List, obtains the changed resource and integrates it
into its own collection.
It then publishes its own Change List that includes the description of the change but also
gives credit to where the change originated from.
The cycle repeats as a second Aggregator consumes the Change List from the first.
Note the datetimes in all three examples.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T11:00:00Z"/>
<url>
<loc>http://original.example.com/res1.html</loc>
<lastmod>2013-01-03T07:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"
change="updated"/>
</url>
</urlset>
Example 8.7: An original Source publishes
The example below shows a primary Aggregator's Change List that refers to the original Source's resource.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://aggregator1.example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T21:00:00Z"/>
<url>
<loc>http://aggregator1.example.com/res1.html</loc>
<lastmod>2013-01-03T20:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"
change="updated"/>
<rs:ln rel="via"
href="http://original.example.com/res1.html"
modified="2013-01-03T07:00:00Z"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
</urlset>
Example 8.8: A primary aggregator republishes
A second Aggregator obtains the changed resource as it consumes the Change List of the primary Aggregator and republishes its Change List where it adds
yet another <rs:ln> child element to convey the original Source from its perspective.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://aggregator2.example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-04T15:00:00Z"/>
<url>
<loc>http://aggregator2.example.com/res1.html</loc>
<lastmod>2013-01-04T09:00:00Z</lastmod>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"
change="updated"/>
<rs:ln rel="via"
href="http://original.example.com/res1.html"
modified="2013-01-03T07:00:00Z"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
<rs:ln rel="via"
href="http://aggregator1.example.com/res1.html"
modified="2013-01-03T20:00:00Z"
hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="text/html"/>
</url>
</urlset>
Example 8.9: A second aggregator republishes
In order for a Source to provide historical data, it can implement an Archive.
An Archive is implemented based on the <sitemapindex> format.
It points to Sitemaps representing the implementation of the corresponding capability.
The Resource Dump, Change List, and Change Dump capabilities each can have an archive.
As part of the regular maintenance of its data, a Source might maintain old Resource Dumps. For a Destination that wishes to compare or archive versions of the data over time, access to these Resource Dumps allows the packaged historical data to be downloaded all at once, rather than requiring the Source to support access to the individual resource versions, and for the Destination to collect them one at a time.
As shown in Figure 1, a Source can provide a Resource Dump Archive. It not only points to the current Resource Dump but also to previously created and published Resource Dumps. Each of these Resource Dumps represents a snapshot of the Source's data at a certain point in time - the creation time of the Resource Dump.
A Resource Dump Archive is based on the <sitemapindex> document format introduced by the Sitemap protocol.
It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of resourcedump and it must have a
modified attribute that conveys the Resource Dump Archive's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Resource Dump. This element does not have attributes,
but uses child elements to convey information about the Resource Dump. The <sitemap> element has the following child elements:
<loc> child element provides the URI of the Resource Dump.<lastmod> child element with semantics as described in Section 3.
The Destination can determine whether it has reached a Resource Dump or a Resource Dump Archive based on the root element, either
<urlset> or <sitemapindex> respectively.
Example 9.1 shows a Resource Dump Archive that points to the current Resource Dump http://example.com/resourcedump3.xml
and two Resource Dumps created in the two previous months. The Resource Dump documents referred to in Example 9.1 will have a
navigational top level <rs:ln> element with the relation type up (as seen in Example 4.3) that points
to the Resource Dump Archive.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcedump"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/resourcedump3.xml</loc>
<lastmod>2012-11-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcedump2.xml</loc>
<lastmod>2012-12-03T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/resourcedump1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Example 9.1: A Resource Dump Archive
A Change List describes the changes in a Source's resources over a certain period of time. The Source determines the length of that time interval. If a Source wishes to offer Change Lists of prior temporal intervals, it can provide a Change List Archive. A Change List Archive refers to individual Change Lists as depicted in Figure 1.
A Change List Archive is based on the <sitemapindex> document format introduced by the Sitemap protocol.
It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of changelist and it must have a
modified attribute that conveys the Change List Archive's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Change List. This element does not have attributes,
but uses child elements to convey information about the Change List. The <sitemap> element has the following child elements:
<loc> child element provides the URI of the Change List.<lastmod> child element with semantics as described in Section 3.
The Destination can determine whether it has reached a Change List or a Change List Archive based on the root element, either
<urlset> or <sitemapindex> respectively.
All pointers in a Change List Archive must be in chronological order. The associated datetime can be used by Destinations
to determine if new changes have to be processed.
Example 9.2 shows a Change List Archive that points to three Change Lists created on consecutive days.
To ease navigation for Destinations, the Change Lists referred to in the below example will have the top level <rs:ln> element with the relation
type up that points to the Change List Archive.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/changelist3.xml</loc>
<lastmod>2013-01-01T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/changelist2.xml</loc>
<lastmod>2013-01-02T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/changelist1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Example 9.2: A Change List Archive
If a Source decides to offer Change Dumps of prior temporal intervals, it can provide a Change Dump Archive. A Change Dump Archive, as seen in Figure 1, points to Change Dumps.
A Change Dump Archive is based on the <sitemapindex> document format introduced by the Sitemap protocol.
It has the <sitemapindex> root element and the following structure:
<rs:md> child element of <sitemapindex> must have a
capability attribute that has a value of changedump and it must have a
modified attribute that conveys the Change Dump Archive's last modification time.<rs:ln> child element of <sitemapindex> points to the Capability List with the relation type
resourcesync (see Section 10).<sitemap> child element of <sitemapindex> per Change Dump. This element does not have attributes,
but uses child elements to convey information about the Change Dump. The <sitemap> element has the following child elements:
<loc> child element provides the URI of the Change Dump.<lastmod> child element with semantics as described in Section 3.
The Destination can determine whether it has reached a Change Dump or a Change Dump Archive based on the root element, either
<urlset> or <sitemapindex> respectively.
The pointers to Change Dumps need to be in chronological and have an associated last modification time order in order to support Destination in identifying
unprocessed Change Dumps.
An example for a Change Dump Archive is provided below. It points to three Change Dumps that were created in consecutive weeks.
The referred Change Dumps will have the top level <rs:ln> element with the relation type up that points to
the Change Dump Archive.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="resourcesync"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changedump"
modified="2013-01-03T09:00:00Z"/>
<sitemap>
<loc>http://example.com/changedump3.xml</loc>
<lastmod>2012-12-20T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/changedump2.xml</loc>
<lastmod>2012-12-27T09:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>http://example.com/changedump1.xml</loc>
<lastmod>2013-01-03T09:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Example 9.3: A Change Dump Archive
In order to make use of the capabilities that a Source provides, it is first necessary to determine which capabilities are supported and the URIs
of the corresponding capability documents.
In the ResourceSync framework pointers to the component resources that provide the capabilities are included in a document
that can be made discoverable in various ways, including a well-known URI pattern, HTML or XHTML <link> elements from web pages,
or HTTP link headers from other resources that are to be synchronized.
A Capability List is a document that points to component resources that provide the Source's capabilities.
The four possible capabilities a Source can point to are: resourcelist, resourcedump, changelist, and changedump.
All values have previously been introduced in Example 4.1, Example 5.1, Example 6.1, and
Example 7.1. A Capability list can only contain one entry per capability.
A set of resources exposed via one capability has to also be exposed via all other capabilities a Source offers. That means a resource can not only be
exposed by a Resource List, for example, but not by a Change List in the event it undergoes a change. This ensures that a Destination does not have to consume
all capabilities to perform an Audit but, for example, an up-to-date Resource List is sufficient.
The Capability List is based on the <urlset> format.
It has the <urlset> root element and the following structure:
<rs:md> child element of <urlset> must have a
capability attribute with a value of capabilitylist and it must have a
modified attribute that conveys the Capability List's last modification time.<rs:ln> child element of <urlset> with the relation type describedby
points to a document that provides information about the Source offering the capabilities, and its resources, which are subject to synchronization.<url> child element of <urlset> per capability offered by the Source. This element does not have attributes, but uses
child elements to convey information about the capabilities. The <url> element has the following child elements:
<loc> child element provides the URI of the capability document.<rs:md> child element must have a capability attribute to convey the type of the linked capability.
The <lastmod> elements should be omitted from the Capability List unless the Source updates the Capbility List every time
it updates one of the capability documents referenced.
Example 10.1 shows a Capability List where the Source offers four capabilities: a Resource List, a Resource Dump, a Change List, and a Change Dump. A Destination cannot determine from the Capability List whether a Source provides, for example, a Resource List Index or a single Resource List. The capability document must be downloaded and parsed to make this determination.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="describedby"
href="http://example.com/dataset1/info_about_source.xml"/>
<rs:md capability="capabilitylist"
modified="2013-01-02T14:00:00Z"/>
<url>
<loc>http://example.com/dataset1/resourcelist.xml</loc>
<rs:md capability="resourcelist"/>
</url>
<url>
<loc>http://example.com/dataset1/resourcedump.xml</loc>
<rs:md capability="resourcedump"/>
</url>
<url>
<loc>http://example.com/dataset1/changelist.xml</loc>
<rs:md capability="changelist"/>
</url>
<url>
<loc>http://example.com/dataset1/changedump.xml</loc>
<rs:md capability="changedump"/>
</url>
</urlset>
Example 10.1: A Capability List
If a Source wishes to offer more than one Capability List, for example, to split up its resources into different sets, it should implement a Capability List Index. A Capability List Index refers to individual Capability Lists. A Source may decide to offer distinct Capability Lists, for example, for content of different MIME-Types, for content tailored towards different Destinations, or with different access control mechanisms.
A Capability List Index is based on the <sitemap> format.
It has the <sitemap> root element and the following structure:
<rs:md> child element of <sitemap> must have a
capability attribute with a value of capabilitylist and it must have a
modified attribute that conveys the Capability List Index's last modification time.<rs:ln> child element of <sitemapindex> with the relation type describedby
points to a document that provides information about the Source and its Capability Lists.<sitemap> child element of <sitemapindex> per Capability List offered by the Source.
This element does not have attributes, but uses child elements to convey information about the Capability Lists. The <sitemap> element
has the following child elements:
<loc> child element provides the URI of the Capability List.
The <lastmod> child elements should be omitted from the Capability List Index unless the Source updates the Capbility List Index every time
it updates one of the Capability Lists referenced.
Example 10.2 shows a Capability List Index with pointers to three different Capability Lists.
The referred Capability Lists will have a top level <rs:ln> element with the relation type up that points to
the Capability List Index.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="describedby"
href="http://example.com/info_about_source.xml"/>
<rs:md capability="capabilitylist"
modified="2013-01-02T14:00:00Z"/>
<sitemap>
<loc>http://example.com/dataset1/capabilitylist.xml</loc>
</sitemap>
<sitemap>
<loc>http://example.com/dataset2/capabilitylist.xml</loc>
</sitemap>
<sitemap>
<loc>http://example.com/dataset3/capabilitylist.xml</loc>
</sitemap>
</sitemapindex>
Example 10.2: A Capability List Index
This section describes approaches to support discovery of Capability List or Capability List Index documents.
The well-known URI [RFC 5785] /.well-known/resourcesync is defined for the ResourceSync framework.
When dereferenced, the representation obtained will be either a Capability List document or a Capability List Index, and thus provide a mechanism for
Destinations to discover the capabilities offered by the Source without any prior knowledge.
A Capability List can be made discoverable by means of an X/HTML link.
In order to do so, a <link> element is introduced in the <head> of the HTML page
that points to a Capability List.
This <link> element must have the rel attribute with the value resourcesync.
The Capability List that is made discoverable in this way must pertain to
the resource that provides the link. This means that the resource must be covered by the capabilities listed in the linked Capability List.
In case the Source also provides a Capability List Index, it should be made discoverable from this Capability List by means of a
<rs:ln> child element of the <urlset> element that has an up relation type.
Example 10.3 shows the structure of a web page that contains a link to a Capability List.
<html>
<head>
<link rel="resourcesync"
href="http://www.example.com/datasets/capabilitylist.xml"/>
...
</head>
<body>...</body>
</html>
Example 10.3: X/HTML link discovery syntax
A Capability List can be made discoverable by means of an HTTP Link header that can be included with a representation of a resource of any content-type.
In order to do so, an entry in the HTTP Link header is introduced
that has as Target IRI the URI of the Capability List and as relation type resourcesync.
The Capability List that is made discoverable in this way must pertain to
the resource that provides the link. This means that the resource must be covered by the capabilities listed in the linked Capability List.
In case the Source also provides a Capability List Index, it should be made discoverable from this Capability List by means of a
<rs:ln> child element of the <urlset> element that has an up relation type.
Example 10.4 contains part of an HTTP response header. It includes an HTTP Link header with the
relation type resourcesync to make a Capability List that pertains to the resource that provides the Link discoverable.
HTTP/1.1 200 OK
Date: Thu, 21 Jan 2010 00:02:12 GMT
Server: Apache
Link: <http://www.example.com/datasets/capabilitylist.xml>;
rel="resourcesync"
...
Example 10.4: HTTP link discovery syntax
| Capability Attribute Value | Section |
|---|---|
<urlset> or <sitemapindex> |
|
<rs:md capability="..."> |
|
resourcelist |
4.1. Resource List 4.2. Resource List Index |
resourcedump |
5.1. Resource Dump 9.1. Resource Dump Archive |
resourcedump-manifest |
5.1.1. Resource Dump Manifest 5.1.2. Resource Dump Manifest Index |
changelist |
6.1. Change List 6.2. Change List Archive |
changedump |
7.1. Change Dump 9.3. Change Dump Archive |
changedump-manifest |
7.1.1. Change Dump Manifest 7.1.2. Change Dump Manifest Index |
Table A.1: ResourceSync values for the capability attribute of the <rs:md> child element of the <urlset> or <sitemapindex> element
| Relation Type | Specification | Example |
|---|---|---|
<urlset> |
||
<rs:ln rel="..."> |
||
resourcesync |
This specification | 4.1 and following |
up |
RFC5988 | 4.3 |
describedby |
Protocol for Web Description Resources (POWDER): Description Resources | 10.1 |
<url> |
||
<rs:ln rel="..."> |
||
duplicate |
RFC6249 | 8.1 |
alternate |
HTML 5 | 8.2 |
canonical |
RFC6596 | 8.3 |
http://www.openarchives.org/rs/terms/patch |
This specification | 8.4 |
describedby |
Protocol for Web Description Resources (POWDER): Description Resources | 8.5 |
describes |
The 'describes' Link Relation Type | 8.5 |
memento |
Memento Internet Draft | 8.6 |
timegate |
Memento Internet Draft | 8.6 |
via |
RFC4287 | 8.8 8.9 |
<sitemapindex> |
||
<rs:ln rel="..."> |
||
describedby |
Protocol for Web Description Resources (POWDER): Description Resources | 10.2 |
Table A.2: ResourceSync relation types for the <rs:ln> child element of the <urlset> or <sitemapindex> element and of the <rs:ln> child element of the <url> element
This specification is the collaborative work of NISO and the Open Archives Initiative. Funding for ResourceSync is provided by the Alfred P. Sloan Foundation. UK participation is supported by Jisc.
The names of individual contributors will be listed here when the final specification is released.
| Date | Editor | Description |
|---|---|---|
| 2012-08-13 | martin, herbert, simeon, bernhard | first alpha spec draft |
| 2013-02-01 | martin, herbert, rob, simeon | beta spec draft |
| 2013-02-06 | simeon, herbert, martin | typo fixes |

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.