Editors
The OAI Executive:
Carl Lagoze <lagoze@cs.cornell.edu>
-- Cornell University - Computer Science
Herbert Van de Sompel <herbertv@lanl.gov>
-- Los Alamos National Laboratory - Research
Library
From the OAI Technical Committee:
Michael Nelson
<m.l.nelson@larc.nasa.gov>
-- NASA - Langley Research Center
Simeon Warner
<simeon@cs.cornell.edu>
-- Cornell University - Computer Science
This document is one part of the Implementation Guidelines that accompany the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
The OAI-PMH is designed to provide a low barrier to implementation for repositories. As such, there are many constructs that are optional in the OAI-PMH. This document first lists the requirements for a minimal repository implementation and then describes best practices relevant for repository implementers and maintainers.
It is important to stress that there are many optional concepts in the OAI-PMH. The aim is to allow for high fidelity communication between repositories and harvesters when available and desirable. However, they remain optional for situations when these constructs are not needed or implementation would be too difficult. A list and discussion of optional features of the OAI-PMH is given below. Repository implementers may implement all, some or none of these features.
Dublin Core (DC) is the resource discovery lingua franca for metadata. Since all DC fields are optional and repeatable, most repositories should have no trouble creating at least a minimal mapping of their native metadata to unqualified DC. Repositories are not required to store their metadata in DC; DC is something that is frequently "converted to", rather than "stored in". Many repositories store their metadata in some other format, and dynamically convert to DC in response to harvester requests. Repositories may, of course, choose to store DC directly if it is desirable in a particular implementation.
Although repositories are strongly encouraged to expose richer, possibly community-specific metadata formats, there is no requirement to do so. Repository implementers should consider exporting DC the first and most important step toward OAI-PMH interoperability. The addition of facilities to export other formats may then be added at a later date.
<about> Containers
Records in OAI-PMH are particular disseminations of metadata from
items in the repository. <about> containers
allow the repository to include self-referential metadata about
the records themselves. While this may prove to be an invaluable
tool for encoding rights and provenance statements, it is not required.
It is recommended that implementers defer the use of
<about> containers if they are uncertain of
their use or ramifications.
Sets provide a method of exposing a partitioning of a repository's contents to harvesters. While this allows for sophisticated harvesting, not all harvesters will exploit this capability. Like non-DC metadata formats, sets are most likely to be useful within specific communities. It is recommended that implementers defer the implementation of sets until there is a particular community-specific situation or deployment scenario that needs sets. Implementation of sets is optional; repository implementers may choose not to implement sets (harvesters may also ignore sets).
The ability to send compressed responses to harvesters can be a significant performance optimization. However, this depends on: 1) harvesters being able to accept compressed responses, and 2) sufficiently lengthy responses are available to warrant the additional time required to compress and uncompress the responses by the repository and harvester, respectively. It is recommended that repositories defer the implementation of response compression unless they have large collections (e.g., several thousand records) and until the time when harvesters widely adopt the compression mechanism.
All repositories must support identity (uncompressed)
encoding and must reply with that unless some other recognized
encoding is parsed from the Accept-Encoding: header of
the request.
Flow control, through the use of incomplete list responses and
resumptionToken elements, is a powerful mechanism that allows a
repository to throttle incoming requests from harvesters.
However, depending on the repository, this can be one of the more
complex parts of the OAI-PMH to implement.
While there are no definite guidelines regarding the size of a list response
that should result in an incomplete-list responses, it is reasonable that
repositories with less than 1000 items reply with a single complete-list
responses (i.e. do not use resumptionToken elements).
datestamp and responseDate values
The OAI-PMH permits two datestamp granularities:
days and seconds. All datestamps must be expressed in UTC and therefore
the day-to-day transition occurs at midnight UTC and not, in general, at
the local midnight of the repository. It is expected that most repositories
will use the seconds granularity. However, repositories with particular
update schedules or based on existing software may choose to use day
granularity if that reflects the underlying processes.
Regardless of the granularity used, metadata must be available from
new or modified items during the period of the item's new
datestamp. If this is not the case then that update may
be missed by an incremental harvest. For example, if an item is added
to a repository with datestamp 2002-03-01
and but this shows up in an index created at
2002-04-01T02:00Z. Then then an incremental harvest
which runs at 2002-04-01T01:00Z will miss the update
from the previous day, and the follow-on harvest specifying
from=2002-04-01 will also miss the update.
The datestamp values used for OAI are for the purpose of
incremental harvesting. They should be changed to reflect any change
in the item which includes any change to any of the disseminations.
For example, even if the underlying master-record within a repository
is not changed, a change to the
way that some formats are generated on the fly should result in updates to
the datestamps of all items for which that format may be disseminated. One way
that this may be implemented internally when building a datestamp
index is:
if (internalItemDatestamp > disseminationInterfaceDatestamp) {
datestamp = internalItemDatestamp
} else {
datestamp = disseminationInterfaceDatestamp
}
Harvesters may use the responseDate to obtain the time
(expressed in UTC) from the repository's clock corresponding to the start
of a list request as a way to implement incremental harvesting.
By using the time returned by the repository, harvesters avoid problems
with clock skew. It is thus recommended that the responseDate
reflect the time of the repository's clock at the start of any database
query or search function necessary to answer the list request, rather than
when the output is written. This will ensure that, even if the query
takes some time, the next incremental harvest will catch any updates
which occurred during the query.
Sets provide optional support for selective harvesting. Repositories need not implement them (see Minimal Implementation: Sets) and harvesters may ignore any set structure that is exposed.
Use of the colon (:) character in setSpec values
implies a hierarchy of sets. Repositories that do not implement a
hierarchy of sets must not include colons in their setSpec
values.
Example ListSets response:
<ListSets ...>
<set>
<setSpec>A</setSpec>
<setName>set A</setName>
</set>
<set>
<setSpec>A:B</setSpec>
<setName>set A:B</setName>
</set>
<set>
<setSpec>B</setSpec>
<setName>set B</setName>
</set>
<set>
<setSpec>B:C</setSpec>
<setName>set B:C</setName>
</set>
<set>
<setSpec>B:D</setSpec>
<setName>set B:D</setName>
</set>
<set>
<setSpec>B:D:E</setSpec>
<setName>set B:D:E</setName>
</set>
</ListSets>
This response implies the following hierarchy of sets:
A A:B B B:C B:D B:D:EThe sets A:B and B are distinct, however
the repeated use of node name B is not recommended because
it may lead to confusion. If the header for item1 is:
<header>
<identifier>item1</identifier>
<setSpec>A:B</setSpec>
...
</header>
then item1 is organized in sets
A and A:B.
If the header for item2 is:
<header>
<identifier>item2</identifier>
<setSpec>A</setSpec>
<setSpec>B:D:E</setSpec>
...
</header>
then item2 is organized in sets
A, B, B:D and B:D:E.
|
Set descriptions may be included in the setDescription of the
ListSets response. If sets correspond to some notion of
"collections" then this allows collection description.
If the whole repository represents one collection then it may be more
appropriate to use the description container in the
Identify response to describe the collection.
Communities are able to develop their own collection description XML
schemas for use within the description container and
setDescription elements. If all that is desired is the ability to
include an unstructured textual description then it is recommended that
repositories use the Dublin Core description element.
Seven existing schemes are:
| Dublin Core | The 15 DCMES elements (unqualified) |
http://www.dublincore.org/documents/dces/ OAI-PMH: Dubin Core, oai_dc |
| Encoded Archival Description (EAD) | A standard for encoding archival finding aids. EAD instances used effectively for this purpose would probably be limited to information about the collection as a whole and significant high-level components. | http://www.loc.gov/ead/ |
The eprints schema |
A schema for describing eprint archives. | OAI-PMH Guidelines: eprints |
| RSLP collection description schema | An application of qualified Dublin Core built on an underlying model of collections and their descriptions. | http://www.ukoln.ac.uk/metadata/rslp/schema/ |
| UDDI/WSDL | The Universal Description Discovery and Integration specification provides a mechanism for describing services on the Internet. | http://www.uddi.org/ |
| MARC21 | Commonly used in library catalogs to describe entire collections in a single record. |
http://www.loc.gov/marc/ OAI-PMH Guidelines: marcxml
|
The branding schema |
A schema for including branding information: icon, link etc. |
OAI-PMH Guidelines: branding
|
Note that appropriate XML s may not yet exist for EAD, RSLP and UDDI. Also note that mappings between the DC, EAD, RSLP and MARC21 schemas are available.
The OAI community is encouraged to track the work of the DCMI Collection Description working group, who are planning to develop a simple collection description schema based on qualified DC.
resumptionToken Elements
The OAI-PMH does not specify the syntax or even suggest an implementation
strategy for resumptionToken elements. This is to allow maximum
flexibility for individual repositories to adapt this mechanism to their
own deployment profile.
Harvesters are required to treat resumptionToken values
as opaque data objects. A repository may choose to make the syntax of a
resumptionToken obvious, such as:
<resumptionToken>resultSet=157&nextRange=1001-2000</resumptionToken>
or may choose to encode the resumptionToken
so the syntax is not visible, such as:
<resumptionToken>9023A210CD007</resumptionToken>
The OAI-PMH was designed to provide a mechanism whereby a harvester can
incrementally harvest from a repository and not miss any changes
(additions, modifications, and deleted reported with deleted
status). The implementation of incomplete list responses and
resumptionToken elements should not change this.
Whether a change is harvested during a particular harvest or on the
subsequent harvest is probably not critically important, provided it
is not missed in the sequence of harvests.
For example, consider the case of an item modified during the sequence of
requests necessary to complete a ListRecords request. The
new datestamp will correspond to the time after the start,
but before the end, of the sequence of requests. It is
quite reasonable for this change to be picked up by the current harvest
or for it not to be picked up until the following harvest.
There are two main implementation strategies that a repository may
choose when implementing incomplete-list responses and
resumptionToken elements: to encode the state in the
resumptionToken; or to cache the result set.
These approaches are described in the next two sections.
resumptionToken
By encoding all state in the resumptionToken, a repository
can remain stateless. This implementation strategy trades the
management overhead required to cache results for possible increased
computation time in regenerating the state necessary for the next
response. Consider the following example:
http://an.oai.org/?verb=ListRecords&set=227&from=1999-02-03
A resumptionToken may be similar to (wrapped for clarity):
<resumptionToken>set=227&from=1999-02-03&until=2002-04-01
&range=751-1500&metadataPrefix=oai_dc</resumptionToken>
This example is for a ListRecords request and so the resumptionToken
includes the metadataPrefix as part of the state; the metadataPrefix
part would not be required for a ListIdentifiers resumptionToken.
Harvesters may omit either or both of the from and until
arguments to say `from the earliest datestamp' and/or `until the last datestamp'.
It is recommended that any resumptionToken constucted from a request
such as this have at least the part that corresponds to the until
argument filled in with the current repository datestamp. While this will not
avoid the incomplete-list requests including new items with the same datestamp
as the until value (likely if day granularity is used), it will
avoid the incomplete-list requests including new or modified items with later
datestamps.
One way to implement this form of resumptionToken is to encode
the original arguments of the list request and some offset information in
the resumptionToken. When the repository later receives the resumptionToken
it can then repeat the original query to regenerate the state. This may be
computationally expensive. However, there is no cache management overhead
associated with the resumptionToken values. For this reason alone,
it is recommended that unless implementers are comfortable with managing a
cache, repositories should use a scheme whereby all necessary state is encoded
in the resumptionToken values.
Note that some additional information may need to be encoded in the resumptionToken
to cope with the case where items move outside the datestamp
range of the initial request before the list request sequence is complete.
A repository may cache the results of a harvester's list request and then
return segments of this as incomplete list responses (accessible through further
requests with the appropriate resumptionToken values). A typical
solution would be to encode both a session identifier and a cursor in the
resumptionToken. If harvester issued a request:
http://an.oai.org/?verb=ListRecords&set=227&from=1999-02-03A
resumptionToken in the response could look similar to:
<resumptionToken>searchId=4234&range=751-1500</resumptionToken>
Subsequent requests from the harvester would involve the repository only sending the next 750 records (for example) from the cached result set. A repository could also check the result set cache for duplicate searches and if two different harvesters issued the same request, they could be served from the same cached result set. Depending on the nature of the repository and the community it serves, this scenario could turn out to be common, and caching results could be a significant optimization for the repository.
A downside of this approach is that the repository has the additional responsibility of maintaining and purging the cache. Given the limited amount of knowledge the OAI community currently has regarding harvesting profiles, it is not known which cache replacement algorithm would be best for a given repository. The wide range of repository deployment profiles may make a general recommendation impossible.
Repositories implementing this sort of resumptionToken should
use the expirationDate attribute to indicate how long the resumptionToken
values will remain valid. If only the identifiers are cached for ListRecords
and ListIdentifers requests then it may also be necessary to
check that items have not been deleted or that datestamps have
not moved out of the range of the request by the time that the metadata is
actually disseminated.
The OAI-PMH requires a weak form of idempotency for both complete list requests
and individual incomplete list requests using resumptionToken
arguments:
resumptionToken)
must return the same result. A session identifier cannot be used
as a resumptionToken without additional state information because
it must be possible to re-use a resumptionToken to re-issue
an incomplete list request. It is, however, permitted for repositories to
expire resumptionToken values; the time at which expiration
occurs may be specified in the expirationDate attribute.resumptionToken most recently received by a harvester
may be re-issued. The implication for repositories is that they must support
the above idempotency requirement for the most recent incomplete list request
that was answered. They must accept both the most recent resumptionToken
issued and the previous resumptionToken issued. There are scenarios where different semantics may result from the two implementation
strategies described above. If a repository changes rapidly compared with
the rate at which a harvester issues follow-on requests with resumptionToken
arguments, the two different implementations can yield different results.
Consider a repository that adds N new records midway through a harvester's
session, and these records match the criteria specified in the initial request
(set, timestamp, etc.). Implementations relying
on state stored by the repository are likely to give the results relative
to the time of the initial harvest. Implementations transferring state in
the resumptionToken values are likely to give the results relative
to the time of the last request (of some intermediate time). Sophisticated
implementations of either scheme could accommodate for this difference, but
it is recommended that implementers consider the anticipated accession rate
of the repository and set the optional resumptionToken attribute
expirationDate accordingly.
Idempotency of resumptionTokens will aid harvesters in recovering
from the unavoidable error conditions that can occur when harvesting large
repositories (e.g., lost responses, crashed harvesters, etc.).
expirationDate The expirationDate attribute may be used to indicate when a
resumptionToken expires. It is recommended that repositories
use resumptionToken values that remain valid for at least tens
of minutes.
If a repository uses resumptionToken values that do not expire,
then the expirationDate attribute should not be used.
If, for some reason, a repository cannot continue to complete a request
then it may issue a badResumptionToken (with suitable text explanation
for human readers) to stop the sequence of requests. A harvester would be
expected to start again from the initial list request.
completeListSize and cursor attributes The completeListSize and cursor attributes of
the resumptionToken element may optionally be included to provide
extra information about the size of the result set and the number of results
so far disseminated.
The cursor attribute is simply the number of results returned
so far in the complete list response, thus it is always "0" in the first incomplete
list response. It should only be specified if it is consistently used in all
responses.
The completeListSize attribute provides a place where the
estimated number of results in the complete list response may be announced.
This is likely to be used for status monitoring by harvesting software and
implementation is recommended especially in repositories with large numbers
of records. The value of completeListSize can be reliably accurate
only in the case of a system where the result set is cached. In other cases,
it is permissible for repositories to revise the estimate during a list
request sequence.
The related but orthogonal concepts of flow control and load
balancing allow a repository to throttle incoming requests. Flow control,
as available through incomplete lists and resumptionToken elements,
can be viewed as an intra-repository technique, and the semantics
of flow control are defined in the OAI-PMH. Load balancing can be viewed
as an inter-repository technique to redirect the harvesting load
from busy repositories to lightly loaded repositories with the same contents.
Load balancing semantics are not addressed in the OAI-PMH and are expected
to be handled at the transport layer.
Both concepts are optional, and both may be used independently or in combination.
It is easy to imagine a scenario in which a harvester is redirected to a
lightly loaded "mirror" repository, and the resulting requests and responses
between that repository and the harvester are throttled through the use
of resumptionToken elements.
The semantics of resumptionToken elements are defined in
the OAI-PMH, and possible implementation strategies are discussed above.
The rest of this section addresses transport-level flow control methods.
The HTTP protocol defines a status code of 503 - "Service Unavailable". If a repository is heavily loaded, instead of processing the request (and returning HTTP status code 200), it can issue a HTTP 503 response with a "Retry-After" header. This mechanism might also be used to delay requests during database updates or other short outages.
If a repository wants to slow a sequence of requests then it can issue
a HTTP 503 response, which informs the harvester that its request was not
processed and it should wait the specified amount of time before retrying.
At the repository's discretion, it may issue HTTP 403 "Forbidden" responses
to harvesters that do not adhere to the delay specified in the Retry-After
header of the 503 response. A 403 is more punitive than a 503, and should
be issued only in extreme situations.
It is recommended that a waterfall model of flow control be adopted, with repositories moving to the next level only after a harvester appears unresponsive at the current level:
resumptionToken.Retry-After header set to
an appropriate value if subsequent request follows too quickly or if the
server is heavily loaded.Retry-After delays.Although not addressed in the OAI-PMH, it is possible to build hierarchies of cooperating repositories. Especially popular or heavily loaded repositories might set up mirror repositories. Load balancing can then be accomplished through the use of HTTP 302 or 303 status codes, or even through the use of CNAMES in DNS. Load balancing is powerful, yet subtle. It is recommended that repositories implement load balancing only when: 1) the canonical repository is clearly overloaded, and 2) they are comfortable with details of OAI-PMH.
For example, consider the request:
http://an.oai.org/?verb=ListRecords&set=227&from=1999-02-03
If an.oai.org is especially busy, it may choose to issue
a 302 with the Location header of:
http://another.oai.org/?verb=ListRecords&set=227&from=1999-02-03
Which implies that another.oai.org is a mirror of (or "slave"
to) the canonical an.oai.org. If this request generates an
incomplete list, the successive harvesting requests will be relative to
another.oai.org. However, all future harvesting requests must
be directed to an.oai.org. To this end, if an Identify
request is issued to another.oai.org, the response must contain
an.oai.org in the baseURL. Similarly, the contents
of the request element in the header of the response must be
the baseURL (an.oai.org).
Note that another.oai.org is free to load balance as well,
issuing a 302 with the Location header of:
http://yetanother.oai.org/?verb=ListRecords&set=227&from=1999-02-03
It is up to the implementers to ensure there are no circular redirections for the harvester.
A simpler, but slightly less powerful load balancing strategy is to use
a DNS rotor. In this case, an.oai.org might resolve to one
of: a.oai.org, b.oai.org or c.oai.org.
Which presumably all map to physically separate machines. The harvester
resolving the name an.oai.org would randomly receive one of
the addresses and use that address for the harvesting requests. Repositories
wishing to implement this form of load balancing must ensure tight synchronization
between the machines answering the requests. They must also use an.oai.org
for baseURL (in Identify responses) and request
(in response headers).
Compression in the OAI-PMH is handled at the HTTP
level, with the added restriction that both repositories and harvesters
must support the identity encoding (no compression). Repositories
should advertise the compression they support in compression
element(s) in the Identify response.
Response from Identify:
<Identify ...> ... <compression>gzip</compression> ... </Identify>Next request includes: Accept-encoding: gzip;q=1.0, identity;q=0.5Response then gzip encoded with header:
Content-Encoding: gzip |
While compression is optional, it is encouraged to minimize network traffic
involved in lengthy harvests. It is recommended that repositories offer
at least one of the compression types defined in HTTP:
RFC 2616 section 3.5 "Content Codings": gzip, compress,
or deflate.
Errors are reported using one or more error elements in the
response. Although not required, it is strongly recommended that repository
implementers include detailed and helpful error messages (content of the
error elements) in addition to the mandatory code
attributes.
Repositories are permitted to include just a single error
element even if multiple error conditions are identified. However, to aid
debugging, it is recommended that repositories report as many errors as
they identify. Multiple error elements may be used and it is
preferable to separate even multiple errors with the same code
into separate error elements. For example,
<error code="badArgument">Illegal argument 'arg1'</error> <error code="badArgument">Illegal argument 'arg2'</error>
is to be preferred over
<error code="badArgument">Illegal arguments 'arg1', 'arg2'</error>
Support for the development of the OAI-PMH and for other Open Archives Initiative activities comes from the Digital Library Federation, the Coalition for Networked Information, and from the National Science Foundation through Grant No. IIS-9817416. Individuals who have played a significant role in the development of OAI-PMH version 2.0 are acknowledged in the protocol document.
2002-06-14: Release of this document, combined with the release of OAI-PMH version 2.0.