[OAI-implementers] Proposed corrections/fixes to OAI-PMH protocol document and schema

Simeon Warner simeon at cs.cornell.edu
Tue Jun 15 15:18:13 EDT 2004


I've been collecting problems/corrections to the OAI-PMH protocol document
and schema for some time. I detail proposed corrections and fixes below.
Please reply either directly to me or to the list with comments or
additional issues/corrections/errors.

Cheers,
Simeon



PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
--------------------------------------------------

1) Correct protocol document and schema definition of setSpec to be
consistent, and also to permit the use of URL encoding.

1.1) Motivation

First, the protocol document and the schema simply do not agree. The use
of the wording "any characters that are safe in a query component of a
URI" is unclear and cannot be construed to agree with the schema. Second,
many repositories are using URL-like encoding to create setSpecs so it
seems better to permit the recognized URL encoding. The practical change
to meet both of these criteria is very small: the schema regular
expression should be changed to remove $ and +, and to add ~ and %xx (URL
encoding). This will bring the protocol document in line with the terms
"escaped" and "unreserved" as used in the URI RFC.

1.2) Impact

The only conforming repository that we know of using setSpecs affected by
this change is Jeff Young's OpenURL repository
(http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
an encoding for space. Jeff agrees that a change would be sensible and
that he could be replace '+' with '%20'. Repositories using URL-like
encodings will not be affected although they may choose to change to use
real URL encoding. All OAI software maintainers should, however, review
the change and update their parsing code accordingly.

1.3) Changes

1.3.1) Change wording in protocol document
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
from:

a setSpec -- a colon [:] separated list indicating the path from the root
of the set hierarchy to the respective node.  Each element in the list is
a string consisting of any characters that are safe in a query component
of a URI , which must not contain any colons [ :].  Since a setSpec forms
a unique identifier for the set within the repository, it must be unique
for each set.  Flat set organizations have only sets with setSpec that do
not contain any colons [ :].

to:

a setSpec -- a colon [:] separated list indicating the path from the root
of the set hierarchy to the respective node. Each element in the list is a
string consisting of any valid URI "unreserved" and "escaped" characters.
A setTag must not contain URI "reserved" characters, for example the colon
[:] which is used to delimit setTags. Since a setSpec forms a unique
identifier for the set within the repository, it must be unique for each
set. Flat set organizations have only sets with setSpec that do not
contain any colons [:].

The corresponding parts of the specification of allowed characters in URIs
are:

unreserved    = alphanum | mark
mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                "(" | ")"
escaped       = "%" hex hex
hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                "a" | "b" | "c" | "d" | "e" | "f"


1.3.2) Change definition of setSpecType in the schema to match the definition
from:

 <simpleType name="setSpecType">
    <restriction base="string">
      <pattern value=
       "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
    </restriction>
  </simpleType>

to:

  <simpleType name="setSpecType">
    <restriction base="string">
      <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
    </restriction>
  </simpleType>


2) Correct protocol document and schema definition for metadataPrefix to
be consistent, and also to match the revised setSpec definition.

2.1) Motivation

The protocol document uses the same imprecise wording for metadataPrefix
as it does for setSpec ("any characters that are safe in a query
component of a URI") and the schema does not even follow a reasonable
interpretation of this wording. It seems sensible to use the same
character restrictions in a consistent fashion. This will bring the
protocol document in line with the terms "escaped" and "unreserved" as
used in the URI RFC.

2.2) Impact

This change is not expected to impact any known repository.  All OAI
software maintainers should, however, review the change and update their
parsing code accordingly.

2.3) Changes

2.2.1) Change wording in protocol document
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPrefix
from:

The metadataPrefix - a string to specify the metadata format in OAI-PMH
requests issued to the repository. metadataPrefix consists of any
characters that are safe in a query component of a URI. metadataPrefix
arguments are used in ListRecords, ListIdentifiers, and GetRecord
requests to retrieve records, or the headers of records that include
metadata in the format specified by the metadataPrefix;

to:

The metadataPrefix - a string to specify the metadata format in OAI-PMH
requests issued to the repository. metadataPrefix consists of any valid
URI "unreserved" and "escaped"  characters. A metadataPrefix must not
contain URI "reserved" characters. metadataPrefix arguments are used in
ListRecords, ListIdentifiers, and GetRecord requests to retrieve records,
or the headers of records that include metadata in the format specified
by the metadataPrefix;

2.3.2) Change definition of metadataPrefixType in schema to match the
definition from:

  <simpleType name="metadataPrefixType">
    <restriction base="string">
      <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
    </restriction>
  </simpleType>

to:

  <simpleType name="metadataPrefixType">
    <restriction base="string">
      <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
    </restriction>
  </simpleType>


3) Change protocol document to explicitly state the sets may be empty

3.1) Motivation

This issue has been raised a number of times in discussion and is not
made explicit in the protocol document.

3.2) Impact

None (except clarity)

3.3) Change: add additional bullet to the final set of bullets in protocol
document sec2.6
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set

Introduction of bullets should say "Five issues should be noted here"
instead of "Four issues should be noted here".

Additional bullet: The set hierarchy of a repository may include sets
that are empty.


4) Correct typo on sec2.6
(http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set)

Section 2.6 says:

If a repository supports sets then it must include set membership
information in the GetRecord request. The list of setSpec should include
only the minimum number of setSpec required to specify the set
membership. Using the previous example of a set hierarchy, the header for
an item organized in set institution:florida should not include setSpec
institution since that is implied by the setSpec institution:florida .

when it should say 'in response to ListIdentifiers, ListRecords and
GetRecord requests' instead of 'in the GetRecord request'. The corrected
paragraph reads:

If a repository supports sets then it must include set membership
information in the headers returned in response to ListIdentifiers,
ListRecords and GetRecord requests. The list of setSpec elements should
include only the minimum number of setSpec elements required to specify
the set membership. Using the previous example of a set hierarchy, the
header for an item organized in set institution:florida should not
include setSpec institution since that is implied by the setSpec
institution:florida.

[problem pointed out by Mike Taylor <mike at indexdata.com>]


5) Correct typo the second example in protocol document sec5
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#dublincore

The namespace is incorrectly given in the declaration. It is 'oai' and
not 'oai_dc' and the example is not schema valid. The following
correction is required:

3c3
<     xmlns:oai="http://www.openarchives.org/OAI/2.0/oai_dc/"
---
>     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"

[problem pointed out by Terry Harrison <1maniac at cox.net>]


6) Schema should more tightly validate UTCdatetime

6.1) Motivation

While the specification is quite clear about what datetime formats must
be used, the fact that the schema does not enforce the restriction to Z
notation means that there are errors not found during schema validation.
The additional check is easy to add to the schema and would likely
improve interoperability.

6.2) Impact

No conforming repository should be affected. The responses from some
non-conforming implementations will no longer schema validate. It is
hoped that this will encourage maintainers to correct them.

6.3) Change schema definition from:

  <simpleType name="UTCdatetimeType">
    <union memberTypes="date dateTime"/>
  </simpleType>

to:

  <simpleType name="UTCdatetimeType">
    <union memberTypes="date oai:UTCdateTimeZType"/>
  </simpleType>

  <simpleType name="UTCdateTimeZType">
    <restriction base="dateTime">
      <pattern value=".*Z"/>
    </restriction>
  </simpleType>


7) Add section on deleted records to the protocol document

7.1) Motivation

Deleted records are an issue that causes confusion. This is not helped by
information about them being distributed over the protocol document.

7.2) Impact

none (except greater comprehension!)

7.3) Change protocol document to include:

2.5.1 Deleted records

If a record is no longer available then it is said to be 'deleted'.
Repositories may or may not keep track of deletions. If a repository does
not keep track of deletions then such records will simply vanish from
responses and there will be no way for a harvester to discover deletions
through continued incremental harvesting. If a repository does keep track
of deletions then the datestamp of the deleted record _must_ be the date
and time that it was deleted. Responses to a GetRecord request for a
deleted record _must_ then include a header with the attribute
status="deleted" and no metadata or about parts. Similarly, responses to
selective harvesting requests with set membership and date range criteria
that include deleted records _must_ include the headers of these records.
Incremental harvesting will thus discover deletions from repositories
that keep track of them. Repositories must indicate their level of
support for deletions in the deletedRecord element of the Identify
response.

Note that deleted status is a property of individual records. Like a
normal record, a deleted record is identified by a unique identifier, a
metadataPrefix and a datestamp. Other records, with different
metadataPrefix but the same unique identifier, may remain available for
the an item.


8) Change schema to defined a type for protocolVersion instead of using
an anonymous definition (ALREADY DONE, 2004-03-29)

8.1) Motivation

To allow reuse as part of the Static Repository schema
http://www.openarchives.org/OAI/2.0/OAI-PMH-static-repository.xsd

8.2) Impact

None except as noted in motivation. Schema semantics unchanged; all
validating instances will still validate.

8.3) Change to OAI-PMH.xsd

98,104c100
<       <element name="protocolVersion">
<         <simpleType>
<           <restriction base="string">
<             <enumeration value="2.0"/>
<           </restriction>
<         </simpleType>
<       </element>
---
>       <element name="protocolVersion" type="oai:protocolVersionType"/>
192a189
>
253a251,256
>
>   <simpleType name="protocolVersionType">
>     <restriction base="string">
>       <enumeration value="2.0"/>
>     </restriction>
>   </simpleType>

(old version available as
http://www.openarchives.org/OAI/2.0/OAI-PMH.2002-06-13.xsd)



More information about the OAI-implementers mailing list