[OAI-implementers] Part II: Proposed corrections/fixes to OAI-PMH protocol document and schema

Simeon Warner simeon at cs.cornell.edu
Thu Sep 16 19:31:29 EDT 2004


I'd like to solicit further comment regarding issues 1 and 2 of the set of
proposed corrections and fixes to the OAI-PMH protocol document and schema
that I sent back in June (copied below, alternatively see:
http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
These are really the same issue repeated for both setSpec and
metadataPrefix. Both cases involve the same two parts which I describe
below: part A I assume is not controversial; part B Hussein commented on.
A lack of other comments presumably indicates lack of other objections but
I'd like to confirm that since the proposal will involve minor changes in
some implementations.


A) The values of setSpec and metadataPrefix permitted protocol document
and the by the schema simply do not agree. This should be corrected.

The meaning of the current wording "any characters that are safe in a
query component of a URI" is unclear and cannot be construed to agree with
the schema.  I suggest the simplest way to clarify and fix this is to
rephrase as "a string consisting of any valid URI 'unreserved' characters"
which would give the following changes in allowed values (both of which
add ~ and disallow $ and + ):

setSpec from:
<pattern value="([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
to:
<pattern value="([A-Za-z0-9\-_\.!~\*'\(\)])+(:[A-Za-z0-9\-_\.!~\*'\(\)]+)*"/>

metadataPrefix from:
<pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
to:
<pattern value="[A-Za-z0-9\-_\.!~\*'\(\)]+"/>

The setSpec pattern is more complicated because elements are separated by
colons [:].


B) There should be some standard way to permit straightforward use,
perhaps via escaping, of setSpec and metadataPrefix values native to
repositories.

The suggestion is to permit URI "escaped" characters (%xx where xx are two
hex digits). I note that a number of repositories have already adopted
encoding using hex but that in most cases the escape character is simply
omitted; in a few cases another escape character has been chosen (e.g. *)
because % is not permitted. The fact that implementers are already doing
this demonstrates a desire to encode values native to other systems.
Permitting URI "escaped" characters is a simple way to standardize this
using and well-known escaping mechanism without significantly increasing
complexity.

Alternatives include:

1) Use another escaping mechanism. Another obvious choice would be to use
XML numeric entities (e.g. '&#58;' (decimal) or '&#x3A;' (hex) for a
quotation mark).  These entities would themselves have to be escaped in
XML responses (otherwise you have alternative 2) so responses might
include XML of the form <setSpec>&amp;#x3A;</setSpec> to encode a setSpec
which is internally a colon [:]. One might also want to restrict to
just-decimal or just-hex to reduce complexity. It seems to me that one
ends up with a complex set of restrictions on XML entity encoding which
largely negate any benefit of adopting that standard. Perhaps there is
another good option?

2) Permit a much larger character set in the first place (the limit being
"anything" - the XML schema "string" type). I see three issues with this.
First, when OAI-PMH was first designed we decided on a limited character
set to make implementation easier, I think this still has some merit.
Second, in the setSpec there will always be a potential need to escape a
colon [:], since that has special meaning in OAI-PMH (which may not
correspond to use in values native to a repository). Third, this would be
a significant change requiring updates to most harvesting software.
Significant extension of the character set is beyond the scope of the
present proposal.

3) Do not include a standard way to permit the use of setSpec and
metadataPrefix values native to repositories (simply make the protocol
document and schema agree as described in A).

Note that this issue is quite separate from URL-encoding of OAI requests
made over HTTP. Characters used in any escaping mechanism for setSpec and
metadataPrefix may need to be further escaped when used in URLs.

On Mon, 21 Jun 2004, Hussein Suleman wrote:
...
> 1/2: i have some reservations about us requiring URL-encoding within
> XML. this mixes syntax with intended semantics of use and further
> entrenches the implicit support for URL-encoding, which is irrelevant
> if, for example, OAI-PMH makes the jump to using a SOAP request/response
> model. the model and abstractions must be clean and separable, they
> arent quite so already and i would prefer they didnt get more complicated.

In response, I don't think the proposal was to _require_ URL-encoding. It
was to allow it at a data-provider's choice; service providers should (in
the absence of other information, e.g. oai_dc is special) treat both
setSpec and metadataPrefix values as opaque tokens. OAI-PMH's special use
of the colon means that this issue would not entirely go away even if
OAI-PMH used an XML-clean transport such as SOAP, and we were no longer
concerned about the burden on harvesters of permitting any string to be
used.


Ug, that got longer than I hoped...

Cheers,
Simeon


> Simeon Warner wrote:
> > ...
> > PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
> > --------------------------------------------------
> >
> > 1) Correct protocol document and schema definition of setSpec to be
> > consistent, and also to permit the use of URL encoding.
> >
> > 1.1) Motivation
> >
> > First, the protocol document and the schema simply do not agree. The use
> > of the wording "any characters that are safe in a query component of a
> > URI" is unclear and cannot be construed to agree with the schema. Second,
> > many repositories are using URL-like encoding to create setSpecs so it
> > seems better to permit the recognized URL encoding. The practical change
> > to meet both of these criteria is very small: the schema regular
> > expression should be changed to remove $ and +, and to add ~ and %xx (URL
> > encoding). This will bring the protocol document in line with the terms
> > "escaped" and "unreserved" as used in the URI RFC.
> >
> > 1.2) Impact
> >
> > The only conforming repository that we know of using setSpecs affected by
> > this change is Jeff Young's OpenURL repository
> > (http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
> > an encoding for space. Jeff agrees that a change would be sensible and
> > that he could be replace '+' with '%20'. Repositories using URL-like
> > encodings will not be affected although they may choose to change to use
> > real URL encoding. All OAI software maintainers should, however, review
> > the change and update their parsing code accordingly.
> >
> > 1.3) Changes
> >
> > 1.3.1) Change wording in protocol document
> > http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
> > from:
> >
> > a setSpec -- a colon [:] separated list indicating the path from the root
> > of the set hierarchy to the respective node.  Each element in the list is
> > a string consisting of any characters that are safe in a query component
> > of a URI , which must not contain any colons [ :].  Since a setSpec forms
> > a unique identifier for the set within the repository, it must be unique
> > for each set.  Flat set organizations have only sets with setSpec that do
> > not contain any colons [ :].
> >
> > to:
> >
> > a setSpec -- a colon [:] separated list indicating the path from the root
> > of the set hierarchy to the respective node. Each element in the list is a
> > string consisting of any valid URI "unreserved" and "escaped" characters.
> > A setTag must not contain URI "reserved" characters, for example the colon
> > [:] which is used to delimit setTags. Since a setSpec forms a unique
> > identifier for the set within the repository, it must be unique for each
> > set. Flat set organizations have only sets with setSpec that do not
> > contain any colons [:].
> >
> > The corresponding parts of the specification of allowed characters in URIs
> > are:
> >
> > unreserved    = alphanum | mark
> > mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
> >                 "(" | ")"
> > escaped       = "%" hex hex
> > hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> >                 "a" | "b" | "c" | "d" | "e" | "f"
> >
> >
> > 1.3.2) Change definition of setSpecType in the schema to match the definition
> > from:
> >
> >  <simpleType name="setSpecType">
> >     <restriction base="string">
> >       <pattern value=
> >        "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> >     </restriction>
> >   </simpleType>
> >
> > to:
> >
> >   <simpleType name="setSpecType">
> >     <restriction base="string">
> >       <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
> >     </restriction>
> >   </simpleType>
> >
> >
> > 2) Correct protocol document and schema definition for metadataPrefix to
> > be consistent, and also to match the revised setSpec definition.
> >
> > 2.1) Motivation
> >
> > The protocol document uses the same imprecise wording for metadataPrefix
> > as it does for setSpec ("any characters that are safe in a query
> > component of a URI") and the schema does not even follow a reasonable
> > interpretation of this wording. It seems sensible to use the same
> > character restrictions in a consistent fashion. This will bring the
> > protocol document in line with the terms "escaped" and "unreserved" as
> > used in the URI RFC.
> >
> > 2.2) Impact
> >
> > This change is not expected to impact any known repository.  All OAI
> > software maintainers should, however, review the change and update their
> > parsing code accordingly.
> >
> > 2.3) Changes
> >
> > 2.2.1) Change wording in protocol document
> > http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPrefix
> > from:
> >
> > The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > requests issued to the repository. metadataPrefix consists of any
> > characters that are safe in a query component of a URI. metadataPrefix
> > arguments are used in ListRecords, ListIdentifiers, and GetRecord
> > requests to retrieve records, or the headers of records that include
> > metadata in the format specified by the metadataPrefix;
> >
> > to:
> >
> > The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > requests issued to the repository. metadataPrefix consists of any valid
> > URI "unreserved" and "escaped"  characters. A metadataPrefix must not
> > contain URI "reserved" characters. metadataPrefix arguments are used in
> > ListRecords, ListIdentifiers, and GetRecord requests to retrieve records,
> > or the headers of records that include metadata in the format specified
> > by the metadataPrefix;
> >
> > 2.3.2) Change definition of metadataPrefixType in schema to match the
> > definition from:
> >
> >   <simpleType name="metadataPrefixType">
> >     <restriction base="string">
> >       <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> >     </restriction>
> >   </simpleType>
> >
> > to:
> >
> >   <simpleType name="metadataPrefixType">
> >     <restriction base="string">
> >       <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
> >     </restriction>
> >   </simpleType>


----------------------------------------------------------
Simeon Warner                 Email: simeon at cs.cornell.edu
Cornell Information Science              Tel: 607-254-8605
301 College Ave                          Fax: 607-255-5196
Ithaca, NY 14850-4623, USA




More information about the OAI-implementers mailing list