[OAI-implementers] Part II: Proposed corrections/fixes to OAI-PMHprotocol document and schema

Tim Brody tdb01r at ecs.soton.ac.uk
Tue Sep 21 08:01:12 EDT 2004


I would remove the protocol definition of the "magic colon" hierarchy in
Sets, and make the Prefix and Set arguments anyString.

The use of structured data in request arguments is confusing and
unnecessary.

When/if an official SOAP definition of OAI is released I would recommend
replacing Prefix with the schema URL. There also needs to be a solution to
the record moving out of set problem ...

All the best,
Tim.

----- Original Message ----- 
From: "Hussein Suleman" <hussein at cs.uct.ac.za>
To: <oai-implementers at oaisrv.nsdl.cornell.edu>
Cc: "Simeon Warner" <simeon at cs.cornell.edu>
Sent: Monday, September 20, 2004 5:13 PM
Subject: Re: [OAI-implementers] Part II: Proposed corrections/fixes to
OAI-PMHprotocol document and schema


> hi Simeon (et al)
>
> to follow on, i agree that we will always need to escape ":" because of
> PMH semantics.
>
> the clean solution is to propose the use of a special OAI escape
> character, say "!". then, we could use the forward mapping:
>    : -> !:
>    ! -> !!
> then, specify that setSpecs and mdps are simply unrestricted Unicode,
> with service providers having to apply URL-encoding when submitting
> requests involving setSpecs and mdps, and data providers having to apply
> XML encoding when returning such information (with reverse
> transformation as needed). there are a few other issues here - like
> Unicode use in URLs, but lets punt on that for now ...
>
> now, i know this proposes to change semantics - i believe we are already
> on the slippery slope of trying to patch things up by introducing more
> complexity and greater reliance on basic HTTP.
>
> practically, in the short term, i support option 3, to tackle only issue
> A and not issue B. in the long term, maybe when we consider SOAP, we
> really should clean up the data model.
>
> ttfn,
> ----hussein
>
>
> Simeon Warner wrote:
>
> > I'd like to solicit further comment regarding issues 1 and 2 of the set
of
> > proposed corrections and fixes to the OAI-PMH protocol document and
schema
> > that I sent back in June (copied below, alternatively see:
> >
http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
> > These are really the same issue repeated for both setSpec and
> > metadataPrefix. Both cases involve the same two parts which I describe
> > below: part A I assume is not controversial; part B Hussein commented
on.
> > A lack of other comments presumably indicates lack of other objections
but
> > I'd like to confirm that since the proposal will involve minor changes
in
> > some implementations.
> >
> >
> > A) The values of setSpec and metadataPrefix permitted protocol document
> > and the by the schema simply do not agree. This should be corrected.
> >
> > The meaning of the current wording "any characters that are safe in a
> > query component of a URI" is unclear and cannot be construed to agree
with
> > the schema.  I suggest the simplest way to clarify and fix this is to
> > rephrase as "a string consisting of any valid URI 'unreserved'
characters"
> > which would give the following changes in allowed values (both of which
> > add ~ and disallow $ and + ):
> >
> > setSpec from:
> > <pattern
value="([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> > to:
> > <pattern
value="([A-Za-z0-9\-_\.!~\*'\(\)])+(:[A-Za-z0-9\-_\.!~\*'\(\)]+)*"/>
> >
> > metadataPrefix from:
> > <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> > to:
> > <pattern value="[A-Za-z0-9\-_\.!~\*'\(\)]+"/>
> >
> > The setSpec pattern is more complicated because elements are separated
by
> > colons [:].
> >
> >
> > B) There should be some standard way to permit straightforward use,
> > perhaps via escaping, of setSpec and metadataPrefix values native to
> > repositories.
> >
> > The suggestion is to permit URI "escaped" characters (%xx where xx are
two
> > hex digits). I note that a number of repositories have already adopted
> > encoding using hex but that in most cases the escape character is simply
> > omitted; in a few cases another escape character has been chosen (e.g.
*)
> > because % is not permitted. The fact that implementers are already doing
> > this demonstrates a desire to encode values native to other systems.
> > Permitting URI "escaped" characters is a simple way to standardize this
> > using and well-known escaping mechanism without significantly increasing
> > complexity.
> >
> > Alternatives include:
> >
> > 1) Use another escaping mechanism. Another obvious choice would be to
use
> > XML numeric entities (e.g. '&#58;' (decimal) or '&#x3A;' (hex) for a
> > quotation mark).  These entities would themselves have to be escaped in
> > XML responses (otherwise you have alternative 2) so responses might
> > include XML of the form <setSpec>&amp;#x3A;</setSpec> to encode a
setSpec
> > which is internally a colon [:]. One might also want to restrict to
> > just-decimal or just-hex to reduce complexity. It seems to me that one
> > ends up with a complex set of restrictions on XML entity encoding which
> > largely negate any benefit of adopting that standard. Perhaps there is
> > another good option?
> >
> > 2) Permit a much larger character set in the first place (the limit
being
> > "anything" - the XML schema "string" type). I see three issues with
this.
> > First, when OAI-PMH was first designed we decided on a limited character
> > set to make implementation easier, I think this still has some merit.
> > Second, in the setSpec there will always be a potential need to escape a
> > colon [:], since that has special meaning in OAI-PMH (which may not
> > correspond to use in values native to a repository). Third, this would
be
> > a significant change requiring updates to most harvesting software.
> > Significant extension of the character set is beyond the scope of the
> > present proposal.
> >
> > 3) Do not include a standard way to permit the use of setSpec and
> > metadataPrefix values native to repositories (simply make the protocol
> > document and schema agree as described in A).
> >
> > Note that this issue is quite separate from URL-encoding of OAI requests
> > made over HTTP. Characters used in any escaping mechanism for setSpec
and
> > metadataPrefix may need to be further escaped when used in URLs.
> >
> > On Mon, 21 Jun 2004, Hussein Suleman wrote:
> > ...
> >
> >>1/2: i have some reservations about us requiring URL-encoding within
> >>XML. this mixes syntax with intended semantics of use and further
> >>entrenches the implicit support for URL-encoding, which is irrelevant
> >>if, for example, OAI-PMH makes the jump to using a SOAP request/response
> >>model. the model and abstractions must be clean and separable, they
> >>arent quite so already and i would prefer they didnt get more
complicated.
> >
> >
> > In response, I don't think the proposal was to _require_ URL-encoding.
It
> > was to allow it at a data-provider's choice; service providers should
(in
> > the absence of other information, e.g. oai_dc is special) treat both
> > setSpec and metadataPrefix values as opaque tokens. OAI-PMH's special
use
> > of the colon means that this issue would not entirely go away even if
> > OAI-PMH used an XML-clean transport such as SOAP, and we were no longer
> > concerned about the burden on harvesters of permitting any string to be
> > used.
> >
> >
> > Ug, that got longer than I hoped...
> >
> > Cheers,
> > Simeon
> >
> >
> >
> >>Simeon Warner wrote:
> >>
> >>>...
> >>>PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
> >>>--------------------------------------------------
> >>>
> >>>1) Correct protocol document and schema definition of setSpec to be
> >>>consistent, and also to permit the use of URL encoding.
> >>>
> >>>1.1) Motivation
> >>>
> >>>First, the protocol document and the schema simply do not agree. The
use
> >>>of the wording "any characters that are safe in a query component of a
> >>>URI" is unclear and cannot be construed to agree with the schema.
Second,
> >>>many repositories are using URL-like encoding to create setSpecs so it
> >>>seems better to permit the recognized URL encoding. The practical
change
> >>>to meet both of these criteria is very small: the schema regular
> >>>expression should be changed to remove $ and +, and to add ~ and %xx
(URL
> >>>encoding). This will bring the protocol document in line with the terms
> >>>"escaped" and "unreserved" as used in the URI RFC.
> >>>
> >>>1.2) Impact
> >>>
> >>>The only conforming repository that we know of using setSpecs affected
by
> >>>this change is Jeff Young's OpenURL repository
> >>>(http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
> >>>an encoding for space. Jeff agrees that a change would be sensible and
> >>>that he could be replace '+' with '%20'. Repositories using URL-like
> >>>encodings will not be affected although they may choose to change to
use
> >>>real URL encoding. All OAI software maintainers should, however, review
> >>>the change and update their parsing code accordingly.
> >>>
> >>>1.3) Changes
> >>>
> >>>1.3.1) Change wording in protocol document
> >>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
> >>>from:
> >>>
> >>>a setSpec -- a colon [:] separated list indicating the path from the
root
> >>>of the set hierarchy to the respective node.  Each element in the list
is
> >>>a string consisting of any characters that are safe in a query
component
> >>>of a URI , which must not contain any colons [ :].  Since a setSpec
forms
> >>>a unique identifier for the set within the repository, it must be
unique
> >>>for each set.  Flat set organizations have only sets with setSpec that
do
> >>>not contain any colons [ :].
> >>>
> >>>to:
> >>>
> >>>a setSpec -- a colon [:] separated list indicating the path from the
root
> >>>of the set hierarchy to the respective node. Each element in the list
is a
> >>>string consisting of any valid URI "unreserved" and "escaped"
characters.
> >>>A setTag must not contain URI "reserved" characters, for example the
colon
> >>>[:] which is used to delimit setTags. Since a setSpec forms a unique
> >>>identifier for the set within the repository, it must be unique for
each
> >>>set. Flat set organizations have only sets with setSpec that do not
> >>>contain any colons [:].
> >>>
> >>>The corresponding parts of the specification of allowed characters in
URIs
> >>>are:
> >>>
> >>>unreserved    = alphanum | mark
> >>>mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
> >>>                "(" | ")"
> >>>escaped       = "%" hex hex
> >>>hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> >>>                "a" | "b" | "c" | "d" | "e" | "f"
> >>>
> >>>
> >>>1.3.2) Change definition of setSpecType in the schema to match the
definition
> >>>from:
> >>>
> >>> <simpleType name="setSpecType">
> >>>    <restriction base="string">
> >>>      <pattern value=
> >>>
"([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> >>>    </restriction>
> >>>  </simpleType>
> >>>
> >>>to:
> >>>
> >>>  <simpleType name="setSpecType">
> >>>    <restriction base="string">
> >>>      <pattern
value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'
\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
> >>>    </restriction>
> >>>  </simpleType>
> >>>
> >>>
> >>>2) Correct protocol document and schema definition for metadataPrefix
to
> >>>be consistent, and also to match the revised setSpec definition.
> >>>
> >>>2.1) Motivation
> >>>
> >>>The protocol document uses the same imprecise wording for
metadataPrefix
> >>>as it does for setSpec ("any characters that are safe in a query
> >>>component of a URI") and the schema does not even follow a reasonable
> >>>interpretation of this wording. It seems sensible to use the same
> >>>character restrictions in a consistent fashion. This will bring the
> >>>protocol document in line with the terms "escaped" and "unreserved" as
> >>>used in the URI RFC.
> >>>
> >>>2.2) Impact
> >>>
> >>>This change is not expected to impact any known repository.  All OAI
> >>>software maintainers should, however, review the change and update
their
> >>>parsing code accordingly.
> >>>
> >>>2.3) Changes
> >>>
> >>>2.2.1) Change wording in protocol document
>
>>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPref
ix
> >>>from:
> >>>
> >>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
> >>>requests issued to the repository. metadataPrefix consists of any
> >>>characters that are safe in a query component of a URI. metadataPrefix
> >>>arguments are used in ListRecords, ListIdentifiers, and GetRecord
> >>>requests to retrieve records, or the headers of records that include
> >>>metadata in the format specified by the metadataPrefix;
> >>>
> >>>to:
> >>>
> >>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
> >>>requests issued to the repository. metadataPrefix consists of any valid
> >>>URI "unreserved" and "escaped"  characters. A metadataPrefix must not
> >>>contain URI "reserved" characters. metadataPrefix arguments are used in
> >>>ListRecords, ListIdentifiers, and GetRecord requests to retrieve
records,
> >>>or the headers of records that include metadata in the format specified
> >>>by the metadataPrefix;
> >>>
> >>>2.3.2) Change definition of metadataPrefixType in schema to match the
> >>>definition from:
> >>>
> >>>  <simpleType name="metadataPrefixType">
> >>>    <restriction base="string">
> >>>      <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> >>>    </restriction>
> >>>  </simpleType>
> >>>
> >>>to:
> >>>
> >>>  <simpleType name="metadataPrefixType">
> >>>    <restriction base="string">
> >>>      <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
> >>>    </restriction>
> >>>  </simpleType>
> >
> >
> >
> > ----------------------------------------------------------
> > Simeon Warner                 Email: simeon at cs.cornell.edu
> > Cornell Information Science              Tel: 607-254-8605
> > 301 College Ave                          Fax: 607-255-5196
> > Ithaca, NY 14850-4623, USA
> >
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > List information, archives, preferences and to unsubscribe:
> > http://openarchives.org/mailman/listinfo/oai-implementers
> >
>
> -- 
> =====================================================================
> hussein suleman ~ hussein at cs.uct.ac.za ~ http://www.husseinsspace.com
> =====================================================================
>
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://openarchives.org/mailman/listinfo/oai-implementers
>




More information about the OAI-implementers mailing list