[OAI-implementers] Part II: Proposed corrections/fixes to OAI-PMHprotocol document and schema

Simeon Warner simeon at cs.cornell.edu
Tue Sep 21 09:29:48 EDT 2004


On Tue, 21 Sep 2004, Tim Brody wrote:
> I would remove the protocol definition of the "magic colon" hierarchy in
> Sets, and make the Prefix and Set arguments anyString.

Removing the colon would change the protocol semantics. I think this is
out of scope at the moment.

Cheers,
Simeon

> The use of structured data in request arguments is confusing and
> unnecessary.
>
> When/if an official SOAP definition of OAI is released I would recommend
> replacing Prefix with the schema URL. There also needs to be a solution to
> the record moving out of set problem ...
>
> All the best,
> Tim.
>
> ----- Original Message -----
> From: "Hussein Suleman" <hussein at cs.uct.ac.za>
> To: <oai-implementers at oaisrv.nsdl.cornell.edu>
> Cc: "Simeon Warner" <simeon at cs.cornell.edu>
> Sent: Monday, September 20, 2004 5:13 PM
> Subject: Re: [OAI-implementers] Part II: Proposed corrections/fixes to
> OAI-PMHprotocol document and schema
>
>
> > hi Simeon (et al)
> >
> > to follow on, i agree that we will always need to escape ":" because of
> > PMH semantics.
> >
> > the clean solution is to propose the use of a special OAI escape
> > character, say "!". then, we could use the forward mapping:
> >    : -> !:
> >    ! -> !!
> > then, specify that setSpecs and mdps are simply unrestricted Unicode,
> > with service providers having to apply URL-encoding when submitting
> > requests involving setSpecs and mdps, and data providers having to apply
> > XML encoding when returning such information (with reverse
> > transformation as needed). there are a few other issues here - like
> > Unicode use in URLs, but lets punt on that for now ...
> >
> > now, i know this proposes to change semantics - i believe we are already
> > on the slippery slope of trying to patch things up by introducing more
> > complexity and greater reliance on basic HTTP.
> >
> > practically, in the short term, i support option 3, to tackle only issue
> > A and not issue B. in the long term, maybe when we consider SOAP, we
> > really should clean up the data model.
> >
> > ttfn,
> > ----hussein
> >
> >
> > Simeon Warner wrote:
> >
> > > I'd like to solicit further comment regarding issues 1 and 2 of the set
> of
> > > proposed corrections and fixes to the OAI-PMH protocol document and
> schema
> > > that I sent back in June (copied below, alternatively see:
> > >
> http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
> > > These are really the same issue repeated for both setSpec and
> > > metadataPrefix. Both cases involve the same two parts which I describe
> > > below: part A I assume is not controversial; part B Hussein commented
> on.
> > > A lack of other comments presumably indicates lack of other objections
> but
> > > I'd like to confirm that since the proposal will involve minor changes
> in
> > > some implementations.
> > >
> > >
> > > A) The values of setSpec and metadataPrefix permitted protocol document
> > > and the by the schema simply do not agree. This should be corrected.
> > >
> > > The meaning of the current wording "any characters that are safe in a
> > > query component of a URI" is unclear and cannot be construed to agree
> with
> > > the schema.  I suggest the simplest way to clarify and fix this is to
> > > rephrase as "a string consisting of any valid URI 'unreserved'
> characters"
> > > which would give the following changes in allowed values (both of which
> > > add ~ and disallow $ and + ):
> > >
> > > setSpec from:
> > > <pattern
> value="([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> > > to:
> > > <pattern
> value="([A-Za-z0-9\-_\.!~\*'\(\)])+(:[A-Za-z0-9\-_\.!~\*'\(\)]+)*"/>
> > >
> > > metadataPrefix from:
> > > <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> > > to:
> > > <pattern value="[A-Za-z0-9\-_\.!~\*'\(\)]+"/>
> > >
> > > The setSpec pattern is more complicated because elements are separated
> by
> > > colons [:].
> > >
> > >
> > > B) There should be some standard way to permit straightforward use,
> > > perhaps via escaping, of setSpec and metadataPrefix values native to
> > > repositories.
> > >
> > > The suggestion is to permit URI "escaped" characters (%xx where xx are
> two
> > > hex digits). I note that a number of repositories have already adopted
> > > encoding using hex but that in most cases the escape character is simply
> > > omitted; in a few cases another escape character has been chosen (e.g.
> *)
> > > because % is not permitted. The fact that implementers are already doing
> > > this demonstrates a desire to encode values native to other systems.
> > > Permitting URI "escaped" characters is a simple way to standardize this
> > > using and well-known escaping mechanism without significantly increasing
> > > complexity.
> > >
> > > Alternatives include:
> > >
> > > 1) Use another escaping mechanism. Another obvious choice would be to
> use
> > > XML numeric entities (e.g. '&#58;' (decimal) or '&#x3A;' (hex) for a
> > > quotation mark).  These entities would themselves have to be escaped in
> > > XML responses (otherwise you have alternative 2) so responses might
> > > include XML of the form <setSpec>&amp;#x3A;</setSpec> to encode a
> setSpec
> > > which is internally a colon [:]. One might also want to restrict to
> > > just-decimal or just-hex to reduce complexity. It seems to me that one
> > > ends up with a complex set of restrictions on XML entity encoding which
> > > largely negate any benefit of adopting that standard. Perhaps there is
> > > another good option?
> > >
> > > 2) Permit a much larger character set in the first place (the limit
> being
> > > "anything" - the XML schema "string" type). I see three issues with
> this.
> > > First, when OAI-PMH was first designed we decided on a limited character
> > > set to make implementation easier, I think this still has some merit.
> > > Second, in the setSpec there will always be a potential need to escape a
> > > colon [:], since that has special meaning in OAI-PMH (which may not
> > > correspond to use in values native to a repository). Third, this would
> be
> > > a significant change requiring updates to most harvesting software.
> > > Significant extension of the character set is beyond the scope of the
> > > present proposal.
> > >
> > > 3) Do not include a standard way to permit the use of setSpec and
> > > metadataPrefix values native to repositories (simply make the protocol
> > > document and schema agree as described in A).
> > >
> > > Note that this issue is quite separate from URL-encoding of OAI requests
> > > made over HTTP. Characters used in any escaping mechanism for setSpec
> and
> > > metadataPrefix may need to be further escaped when used in URLs.
> > >
> > > On Mon, 21 Jun 2004, Hussein Suleman wrote:
> > > ...
> > >
> > >>1/2: i have some reservations about us requiring URL-encoding within
> > >>XML. this mixes syntax with intended semantics of use and further
> > >>entrenches the implicit support for URL-encoding, which is irrelevant
> > >>if, for example, OAI-PMH makes the jump to using a SOAP request/response
> > >>model. the model and abstractions must be clean and separable, they
> > >>arent quite so already and i would prefer they didnt get more
> complicated.
> > >
> > >
> > > In response, I don't think the proposal was to _require_ URL-encoding.
> It
> > > was to allow it at a data-provider's choice; service providers should
> (in
> > > the absence of other information, e.g. oai_dc is special) treat both
> > > setSpec and metadataPrefix values as opaque tokens. OAI-PMH's special
> use
> > > of the colon means that this issue would not entirely go away even if
> > > OAI-PMH used an XML-clean transport such as SOAP, and we were no longer
> > > concerned about the burden on harvesters of permitting any string to be
> > > used.
> > >
> > >
> > > Ug, that got longer than I hoped...
> > >
> > > Cheers,
> > > Simeon
> > >
> > >
> > >
> > >>Simeon Warner wrote:
> > >>
> > >>>...
> > >>>PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
> > >>>--------------------------------------------------
> > >>>
> > >>>1) Correct protocol document and schema definition of setSpec to be
> > >>>consistent, and also to permit the use of URL encoding.
> > >>>
> > >>>1.1) Motivation
> > >>>
> > >>>First, the protocol document and the schema simply do not agree. The
> use
> > >>>of the wording "any characters that are safe in a query component of a
> > >>>URI" is unclear and cannot be construed to agree with the schema.
> Second,
> > >>>many repositories are using URL-like encoding to create setSpecs so it
> > >>>seems better to permit the recognized URL encoding. The practical
> change
> > >>>to meet both of these criteria is very small: the schema regular
> > >>>expression should be changed to remove $ and +, and to add ~ and %xx
> (URL
> > >>>encoding). This will bring the protocol document in line with the terms
> > >>>"escaped" and "unreserved" as used in the URI RFC.
> > >>>
> > >>>1.2) Impact
> > >>>
> > >>>The only conforming repository that we know of using setSpecs affected
> by
> > >>>this change is Jeff Young's OpenURL repository
> > >>>(http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
> > >>>an encoding for space. Jeff agrees that a change would be sensible and
> > >>>that he could be replace '+' with '%20'. Repositories using URL-like
> > >>>encodings will not be affected although they may choose to change to
> use
> > >>>real URL encoding. All OAI software maintainers should, however, review
> > >>>the change and update their parsing code accordingly.
> > >>>
> > >>>1.3) Changes
> > >>>
> > >>>1.3.1) Change wording in protocol document
> > >>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
> > >>>from:
> > >>>
> > >>>a setSpec -- a colon [:] separated list indicating the path from the
> root
> > >>>of the set hierarchy to the respective node.  Each element in the list
> is
> > >>>a string consisting of any characters that are safe in a query
> component
> > >>>of a URI , which must not contain any colons [ :].  Since a setSpec
> forms
> > >>>a unique identifier for the set within the repository, it must be
> unique
> > >>>for each set.  Flat set organizations have only sets with setSpec that
> do
> > >>>not contain any colons [ :].
> > >>>
> > >>>to:
> > >>>
> > >>>a setSpec -- a colon [:] separated list indicating the path from the
> root
> > >>>of the set hierarchy to the respective node. Each element in the list
> is a
> > >>>string consisting of any valid URI "unreserved" and "escaped"
> characters.
> > >>>A setTag must not contain URI "reserved" characters, for example the
> colon
> > >>>[:] which is used to delimit setTags. Since a setSpec forms a unique
> > >>>identifier for the set within the repository, it must be unique for
> each
> > >>>set. Flat set organizations have only sets with setSpec that do not
> > >>>contain any colons [:].
> > >>>
> > >>>The corresponding parts of the specification of allowed characters in
> URIs
> > >>>are:
> > >>>
> > >>>unreserved    = alphanum | mark
> > >>>mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
> > >>>                "(" | ")"
> > >>>escaped       = "%" hex hex
> > >>>hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> > >>>                "a" | "b" | "c" | "d" | "e" | "f"
> > >>>
> > >>>
> > >>>1.3.2) Change definition of setSpecType in the schema to match the
> definition
> > >>>from:
> > >>>
> > >>> <simpleType name="setSpecType">
> > >>>    <restriction base="string">
> > >>>      <pattern value=
> > >>>
> "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> > >>>    </restriction>
> > >>>  </simpleType>
> > >>>
> > >>>to:
> > >>>
> > >>>  <simpleType name="setSpecType">
> > >>>    <restriction base="string">
> > >>>      <pattern
> value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'
> \(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
> > >>>    </restriction>
> > >>>  </simpleType>
> > >>>
> > >>>
> > >>>2) Correct protocol document and schema definition for metadataPrefix
> to
> > >>>be consistent, and also to match the revised setSpec definition.
> > >>>
> > >>>2.1) Motivation
> > >>>
> > >>>The protocol document uses the same imprecise wording for
> metadataPrefix
> > >>>as it does for setSpec ("any characters that are safe in a query
> > >>>component of a URI") and the schema does not even follow a reasonable
> > >>>interpretation of this wording. It seems sensible to use the same
> > >>>character restrictions in a consistent fashion. This will bring the
> > >>>protocol document in line with the terms "escaped" and "unreserved" as
> > >>>used in the URI RFC.
> > >>>
> > >>>2.2) Impact
> > >>>
> > >>>This change is not expected to impact any known repository.  All OAI
> > >>>software maintainers should, however, review the change and update
> their
> > >>>parsing code accordingly.
> > >>>
> > >>>2.3) Changes
> > >>>
> > >>>2.2.1) Change wording in protocol document
> >
> >>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPref
> ix
> > >>>from:
> > >>>
> > >>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > >>>requests issued to the repository. metadataPrefix consists of any
> > >>>characters that are safe in a query component of a URI. metadataPrefix
> > >>>arguments are used in ListRecords, ListIdentifiers, and GetRecord
> > >>>requests to retrieve records, or the headers of records that include
> > >>>metadata in the format specified by the metadataPrefix;
> > >>>
> > >>>to:
> > >>>
> > >>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > >>>requests issued to the repository. metadataPrefix consists of any valid
> > >>>URI "unreserved" and "escaped"  characters. A metadataPrefix must not
> > >>>contain URI "reserved" characters. metadataPrefix arguments are used in
> > >>>ListRecords, ListIdentifiers, and GetRecord requests to retrieve
> records,
> > >>>or the headers of records that include metadata in the format specified
> > >>>by the metadataPrefix;
> > >>>
> > >>>2.3.2) Change definition of metadataPrefixType in schema to match the
> > >>>definition from:
> > >>>
> > >>>  <simpleType name="metadataPrefixType">
> > >>>    <restriction base="string">
> > >>>      <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> > >>>    </restriction>
> > >>>  </simpleType>
> > >>>
> > >>>to:
> > >>>
> > >>>  <simpleType name="metadataPrefixType">
> > >>>    <restriction base="string">
> > >>>      <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
> > >>>    </restriction>
> > >>>  </simpleType>
> > >
> > >
> > >
> > > ----------------------------------------------------------
> > > Simeon Warner                 Email: simeon at cs.cornell.edu
> > > Cornell Information Science              Tel: 607-254-8605
> > > 301 College Ave                          Fax: 607-255-5196
> > > Ithaca, NY 14850-4623, USA
> > >
> > >
> > > _______________________________________________
> > > OAI-implementers mailing list
> > > List information, archives, preferences and to unsubscribe:
> > > http://openarchives.org/mailman/listinfo/oai-implementers
> > >
> >
> > --
> > =====================================================================
> > hussein suleman ~ hussein at cs.uct.ac.za ~ http://www.husseinsspace.com
> > =====================================================================
> >
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > List information, archives, preferences and to unsubscribe:
> > http://openarchives.org/mailman/listinfo/oai-implementers
> >
>
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://openarchives.org/mailman/listinfo/oai-implementers
>



More information about the OAI-implementers mailing list