[OAI-implementers] Perl regexp for validating 'identifier' (anyURI) needed

Simeon Warner simeon@cs.cornell.edu
Wed, 26 Feb 2003 10:49:03 -0500 (EST)

On Wed, 26 Feb 2003 marinb@gmx.net wrote:
> I am sure somebody has already written/found a reasonable good perl regexp
> for validating the identifier parameter. I only could find one for decoding
> m|^(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|
> but it is not suitable for validating as no check is made for allowed
> characters within each 'fragment'. There must be a better solution
> instead of extracting the fragments and validating each of them
> separately?

I don't know whether you should take the following as an admission or as a
suggestion. The pattern you give above pretty closely matches that given
in http://www.ietf.org/rfc/rfc2396.txt (appendix B) as a match for generic
URI syntax. I don't see why you can't add further validation for allowed
characeters although it will make the match rather unweildy.  However, if
you are creating a repository (as opposed to a service that automatically
harvests and re-exports records), then from a practical point of view it
isn't essential to validate all possible URIs (even the XML Schema docs
point out issues with this, see
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI). What is
essential is to do enough validation such that all identifiers you use are
permitted and the resulting "validated" identifiers are safe to use as
keys for lookup internally. This is acceptable because the OAI
specification requires only that you report at least one error for an
illegal request -- for an invalid identifier it might reasonably be
badArgument and/or idDoesNotExist (see:  

If you are creating a service that automatically harvests and re-exports
records then incoming records must be carefully validated to avoid
re-exporting bad data.
> Can anybody also tell me where is the problem with following request?
> Response to this request did not give error code 'badArgument':
> verb=ListRecords&metadataPrefix=oai_dc&resumptionToken=junk&until=1990-01-10

As Donna points out, this request is certainly bad and should give at
least one error code (the most obvious being badArgument). However, since
the specification allows servers to respond with any appropriate error
element it could reasonably give badResumptionToken if doesn't recognize
the resumptionToken 'junk'.

> Would appreciate very much any help,
> Cheers,
> Marin