[OAI-implementers] XML Schemas and Xerces again

Thomas G. Habing thabing@uiuc.edu
Wed, 25 Apr 2001 13:21:28 -0500


Jeff-

Regarding the 1.3.1 error, like Herbert we think Xerces is probably wrong
here, but at the same time, we also wonder if the namespace='##any' is
actually necessary in the XSD, since ##any is the default for the namespace
attribute.  Does Xerces not give an error if the XSD leaves out the
namespace='##any'?

Regarding the 1.3.0 error, our guess is that Xerces is trying to do too much
when validating uriReference string.  [Note, 'uriReference' has been changed
to 'anyURI' in the latest version of the schema spec.]  

Quoting from the latest XML schema spec (datatype):

"NOTE: Each URI scheme imposes specialized syntax rules for URIs in that
scheme, including restrictions on the syntax of allowed fragment
identifiers. Because it is impractical for processors to check that a value
is a context-appropriate URI reference, this specification follows the lead
of [RFC 2396] (as amended by [RFC 2732]) in this matter: such rules and
restrictions are not part of type validity and are not checked by minimally
conforming processors. Thus in practice the above definition imposes only
very modest obligations on minimally conforming processors."

From looking at RFC2396 and RFC2732, we're fairly certain that OAI
Identifiers are valid URIs.  The second colon might be considered an issue
by some, but looking at the RFC2396:

   ...An absolute URI contains the name of the scheme being used (<scheme>)
   followed by a colon (":") and then a string (the <scheme-specific-
   part>) whose interpretation depends on the scheme. The URI syntax does 
   not require that the scheme-specific-part have any general structure or
   set of semantics which is common among all URI...

Colons are commonly considered reserved (i.e., needing to be escaped), but
the RFC also says about reserved characters:

   The "reserved" syntax class above refers to those characters that are
   allowed within a URI, but which may not be allowed within a
   particular component of the generic URI syntax; they are used as
   delimiters of the components described in Section 3.
   Characters in the "reserved" set are not reserved in all contexts.
   The set of characters actually reserved within any given URI
   component is defined by that component.  In general, a character is
   reserved if the semantics of the URI changes if the character is
   replaced with its escaped US-ASCII encoding.

On this basis and looking at the "Collected BNF for URI" (Appendix A), OAI
should be allowed to specify that colons used as delimiters in the
scheme-specific part are allowed unescaped within that component (i.e., the
"opaque" OAI scheme-specific-part) -- just as some URIs use an unescaped
colon within the hostport component of their scheme-specific-part.  Which
means that the only things that a generic URI parser should be able to
discern from the OAI URIs is scheme:opaque_part.  (It might be interesting
to see what would happen with Xerces if the second colon was escaped, such
as oai:etdcat%3Aocm02999966 -- if Xerces no longer objected, then we would
know that it was "over" validating the identifier element.)

Of course OAI might still want to consider eventually changing the OAI
identifier scheme (scheme-specific-part of the URI) to something more
similar to other net URIs, such as oai://etdcat/ocm02999966
(scheme://registration_name/opaque_part), but we would hesitate to suggest
that at this stage simply to accommodate what appears to be an overzealous
parser.

Tim Cole
Tom Habing
University of Illinois at UC

"Young,Jeff" wrote:
> 
> I'm happy to say that the status=deleted problem appears to be resolved.
> Unfortunately, I now seem to have a different (unrelated) problem. Someone
> reported to me that Xerces 1.3.1 is reporting an XML schema error where
> 1.3.0 didn't. It seems that I had failed to call setErrorHandler() which is
> key to reporting any validation errors. Xerces 1.3.0 let this slide where
> 1.3.1 complains about it. Now that I've corrected this oversight, I'm now
> seeing some parser errors related to the XML schema. I've attached another
> small demo application that shows the effects. To add to the confusion,
> 1.3.0 reports a different error than does 1.3.1.
> 
> Using Xerces 1.3.0, the demo application produces:
> 
> error
> org.xml.sax.SAXParseException: Datatype error: In element 'identifier' :
> Value 'oai:etdcat:ocm02999966' is a Malformed URI .
>         at
> org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1068)
>         at
> org.apache.xerces.validators.common.XMLValidator.checkContent(XMLValidator.j
> ava:3609)
>         at
> org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator
> .java:1133)
>         at
> org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
> LDocumentScanner.java:1201)
>         at
> org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
> java:381)
>         at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
>         at
> org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:12
> 3)
>         at Test.main(Test.java:34)
> 
> Using Xerces 1.3.1, the demo produces:
> 
> error
> org.xml.sax.SAXParseException: The content of element type "metadata" must
> match "##any:uri=http://www.openarchives.org/OAI/1.0/OAI_ListRecords".
>         at
> org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)
>         at
> org.apache.xerces.validators.common.XMLValidator.reportRecoverableXMLError(X
> MLValidator.java:1689)
>         at
> org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator
> .java:1353)
>         at
> org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
> LDocumentScanner.java:1205)
>         at
> org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
> java:381)
>         at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
>         at
> org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:17
> 2)
>         at Test.main(Test.java:34)
> 
> As far as I can tell, the schema look fine. My assumption, at this point, is
> that Xerces is at fault and my only recourse is turn off validation. I must
> also admit the possibility that my program is flawed in some way. On the
> slim chance that I've found the 2nd and 3rd XML schema errors within the
> span of a week, though, I thought I'd pass along my findings.
> 
>  <<Test.java>>
> Cheers,
> 
> Jeff
> 
> ---
> Jeffrey A. Young
> Senior Consulting Systems Analyst
> Office of Research, Mail Code 710
> OCLC Online Computer Library Center, Inc.
> 6565 Frantz Road
> Dublin, OH   43017-3395
> www.oclc.org
> 
> Voice:  614-764-4342
> Voice:  800-848-5878, ext. 4342
> Fax:    614-718-7477
> Email:  jyoung@oclc.org
> 
>   ----------------------------------------------------------------------------
>                 Name: Test.java
>    Test.java    Type: java/*
>             Encoding: quoted-printable

-- 
Thomas G. Habing
Research Programmer, Digital Library Initiative
University of Illinois at Urbana-Champaign
052 Grainger Engineering Library, MC-274
thabing@uiuc.edu, (217) 244-7809