[OAI-implementers] XML Schemas and Xerces again

herbert van de sompel herbertv@cs.cornell.edu
Thu, 26 Apr 2001 12:26:19 -0400


hi all,

"Thomas G. Habing" wrote:
> Regarding the 1.3.1 error, like Herbert we think Xerces is probably wrong
> here, but at the same time, we also wonder if the namespace='##any' is
> actually necessary in the XSD, since ##any is the default for the namespace
> attribute.  Does Xerces not give an error if the XSD leaves out the
> namespace='##any'?
> 

the use of "##any" is again related to inconsistency in parsers.  what
we really wanted there instead of "##any" was "##other" but XML Spy
would not let us use the "##other": it generated errors.  XSV did allow
the "##other".  as a result, we chose to go for the less stringent
"##any".  eventually, I hope we can go back to "##other" , which is
really what we want.

> Regarding the 1.3.0 error, our guess is that Xerces is trying to do too much
> when validating uriReference string.  [Note, 'uriReference' has been changed
> to 'anyURI' in the latest version of the schema spec.]
> 

I would like to remind you that XML Spy did not validate an OAI
requestURL (which is a HTTP URL) as a uriReference.  which is why we
moved to the type "string".  so, again, it means that we have to be
careful with interpreting error messages from current validators.  as
Thomas points out extensively, it seems that OAI identifiers are valid
URI's according to RFC2396 and RFC2732.

herbert


> Quoting from the latest XML schema spec (datatype):
> 
> "NOTE: Each URI scheme imposes specialized syntax rules for URIs in that
> scheme, including restrictions on the syntax of allowed fragment
> identifiers. Because it is impractical for processors to check that a value
> is a context-appropriate URI reference, this specification follows the lead
> of [RFC 2396] (as amended by [RFC 2732]) in this matter: such rules and
> restrictions are not part of type validity and are not checked by minimally
> conforming processors. Thus in practice the above definition imposes only
> very modest obligations on minimally conforming processors."
> 
> >From looking at RFC2396 and RFC2732, we're fairly certain that OAI
> Identifiers are valid URIs.  The second colon might be considered an issue
> by some, but looking at the RFC2396:
> 
>    ...An absolute URI contains the name of the scheme being used (<scheme>)
>    followed by a colon (":") and then a string (the <scheme-specific-
>    part>) whose interpretation depends on the scheme. The URI syntax does
>    not require that the scheme-specific-part have any general structure or
>    set of semantics which is common among all URI...
> 
> Colons are commonly considered reserved (i.e., needing to be escaped), but
> the RFC also says about reserved characters:
> 
>    The "reserved" syntax class above refers to those characters that are
>    allowed within a URI, but which may not be allowed within a
>    particular component of the generic URI syntax; they are used as
>    delimiters of the components described in Section 3.
>    Characters in the "reserved" set are not reserved in all contexts.
>    The set of characters actually reserved within any given URI
>    component is defined by that component.  In general, a character is
>    reserved if the semantics of the URI changes if the character is
>    replaced with its escaped US-ASCII encoding.
> 
> On this basis and looking at the "Collected BNF for URI" (Appendix A), OAI
> should be allowed to specify that colons used as delimiters in the
> scheme-specific part are allowed unescaped within that component (i.e., the
> "opaque" OAI scheme-specific-part) -- just as some URIs use an unescaped
> colon within the hostport component of their scheme-specific-part.  Which
> means that the only things that a generic URI parser should be able to
> discern from the OAI URIs is scheme:opaque_part.  (It might be interesting
> to see what would happen with Xerces if the second colon was escaped, such
> as oai:etdcat%3Aocm02999966 -- if Xerces no longer objected, then we would
> know that it was "over" validating the identifier element.)
> 
> Of course OAI might still want to consider eventually changing the OAI
> identifier scheme (scheme-specific-part of the URI) to something more
> similar to other net URIs, such as oai://etdcat/ocm02999966
> (scheme://registration_name/opaque_part), but we would hesitate to suggest
> that at this stage simply to accommodate what appears to be an overzealous
> parser.
> 
> Tim Cole
> Tom Habing
> University of Illinois at UC
> 
> "Young,Jeff" wrote:
> >
> > I'm happy to say that the status=deleted problem appears to be resolved.
> > Unfortunately, I now seem to have a different (unrelated) problem. Someone
> > reported to me that Xerces 1.3.1 is reporting an XML schema error where
> > 1.3.0 didn't. It seems that I had failed to call setErrorHandler() which is
> > key to reporting any validation errors. Xerces 1.3.0 let this slide where
> > 1.3.1 complains about it. Now that I've corrected this oversight, I'm now
> > seeing some parser errors related to the XML schema. I've attached another
> > small demo application that shows the effects. To add to the confusion,
> > 1.3.0 reports a different error than does 1.3.1.
> >
> > Using Xerces 1.3.0, the demo application produces:
> >
> > error
> > org.xml.sax.SAXParseException: Datatype error: In element 'identifier' :
> > Value 'oai:etdcat:ocm02999966' is a Malformed URI .
> >         at
> > org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1068)
> >         at
> > org.apache.xerces.validators.common.XMLValidator.checkContent(XMLValidator.j
> > ava:3609)
> >         at
> > org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator
> > .java:1133)
> >         at
> > org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
> > LDocumentScanner.java:1201)
> >         at
> > org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
> > java:381)
> >         at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
> >         at
> > org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:12
> > 3)
> >         at Test.main(Test.java:34)
> >
> > Using Xerces 1.3.1, the demo produces:
> >
> > error
> > org.xml.sax.SAXParseException: The content of element type "metadata" must
> > match "##any:uri=http://www.openarchives.org/OAI/1.0/OAI_ListRecords".
> >         at
> > org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)
> >         at
> > org.apache.xerces.validators.common.XMLValidator.reportRecoverableXMLError(X
> > MLValidator.java:1689)
> >         at
> > org.apache.xerces.validators.common.XMLValidator.callEndElement(XMLValidator
> > .java:1353)
> >         at
> > org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
> > LDocumentScanner.java:1205)
> >         at
> > org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
> > java:381)
> >         at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
> >         at
> > org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:17
> > 2)
> >         at Test.main(Test.java:34)
> >
> > As far as I can tell, the schema look fine. My assumption, at this point, is
> > that Xerces is at fault and my only recourse is turn off validation. I must
> > also admit the possibility that my program is flawed in some way. On the
> > slim chance that I've found the 2nd and 3rd XML schema errors within the
> > span of a week, though, I thought I'd pass along my findings.
> >
> >  <<Test.java>>
> > Cheers,
> >
> > Jeff
> >
> > ---
> > Jeffrey A. Young
> > Senior Consulting Systems Analyst
> > Office of Research, Mail Code 710
> > OCLC Online Computer Library Center, Inc.
> > 6565 Frantz Road
> > Dublin, OH   43017-3395
> > www.oclc.org
> >
> > Voice:  614-764-4342
> > Voice:  800-848-5878, ext. 4342
> > Fax:    614-718-7477
> > Email:  jyoung@oclc.org
> >
> >   ----------------------------------------------------------------------------
> >                 Name: Test.java
> >    Test.java    Type: java/*
> >             Encoding: quoted-printable
> 
> --
> Thomas G. Habing
> Research Programmer, Digital Library Initiative
> University of Illinois at Urbana-Champaign
> 052 Grainger Engineering Library, MC-274
> thabing@uiuc.edu, (217) 244-7809
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers