[OAI-implementers] XML Schemas and Xerces again

Young,Jeff jyoung@oclc.org
Wed, 25 Apr 2001 16:15:36 -0400


Thomas,

Removing namespace='##any' has no effect against 1.3.1. The complaint
remains the same. 

I am seeing some odd behavior, though, that may or may not be revealing to
someone. The label '##any' appears to be used ambiguously. Here's the error
message:

'org.xml.sax.SAXParseException: The content of element type "metadata" must
match "##any:uri=http://www.openarchives.org/OAI/1.0/OAI_ListRecords"'

It appears that the '##any' reference in the error message is a product of
the <any> element, not the namespace="##any" attribute. This is proven by
the fact that the error message continues to contain the '##any' regardless
of the presence, absence, or value of the namespace attribute. Instead, the
namespace attribute appears to map to the uri. In other words,
namespace='##any' or the absence of the namespace attribute produces
'uri=http://www.openarchives.org/OAI/1.0/OAI_ListRecords' in the error
message. If I change the namespace to something else, the uri changes
accordingly. I suspect this is another indication of Xerces' failings as a
schema validator.

Regarding 1.3.0, escaping the second colon (or both for that matter) has no
effect. I checked, though, and 1.3.0 is completely satisfied by your
suggestion of oai://etdcat/ocm02999966. On top of that, 1.3.0 doesn't
complain about the ##any problem and validated correctly with the
hypothetical identifier. 

Jeff


> -----Original Message-----
> From: Thomas G. Habing [mailto:thabing@uiuc.edu]
> Sent: Wednesday, April 25, 2001 2:21 PM
> To: OAI-implementers (E-mail)
> Subject: Re: [OAI-implementers] XML Schemas and Xerces again
> 
> 
> Jeff-
> 
> Regarding the 1.3.1 error, like Herbert we think Xerces is 
> probably wrong
> here, but at the same time, we also wonder if the namespace='##any' is
> actually necessary in the XSD, since ##any is the default for 
> the namespace
> attribute.  Does Xerces not give an error if the XSD leaves out the
> namespace='##any'?
> 
> Regarding the 1.3.0 error, our guess is that Xerces is trying 
> to do too much
> when validating uriReference string.  [Note, 'uriReference' 
> has been changed
> to 'anyURI' in the latest version of the schema spec.]  
> 
> Quoting from the latest XML schema spec (datatype):
> 
> "NOTE: Each URI scheme imposes specialized syntax rules for 
> URIs in that
> scheme, including restrictions on the syntax of allowed fragment
> identifiers. Because it is impractical for processors to 
> check that a value
> is a context-appropriate URI reference, this specification 
> follows the lead
> of [RFC 2396] (as amended by [RFC 2732]) in this matter: such 
> rules and
> restrictions are not part of type validity and are not 
> checked by minimally
> conforming processors. Thus in practice the above definition 
> imposes only
> very modest obligations on minimally conforming processors."
> 
> From looking at RFC2396 and RFC2732, we're fairly certain that OAI
> Identifiers are valid URIs.  The second colon might be 
> considered an issue
> by some, but looking at the RFC2396:
> 
>    ...An absolute URI contains the name of the scheme being 
> used (<scheme>)
>    followed by a colon (":") and then a string (the <scheme-specific-
>    part>) whose interpretation depends on the scheme. The URI 
> syntax does 
>    not require that the scheme-specific-part have any general 
> structure or
>    set of semantics which is common among all URI...
> 
> Colons are commonly considered reserved (i.e., needing to be 
> escaped), but
> the RFC also says about reserved characters:
> 
>    The "reserved" syntax class above refers to those 
> characters that are
>    allowed within a URI, but which may not be allowed within a
>    particular component of the generic URI syntax; they are used as
>    delimiters of the components described in Section 3.
>    Characters in the "reserved" set are not reserved in all contexts.
>    The set of characters actually reserved within any given URI
>    component is defined by that component.  In general, a character is
>    reserved if the semantics of the URI changes if the character is
>    replaced with its escaped US-ASCII encoding.
> 
> On this basis and looking at the "Collected BNF for URI" 
> (Appendix A), OAI
> should be allowed to specify that colons used as delimiters in the
> scheme-specific part are allowed unescaped within that 
> component (i.e., the
> "opaque" OAI scheme-specific-part) -- just as some URIs use 
> an unescaped
> colon within the hostport component of their 
> scheme-specific-part.  Which
> means that the only things that a generic URI parser should be able to
> discern from the OAI URIs is scheme:opaque_part.  (It might 
> be interesting
> to see what would happen with Xerces if the second colon was 
> escaped, such
> as oai:etdcat%3Aocm02999966 -- if Xerces no longer objected, 
> then we would
> know that it was "over" validating the identifier element.)
> 
> Of course OAI might still want to consider eventually changing the OAI
> identifier scheme (scheme-specific-part of the URI) to something more
> similar to other net URIs, such as oai://etdcat/ocm02999966
> (scheme://registration_name/opaque_part), but we would 
> hesitate to suggest
> that at this stage simply to accommodate what appears to be 
> an overzealous
> parser.
> 
> Tim Cole
> Tom Habing
> University of Illinois at UC
> 
> "Young,Jeff" wrote:
> > 
> > I'm happy to say that the status=deleted problem appears to 
> be resolved.
> > Unfortunately, I now seem to have a different (unrelated) 
> problem. Someone
> > reported to me that Xerces 1.3.1 is reporting an XML schema 
> error where
> > 1.3.0 didn't. It seems that I had failed to call 
> setErrorHandler() which is
> > key to reporting any validation errors. Xerces 1.3.0 let 
> this slide where
> > 1.3.1 complains about it. Now that I've corrected this 
> oversight, I'm now
> > seeing some parser errors related to the XML schema. I've 
> attached another
> > small demo application that shows the effects. To add to 
> the confusion,
> > 1.3.0 reports a different error than does 1.3.1.
> > 
> > Using Xerces 1.3.0, the demo application produces:
> > 
> > error
> > org.xml.sax.SAXParseException: Datatype error: In element 
> 'identifier' :
> > Value 'oai:etdcat:ocm02999966' is a Malformed URI .
> >         at
> > 
> org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1068)
> >         at
> > 
> org.apache.xerces.validators.common.XMLValidator.checkContent(
> XMLValidator.j
> > ava:3609)
> >         at
> > 
> org.apache.xerces.validators.common.XMLValidator.callEndElemen
> t(XMLValidator
> > .java:1133)
> >         at
> > 
> org.apache.xerces.framework.XMLDocumentScanner$ContentDispatch
> er.dispatch(XM
> > LDocumentScanner.java:1201)
> >         at
> > 
> org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDo
> cumentScanner.
> > java:381)
> >         at 
> org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
> >         at
> > 
> org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuild
> erImpl.java:12
> > 3)
> >         at Test.main(Test.java:34)
> > 
> > Using Xerces 1.3.1, the demo produces:
> > 
> > error
> > org.xml.sax.SAXParseException: The content of element type 
> "metadata" must
> > match 
> "##any:uri=http://www.openarchives.org/OAI/1.0/OAI_ListRecords".
> >         at
> > 
> org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1067)
> >         at
> > 
> org.apache.xerces.validators.common.XMLValidator.reportRecover
> ableXMLError(X
> > MLValidator.java:1689)
> >         at
> > 
> org.apache.xerces.validators.common.XMLValidator.callEndElemen
> t(XMLValidator
> > .java:1353)
> >         at
> > 
> org.apache.xerces.framework.XMLDocumentScanner$ContentDispatch
> er.dispatch(XM
> > LDocumentScanner.java:1205)
> >         at
> > 
> org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDo
> cumentScanner.
> > java:381)
> >         at 
> org.apache.xerces.framework.XMLParser.parse(XMLParser.java:952)
> >         at
> > 
> org.apache.xerces.jaxp.DocumentBuilderImpl.parse(DocumentBuild
> erImpl.java:17
> > 2)
> >         at Test.main(Test.java:34)
> > 
> > As far as I can tell, the schema look fine. My assumption, 
> at this point, is
> > that Xerces is at fault and my only recourse is turn off 
> validation. I must
> > also admit the possibility that my program is flawed in 
> some way. On the
> > slim chance that I've found the 2nd and 3rd XML schema 
> errors within the
> > span of a week, though, I thought I'd pass along my findings.
> > 
> >  <<Test.java>>
> > Cheers,
> > 
> > Jeff
> > 
> > ---
> > Jeffrey A. Young
> > Senior Consulting Systems Analyst
> > Office of Research, Mail Code 710
> > OCLC Online Computer Library Center, Inc.
> > 6565 Frantz Road
> > Dublin, OH   43017-3395
> > www.oclc.org
> > 
> > Voice:  614-764-4342
> > Voice:  800-848-5878, ext. 4342
> > Fax:    614-718-7477
> > Email:  jyoung@oclc.org
> > 
> >   
> --------------------------------------------------------------
> --------------
> >                 Name: Test.java
> >    Test.java    Type: java/*
> >             Encoding: quoted-printable
> 
> -- 
> Thomas G. Habing
> Research Programmer, Digital Library Initiative
> University of Illinois at Urbana-Champaign
> 052 Grainger Engineering Library, MC-274
> thabing@uiuc.edu, (217) 244-7809
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>