[OAI-implementers] XML encoding problems with DSpace at MIT

Heinrich Stamerjohanns stamer@uni-oldenburg.de
Mon, 17 Feb 2003 14:04:14 +0100 (CET)

On Sat, 15 Feb 2003, Hussein Suleman wrote:

>  > The question is, the more harvesters implement fixes the less pressure
>  > there is on repositories to fix their output, so should harvesters
>  > accept bad-XML?

> hi
> i think Tim poses a very relevant question: do we deal with the
> so-called "real-world" encoding problems or do we try to encourage
> people to fix their implementations? (of course, for research purposes,
> we may end up doing both :))


If you want a working protocol, you must insist that the data-providers
deliver valid XML.
If they don't deliver valid XML, they are not OAI-compliant, thus some
harvesters will choke, some who try to fix the XML, might not.

The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
see the parsing errors) that people create Unicode from their databases
but forget to remove ISO-control characters, which are not valid in XML
(the comment in XML 1.0 spec was irritating and has been changed in XML
1.1 spec). Maybe this should be explicitly pointed out in the
documentation of the protocol.

So to produce valid xml, something like this should be applied before you
send out the data (this is in php, but is a perlre pattern):

        // just remove invalid characters
        $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
        $string = preg_replace($pattern,'',$string);

Greetings, Heinrich

  Dr. Heinrich Stamerjohanns        Tel. +49-441-798-4276
  Institute for Science Networking  stamer@uni-oldenburg.de
  University of Oldenburg           http://isn.uni-oldenburg.de/~stamer