regular expressions for cleanup was: Re: [OAI-implementers] XML encoding problems with DSpace at MIT

Hussein Suleman hussein@cs.uct.ac.za
Thu, 20 Feb 2003 17:18:24 +0200


hi

Brian Tingle wrote:
 > The most common problems I've had as a provider so far have had to do
 > with the ampersands in non-XML data that I want to expose.
... (see rest below)

this will work some of the time, but there will be problems if you have 
XML/HTML/SGML entities that are other than the standard ones (eg. i 
believe © will cause problems) ... maybe you are already addressing 
this, but if not, read on ...

XML has only 5 predefined entities (quot, lt, gt, amp, apos) - anything 
else requires an external entity definition and OAI requires using 
numerical entities instead of those (see start of section 3.2 of 
protocol). the clean solution is either to convert any suspected 
entities (Latin-1 seems to pop up in many places because of HTML) into 
numerical Unicode entities, and then double-escape anything you dont 
recognise ... best effort is probably not good enough - if in doubt, 
it's better to produce slightly over-escaped valid XML than originally 
encoded but possibly invalid XML :)

but, hey, don't reinvent the wheel ... look at the code templates 
available on the OAI website. most of the toolkits do some degree of 
data cleaning. if you use Perl, the VTOAI template i wrote has a 
"Utility.pm" module for data cleaning which does all of the above/below 
plus much more.

ttfn,
----hussein

-- 
=====================================================================
hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================


> This regular expression is what I use to take non-XML data that has lots 
> of ampersands and turn them to & but it will not "duouble" escape
> " &c. that might allready be in there allready.
> 
> $content =
> (Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co.  :127a) 
> 
> turns to $content=
> (Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co.  :127a)
> 
>         my $ident = '[:_A-Za-z][:A-Za-z0-9\-\_]+';
>         $content =~ s,\&(?!$ident;),&,sg;
> 
> 
> Heinrich Stamerjohanns <stamer@uni-oldenburg.de> wrote:
> 
>>The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
>>see the parsing errors) that people create Unicode from their databases
>>but forget to remove ISO-control characters, which are not valid in XML
>>(the comment in XML 1.0 spec was irritating and has been changed in XML
>>1.1 spec). Maybe this should be explicitly pointed out in the
>>documentation of the protocol.
>>
>>So to produce valid xml, something like this should be applied before 
> 
> you
> 
>>send out the data (this is in php, but is a perlre pattern):
>>
>>  
>>        // just remove invalid characters
>>        $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>>        $string = preg_replace($pattern,'',$string);
> 
> 
> 
> 
> 
> On Mon, 17 Feb 2003, Heinrich Stamerjohanns wrote:
> 
> 
>>On Sat, 15 Feb 2003, Hussein Suleman wrote:
>>
>>
>>> > The question is, the more harvesters implement fixes the less pressure
>>> > there is on repositories to fix their output, so should harvesters
>>> > accept bad-XML?
>>
>>>hi
>>>
>>>i think Tim poses a very relevant question: do we deal with the
>>>so-called "real-world" encoding problems or do we try to encourage
>>>people to fix their implementations? (of course, for research purposes,
>>>we may end up doing both :))
>>>
>>
>>Hi,
>>
>>If you want a working protocol, you must insist that the data-providers
>>deliver valid XML.
>>If they don't deliver valid XML, they are not OAI-compliant, thus some
>>harvesters will choke, some who try to fix the XML, might not.
>>
>>The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
>>see the parsing errors) that people create Unicode from their databases
>>but forget to remove ISO-control characters, which are not valid in XML
>>(the comment in XML 1.0 spec was irritating and has been changed in XML
>>1.1 spec). Maybe this should be explicitly pointed out in the
>>documentation of the protocol.
>>
>>So to produce valid xml, something like this should be applied before you
>>send out the data (this is in php, but is a perlre pattern):
>>
>>
>>        // just remove invalid characters
>>        $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>>        $string = preg_replace($pattern,'',$string);
>>
>>
>>Greetings, Heinrich
>>
>>
>>--
>>  Dr. Heinrich Stamerjohanns        Tel. +49-441-798-4276
>>  Institute for Science Networking  stamer@uni-oldenburg.de
>>  University of Oldenburg           http://isn.uni-oldenburg.de/~stamer
>>
>>
>>
>>_______________________________________________
>>OAI-implementers mailing list
>>OAI-implementers@oaisrv.nsdl.cornell.edu
>>http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers