[OAI-implementers] XML encoding problems with DSpace at MIT

Simeon Warner simeon@cs.cornell.edu
Tue, 18 Feb 2003 11:57:04 -0500 (EST)

I agree that encoding problems should be corrected by the data provider. I
have spent a considerable amount of time sending bug reports to data
providers in the past (I haven't been doing this recently because I
haven't been doing broad harvests).

My point on the oai-general list was really that we shouldn't get hung-up
on the notion of "compliance". The notion isn't well defined or policed
and many server implementations have occasional problems (even mine...
shock, horror!). The key is that initiatives such as eprints.org an dspace
have built-in support for metadata sharing using OAI (notwithstanding the
occasional bug) and that is very good for the open access movement and OAI
service providers.


On Mon, 17 Feb 2003, Kat Hagedorn wrote:
> I'm also with Hussein and Caroline. As a service provider, we do our  
> best to notify data providers when we run into errors that seem to not  
> be problems with our harvester. We can always clean the data at our end  
> if we can (and as close to the source as possible), but it's infinitely  
> easier in the long run for us to notify the data provider. As a result,  
> we don't waste time on our end and other service providers will harvest  
> cleaner data. The communication ends up benefitting everyone.
> - Kat
> On Saturday, Feb 15, 2003, at 11:25 America/Detroit, Caroline Arms wrote:
> > As a data provider, LC would like to know if it is generating invalid
> > characters.  The gradual migration to UNICODE is going to give us all
> > problems, in part BECAUSE some systems work so hard to recognize  
> > different
> > character encodings and adjust.  I'm with Hussein.  Notify data  
> > providers
> > of problems (even if you do adjust) so that the problem can be fixed as
> > close to home as possible.
> >
> > As a related aside, if anyone has a suggestion for an efficient way
> > (preferably unix-based) to check that the metadata in a PDF file is
> > stored in UTF-8 encoding (or consistently in any other UNICODE
> > encoding), I'd be interested.
> >
> > Caroline Arms
> > Office of Strategic Initiatives
> > Library of Congress
> >
> > On Sat, 15 Feb 2003, Hussein Suleman wrote:
> >> hi
> >>
> >> i think Tim poses a very relevant question: do we deal with the
> >> so-called "real-world" encoding problems or do we try to encourage
> >> people to fix their implementations? (of course, for research  
> >> purposes,
> >> we may end up doing both :))
> >>
> >> personally, the code i distribute to others does quite a lot of XML
> >> cleaning in the data provider, but none at all in the harvester. i
> >> think the basic philosophy i'm following is: clean data as close to
> >> the source as possible. also, i believe one of the reasons the
> >> adminEmail field in Identify responses is required is so that a
> >> service provider can contact the administrator if there are problems
> >> with the data.
> >>
> >> and now that the hype about OAI2 is dying down, i wonder how much (if
> >> any) more testing we need. i have some ideas to enhance, complement
> >> and possibly even replace the repository explorer in the next year
> >> ... it all depends on finding time and/or students/colleagues with
> >> time :)
> >>
> >> ttfn,
> >> ----hussein