[OAI-implementers] Newbie available sites question
Fri, 8 Feb 2002 09:24:36 +1100
I am new to this list so I am not sure yet whether this is a 'general'
or 'implementors' question. But I am trying to implement a OAI Service
Provider from the spec so 'implementers' seemed reasonable.
I have built a first cut of a OAI Harvester, loading the data up into
a local database I have built. I believe I have correctly followed the
spec, but I have been trying to access many of the listed OAI sites
listed from the openarchives.org site and getting lots of problems.
Is this because the protocol is still new? Or are many of the sites
out of date now? Is there a more up to date list?
I am using POST methods, and quite a few of the errors indicates to
me the sites only accept GET methods with the verb etc tacked onto the
URL. Is there any up-to-date "state of the nation" statement around?
Is OAI in serious production, or still very experimental? The spec
recommends POST rather than GET, but is the practice that GET is more
interoperable than POST?
Another unrelated question is: are there any systems around that will
crawl a web site, pull DC metadata out of the Meta tags, and then dish
that up via OAI? I had an enquiry from someone who thought OAI would
be interesting to use, but they were not sure if they could convince
their data suppliers to implement OAI (too much cost/hard work). If there
was a simple product that could be installed at the data providers sites,
then it can do a local crawl and then use OAI to only distribute the
changes. If there was existing cheap/free software for this, the data
providers might be willing to do something.
A little back ground on myself: we have a Z39.50 database system with
lots of SGML and XML support. However we are not really in the library
marketplace at present - more document management. I came across OAI
again recently, read the 1.1 spec, so had a go at writing a crawler.
It only took one day to write, which is a probably a good report for OAI.
I have been loading the DC XML into our Z39.50 server. If there was
interest, I might be able to put it up for public access. But at this
stage I don't have a good feel of where OAI is up to in real life.
Another question I had is that ListRecords does not seem to guarantee
any order in the returned data. I am doing a crawl from 1900 as an initial
pass to get all the data using resumptionToken's to keep going. If something
fails half way through, I currently have to get the lot again. If the
records were guaranteed to come back in date sorted order, I could
resume from a more recent date/time. I could use from/util 1 year at a
time, but the nice thing about resumptionToken is that the server gets
to choose a reasonable lump size (eg: 100 records). The client does not
know if 1 year contains 1, 100, or 1000000 records. There has probably
been previous discussion on this. Does ListRecords guarantee the order
of records back (in date order)?
Another issue I had was it seemed strange that some date/time values
were accurate to the day, whereas others were accurate to the second.
My crawler when doing a query subtracts 1 day from the last crawl date
when doing the next crawl to ensure no records are lost, and to take
into account time zone differences etc. Is there any standard practice
Thanks for any help people can provide, and sorry if these are old
topics. But I find crawling through the archives does not always give
you what the current feelings on topics are.
Alan Kent (mailto:email@example.com, http://www.mds.rmit.edu.au)
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC Australia.
Phone: +61 3 9925 4114 Reception: +61 3 9925 4099 Fax: +61 3 9925 4098