[OAI-implementers] Newbie available sites question

zubair@cs.odu.edu zubair@cs.odu.edu
Thu, 7 Feb 2002 18:00:37 -0500


Regarding your question about a system that crawls a web site, extract and
map metadata to DC, and make it available through OAI, we have a
preliminary version of such a system. For web sites where DC metadata can
be obtained from META tags, the implementation is straightforward. We were
also looking for Web sites, which are not that structured and do not have
META tags. In addition, we wanted a data-centered architecture where we do
not have to change the code once we want to work with another Web site. For
this reason, the system we have built uses WIDL  (XML based Web Interface
Definition Language; Web Methods has a NOTE in W3C on this language) -like
language  to describe the  Web site and the mapping of information on the
Web pages to DC metadata. The system uses this description to mine,
extract, and map the metadata and stores all the information in MYSQL,
which has a servlet-based OAI wrapper. If you want more information on this
system, you can send me an email (zubair@cs.odu.edu).


                    Alan Kent <ajk@mds.rmit.edu.au>                                                                                   
                    Sent by:                                    To:     OAI Implementors <oai-implementers@oaisrv.nsdl.cornell.edu>   
                    oai-implementers-admin@oaisrv.nsdl.c        cc:                                                                   
                    ornell.edu                                  Subject:     [OAI-implementers] Newbie available sites question       
                    02/07/2002 05:24 PM                                                                                               


I am new to this list so I am not sure yet whether this is a 'general'
or 'implementors' question. But I am trying to implement a OAI Service
Provider from the spec so 'implementers' seemed reasonable.

I have built a first cut of a OAI Harvester, loading the data up into
a local database I have built. I believe I have correctly followed the
spec, but I have been trying to access many of the listed OAI sites
listed from the openarchives.org site and getting lots of problems.
Is this because the protocol is still new? Or are many of the sites
out of date now? Is there a more up to date list?

I am using POST methods, and quite a few of the errors indicates to
me the sites only accept GET methods with the verb etc tacked onto the
URL. Is there any up-to-date "state of the nation" statement around?
Is OAI in serious production, or still very experimental? The spec
recommends POST rather than GET, but is the practice that GET is more
interoperable than POST?

Another unrelated question is: are there any systems around that will
crawl a web site, pull DC metadata out of the Meta tags, and then dish
that up via OAI? I had an enquiry from someone who thought OAI would
be interesting to use, but they were not sure if they could convince
their data suppliers to implement OAI (too much cost/hard work). If there
was a simple product that could be installed at the data providers sites,
then it can do a local crawl and then use OAI to only distribute the
changes. If there was existing cheap/free software for this, the data
providers might be willing to do something.

A little back ground on myself: we have a Z39.50 database system with
lots of SGML and XML support. However we are not really in the library
marketplace at present - more document management. I came across OAI
again recently, read the 1.1 spec, so had a go at writing a crawler.
It only took one day to write, which is a probably a good report for OAI.
I have been loading the DC XML into our Z39.50 server. If there was
interest, I might be able to put it up for public access. But at this
stage I don't have a good feel of where OAI is up to in real life.

Another question I had is that ListRecords does not seem to guarantee
any order in the returned data. I am doing a crawl from 1900 as an initial
pass to get all the data using resumptionToken's to keep going. If
fails half way through, I currently have to get the lot again. If the
records were guaranteed to come back in date sorted order, I could
resume from a more recent date/time. I could use from/util 1 year at a
time, but the nice thing about resumptionToken is that the server gets
to choose a reasonable lump size (eg: 100 records). The client does not
know if 1 year contains 1, 100, or 1000000 records. There has probably
been previous discussion on this. Does ListRecords guarantee the order
of records back (in date order)?

Another issue I had was it seemed strange that some date/time values
were accurate to the day, whereas others were accurate to the second.
My crawler when doing a query subtracts 1 day from the last crawl date
when doing the next crawl to ensure no records are lost, and to take
into account time zone differences etc. Is there any standard practice

Thanks for any help people can provide, and sorry if these are old
topics. But I find crawling through the archives does not always give
you what the current feelings on topics are.

Alan Kent (mailto:ajk@mds.rmit.edu.au, http://www.mds.rmit.edu.au)
Postal: Multimedia Database Systems, RMIT, GPO Box 2476V, Melbourne 3001.
Where: RMIT MDS, Bld 91, Level 3, 110 Victoria St, Carlton 3053, VIC
Phone: +61 3 9925 4114  Reception: +61 3 9925 4099  Fax: +61 3 9925 4098
OAI-implementers mailing list