[OAI-implementers] DP9 and HTML metadata

Walter Underwood wunder@inktomi.com
Thu, 24 Jan 2002 12:01:03 -0800

--On Thursday, January 24, 2002 02:18:25 PM -0500 "Michael L. Nelson" <mln@ils.unc.edu> wrote:
> But since you're on the line, I have some questions for you ;-)
> 1.  Do you have an official or personal opinion that you can share about
> OAI & spidering?  

Open archives are great, and a common protocol is a good idea.
I haven't looked at the OAI protocol yet.

Metadata-only indexes have some serious weaknesses. They almost
require a separate thesaurus with lots of entry terms, and even
then, they won't handle searches for secondary characteristics,
like names of characters in fiction.

Metadata+fulltext is much more powerful.

For example, the "Celebration of Women Writers" collection is good,
but the metadata is minimal. The Caltech Thesis collection has nice, big
descriptions, which gives a bigger "target" for matching user queries.

For "celebration", I'd recommend indexing the source documents instead.

Good metadata is very expensive, so we only see that on high-investment
documents, like public techreports. Even then, I've seen some surprisingly
sloppy index keywords ("IR" in one report, "Information Retrieval" in
the next).

> 2.  DP9 is great for spiders that don't know any better, but what are the
> chances of "OAI-aware" spiders?  Or is that such a special case that its
> not worth accounting for...

Part of the answer is "how much will people pay for that?"
Another part is "how hard is it to implement?"
And then there is "how much fun is it?"

I don't have the answer to any of those right now.

> ... Of course, this is a good substitute:
> http://arc.cs.odu.edu:8080/dp9/listidentifiers/NACA

For spiders, that should be broken into smaller pages, with around
1000 links each. Big pages with tons of links can cause problems.
Perhaps a page for each year.

And you need the HTML pages anyway, for humans to view.

Walter Underwood
Senior Staff Engineer, Inktomi