[OAI-implementers] DP9 and HTML metadata

Xiaoming Liu liu_x@cs.odu.edu
Fri, 25 Jan 2002 01:03:38 -0500 (EST)


Walter,

These suggestions are great! They are added in the test site:
	http://egbert.cs.odu.edu:8901/dp9

I hope I have addressed most issues you mentioned, including,
--Add html meta tag for each DC field.
--Use dc:title in <title>title</title> (This is also suggested by Tim
Brody, thanks)
--dc.language:en  --> <meta http-equiv="content-language"
content="en">
-- Add  <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> in browsable index
page.


I also add another HTML/HTTP tag,
<meta content="2000-07-27" http-equiv="Last-Modified">

The value in this field is "datestamp" in OAI protocol, roughly it means
the last modification date of metadata record, for its precise definition
you may see
http://www.openarchives.org/OAI_protocol/openarchivesprotocol.html#Datestamp
 
All pages in DP9 are dynamically created, so the last-modified header
in DP9 http response may be misleading and  bring about addtional
indexing overhead to web crawlers.  I hope "last modified" HTML/HTTP
header may help a bit. 

Several issues need more works,
-- Use DC.identifier as URL which is presented with the results.

In OAI, the DC.identifier may not be an URL, (although in many cases they
are). I am not sure whether a validator should be added.

-- Break large page (like NACA) to smaller pieces (metntioned in
re:Michael's mail)

Due to current limitations of DP9, it is difficult to add this
feature right now. DP9 forwards requests to data provider and tries to
present whatever it gets. We are thinking about some cache mechansims, it
may help solve this problem.

There are over 1M metadata records available in current OAI compliant
repositories (see http://arc.cs.odu.edu), it is great if they are all get
indexed by search engines :-)

thanks,

Liu




On Thu, 24 Jan 2002, Walter Underwood wrote:

> As a spider engineer, I'd like to suggest an improvement to DP9.
> I'm sending this to the whole OAI list partly to introduce myself,
> and partly because it is an interesting omission in DP9.
> 
> DP9 should use HTML metadata standards to present the Dublin Core
> metadata. Right now, it prettyprints the info, but that is not
> useful for a spider. 
> 
> In addition to the pretty representation, the generated HTML should
> include meta tags for each DC element. I'd recommend also using
> native HTML/HTTP standards for a couple of the elements:
> 
>    dc.title:Hamlet --> <title>Hamlet</title>
>    dc.language:en  --> <meta http-equiv="content-language" content="en">
> 
> Our engine (Inktomi Enterprise Search) will use that metadata for
> the information presented in the results page. In addition, the
> engine can be configured to use DC.identifier as the URL which is
> presented with the results.
> 
> Finally, if there are browsable index pages with links to the 
> generated GetRecord pages, those should probably include a
> noindex robots meta tag. Lists of URLs are usually not very
> useful search results. They are excellent roots (start pages)
> for spidering, though.
> 
> wunder
> --
> Walter Underwood
> wunder@inktomi.com
> Senior Staff Engineer, Inktomi
> http://www.inktomi.com/
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>