[OAI-implementers] deep nesting problem in DP9

Young,Jeff jyoung@oclc.org
Tue, 12 Nov 2002 13:18:59 -0500

Let me add a bit of the context to this. If DP9 knew what all the
resumptionTokens looked like up front, it could generate an HTML page
containing a complete list of resumptionToken URLs as the front page for a
repository that it presents to a crawler. This way the crawler only sees a
total of 4 levels: 1) the dp9 gateway, 2) the complete list of harvest URLs
for a particular repository, 3) an HTML rendering of a ListIdentifiers
response, and 4) an HTML rendering of a GetRecord response.


> -----Original Message-----
> From: Xiaoming Liu [mailto:liu_x@cs.odu.edu]
> Sent: Tuesday, November 12, 2002 1:05 PM
> To: oai-implementers@oaisrv.nsdl.cornell.edu
> Subject: [OAI-implementers] deep nesting problem in DP9
> Hi,
> I have a question regarding DP9 service and hope to solicit 
> some consensus
> in OAI community (After discussions with Jeff Young). DP9 is a gateway
> service which allows general search engines, (e.g. Google, Inktomi) to
> index OAI-PMH-compliant archives. For more information please see 
> http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf
> http://dlib.cs.odu.edu/dp9
> DP9 uses resumptionToken to handle large archive, a 
> resumptionToken link
> is put at the bottom of each listidentifiers screen. Each time a web
> crawler follows a resumptionToken link, it takes them down 
> another level
> in the page hierarchy. In the case of XTCat (4M+) records, 
> there will be
> 8000+ pages deep. But actually most crawlers will only do 4~5 
> levels, so
> there will never be a chance of whole XTCat being harvested. 
> Jeff and me discussed this question and came up several possible
> solutions. All of them require some levels of actions in data provider
> side, We hope the final solution is "general" and that's why 
> I post it in
> the list.
> 1) Create many small bins based  on timestamp and sets, the DP9, of
> course, must be intelligent to do the correct split. This 
> probably can be
> done by a partial pre-harvest. This solution should be 
> applied for most
> data providers. But it fails if a data provider has a large 
> number records
> with same datestamp. 
>   pro) no change in OAI spec.
>   con) data provider should not have a large number of 
> records with same
> datestamp.
> (snippet from Jeff's email)
> 2) Create a new verb named ListResumptionTokens that
> returns a complete set of stateless resumptionTokens.
>   pro) easy to implement in DP9 side.
>   con) requires modification of the OAI spec.
> 3) Define a new <description> element to be returned in the Identify
> response that provides with the information DP9 need to automatically
> generate stateless resumptionTokens.
>   pro) doesn't require any modification of OAI spec.
>   con) requires individual repositories to voluntarily provide this
> information
> Probably somebody can come up with a better idea, and we may  reach a
> consensus of which way to go ;-) Please send to the list if you
> have any input. 
> best regards,
> liu
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers