[OAI-implementers] deep nesting problem in DP9
Tue, 12 Nov 2002 13:18:59 -0500
Let me add a bit of the context to this. If DP9 knew what all the
resumptionTokens looked like up front, it could generate an HTML page
containing a complete list of resumptionToken URLs as the front page for a
repository that it presents to a crawler. This way the crawler only sees a
total of 4 levels: 1) the dp9 gateway, 2) the complete list of harvest URLs
for a particular repository, 3) an HTML rendering of a ListIdentifiers
response, and 4) an HTML rendering of a GetRecord response.
> -----Original Message-----
> From: Xiaoming Liu [mailto:email@example.com]
> Sent: Tuesday, November 12, 2002 1:05 PM
> To: firstname.lastname@example.org
> Subject: [OAI-implementers] deep nesting problem in DP9
> I have a question regarding DP9 service and hope to solicit
> some consensus
> in OAI community (After discussions with Jeff Young). DP9 is a gateway
> service which allows general search engines, (e.g. Google, Inktomi) to
> index OAI-PMH-compliant archives. For more information please see
> DP9 uses resumptionToken to handle large archive, a
> resumptionToken link
> is put at the bottom of each listidentifiers screen. Each time a web
> crawler follows a resumptionToken link, it takes them down
> another level
> in the page hierarchy. In the case of XTCat (4M+) records,
> there will be
> 8000+ pages deep. But actually most crawlers will only do 4~5
> levels, so
> there will never be a chance of whole XTCat being harvested.
> Jeff and me discussed this question and came up several possible
> solutions. All of them require some levels of actions in data provider
> side, We hope the final solution is "general" and that's why
> I post it in
> the list.
> 1) Create many small bins based on timestamp and sets, the DP9, of
> course, must be intelligent to do the correct split. This
> probably can be
> done by a partial pre-harvest. This solution should be
> applied for most
> data providers. But it fails if a data provider has a large
> number records
> with same datestamp.
> pro) no change in OAI spec.
> con) data provider should not have a large number of
> records with same
> (snippet from Jeff's email)
> 2) Create a new verb named ListResumptionTokens that
> returns a complete set of stateless resumptionTokens.
> pro) easy to implement in DP9 side.
> con) requires modification of the OAI spec.
> 3) Define a new <description> element to be returned in the Identify
> response that provides with the information DP9 need to automatically
> generate stateless resumptionTokens.
> pro) doesn't require any modification of OAI spec.
> con) requires individual repositories to voluntarily provide this
> Probably somebody can come up with a better idea, and we may reach a
> consensus of which way to go ;-) Please send to the list if you
> have any input.
> best regards,
> OAI-implementers mailing list