[OAI-implementers] Re: OAI sets as new instances (Sets Proposal (from DLF))

Fri Apr 22 16:45:18 EDT 2005

[Future posts will be going to the oai-implementers list.]

Young,Jeff (OR) wrote:

> The choices hinge on the possibility of changing the OAI specification.
> If we assume the spec won't change anytime soon, then Rob's solution is
> the only feasible solution available within those confines. Since the
> problem is of concern to some people, Rob's solution seems worthy of
> mention as a best practice under the current spec.
> 
> If we allow ourselves to speculate about potential changes to the
> specification, then perhaps we could simply change the definition of
> status="deleted" to apply at the record level rather than the item
> level. There may have been a reason we defined it at the item level, but
> I'm not sure what it was, other than to simplify the implementation.
> 
> Jeff

I know it was rather presumptuous to suggest changes to the protocol. :-)

Regarding status="deleted", I am pretty sure that the spec currently 
puts this at the record level and not the item level.  A suggested 
change to the protocol might be to define sets as containing records 
instead of containing items.  Although, I don't think this would fix the 
problem of how to signal when a record has been moved from a set.

I'll also admit to somewhat playing devils advocate, because I kind of 
like Rob's solution, but I can't shake some misgivings which I am having 
a hard time articulating.  Perhaps the problem is that there are several 
different issues with sets, and I'm not sure which of these we are 
really trying to address.

1) The tendency of people to misunderstand sets as a sort of poor man's 
search.

2) Technical issues relating to how to signal that a record has been 
moved out of a set, but has not been deleted from the repository.

3) How best to describe a set: there is a technical description such as 
how many items are in the set and what the updated frequency is.  There 
is also the conceptual description, such as the records in this set are 
all described by this subject heading, or they all belong to this 
"collection," or they all have this publishing status.

4) Issues such as whether its a good idea to have overlapping sets, flat 
sets, hierarchical sets, and in which circumstances.

5) Variations in how different implementers have interpretted the OAI 
"data model".

Briefly some of my misgivings:

Does Rob's model place an excessive burden on data providers, or service 
providers?

Does it fundamentally alter the underlying data model of OAI, for better 
or worse?  Previously, I think that items belonged to one or more sets, 
and records were disseminations of these items in a specific format.  I 
think Rob's model alters this to something like records being 
disseminations of items within the context of those items being 
contained in a particular set.  In other words, the oai_dc record in set 
A could be different than the oai_dc record in set B, for the same item; 
they could have different datestamps, and different delete statuses.  I 
might try to describe this better with a little relation diagram later. 
  But the issue is do we want to encourage this sort of model?

Been rambling too long.  Curious as to what others think.

Tom
> 
> 
>>-----Original Message-----
>>From: Thomas G. Habing [mailto:thabing at uiuc.edu]
>>Sent: Friday, April 22, 2005 12:22 PM
>>To: Young,Jeff (OR)
>>Cc: Dr Robert Sanderson; LeVan,Ralph; Hickey,Thom; sshreeve at uiuc.edu;
>>khage at umich.edu; jewelw at usc.edu
>>Subject: Re: OAI sets as new instances
>>
>>Young,Jeff (OR) wrote:
>>
>>
>>>Hi Tom,
>>>
>>>
>>>
>>>>As much as I like the RESTiness of this idea, I'm not sure if it
>>>
>>>really
>>>
>>>
>>>>solves the problems associated with sets, except possibly the
> 
> problem
> 
>>>>with the inclination to use sets as a way to sneak searching into
> 
> the
> 
>>>>protocol :-)
>>>
>>>
>>>I believe it does solve the problem where service providers are
>>>currently required to reharvest from scratch periodically. Here's a
>>>detailed scenario:
>>>
>>>Time = 1:
>>>- oai:foo.oclc.org:123 exists in setSpec "foo" and "bar"
>>>- Service provider A harvests "foo"
>>>
>>>Time = 2:
>>>Record is removed from "foo" but not "bar"
>>>
>>>Time = 3:
>>>Service provider A harvests "foo"
>>>
>>>Conventional monolithic model:
>>>Since status="deleted" is an item level flag, there is nothing in
> 
> the
> 
>>>incremental harvest to indicate that the record is no longer
> 
> relevant to
> 
>>>the "foo" client.
>>>
>>>Robert's decoupled model:
>>>Since the sets are spread across different baseURLs, the data
> 
> provider
> 
>>>is free to flag the "foo" record as deleted without compromising the
>>>item's continued existence in "bar" since exists under a different
>>>baseURL.
>>>
>>>In applications where accuracy and currency are important, this is
> 
> much
> 
>>>better than periodic reharvesting from scratch.
>>>
>>>Jeff
>>>
>>
>>I agree that it does address the problem of records moving between
> 
> sets,
> 
>>but it seems like there may be easier solutions to the problem, such
> 
> as
> 
>>adding a new optional status, such as
>>
>>Time 1: Service provider A harvests "foo"
>>
>>    <record>
>>     <header>
>>       <identifier>oai:foo.oclc.org:123</identifier>
>>       <datestamp>2005-1-1</datestamp>
>>       <setSpec>foo</setSpec>
>>       <setSpec>bar</setSpec>
>>     </header>
>>     <metadata>
>>       ...
>>     </metadata>
>>    </record>
>>
>>Time 2: Record is removed from "foo" but not "bar"
>>
>>Time 3: Service provider A harvests "foo"
>>
>>    <record>
>>     <header status="moved">
>>       <identifier>oai:foo.oclc.org:123</identifier>
>>       <datestamp>[Time 2]</datestamp>
>>       <setSpec>bar</setSpec>
>>     </header>
>>    </record>
>>
>>Repositories could signal their support of the "moved" status the same
>>way they do for the "deleted" status in the Identify response:
>>
>><movedRecord>no|transient|persistent</movedRecord>
>>
>>If a repository supported transient or persistent "moved" statuses,
> 
> they
> 
>>would need to keep track of when a record is moved out of a set.  A
>>harvest of that set would return record headers with a status of
> 
> "moved"
> 
>>for those records which have been moved within the date range of the
>>harvest.  The datestamp shown in these headers would reflect the date
> 
> on
> 
>>which the record was moved, not the date on which it was last
> 
> modified.
> 
>>  The real datestamp of the actual record would remain unchanged.
> 
> This
> 
>>also has the advantage of showing where the record moved to because of
>>the setSpec elements in the header.  It also has the advantage of
>>differentiating between a record that has just been moved around and
> 
> one
> 
>>which has really been deleted.
>>
>>For data providers the implementation cost for this would probably be
>>about the same as for the different baseURLs.  Not sure how it might
>>complicate harvesters implementations, probably not too much, and I
>>suspect less than the different baseURL approach.
>>
>>Kind regards,
>>
>>Tom
> 
>