[OAI-implementers] Some OAI-PMH protocol issues

Fri Dec 7 11:08:12 EST 2007

Frederic,

Les, Muriel, and Jesus are right.  These are not issues with the protocol, but with how it is used and the metadata practices behind the records.  And none of them are new.  They are HARD problems. Much as one might hope for a consensus, attempts to develop stricter rules for metadata across all communities have failed repeatedly.  There is a reason that the "simple" Dublin Core elements are all optional and repeatable.  Within communities, progress has been made.

Muriel mentioned the DLF/NSDL Best Practices for Shareable Metadata.  Here are links.
  http://www.diglib.org/architectures/oai/imls2004/training/MetadataFinal.pdf
  http://webservices.itcs.umich.edu/mediawiki/oaibp/?MetadataContent

In the interest of full disclosure, I should say that both Muriel and I were involved in this activity.  I believe it distills wisdom gleaned from a great deal of experience. Given your topical interest, the NSDL Metadata Primer from the National Science Digital Library (a project funded by the US National Science Foundation) may be relevant.
http://metamanagement.comm.nsdlib.org/outline.html

It's worth keeping on highlighting the issues to try and promote better practices.  

On the issue of TYPE: 
All guidelines I have seen recommend at least one "type" value from a controlled list, but this is an area where communities and service providers may have very different idea of what is needed beyond that.  For example, the DRIVER guidelines assume the objects described are textual and provide a short list of more specific terms considered adequate for the objects in scope.  The NSDL guideline is to include one term from the DCMI type vocabulary plus more specific terms as appropriate.  What the Library of Congress tries to do with its digitized historical material in American Memory follows that practice: assign both a high-level type (e.g. text, still image, as in the DCMI type vocabulary) and something more specific.  Various vocabularies are used for the more specific vocabularies, from the Thesaurus for Graphic Materials (http://lcweb2.loc.gov/pp/tgmhtml/tgmabt.html) with 650 terms for types of visual materials to Basic Genre Terms for Cultural Heritage Materials (http://memory.loc.gov/ammem/techdocs/genre.html) which is a small, somewhat ad hoc, set of terms based on the content in American Memory (much of which comes from collections of personal or organizational papers and ethnographic folklife collections).

The ePrints Type vocabulary [http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Type_Vocabulary_Encoding_Scheme] is another vocabulary with a focussed (but very widely applicable) scope.

Service providers will have to make decisions about things like: (a) how much work to do with the records they harvest to improve the situation; (b) whether to try and filter out duplicates; (c) when to throw records out; and (d) whether to work with the organizations they most want to harvest from, either one-on-one or to make some community guidelines that improve the situation.  

You might want to see if you can find out what strategies other service providers in similar domains have used for dealing with Type and Date. 

Caroline Arms
Library of Congress, Office of Strategic Initiatives
caar at loc.gov

== Opinions expressed here are my own, not those of the Library of Congress ==

PS  FWIW, a book chapter my husband and I wrote for Metadata in Practice, edited by Diane Hillmann and Elaine Westbrooks, and published by  ALA Editions in 2004, attempts to explain why heterogeneous metadata is a fact of life.
  Mixed Content and Mixed Metadata: Information Discovery in a Messy World
  http://www.cs.cornell.edu/wya/papers/ALA-2003.php

>>> Leslie Carr <lac at ecs.soton.ac.uk> 12/07/07 8:46 AM >>>
I don't think that these are OAI-PMH data transport issues, but  
application issues that arise from the interpretation and usage of  
metadata from heterogenous data providers.

Managing duplicates: while it makes life inconvenient for your service  
users, it is not something that can in general be controlled by  
independent data providers. I'm afraid that your service must have the  
ability to reconcile duplicate (or near duplicate) items.

Controlled vocabularies for types: you will need to make a  
recommendation and try to gain support among the OAI data providers. I  
am not sure that the selection that you propose will work in practice.  
Is a PDF a text if it has embedded images? Is an image a text if it is  
an OCR scan?

Enforced publication dates: what do you enforce if the item has never  
been officially published?
--
Les Carr

On 5 Dec 2007, at 09:40, Frederic MERCEUR wrote:

> Hello,
>
> Further to the previous email I sent about the document we redacted  
> to assess the main difficulties met during the first year of  
> management of our Avano harvester, I would like to focus, in this  
> email, on just 3 problems linked to the OAI-PMH protocol, Dublin  
> Core or to repositories implementation. I would like to focus  
> particularly on these 3 problems because I guess they should not be  
> so difficult to fix.
>
>
> Managing duplicates
>
> Too many duplicates in a result list in Harvesters list can affect  
> the user’s comfort. This is not the main problem harvesters are  
> facing today, but this should increase in the coming years. Today,  
> at least two phenomenons can generate duplicates in the harvesters’  
> databases:
> Several research organisations or universities can record the same  
> electronic resource in their own institutional repository. If Avano  
> harvests those repositories, it will get descriptive index files of  
> the same topic stored in several places. This can happen if, for  
> example, a publication is written in collaboration with several  
> institutions. If so, this publication may be archived on the server  
> of each institution. Considering the current low auto-archiving  
> rate, especially in life sciences, this phenomenon is not the main  
> cause of the production of duplicates.
> Projects for national or thematic aggregators can pose problem. In  
> some countries, projects of merged institutional repositories can  
> agregate records from a selection of repositories in a centralised  
> database before displaying them again in OAI-PMH on their own  
> server. As a consequence, records referenced on those servers are  
> displayed twice in OAI-PMH: via the institutional repository and via  
> the centralised database. If the manager of an harvester does not  
> know about the architecture of those national or thematic projects,  
> he may record the two different servers and generate duplicates in  
> his harvester’s result lists.
> To help harvesters administrator to avoid recording repositories  
> generating duplicates, could we imagine adding to the description of  
> the repository information about the involvement of the said  
> repository in a national or thematic agregation system that would  
> reexpose the records in OAI-PMH from a different server?
>
>
> Managing Type and Date field
>
> As far as I understand, in order to comply with the OAI-PMH  
> protocol, repositories have to expose their data in the non- 
> qualified Dublin Core DTD. In this DTD all fields are optional.  
> Those fields are also non-qualified, meaning, for example, that they  
> do not have to correspond to an enclosed value list. This optional  
> and non-formalised information trait raises several issues,  
> especially for the Type field.
>
> Indeed, even if the Dublin Core DTD recommends storing the Type  
> information by using standardised text strings, few repositories  
> take this into consideration and still present the information as  
> free text (ex: publication, artjournal, text, article are used to  
> describe an article). Some harvesters, including Avano, offer their  
> users to limit their search to one or several types of resources. To  
> set up this filter, harvesters try to standardise the Type field  
> using a system based on key-word recognition in this character  
> string. This standardising is therefore imperfect and the filter  
> system may exclude resources from the result list when a user  
> narrows his search to one or several types of specific data. Some  
> informations contained in this Type field cannot be standardised.
>
> Even more problematic is the fact that some repositories do not fill  
> in this field. As an example, in September 2007, out of the 107.000  
> records available in Avano, more than 26.000 did not have a Type  
> field. All of those records are automatically barred from the search  
> space if a user limits is search to one or several selected types.
>
> Could it be possible to imagine getting a new normalised and  
> mandatory information about the type of the digital object (text,  
> image, video….) so harvesters could offer an reliable option to  
> filter one or several types ob objects from the end-user search.
>
> The publication date is also problematic for harvester. For example,  
> In September 2007, out of the 107.000 records available in Avano,  
> about 15.000 did not have a publication date. When a record does not  
> have a publication date or when it cannot be standardised, it is  
> automatically located at the end of the list if the user wants the  
> results to be sorted by date. In the same way, when a user limits  
> his search to a specific period of time (see fig. 9), those files  
> are barred from the search even if they correspond to the specified  
> search.
>
> But I guess this problem with the publication date will be more  
> difficult to fix because it is difficult to define it as mandatory.
>
>
> Records without free access to the digital object
>
> As far as I understand, the OAI-PMH protocol defines only the  
> sharing process of bibliographical records contained in a group of  
> repositories. As a consequence, some repositories mix records  
> without links to the digital object together with records providing  
> free access to the resource. Others provide records with paying  
> access (ex : BePress) or records with restricted access, for  
> example, for university staff.
>
> In my opinion, this is the major problem harvesters have to face  
> today. There is no indication in the Dublin Core DTD showing the  
> harvesters the degree of accessibility of the objects described in  
> the records. As a consequence, harvesters cannot pass on this  
> information to their users or provide them with the ability to  
> filter empty records or records offering paying access to the  
> resource.
>
> It is my opinion that hiding records with free full text among  
> records with inaccessible full text is not helpful. For lack of time  
> and/or interest, scientists are reluctant to join the Open Access  
> movement and the archiving rate of free access publications stays  
> very low, especially in life sciences. Free and immediate access to  
> documentation is, without doubt, the best way to convince the  
> scientists of the interest of the Open Access movement. And drowning  
> a minority of records providing free access publications in an ocean  
> of records without link to the full text and/or records offering  
> paying access to the documents may not be the best way to promote  
> the Open Access movement.
>
> Again, those records without free access to the full text would not  
> be a problem for the harvesters if the Dublin Core DTD enabled to  
> signify the harvesters the degree of accessibility of the objects  
> described in the records. Harvesters could then provide their users  
> with the possibility of filtering the records without free access to  
> the digital object. But it is still not the case.
>
> Could we then imagine that, in a possible future version of the OAI- 
> PMH, each record will have to provide a normalised and mandatory  
> information about the degree of accessibility of the digital object  
> (free, paying, impossible, restricted,...)? This will help  
> harvesters so much to provide a better service to theirs end-users.
>
>
> What do you think?
>
> Kind regards,
> Fred
>
> -- 
> Fred Merceur
> Ifremer / Bibliothèque La Pérouse
> frederic.merceur at ifremer.fr
> Tél : 02-98-49-88-69
> Fax : 02-98-49-88-84
> Bibliothèque La Pérouse
> Archimer, Ifremer's Institutional Repository
> Avano, a marine and aquatic OAI harvester
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>