[OAI-implementers] Some OAI-PMH protocol issues

Fri Dec 7 07:37:08 EST 2007

Hi Frederic,

See some answers below

Frederic MERCEUR a écrit :
> Hello,
>
> Further to the previous email I sent about the document 
> <http://www.ifremer.fr/docelec/doc/2007/acte-3238.pdf> we redacted to 
> assess the main difficulties met during the first year of management 
> of our Avano <http://www.ifremer.fr/avano/> harvester, I would like to 
> focus, in this email, on just 3 problems linked to the OAI-PMH 
> protocol, Dublin Core or to repositories implementation. I would like 
> to focus particularly on these 3 problems because I guess they should 
> not be so difficult to fix.
>
>
> *Managing duplicates *
>
> Too many duplicates in a result list in Harvesters list can affect the 
> user’s comfort. This is not the main problem harvesters are facing 
> today, but this should increase in the coming years. Today, at least 
> two phenomenons can generate duplicates in the harvesters’ databases:
>
>     * Several research organisations or universities can record the
>       same electronic resource in their own institutional repository.
>       If Avano harvests those repositories, it will get descriptive
>       index files of the same topic stored in several places. This can
>       happen if, for example, a publication is written in
>       collaboration with several institutions. If so, this publication
>       may be archived on the server of each institution. Considering
>       the current low auto-archiving rate, especially in life
>       sciences, this phenomenon is not the main cause of the
>       production of duplicates.
>     * Projects for national or thematic aggregators can pose problem.
>       In some countries, projects of merged institutional repositories
>       can agregate records from a selection of repositories in a
>       centralised database before displaying them again in OAI-PMH on
>       their own server. As a consequence, records referenced on those
>       servers are displayed twice in OAI-PMH: via the institutional
>       repository and via the centralised database. If the manager of
>       an harvester does not know about the architecture of those
>       national or thematic projects, he may record the two different
>       servers and generate duplicates in his harvester’s result lists.
>
> /To help harvesters administrator to avoid recording repositories 
> generating duplicates, could we imagine adding to the description of 
> the repository information about the involvement of the said 
> repository in a national or thematic agregation system that would 
> reexpose the records in OAI-PMH from a different server?
> /
 >>>>> There is a way to indicate potential overlaps in the repository 
description, as free text for instance. HAL is in this situation 
(document located in 2 different reps), that's the solution being 
considered beside contacting SP and telling them :-) Any better solution 
welcome.
If there is a harvest then re-exposure of the records, the About section 
of the individual records should indicate the provenance of the original 
record. This should allow deduplication. In practice, I'm not certain SP 
do that very often though.

> //
>
> *Managing Type and Date field*
>
> As far as I understand, in order to comply with the OAI-PMH protocol, 
> repositories have to expose their data in the non-qualified Dublin 
> Core DTD. In this DTD all fields are optional. Those fields are also 
> non-qualified, meaning, for example, that they do not have to 
> correspond to an enclosed value list. This optional and non-formalised 
> information trait raises several issues, especially for the Type field.
>
> Indeed, even if the Dublin Core DTD recommends storing the Type 
> information by using standardised text strings, few repositories take 
> this into consideration and still present the information as free text 
> (ex: publication, artjournal, text, article are used to describe an 
> article). Some harvesters, including Avano, offer their users to limit 
> their search to one or several types of resources. To set up this 
> filter, harvesters try to standardise the Type field using a system 
> based on key-word recognition in this character string. This 
> standardising is therefore imperfect and the filter system may exclude 
> resources from the result list when a user narrows his search to one 
> or several types of specific data. Some informations contained in this 
> Type field cannot be standardised.
>
> Even more problematic is the fact that some repositories do not fill 
> in this field. As an example, in September 2007, out of the 107.000 
> records available in Avano, more than 26.000 did not have a Type 
> field. All of those records are automatically barred from the search 
> space if a user limits is search to one or several selected types.
>
> /Could it be possible to imagine getting a new normalised and 
> mandatory information about the type of the digital object (text, 
> image, video….) so harvesters could offer an reliable option to filter 
> one or several types ob objects from the end-user search.
> /
 >>>>>>>> Agreed with Jesús L. Domínguez. That's guidelines job. DRIVER 
guidelines are a first step for scholarly info. see also DLF Best 
practices on shareable metadata. Some harvesters also try to derive a 
type from other information in the records or documents themselves.
> //
> The publication date is also problematic for harvester. For example, 
> In September 2007, out of the 107.000 records available in Avano, 
> about 15.000 did not have a publication date. When a record does not 
> have a publication date or when it cannot be standardised, it is 
> automatically located at the end of the list if the user wants the 
> results to be sorted by date. In the same way, when a user limits his 
> search to a specific period of time (see fig. 9), those files are 
> barred from the search even if they correspond to the specified search.
>
> But I guess this problem with the publication date will be more 
> difficult to fix because it is difficult to define it as mandatory.
 >>>>> Same as above
>
>
> *Records without free access to the digital object*
>
> As far as I understand, the OAI-PMH protocol defines only the sharing 
> process of bibliographical records contained in a group of 
> repositories. As a consequence, some repositories mix records without 
> links to the digital object together with records providing free 
> access to the resource. Others provide records with paying access (ex 
> : BePress) or records with restricted access, for example, for 
> university staff.
>
> In my opinion, this is the major problem harvesters have to face 
> today. There is no indication in the Dublin Core DTD showing the 
> harvesters the degree of accessibility of the objects described in the 
> records. As a consequence, harvesters cannot pass on this information 
> to their users or provide them with the ability to filter empty 
> records or records offering paying access to the resource.
>
> It is my opinion that hiding records with free full text among records 
> with inaccessible full text is not helpful. For lack of time and/or 
> interest, scientists are reluctant to join the Open Access movement 
> and the archiving rate of free access publications stays very low, 
> especially in life sciences. Free and immediate access to 
> documentation is, without doubt, the best way to convince the 
> scientists of the interest of the Open Access movement. And drowning a 
> minority of records providing free access publications in an ocean of 
> records without link to the full text and/or records offering paying 
> access to the documents may not be the best way to promote the Open 
> Access movement.
>
> Again, those records without free access to the full text would not be 
> a problem for the harvesters if the Dublin Core DTD enabled to signify 
> the harvesters the degree of accessibility of the objects described in 
> the records. Harvesters could then provide their users with the 
> possibility of filtering the records without free access to the 
> digital object. But it is still not the case.
>
> /Could we then imagine that, in a possible future version of the 
> OAI-PMH, each record will have to provide a normalised and mandatory 
> information about the degree of accessibility of the digital object 
> (free, paying, impossible, restricted,...)? This will help harvesters 
> so much to provide a better service to theirs end-users.
> /
 >>>>>> Guidelines again, that can be recorded in DC:Rights field. 
DRIVER guidelines proposed the creation of specific sets in the case you 
are mentioning. NEREUS people are enforcing the encoding of 
ContextObjects in a QDC record to help this. Not sure any of this is 
perfect. Finally, there always the possibility to request scholarly 
communciation repositories to share richer metadata formats and/or to 
indicate accessRights in links to different versions of digital objects. 
That could be guidelines'job or ORE job or other formats's job.
> //
>
> What do you think?
>
> Kind regards,
> Fred
>
> -- 
> Fred Merceur
> Ifremer / Bibliothèque La Pérouse
> frederic.merceur at ifremer.fr
> Tél : 02-98-49-88-69
> Fax : 02-98-49-88-84
> Bibliothèque La Pérouse <http://www.ifremer.fr/blp/>
> Archimer, Ifremer's Institutional Repository 
> <http://www.ifremer.fr/docelec/>
> Avano, a marine and aquatic OAI harvester <http://www.ifremer.fr/avano/>
> ------------------------------------------------------------------------
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>
>   

-- 
Muriel Foulonneau
Centre pour la communication scientifique directe
Centre National de la recherche scientifique
IN2P3
12-14 bd Niels Boehr
69100 Villeurbanne
Tel: +33 (0)4 72 69 52 85
muriel.foulonneau at ccsd.cnrs.fr