<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=windows-1252"

 http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<small><font face="Helvetica, Arial, sans-serif">Dear Les,<br>

<br>

Thank you for your answer. It is clear that it is not really OAI-PMH

issues but rather application issues. <br>

<br>

However about vocabularies for types, an extra field information with

one or several selections in the <a

 href="http://dublincore.org/documents/dcmi-type-vocabulary/">DCMI Type

Vocabulary</a> , for example, will be so much useful. Even it is true

that no recommendation will be perfect and there will be always

exception that will not match it in a controlled list.  <br>

<br>

For managing duplicates, I was really thinking about national

aggregators. The French archive HAL is a good example. You can get all

the records for the HAL OAI server, or you can get a part of them

through institutional repositories server such as HAL-IN2P3 OAI server,

HAL-INSERM OAI server. So if the manager of an harvester does not know

about the architecture of those national or thematic projects, he may

record the two different servers and generate duplicates in his

harvester’s result lists. This is the same for BePress server. You can

load all institutional repositories managed by BePress separately or

all in one time through the general BePress OAI server. I know there is

a ARROW project in Australia that harvest institutional repositories

and re-expose records in its own server. Once again the description of

the repository information about the involvement of the said repository

in a national or thematic aggregation system that would re-expose the

records in OAI-PMH from a different server could be useful in some

cases.<br>

<br>

And I agree there is no solution for dates…<br>

<br>

What do you think about Records without free access to the digital

object issues? Could the addition of a normalised and mandatory

information about the degree of accessibility of the digital object be

a OAI-PMH issues for a possible future versions? <br>

<br>

Anyway, thanks again for your attention and for your answer.<br>

<br>

Kind regards,<br>

Fred</font></small><br>

<br>

<br>

Leslie Carr a écrit :

<blockquote

 cite="midA96A46EF-DD76-4589-9207-D0FBCDECF5A6@ecs.soton.ac.uk"

 type="cite">I don't think that these are OAI-PMH data transport

issues, but application issues that arise from the interpretation and

usage of metadata from heterogenous data providers.

  <div><br class="webkit-block-placeholder">

  </div>

  <div>Managing duplicates: while it makes life inconvenient for your

service users, it is not something that can in general be controlled by

independent data providers. I'm afraid that your service must have the

ability to reconcile duplicate (or near duplicate) items.</div>

  <div><br>

  </div>

  <div>Controlled vocabularies for types: you will need to make a

recommendation and try to gain support among the OAI data providers. I

am not sure that the selection that you propose will work in practice.

Is a PDF a text if it has embedded images? Is an image a text if it is

an OCR scan?</div>

  <div><br class="webkit-block-placeholder">

  </div>

  <div>Enforced publication dates: what do you enforce if the item has

never been officially published?</div>

  <div>--</div>

  <div>Les Carr</div>

  <div><br class="webkit-block-placeholder">

  </div>

  <div><br class="webkit-block-placeholder">

  </div>

  <div>

  <div>

  <div>

  <div>On 5 Dec 2007, at 09:40, Frederic MERCEUR wrote:</div>

  <br class="Apple-interchange-newline">

  <blockquote type="cite">

    <div bgcolor="#ffffff" text="#000000"> <small><font

 face="Helvetica, Arial, sans-serif">Hello, <br>

    <br>

Further to the previous email I sent about the <a

 href="http://www.ifremer.fr/docelec/doc/2007/acte-3238.pdf">document</a>

we redacted to assess the main difficulties met during the first year

of management of our <a href="http://www.ifremer.fr/avano/">Avano</a>

harvester, I would like to focus, in this email, on just 3 problems

linked to the OAI-PMH protocol, Dublin Core or to repositories

implementation. I would like to focus particularly on these 3 problems

because I guess they should not be so difficult to fix.   <br>

    <br>

    <br>

    <big><b>Managing duplicates </b></big><br>

    <br>

Too many duplicates in a result list in Harvesters list can affect the

user’s comfort. This is not the main problem harvesters are facing

today, but this should increase in the coming years. Today, at least

two phenomenons can generate duplicates in the harvesters’ databases:  <br>

    </font></small>

    <ul>

      <li><small><font face="Helvetica, Arial, sans-serif">Several

research organisations or universities can record the same electronic

resource in their own institutional repository. If Avano harvests those

repositories, it will get descriptive index files of the same topic

stored in several places. This can happen if, for example, a

publication is written in collaboration with several institutions. If

so, this publication may be archived on the server of each institution.

Considering the current low auto-archiving rate, especially in life

sciences, this phenomenon is not the main cause of the production of

duplicates.</font></small></li>

      <li><small><font face="Helvetica, Arial, sans-serif">Projects for

national or thematic aggregators can pose problem. In some countries,

projects of merged institutional repositories can agregate records from

a selection of repositories in a centralised database before displaying

them again in OAI-PMH on their own server. As a consequence, records

referenced on those servers are displayed twice in OAI-PMH: via the

institutional repository and via the centralised database. If the

manager of an harvester does not know about the architecture of those

national or thematic projects, he may record the two different servers

and generate duplicates in his harvester’s result lists.  </font></small></li>

    </ul>

    <small><font face="Helvetica, Arial, sans-serif"><i>To help

harvesters administrator to avoid recording repositories generating

duplicates, could we imagine adding to the description of the

repository information about the involvement of the said repository in

a national or thematic agregation system that would reexpose the

records in OAI-PMH from a different server? <br>

    </i><br>

    <br>

    <big><b>Managing Type and Date field</b></big><br>

    <br>

As far as I understand, in order to comply with the OAI-PMH protocol,

repositories have to expose their data in the non-qualified Dublin Core

DTD. In this DTD all fields are optional. Those fields are also

non-qualified, meaning, for example, that they do not have to

correspond to an enclosed value list. This optional and non-formalised

information trait raises several issues, especially for the Type field.

    <br>

    <br>

Indeed, even if the Dublin Core DTD recommends storing the Type

information by using standardised text strings, few repositories take

this into consideration and still present the information as free text

(ex: publication, artjournal, text, article are used to describe an

article). Some harvesters, including Avano, offer their users to limit

their search to one or several types of resources. To set up this

filter, harvesters try to standardise the Type field using a system

based on key-word recognition in this character string. This

standardising is therefore imperfect and the filter system may exclude

resources from the result list when a user narrows his search to one or

several types of specific data. Some informations contained in this

Type field cannot be standardised.<br>

    <br>

Even more problematic is the fact that some repositories do not fill in

this field. As an example, in September 2007, out of the 107.000

records available in Avano, more than 26.000 did not have a Type field.

All of those records are automatically barred from the search space if

a user limits is search to one or several selected types. <br>

  <br>

    <i>Could it be possible to imagine getting a new normalised and

mandatory information about the type of the digital object (text,

image, video….) so harvesters could offer an reliable option to filter

one or several types ob objects from the end-user search.<br>

    </i><br>

The publication date is also problematic for harvester. For example, In

September 2007, out of the 107.000 records available in Avano, about

15.000 did not have a publication date. When a record does not have a

publication date or when it cannot be standardised, it is automatically

located at the end of the list if the user wants the results to be

sorted by date. In the same way, when a user limits his search to a

specific period of time (see fig. 9), those files are barred from the

search even if they correspond to the specified search.  <br>

    <br>

But I guess this problem with the publication date will be more

difficult to fix because it is difficult to define it as mandatory. <br>

    <br>

    <br>

    <b><big>Records without free access to the digital object</big></b><br>

    <br>

As far as I understand, the OAI-PMH protocol defines only the sharing

process of bibliographical records contained in a group of

repositories. As a consequence, some repositories mix records without

links to the digital object together with records providing free access

to the resource. Others provide records with paying access (ex :

BePress) or records with restricted access, for example, for university

staff.  <br>

    <br>

In my opinion, this is the major problem harvesters have to face today.

There is no indication in the Dublin Core DTD showing the harvesters

the degree of accessibility of the objects described in the records. As

a consequence, harvesters cannot pass on this information to their

users or provide them with the ability to filter empty records or

records offering paying access to the resource. <br>

    <br>

It is my opinion that hiding records with free full text among records

with inaccessible full text is not helpful. For lack of time and/or

interest, scientists are reluctant to join the Open Access movement and

the archiving rate of free access publications stays very low,

especially in life sciences. Free and immediate access to documentation

is, without doubt, the best way to convince the scientists of the

interest of the Open Access movement. And drowning a minority of

records providing free access publications in an ocean of records

without link to the full text and/or records offering paying access to

the documents may not be the best way to promote the Open Access

movement. <br>

    <br>

Again, those records without free access to the full text would not be

a problem for the harvesters if the Dublin Core DTD enabled to signify

the harvesters the degree of accessibility of the objects described in

the records. Harvesters could then provide their users with the

possibility of filtering the records without free access to the digital

object. But it is still not the case.  <br>

    <br>

    <i>Could we then imagine that, in a possible future version of the

OAI-PMH, each record will have to provide a normalised and mandatory

information about the degree of accessibility of the digital object

(free, paying, impossible, restricted,...)? This will help harvesters

so much to provide a better service to theirs end-users. <br>

    </i><br>

    <br>

What do you think?<br>

    <br>

Kind regards,<br>

Fred<br>

    </font></small><br>

    <div class="moz-signature">-- <br>

    <font face="Arial" size="2">Fred Merceur<br>

Ifremer / Bibliothèque La Pérouse<br>

    <a class="moz-txt-link-abbreviated"

 href="mailto:frederic.merceur@ifremer.fr">frederic.merceur@ifremer.fr</a><br>

Tél : 02-98-49-88-69<br>

Fax : 02-98-49-88-84<br>

    <a href="http://www.ifremer.fr/blp/">Bibliothèque La Pérouse</a><br>

    <a href="http://www.ifremer.fr/docelec/">Archimer, Ifremer's

Institutional Repository</a><br>

    <a href="http://www.ifremer.fr/avano/">Avano, a marine and aquatic

OAI harvester</a><br>

    </font></div>

    </div>

_______________________________________________<br>

OAI-implementers mailing list<br>

List information, archives, preferences and to unsubscribe:<br>

    <a

 href="http://www.openarchives.org/mailman/listinfo/oai-implementers">http://www.openarchives.org/mailman/listinfo/oai-implementers</a><br>

    <br>

  </blockquote>

  </div>

  <br>

  </div>

  </div>

  <pre wrap="">

<hr size="4" width="90%">

_______________________________________________

OAI-implementers mailing list

List information, archives, preferences and to unsubscribe:

<a class="moz-txt-link-freetext" href="http://www.openarchives.org/mailman/listinfo/oai-implementers">http://www.openarchives.org/mailman/listinfo/oai-implementers</a>

  </pre>

</blockquote>

<br>

<div class="moz-signature">-- <br>

<font face="Arial" size="2">Fred Merceur<br>

Ifremer / Bibliothèque La Pérouse<br>

<a class="moz-txt-link-abbreviated" href="mailto:frederic.merceur@ifremer.fr">frederic.merceur@ifremer.fr</a><br>

Tél : 02-98-49-88-69<br>

Fax : 02-98-49-88-84<br>

<a href="http://www.ifremer.fr/blp/">Bibliothèque La Pérouse</a><br>

<a href="http://www.ifremer.fr/docelec/">Archimer, Ifremer's

Institutional Repository</a><br>

<a href="http://www.ifremer.fr/avano/">Avano, a marine and aquatic OAI

harvester</a><br>

</font></div>

</body>

</html>