[OAI-implementers] Selective Harvesting OAI-PMH Global Harvesters
Frederic.Merceur at ifremer.fr
Mon Aug 11 07:37:30 EDT 2008
Avano <http://www.ifremer.fr/avano/> is indeed a thematic OAI harvester
for aquatic and marine science.
Then Avano harvests a few repositories from different aquatic sciences
research institutes. All resources stored in those specialized
repositories are systematically and automatically referenced in Avano.
But only 20% of the records available via Avano come from harvesting of
these aquatic repositories.
Avano also interrogates a group of Open Archives not specialized in
aquatic sciences which contain relevant resources. This is the case for
the PubMed Central server, which specializes in biomedical sciences and
life sciences, provides more than 18.000 records are relevant to Avano’s
In theory, the thematic harvesting of a repository should be made
possible by using the Set option of the OAI-PMH protocol. Nevertheless,
in reality, we have never found any “Marine and Aquatic Sciences” Set in
any of the harvested repositories. In order to filter those
repositories, we have developed a research system based on key-words and
key-expressions related to aquatic sciences.
To process repositories that are not perfectly categorized within our
fields of interest, Avano uploads all of their records in a temporary
Those data are indexed before a daily automatic system searches for
about 100.000 scientific names of aquatic species in the record. For
example, if a record contains the character string Crassostrea gigas
(scientific name of an oyster species), we consider that there is hardly
any chance that this name is used in a different context than our field
of interest, so it will be automatically visible in Avano.
Avano also searches for a few hundred of more general terms and
expressions related to the aquatic environment. For example, Avano
searches for the words fish, marine, fishing, water treatment... Records
spotted by this key-word system are then manually validated by
librarians before they can be viewed via Avano. To validate those
records, librarians use a specific website. Key-words found in records
are highlighted. This system allows librarians to reject index files
when key-words are not related to their fields of interest (for example
when FISH is used for fluorescence in situ hybridization).
Of course, this method is far from being ideal:
- This method partially relies on a manual sorting of the records which
requires some time (a few minutes per day to filter the new files among
the 150 repositories already recorded, plus extra time to process the
back-log when new repositories are recorded).
- As we do not spend more than 2 or 3 seconds to either validate a file
or not, we may accept a low percentage of records that are not related
to Avano’s fields of interest…
Atanu Garai a écrit :
> *Apologies for cross-posting*
> Dear Colleagues
> Globethics.net intends to harvest all ethics related metadata from
> open repositories around the world and interpolate the same as part of
> the digital library. We feel that this would be a great service towards
> fulfilling the information and knowledge needs and exchange for the
> global ethics community. In so doing, we have studied few alternatives
> and solutions, as given below:
> 1. OAI-PMH 2.0 specification and implementation guidelines:
> The original OAI-PMH 2.0 specification and implementation guideline for
> 'service providers' like harvesters/aggregators provides steps towards
> implementing harvesting engine. The only way to provide subject (or
> keyword) related metadata retrieval, according to this guideline, is to
> specify the subject in the Set. A closer examination in the set-spec,
> as available in the ROAR
> (http://roar.eprints.org/) tells us that 'ethics'
> as subject does not appear in the data providers that I have surveyed
> so far. The conclusion is that using OAI-PMH 2.0 implementation
> guidelines we will not be able to harvest metadata in this domain in an
> optimal fashion.
> 2. The second strategy is the strategy followed by AVANO -
> http://www.ifremer.fr/avano/ - a harvester in the domain of aquatic and
> marine sciences. Essentially, they aggregate all the metadata in a
> temporary (internal) database, run a search query and then interpolate
> the relevant records onto their AVANO public interface. This is a
> advantageous proposition for subject-specialist harvester, but we are
> constrained by resources to implement this strategy.
> 3. The third way, which I have not found any implementation example so
> far, is to take the relevant metadata from already existing global
> harvesters like OAI and interpolate into Globethics..net server. The
> current global harverster that we are examining are - OAISTER and
> Scientific Commons. However, I would like to know the possible
> standardized mechanisms by which we can take relevant (searching with
> the word 'ethics' in Scientific Commons gets 75000+ records) metadata
> these harvestors and ingest in our database.
> Thank you for your time to reflect on this issues.
> Atanu Garai
> International Secretariat
> 150, route de Ferney
> CH-1211 Geneva 2
> Tel.: +41 22 791 62 49
> Fax: +41 22 710 23 86
> Web: www.globethics.net
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
Ifremer / Bibliothèque La Pérouse
frederic.merceur at ifremer.fr
Tél : 02-98-49-88-69
Fax : 02-98-49-88-84
Bibliothèque La Pérouse <http://www.ifremer.fr/blp/>
Archimer, Ifremer's Institutional Repository
Avano, a marine and aquatic OAI harvester <http://www.ifremer.fr/avano/>
*Avant d'imprimer, pensez à l'environnement!*
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OAI-implementers