What is the OAI Protocol for Metadata Harvesting

From CETISwiki

Jump to: navigation, search

By Neil Fegen, Domain Co-coordinator, JISC CETIS

March 2007 (minor revision Oct. 2007)


wiki version | .doc version


Contents

In brief

What is the OAI Protocol for Metadata Harvesting?

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) offers a simple technical option for catalogue and repository services to make their metadata available to other services, based on the HTTP and XML standards, resulting in facilitated discovery of distributed resources. The metadata to be harvested may be in any agreed format, although unqualified Dublin Core is required by the specification in order to provide a basic level interoperability.

The OAI framework distinguishes between data providers and service providers. Data providers have metadata that they wish to make available, typically the metadata will come from a catalogue or repository. Service providers create value-added services, such as search interfaces, based on the metadata from one or more data provider. The OAI-PMH allows data providers to make their metadata available for 'harvesting' by service providers; this is a process of collecting, as the metadata is gathered from a number of distributed repositories into a combined data store.

OAI-PMH was developed by the Open Archives Initiative, with the latest version (2.0) released in January 2002. This release is not backwards-compatible with earlier versions, although a migration document details the changes between versions 1.1 and 2.0.

Back to top

What is the OAI Protocol for Metadata Harvesting for?

The primary role of the OAI-PMH is to facilitate resource discovery when resources are stored in a number of distributed, independent repositories by exporting metadata about items in those repositories. The OAI-PMH is widely used for ePrints archives and has its roots in the ePrints community; however, the protocol can be applied to a wide range of digital materials, such as images, learning materials, assessment materials, technical reports or catalogue records. To enable the broadest level of interoperability, OAI-PMH mandates that metadata should be exposed as Dublin Core. However it is important to appreciate that the protocol enables multiple forms of metadata to be exposed, and that any information associated with a resource that can be encoded in XML may be exchanged using OAI-PMH. While the OAI-PMH is most widely associated with the open sharing public metadata, it can be used for private exchange in closed systems.

The facilitation of resource discovery is achieved through services based on the harvested metadata; some example services are described below:

Back to top

Technical details

How OAI-PMH works

Figure 1: Basic functioning of OAI-PMH
Figure 1: Basic functioning of OAI-PMH

OAI-PMH is based on client-server architecture, in which 'harvesters' request information on updated records from repositories. A harvester is a client application that issues OAI-PMH requests, and is operated as a means of collecting ('harvesting') metadata from repositories. A repository is a network-accessible server that can process the six OAI-PMH requests described below, and is managed by a data provider to expose metadata to harvesters. Data providers handle the deposit and publishing of resources in a repository, making the associated metadata available for harvesting by service providers. To allow various repository configurations, the OAI-PMH distinguishes between three distinct entities related to the metadata made accessible by the OAI-PMH: a resource (what the metadata is about); an item (a constituent of a repository from which metadata about a resource can be disseminated); and a record (metadata expressed in a single format). OAI-PMH specifies that unique identifiers are provided for items.

Searching on a single collection of harvested metadata can be contrasted with distributed searching, which passes the query to multiple repositories and displaying the results returned by them for each search. With harvesting, interaction with a remote repository is separate from and less frequent than searches on metadata, with the aim of providing the end user with an increase in responsiveness, reliability and possibly functionality, but at the potential cost of metadata being old and additional storage required at service provider-level.

Back to top

Overview and structure model

Figure 2: An overview and structure model of the OAI-PMH

OAI-PMH supports six types of requests (known as 'verbs'), which must be submitted using the HTTP GET or POST methods:

Similarly, five types of responses are supported, which must be well-informed XML instance documents (OAI-PMH supports any XML-encoded metadata format):

Back to top

Static Repository

The OAI have also defined a specification for Static Repositories, which provides a simple approach for exposing small collections of metadata records that are updated relatively infrequently via OAI-PMH. The Static Repository approach is suitable for relatively small and unchanging collections, and is aimed at organisations that have collections of up to 5,000 records and make static content available through a network-accessible Web server. A Static OAI Repository is simply an XML file that contains the metadata records in Dublin Core format at a persistent HTTP URL, which is made available via a Static Repository Gateway. The Gateway's role is to provide responses to OAI requests, essentially allowing OAI-PMH access to data and making the Static Repository OAI-PMH harvestable.

Back to top

Requirements

Data Provider

The following should be in place in order to implement OAI-PMH as a Data Provider:

Additionally, the following may be in place:

Back to top

Service Provider

There are three technical infrastructure prerequisites for implementing an OAI-PMH Service Provider that will harvest metadata from Data Providers via OAI-PMH:

Additionally, it would be desirable for a service provider to have a 'de-duplication' feature, as presenting the results of a cross-search can lead to duplication problems, which occur when a resource is catalogued by more than one target, thus resulting in a duplicate find. Service providers will often also perform metadata normalisation to maintain consistent metadata.

Back to top

Related specifications

For more information on specifications such as these, see the 'Guides' section on the JISC CETIS Metadata and Digital Repository SIG wiki.

Back to top

Implementations

Back to top

Resources

Websites and e-mail lists

Back to top

Resources on the Internet

Back to top


References

End of paper - back to top