What is the OAI Protocol for Metadata Harvesting

By Neil Fegen, Domain Co-coordinator, JISC CETIS

March 2007 (minor revision Oct. 2007)

wiki version | [[Media:WhatIsOAI-PMH.doc|.doc version]]

What is the OAI Protocol for Metadata Harvesting?
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) offers a simple technical option for catalogue and repository services to make their metadata available to other services, based on the HTTP and XML standards, resulting in facilitated discovery of distributed resources. The metadata to be harvested may be in any agreed format, although unqualified Dublin Core is required by the specification in order to provide a basic level interoperability.

The OAI framework distinguishes between data providers and service providers. Data providers have metadata that they wish to make available, typically the metadata will come from a catalogue or repository. Service providers create value-added services, such as search interfaces, based on the metadata from one or more data provider. The OAI-PMH allows data providers to make their metadata available for 'harvesting' by service providers; this is a process of collecting, as the metadata is gathered from a number of distributed repositories into a combined data store.

OAI-PMH was developed by the Open Archives Initiative, with the latest version (2.0) released in January 2002. This release is not backwards-compatible with earlier versions, although a migration document details the changes between versions 1.1 and 2.0. Back to top

What is the OAI Protocol for Metadata Harvesting for?
The primary role of the OAI-PMH is to facilitate resource discovery when resources are stored in a number of distributed, independent repositories by exporting metadata about items in those repositories. The OAI-PMH is widely used for ePrints archives and has its roots in the ePrints community; however, the protocol can be applied to a wide range of digital materials, such as images, learning materials, assessment materials, technical reports or catalogue records. To enable the broadest level of interoperability, OAI-PMH mandates that metadata should be exposed as Dublin Core. However it is important to appreciate that the protocol enables multiple forms of metadata to be exposed, and that any information associated with a resource that can be encoded in XML may be exchanged using OAI-PMH. While the OAI-PMH is most widely associated with the open sharing public metadata, it can be used for private exchange in closed systems.

The facilitation of resource discovery is achieved through services based on the harvested metadata; some example services are described below: Back to top
 * A cross-search user interface. A web site offering users the opportunity to find resources from a selection of repositories by searching the metadata harvested those repositories. The value added by the service provider is a result of the selection of repositories to increase the range of scope for each query in an effort to meet all the information needs of a specific user community.
 * A cross-search application interface. The service provider may provide access to the metadata for searching using protocols that may not be offered by the originating repository, for example SRU/W and A9 may be used for searching, and the results any query may be provided as RSS or ATOM feeds. These interfaces may be used to provide a search service that can be embedded into teaching and learning environments.
 * If suitable resource identifiers are available in the harvested metadata, it may be possible for the service provider to harvest the full text of the resources in the repositories, and to provide resource discovery services based on indexing that full text.
 * It may be possible for the service provider to enhance the metadata harvested from a repository, for example by supplementing it with information from other sources such as full text (if that is accessible), metadata on the same resource from another repository, citation or linking analysis, or comments, annotations and peer review from the user community.
 * The service provider may transform the metadata in some way, for example by normalizing non-standard quirks or by providing the same information in a different format (e.g. transforming information from LOM records to formats used in reading lists or citations).

How OAI-PMH works
OAI-PMH is based on client-server architecture, in which 'harvesters' request information on updated records from repositories. A harvester is a client application that issues OAI-PMH requests, and is operated as a means of collecting ('harvesting') metadata from repositories. A repository is a network-accessible server that can process the six OAI-PMH requests described below, and is managed by a data provider to expose metadata to harvesters. Data providers handle the deposit and publishing of resources in a repository, making the associated metadata available for harvesting by service providers. To allow various repository configurations, the OAI-PMH distinguishes between three distinct entities related to the metadata made accessible by the OAI-PMH: a resource (what the metadata is about); an item (a constituent of a repository from which metadata about a resource can be disseminated); and a record (metadata expressed in a single format). OAI-PMH specifies that unique identifiers are provided for items.

Searching on a single collection of harvested metadata can be contrasted with distributed searching, which passes the query to multiple repositories and displaying the results returned by them for each search. With harvesting, interaction with a remote repository is separate from and less frequent than searches on metadata, with the aim of providing the end user with an increase in responsiveness, reliability and possibly functionality, but at the potential cost of metadata being old and additional storage required at service provider-level. Back to top

Overview and structure model
OAI-PMH supports six types of requests (known as 'verbs'), which must be submitted using the HTTP  or   methods:
 * (description of an archive)
 * (list of metadata formats supported by the data provider)
 * (list of sets provided)
 * (harvest records from a repository)
 * (list of resource identifiers, i.e. an abbreviated form of ListRecords)
 * (individual record).

Similarly, five types of responses are supported, which must be well-informed XML instance documents (OAI-PMH supports any XML-encoded metadata format): Back to top
 * General information
 * Metadata formats (any metadata format encoded in XML is supported)
 * Set structure
 * Record identifier
 * Metadata.

Static Repository
The OAI have also defined a specification for Static Repositories, which provides a simple approach for exposing small collections of metadata records that are updated relatively infrequently via OAI-PMH. The Static Repository approach is suitable for relatively small and unchanging collections, and is aimed at organisations that have collections of up to 5,000 records and make static content available through a network-accessible Web server. A Static OAI Repository is simply an XML file that contains the metadata records in Dublin Core format at a persistent HTTP URL, which is made available via a Static Repository Gateway. The Gateway's role is to provide responses to OAI requests, essentially allowing OAI-PMH access to data and making the Static Repository OAI-PMH harvestable. Back to top

Data Provider
The following should be in place in order to implement OAI-PMH as a Data Provider:
 * Archive identifier or a base URL
 * Unique identifier for each item
 * Metadata format (one or more, with unqualified Dublin Core as a minimum)
 * Date-stamps for metadata (created or last modified).

Additionally, the following may be in place: Back to top
 * Logical set hierarchy
 * Flow control by implementation of resumption token ('larger' repositories should have it).

Service Provider
There are three technical infrastructure prerequisites for implementing an OAI-PMH Service Provider that will harvest metadata from Data Providers via OAI-PMH:
 * Internet-connected server
 * Database system (relational or XML)
 * Programming environment (must be able to issue HTTP requests to web servers and database requests, and include an XML parser).

Additionally, it would be desirable for a service provider to have a 'de-duplication' feature, as presenting the results of a cross-search can lead to duplication problems, which occur when a resource is catalogued by more than one target, thus resulting in a duplicate find. Service providers will often also perform metadata normalisation to maintain consistent metadata. Back to top

Related specifications

 * The Dublin Core metadata standard is used for describing a range of digital objects, and contains a set of fifteen metadata elements (e.g. Title, Creator, Subject, Description, etc).
 * The IEEE LOM standard specifies the syntax and semantics of LOM, and focuses on the minimal set of attributes needed to allow LOs to be managed, located and evaluated.
 * The RSS and Atom specifications define web feeds which can be used to aggregate resource descriptions.

For more information on specifications such as these, see the 'Guides' section on the JISC CETIS Metadata and Digital Repository SIG wiki. Back to top

Implementations
Back to top
 * OAI-PMH has been mandated by the JISC Information Environment where metadata is harvested. Data provider users include repository services such as Jorum; service provider examples include aggregators such as the PerX pilot subject-based cross-repository search tool for resource discovery in engineering.  Intute, a free online service (formerly known as the Resource Discovery Network) providing access to evaluated web resources for education and research, provides access to its resource descriptions via an OAI-PMH repository.  Institutions may also use OAI-PMH for their repositories.
 * Collaborative work between the Resource Discovery Network Hubs and Learning and Teaching Support Network (LTSN) Centres resulted in using OAI-PMH to share metadata records within the partnerships.
 * The 'Tools' section of the official OAI website lists nearly thirty tools that support OAI-PMH v2.0, including prominent digital repository software, such as DSpace, ePrints and Fedora. Additionally, a list of OAI conforming repositories is listed, as well as at the University of Illinois OAI-PMH Data Provider Registry.
 * The OAI Repository Explorer allows for the interactive testing of archives for compliance with OAI-PMH.
 * OAI-PMH can be used to provide a Google Sitemap (Google Sitemaps help its crawlers index web sites more efficiently) and is also used by Google to harvest information from sites such as the National Library of Australia Digital Object Repository.

Websites and e-mail lists
Back to top
 * The OAI-implementers list is focused on discussing the implementation of the Open Archives Protocol for Metadata Harvesting: http://www.openarchives.org/mailman/listinfo/OAI-implementers
 * The CETIS metadata mail list may also be of interest: http://www.jiscmail.ac.uk/lists/CETIS-METADATA.html.

Resources on the Internet
Back to top
 * The Open Archives Initiative Protocol for Metadata Harvesting version 2.0: http://www.openarchives.org/OAI/openarchivesprotocol.html
 * The Open Archives Forum provides a tutorial on OAI-PMH: http://www.oaforum.org/tutorial/.