Harvesting documents using the OAI-PMH Harvester

Once an organization participating in a central Metadata Service has successfully configured the OAI-PMH Connector, the harvester can harvest documents from the provider’s Metadata Service using the OAI-PMH Harvester.

First, you need to install the ArcIMS Metadata options. See ‘Installing ArcIMS Metadata Options’ in the ArcIMS Installation Guide for instructions.
To start the OAI-PMH Harvester, open the operating system’s command window and navigate to <ArcIMS install directory>/Metadata/OAI-PMH/Client. Type “run” and press Enter. The OAI-PMH Harvester dialog box appears.

The OAI-PMH Harvester dialog box is divided into three panels. At the top is the list of harvesting tasks; each task identifies an OAI-PMH Connector from which you want to harvest and the documents you want to retrieve. In the middle is information about your Metadata Service; the harvested documents will be published directly to the specified service. At the bottom you can see the status of your harvesting tasks; initially it will show that the default configuration file for the OAI-PMH Harvester has been successfully loaded.

Use the toolbar to add and remove harvesting tasks, save the OAI Harvester’s configuration, load a new configuration file, and start the harvesting process.

When the OAI-PMH Harvester opens, the default configuration file, OAIServers.xml, is loaded. It includes a default harvesting task to help you get started. Edit the default harvesting task to identify an OAI-PMH Connector from which you want to harvest, specify the appropriate information to publish the harvested documents to your Metadata Service, and begin harvesting following the instructions below.

In the task list, double-click the default value in the OAI Servers column, “http://arcims_oai_host/aimsharvester/oai2.0”. Change “arcims_oai_host” to reflect the URL of the provider’s ArcIMS server. For example, if the ArcIMS server is “http://www.providerHost.org”, the address for the provider’s OAI-PMH Connector will be “http://www.providerHost.org/aimsharvester/oai2.0”.

The value in the Prefix column should remain “esri_rawxml”; this indicates that you want the entire document to be retrieved as-is from the provider. The first time you harvest from a connector you’ll want to retrieve all documents—don’t specify values in the From, Until, and Set columns.
Make sure this task has a check mark in the Active column. Only active tasks will be harvested.
Press Enter, or click elsewhere in the task list for the information specified in the OAI Servers column to take effect.

Now that a harvesting task has been defined, you need to provide information about the Metadata Service to which the harvested documents will be published.
In the second panel, double-click in the Value column to the right of URL and specify the URL of your ArcIMS server, for example, “http://www.harvesterHost.com”.
Double-click in the Value column to the right of Service Name and specify the name of the Metadata Service to which the harvested documents will be published.
Double-click in the Value column to the right of User Name and specify the username to use to publish the harvested documents.

This user will own the harvested documents. Only a document’s owner can edit it or make it private. The owner or a user with metadata_administrator privileges can delete it. The only way to change the owner of a document once it has been harvested is by sending the ArcXML request CHANGE_OWNER to the Metadata Service.
Double-click in the Value column to the right of Password and specify the password to use to publish the harvested documents.

The user name and password specified in the Value column must match a user name and password combination in the access control list that has permission to publish to the Metadata Service.
Press Enter, or click elsewhere in the harvester panel for the password you’ve specified to take effect.
Click Save to save the changes you’ve made to the OAI-PMH Harvester’s configuration. In the Save dialog box, navigate to the <ArcIMS install directory>/Metadata/OAI-PMH/Client directory. Save your changes to the configuration file OAIServers.xml; your settings will be loaded each time the OAI-PMH Harvester is started.
Click Begin harvesting active tasks to start harvesting documents.
In the message box that appears, click Yes to start harvesting.

In the status panel, messages appear indicating the progress of the harvesting tasks; scroll down to see the latest messages. The status bar also indicates that harvesting is in process. Click the red dot at the bottom right of the dialog box to stop harvesting at any time.

By default, if a problem occurs when the harvested documents are being published, an error message will appear and harvesting will be stopped. For example, this might happen if you did not click elsewhere in the second panel after specifying the password, or if the user doesn’t have permission to publish to the specified Metadata Service.

Harvesting documents by date

The first time you harvest a provider’s documents you’ll want to retrieve all of them. For subsequent harvests, you’ll typically want to retrieve only the documents that have been added or modified since the last time the provider was harvested. Do this by specifying the date when you last harvested the provider in the From column. For example, if the last successful harvest was on December 31, 2003, harvest all documents modified since then by specifying “2003-12-31” in the From column. Dates must be provided in the format YYYY-MM-DD.

Any combination of dates can be specified in the From and Until columns as appropriate. To harvest documents modified in a specific time period, specify a date in both the From and Until columns. To harvest documents modified before a given date, specify a date only in the Until column. Be sure to click elsewhere in the task list after specifying a date so that your changes to the task list take effect.

You can modify the existing task in the task list for this provider and make that task active. Or, uncheck the existing task in the Active column to deactivate it, click Add a harvesting task, then define the new task to retrieve the latest updates from that provider. Only active tasks will be executed during the harvesting process.

Harvesting documents from a folder in a Metadata Service

Suppose you only want to harvest documents that reside in a specific folder in the provider’s Metadata Service. To specify that folder as the set that you want to harvest, put its document identifier in the Set column in the task list. To get the document identifier for a folder, select the folder in ArcCatalog, click the Metadata tab, then click Xml in the Stylesheets dropdown list on the Metadata toolbar. The value of the PublishedDocID element is the folder’s document identifier. For OAI-PMH services, the value used to specify a set is the document identifier excluding the braces; select this information with your mouse, right-click the selected text, then click Copy.

In the OAI-PMH Harvester, paste this value into the Set column in the task list. Press Enter or click elsewhere in the task list for your changes to take effect.

Customizing the harvesting configuration

While the most commonly used properties for the harvester and the provider can be specified in the OAI-PMH Harvester dialog box, a few properties can only be specified by editing the configuration file in a text editor. The configuration file is named OAIServers.xml and is located at <ArcIMS install directory>/Metadata/OAI-PMH.

Your custom settings will be loaded each time the OAI-PMH Harvester is started. However, they won’t appear in the OAI-PMH Harvester dialog box. If you customize the settings for a harvesting task, remove that task, and save your changes, your custom settings will be lost. If you add a task, you’ll have to modify its settings in a text editor if you want it to use the same custom settings as the other tasks.
When you open the configuration file in a text editor, you can see there is one repository element for each provider. Additional properties can be set for a provider as follows.

Maximum number of documents—To set the maximum number of documents that you want to harvest from a provider, add the maxdocs element into the appropriate repository element and set its value to a desired number. For example, you would add “<maxdocs>50</maxdocs>”.

Log in to access a provider’s Metadata Service—If you must log in to access documents in the provider’s Metadata Service, add username and password elements into the appropriate repository element and set their values appropriately. For example, you would add “<username>harvesterUser</username>” and “<password>harvesterPass</password>”.

In the configuration file there is one publisher element for the harvester. Additional properties can be set for the harvester as follows.

Copying the provider’s folder structure—The provider’s Metadata Service may organize documents into a hierarchy of folders so they can be easily browsed. By default, folders matching the provider’s folder structure will be created at the root level of the Metadata Service into which the documents are harvested. Instead, you can choose not to copy the provider’s folder structure; with this option, all documents will be placed at the root level of the harvester’s Metadata Service. To not replicate the provider’s folders, add the useSets attribute to the provider element and set it to “false”, for example, “<provider useSets=‘false’>”. If the useSets attribute isn’t present, the value “true” will be assumed.

Publishing documents to a folder on disk—Instead of publishing the harvested documents directly to the specified Metadata Service, they can be stored on disk. You can review or process the documents on disk before publishing them. This might be desirable if your Metadata Service doesn’t use an administrative table that keeps the harvested documents from becoming public immediately after harvesting.

For this option, remove the URL element from the provider element, set the agent attribute on the provider element to “com.esri.aims.mtier.mh.publisher.FolderPublisher”, and add a folder element with a value indicating the full path where the harvested documents will be stored. For example, “<provider agent=‘com.esri.aims.mtier.mh.publisher.FolderPublisher’>” and “<folder>D:\harvest</folder>”. When the configuration file is loaded, the harvester panel will appear empty; the folder information won’t be shown.

The folders used to organize documents in the provider’s Metadata Service won’t be replicated when documents are harvested to disk, even if the useSets attribute is set to “true”.

Publishing harvested documents—By default, if any problems are encountered when publishing the harvested documents, the harvesting process will stop. Instead, you can choose to attempt to publish each harvested document; depending on the problem, some documents may be published successfully. To attempt to publish all documents, add the strict attribute to the provider element and set it to false, for example, “<provider strict=‘false’>”. If the strict attribute isn’t present, the value “true” will be assumed.

Setting log options

The log options for the OAI-PMH Harvester are similar to the options for the Z39.50 Connector. Logging options are set by modifying the log.properties file, which is located at <ArcIMS install directory>/Metadata/OAI-PMH.

Log options can be sent to several locations: standard output, the status panel in the OAI-PMH Harvester, a log file, and a separate debug log file. By default, information is logged only to standard output and the status panel. Usually it’s most convenient to check the status of the harvesting process in the status panel of the dialog box. Use the debug log file to troubleshoot any problems you may encounter. Because the standard output for the OAI-PMH Harvester is sent to the command window from which it was started, you may choose not to send log information to that location.

Four levels of logging can be used to record increasing amounts of information: ERROR, WARN, INFO, or DEBUG. The level of logging available to all logs is set on the first line of the log.properties file. By default, it’s set to the INFO level, which records the status of the OAI-PMH Harvester and the harvesting process, warnings that are generated, and any errors that occur. At the ERROR level, only error messages are recorded. At the WARN level, only warnings and error messages are recorded. At the DEBUG level, all messages are recorded.

Each log can record information at a different level. To change a log’s level, set its threshold property in the log.properties file. No log can record more information than the overall level, which is specified in the first line of the file. For example, if the overall level is INFO and the debug log’s threshold is set to DEBUG, the debug log file can’t record more than the INFO level of information. With the overall level set to DEBUG, everything is recorded in the debug log file; the other logs will continue at their specified thresholds.

If you start a log file and a debug log file, files named “access.log” and “debug.log”, respectively, will be created by default in the <ArcIMS install directory>/Metadata/OAI-PMH directory. Settings in the log.properties file let you specify different file names and locations. They also let you include a date in the log file’s name.

The steps for starting and stopping a log, changing a log’s threshold, and changing a log file’s name and date are the same for the OAI-PMH Harvester as for the Z39.50 Connector. See ‘Setting log options’ for detailed instructions about how to complete these tasks.

Search code: @harvesting_documents_using_the_oaipmh_harvester