Details of Lucene Indexing in the Geoportal

Details of Lucene Indexing in the Geoportal

Indexing is important because it determines what properties are queriable and how search results are returned when a user submits search criteria to the geoportal. When publishing a metadata document, certain content from the document will be submitted for indexing by the search engine. To facilitate the more advanced features of Lucene - the technology used for the Geoportal extension's search indexing - this information is assigned a particular meaning. By 'meaning', we are referring to a concept or predicate that you would like to specifically search or query. This 'meaning' determines how Lucene will index the content and how it may be used in searching.

Before a 'meaning' value can be used, it has to be defined in a file called property-meanings.xml, located in the \\geoportal\WEB-INF\classes\gpt\metadata folder. The Geoportal references property-meanings.xml to index the metadata value for search and retrieval.

Each geoportal metadata profile's definition.xml file can specify the set of properties that will be indexed. These properties are usually captured in that profile's indexables.xml file. The indexables.xml makes a connection between an element's XML xpath and its associated meaning in the proptery-meanings.xml file. This in turn defines how that element will be indexed and searched.

NoteNote:

The geoportal can be customized so that it automatically indexes all metadata content, regardless of which parameter it is associated with in the metadata. To enable this customization, see Index All Metadata Content.

Determine if a metadata element is already indexed by default

If a metadata element appears in one of the geoportal's default metadata editors, it is likely that this element is already indexed by default. However, if you have created a custom metadata profile with new metadata parameters, or added new metadata elements to the default editors, then you may need to define the indexing for the element.

To check if the element is already indexed, identify the definition.xml file for the profile that references the metadata element. For example if we want to investigate if the Lineage element from the INSPIRE (Data) profile, we start by opening the inspire-iso-19115-definition.xml file in a text editor. Here, we will need to identify which indexables.xml file is referenced by this profile, and also find the xpath to the metadata element of interest. To find the indexables.xml file, look in the <indexables fileName=""> attribute in the definition.xml file. In our example,this points to the apiso-indexables.xml file from the \\geoportal\WEB-INF\classes\gpt\metadata\iso folder. Once you have identified which indexables.xml file is referenced, open that indexables.xml file in a text or XML editor.

To find the xpath for the metadata element of interest, find the <parameter> element for that metadata element in the definition.xml file. In that <parameter> element, there is a <content> sub element. The <content> has a select attribute. Copy the xpath from that select attribute.

In order for the metadata element to be indexed, its xpath must be listed in the indexables.xml file referenced by its metadata profile. To do this, return to the indexables.xml file, and search for the xpath you just copied. In our Lineage example, we copied the string /gmd:MD_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:statement/gco:CharacterString from the inspire-iso-19115-definition.xml file and searched for it in the apiso-indexables.xml file. In our example, we do find it in the apiso-indexables.xml file, and see that it is indexed by the property meaning name "apiso:Lineage". When we look up the apiso name="apiso:Lineage" in the property-meaningx.xml file, we see that the queriable for this is the text apiso.Lineage. So we could type apiso.Lineage:searchTerm in the Search field on the geoportal search page to search the Lineage elements for searchTerm.

If the xpath to the metadata element is not provided in the indexables.xml document, you can add its xpath to one of the property meanings listed in that file. After adding the xpath to the property meaning that matches your metadata element's meaning, save the indexables.xml file and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.

Instructions are provided later in this topic, but first read the section on the property-meanings.xml file below.

The property-meanings.xml file

Before adding new meanings, check the property-meanings.xml file to see if an existing meaning will suit your need. Some of the meanings already defined in the file are listed in the table below, along with any functionality the Geoportal code associates with that meaning. Additional meanings defined for ISO-based standards are also found in the property-meanings.xml file, but are not listed in the table. By using existing meanings, the effort to upgrade to future versions of the Geoportal extension is minimized. The existing meanings should satisfy most of the search needs.

property-meaning name

description

Geoportal function

uuid

Geoportal's primary key for identifying the document.

Typically you will see this value in URLs. For example: http://host:port/geoportal/rest/document?id=[uuid]

fileIdentifier

Represents an identifier from within the metadata document. Not all metadata standards support an internal identifier metadata element. If present, it is recommended that it be globally unique.

Used by the Geoportal to avoid duplication of resources and as an alternative identifier for most of the REST-based functions. For example: http://host:port/geoportal/rest/document?id=[fileIdentifier]

sys.siteuuid

Internally used by the Geoportal, associated with documents that are harvested from remote catalogs. Is the identifier of the remote catalog, and is available . Do not alter this.

Available for query.

dateModified

Geoportal's modification datestamp associated with the last occurance that the resource's XML was updated.

Used in the Additional Options dialog on the geoportal Search page, and for sorting by date.

geometry

Represents the bounding envelope associated with the resource.

Used for spatial queries.

keywords

Keywords associated with the resource.

Available for query.

body

Non-specific query; a catch-all for indexing and searching text in a metadata document.

If you want to index a certain element, but do not plan to query for that specific element, index it as body.

anytext

Anytext is not actually indexed. It represents a collection of properties that will be searched when the queriable anytext is specified.

General searches that are not directed to a specific property are anytext queries.

title

Title of the resource.

Used when the resource's title is displayed, for example in the list of search results on the Search page.

title.org

Captures the original title as provided from a resource's GetCapabilities response.

Enables geoportal to search both a user-given title for a registered resource, and its original title as per the GetCapabilities response.

abstract

Abstract associated in the resource.

Maps to the information displayed as text below the title or a record in the list of search results.

contentType

Esri concept for catagorizing resources.

Used for generating the icon for the resource listed in Search page results, and also as a filter on the Additional Options dialog.

dataTheme

ISO Topic Catagory code associated with the resource. ISO has defined the Topic Category codelist in the 19115 standard.

Maps to the ISO Categories in the Additional Options dialog.

resource.url

Primary endpoint for accessing the resource through the internet.

Used for generation of links in search results. For example, it is the URL accessed when the Preview or Open link is clicked. It is also sometimes used to determine the Esri contentType for the resource.

thumbnail.url

URL to the thumbnail image for the resource.

Used for generation of the thumbnail image next to the resource in the list of search results.

website.url

URL to a website associated with the resource.

Used for generation of a website link for the resource in the list of search results.

Each property-meaning in the property-meanings.xml file has attributes. These attributes for property-meanings are described below.

Attribute Name

Description

name

Unique name for the meaning in this file, and should match the meaning="" attribute in the definition.xml file. The name designated becomes a Lucene field that can be used for advanced searches, as per Lucene documentation. For example, designating a name of title and then typing title:water on your Geoportal search page will only return items with water in the index Lucene has associated with the property-meaning title.

meaningType

Used to flag metadata elements that are tied to functionality within the Geoportal. It is good practice to avoid altering the meaningType of a property-meaning.

valueType

Data type of the property value, e.g. Double, Geometry, Long, String, or Timestamp.

comparisonType

Indicates how Lucene will index the property values. There are three options defined in the property-meaning.xml file:

  • term: phrases associated with this attribute are tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of term, it will be stored as two separate words "San" and "Diego". Terms are also stored in a lowercase form, e.g. "san" and "diego".
  • keyword: phrases associated with this attribute are not tokenized. For example, if "San Diego" is the word that is being stored, if it is associated with a meaning that has a comparisonType of keyword, it will be stored as one phrase. A search for "San" will not return the record; only a search for "San Diego". Keywords are also stored in a lowercase form, e.g., "san diego".
  • value: items associated with this attribute are stored as values, not phrases or words. Items are case-sensitive. An example would be the fileIdentifier meaning. Parameters with a meaning="fileIdentifier" likely hold unique identification strings, such as {F56408D6-4325-484C-B753-5E8FD4421E31}. Searching for part of the string, such as "E31" will not retrieve the record because the string is stored as a complete value and not parsed. Searching for the string "{f56408d6-4325-b753-5e8fd4421e31}" will also not return the record because the value stored is case-sensitive.

Some property-meanings have one or two additional sub-elements, <dc> and <consider>.

  • The <dc> element stands for "Dublin Core". The <dc> element facilitates the connection of property-meanings to Dublin Core concepts. This is essential to supporting the CS-W OGCORE profile, defining what is queriable and returnable through CS-W. Within the <dc> element, there are is an attribute for name and for aliases. The name attribute defines the name of the Dublin Core element. The aliases attribute defines alternate words that will be recognized when supplied as a CS-W property name.
  • The <consider> element is used only for the anytext property. It defines other property-meanings that should be included when a search target is anytext. For example, the property-meaning for anytext is shown below. Because anytext has four other property-meanings listed in its <consider> element, a search for anytext, results in the title, abstract, keywords, and body properties being searched.
    <property-meaning name="anytext" meaningType="anytext" valueType="String" comparisonType="terms" allowLeadingWildcard="true">
      <consider>title,abstract,keywords,body,contentType,dataTheme</consider>  
        <dc name="AnyText" aliases="csw:AnyText,any,csw:Any"/>
      </property-meaning>
    

How to define a new property meaning

If you have created a custom metadata profile, or added new elements to an existing geoportal metadata profile, and none of the existing property meanings in the property-meanings.xml file suit your needs, then you may need to define a new property meaning. Follow instructions below.

  1. Using the parameters described in the table above, add a new property meaning to the property-meanings.xml file.
  2. Now, add a reference to your property meaning to the indexables.xml file for the profile for your metadata. Make sure that the xpath for the property meaning in the indexables.xml file correctly references the xpath for the element as defined in it's definition.xml file.
  3. Save the files, and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.


8/6/2012