Details of Lucene Indexing in the Geoportal
Details of Lucene Indexing in the Geoportal
Indexing is important because it determines what properties are queriable and how search results are returned when a user submits search criteria to the geoportal. When publishing a metadata document, certain content from the document will be submitted for indexing by the search engine. To facilitate the more advanced features of Lucene - the technology used for the Geoportal extension's search indexing - this information is assigned a particular meaning. By 'meaning', we are referring to a concept or predicate that you would like to specifically search or query. This 'meaning' determines how Lucene will index the content and how it may be used in searching.
Before a 'meaning' value can be used, it has to be defined in a file called property-meanings.xml, located in the \\geoportal\WEB-INF\classes\gpt\metadata folder. The Geoportal references property-meanings.xml to index the metadata value for search and retrieval.
Each geoportal metadata profile's definition.xml file can specify the set of properties that will be indexed. These properties are usually captured in that profile's indexables.xml file. The indexables.xml makes a connection between an element's XML xpath and its associated meaning in the proptery-meanings.xml file. This in turn defines how that element will be indexed and searched.
The geoportal can be customized so that it automatically indexes all metadata content, regardless of which parameter it is associated with in the metadata. To enable this customization, see Index All Metadata Content.
Determine if a metadata element is already indexed by default
If a metadata element appears in one of the geoportal's default metadata editors, it is likely that this element is already indexed by default. However, if you have created a custom metadata profile with new metadata parameters, or added new metadata elements to the default editors, then you may need to define the indexing for the element.
To check if the element is already indexed, identify the definition.xml file for the profile that references the metadata element. For example if we want to investigate if the Lineage element from the INSPIRE (Data) profile, we start by opening the inspire-iso-19115-definition.xml file in a text editor. Here, we will need to identify which indexables.xml file is referenced by this profile, and also find the xpath to the metadata element of interest. To find the indexables.xml file, look in the <indexables fileName=""> attribute in the definition.xml file. In our example,this points to the apiso-indexables.xml file from the \\geoportal\WEB-INF\classes\gpt\metadata\iso folder. Once you have identified which indexables.xml file is referenced, open that indexables.xml file in a text or XML editor.
To find the xpath for the metadata element of interest, find the <parameter> element for that metadata element in the definition.xml file. In that <parameter> element, there is a <content> sub element. The <content> has a select attribute. Copy the xpath from that select attribute.
In order for the metadata element to be indexed, its xpath must be listed in the indexables.xml file referenced by its metadata profile. To do this, return to the indexables.xml file, and search for the xpath you just copied. In our Lineage example, we copied the string /gmd:MD_Metadata/gmd:dataQualityInfo/gmd:DQ_DataQuality/gmd:lineage/gmd:LI_Lineage/gmd:statement/gco:CharacterString from the inspire-iso-19115-definition.xml file and searched for it in the apiso-indexables.xml file. In our example, we do find it in the apiso-indexables.xml file, and see that it is indexed by the property meaning name "apiso:Lineage". When we look up the apiso name="apiso:Lineage" in the property-meaningx.xml file, we see that the queriable for this is the text apiso.Lineage. So we could type apiso.Lineage:searchTerm in the Search field on the geoportal search page to search the Lineage elements for searchTerm.
If the xpath to the metadata element is not provided in the indexables.xml document, you can add its xpath to one of the property meanings listed in that file. After adding the xpath to the property meaning that matches your metadata element's meaning, save the indexables.xml file and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.
Instructions are provided later in this topic, but first read the section on the property-meanings.xml file below.
The property-meanings.xml file
Before adding new meanings, check the property-meanings.xml file to see if an existing meaning will suit your need. Some of the meanings already defined in the file are listed in the table below, along with any functionality the Geoportal code associates with that meaning. Additional meanings defined for ISO-based standards are also found in the property-meanings.xml file, but are not listed in the table. By using existing meanings, the effort to upgrade to future versions of the Geoportal extension is minimized. The existing meanings should satisfy most of the search needs.
property-meaning name | description | Geoportal function |
---|---|---|
uuid | Geoportal's primary key for identifying the document. | Typically you will see this value in URLs. For example: http://host:port/geoportal/rest/document?id=[uuid] |
fileIdentifier | Represents an identifier from within the metadata document. Not all metadata standards support an internal identifier metadata element. If present, it is recommended that it be globally unique. | Used by the Geoportal to avoid duplication of resources and as an alternative identifier for most of the REST-based functions. For example: http://host:port/geoportal/rest/document?id=[fileIdentifier] |
sys.siteuuid | Internally used by the Geoportal, associated with documents that are harvested from remote catalogs. Is the identifier of the remote catalog, and is available . Do not alter this. | Available for query. |
dateModified | Geoportal's modification datestamp associated with the last occurance that the resource's XML was updated. | Used in the Additional Options dialog on the geoportal Search page, and for sorting by date. |
geometry | Represents the bounding envelope associated with the resource. | Used for spatial queries. |
keywords | Keywords associated with the resource. | Available for query. |
body | Non-specific query; a catch-all for indexing and searching text in a metadata document. | If you want to index a certain element, but do not plan to query for that specific element, index it as body. |
anytext | Anytext is not actually indexed. It represents a collection of properties that will be searched when the queriable anytext is specified. | General searches that are not directed to a specific property are anytext queries. |
title | Title of the resource. | Used when the resource's title is displayed, for example in the list of search results on the Search page. |
title.org | Captures the original title as provided from a resource's GetCapabilities response. | Enables geoportal to search both a user-given title for a registered resource, and its original title as per the GetCapabilities response. |
abstract | Abstract associated in the resource. | Maps to the information displayed as text below the title or a record in the list of search results. |
contentType | Esri concept for catagorizing resources. | Used for generating the icon for the resource listed in Search page results, and also as a filter on the Additional Options dialog. |
dataTheme | ISO Topic Catagory code associated with the resource. ISO has defined the Topic Category codelist in the 19115 standard. | Maps to the ISO Categories in the Additional Options dialog. |
resource.url | Primary endpoint for accessing the resource through the internet. | Used for generation of links in search results. For example, it is the URL accessed when the Preview or Open link is clicked. It is also sometimes used to determine the Esri contentType for the resource. |
thumbnail.url | URL to the thumbnail image for the resource. | Used for generation of the thumbnail image next to the resource in the list of search results. |
website.url | URL to a website associated with the resource. | Used for generation of a website link for the resource in the list of search results. |
Each property-meaning in the property-meanings.xml file has attributes. These attributes for property-meanings are described below.
Attribute Name | Description |
---|---|
name | Unique name for the meaning in this file, and should match the meaning="" attribute in the definition.xml file. The name designated becomes a Lucene field that can be used for advanced searches, as per Lucene documentation. For example, designating a name of title and then typing title:water on your Geoportal search page will only return items with water in the index Lucene has associated with the property-meaning title. |
meaningType | Used to flag metadata elements that are tied to functionality within the Geoportal. It is good practice to avoid altering the meaningType of a property-meaning. |
valueType | Data type of the property value, e.g. Double, Geometry, Long, String, or Timestamp. |
comparisonType | Indicates how Lucene will index the property values. There are three options defined in the property-meaning.xml file:
|
Some property-meanings have one or two additional sub-elements, <dc> and <consider>.
- The <dc> element stands for "Dublin Core". The <dc> element facilitates the connection of property-meanings to Dublin Core concepts. This is essential to supporting the CS-W OGCORE profile, defining what is queriable and returnable through CS-W. Within the <dc> element, there are is an attribute for name and for aliases. The name attribute defines the name of the Dublin Core element. The aliases attribute defines alternate words that will be recognized when supplied as a CS-W property name.
- The <consider> element is used only for the anytext property. It defines other
property-meanings that should be included when a search target is anytext. For example, the
property-meaning for anytext is shown below. Because anytext has
four other property-meanings listed in its <consider> element,
a search for anytext, results in the title, abstract, keywords, and body properties being searched.
<property-meaning name="anytext" meaningType="anytext" valueType="String" comparisonType="terms" allowLeadingWildcard="true"> <consider>title,abstract,keywords,body,contentType,dataTheme</consider> <dc name="AnyText" aliases="csw:AnyText,any,csw:Any"/> </property-meaning>
How to define a new property meaning
If you have created a custom metadata profile, or added new elements to an existing geoportal metadata profile, and none of the existing property meanings in the property-meanings.xml file suit your needs, then you may need to define a new property meaning. Follow instructions below.
- Using the parameters described in the table above, add a new property meaning to the property-meanings.xml file.
- Now, add a reference to your property meaning to the indexables.xml file for the profile for your metadata. Make sure that the xpath for the property meaning in the indexables.xml file correctly references the xpath for the element as defined in it's definition.xml file.
- Save the files, and restart the geoportal web application. You will need to re-approve the resources through the geoportal Administration interface for them to be reindexed with your new property meaning.