Language settings for ArcSDE XML columns in geodatabases in Oracle

If you plan to query a table using the contents of an XML column, check the default lexer that will be used for linguistic analysis with text indexes using SQL*Plus—this is particularly important when the column will contain XML documents with text written in a language other than English. If custom settings are not provided in the ArcSDE DBTUNE table, the default Oracle text parameters will determine how text indexes are created for ArcSDE XML columns.

The default objects used for text indexes are owned by the ctxsys user and are created for the database installation, not for individual databases. Their properties are based on the language settings used when Oracle is installed; see the sections about system-defined preferences and system parameters in the Oracle Text Reference for your version of the database to learn how these properties are set.

The Oracle parameters that affect ArcSDE XML columns are DEFAULT_LEXER, DEFAULT_STOPLIST, DEFAULT_STORAGE, DEFAULT_WORDLIST, and DEFAULT_INDEX_MEMORY.

By default, these parameters are typically set to the system-defined preferences with the same name that are owned by the ctxsys user or an appropriate value. For example, the DEFAULT_LEXER is typically set to use the preference CTXSYS.DEFAULT_LEXER. However, someone in your organization may have previously configured Oracle to use a custom preference by default instead.

For many Western European languages and for languages where a language-specific lexer is not provided, the CTXSYS.DEFAULT_LEXER preference will typically use the BASIC_LEXER lexer; the CTXSYS.DEFAULT_WORDLIST preference will often use the BASIC_WORDLIST wordlist.

To successfully index and search some Western European languages, specific wordlist settings are required. If your language is French, for example, make sure the stemmer and fuzzy_match attributes for the BASIC_WORDLIST preference are set to French.

After checking the current text settings in Oracle and consulting the Oracle Text Reference, you may want to customize the settings that will be used to index your XML documents. This requires creating the appropriate preferences in Oracle, then referencing those parameters with the ArcSDE DBTUNE parameter XML_IDX_INDEX_TEXT. Then, use the DBTUNE keyword containing the custom XML_IDX_INDEX_TEXT parameter when creating an XML column. For an ArcIMS Metadata Service, you can reference a DBTUNE parameter in its ArcXML configuration file.

Custom lexers

Depending on the language of your XML documents and the settings at the time Oracle was installed, the DEFAULT_LEXER parameter might not be configured to use the best lexer for your situation. A lexer preference must be set correctly for your language to successfully index XML documents.

For example, if the language you are working in is Thai, Oracle does not provide a Thai lexer, and the DEFAULT_LEXER preference might be set to use the BASIC_LEXER lexer as a result. The BASIC_LEXER indexes and searches text using whitespace only, which may not offer the best results with a language such as Thai. If your XML documents are encoded as UTF-8, you may get better results using the WORLD_LEXER in this case if you are using Oracle 10g or a newer version. All XML documents published to an ArcSDE XML column by ArcIMS Metadata Services are encoded as UTF-8.

It may be helpful to review the system-provided text objects before creating custom preferences, for example, to be sure that the lexer you want to use is available with your Oracle installation.

You use this preference when creating an XML column by setting the preference for it in the XML_IDX_INDEX_TEXT DBTUNE parameter. Always preface the preference name with the user who created the preference. Anyone can use a text preference created by another user.

If you can use the BASIC_LEXER to index your XML documents, you may want to create a custom lexer to merge text that includes nonalphanumeric characters into one word, for example. This can be accomplished using the printjoins attribute of the BASIC_LEXER, and it would be useful to standardize indexing and searching of place-names that use punctuation such as that in Coeur d'Alene.

Creating a custom lexer using the BASIC_LEXER's skipjoin attribute with the apostrophe character would index Coeur d'Alene as Coeur dAlene. Also, any incorrect spellings of this place using extra apostrophes, such as C'oeur d'Alene, would also be indexed as Coeur dAlene, which could produce better search results. For example, you would create a text preference like the following:

begin ctx_ddl.create_preference('BASIC_LEX_SKIP', 'BASIC_LEXER'); 
ctx_ddl.set_attribute('BASIC_LEX_SKIP', 'skipjoins', '''');
end;
/

Use care when changing how punctuation is indexed. The same preference will be used when indexing all text in your XML documents. Values in one XML element may use punctuation differently than in others.

A custom lexer preference may also be required in cases where the DEFAULT_LEXER is correctly set to use the BASIC_LEXER, but the attributes of the BASIC_LEXER are not correct for your language. The BASIC_LEXER attributes for index stems, composite word indexing, alternate spelling, and new German spelling must be set appropriately for your language to correctly index XML documents. This is particularly important for Western European languages. After checking the appropriate values for these attributes for your language in the Oracle Text Reference, create a custom lexer preference following the example above, then reference it in the XML_IDX_INDEX_TEXT DBTUNE parameter.

Custom wordlist or stoplist

Depending on the language of your XML documents and the settings at the time Oracle was installed, the DEFAULT_WORDLIST and DEFAULT_STOPLIST preferences may not have the correct configuration for your language.

The wordlist preference must be set correctly for your language to successfully search XML documents. This is particularly important for Western European languages, where the BASIC_WORDLIST preference must have the correct language set for stemming and fuzzy matching.

The stoplist preference controls which words in the XML documents will not be indexed and, therefore, not available for searching. Stoplists are available for several languages. You can also customize the stoplists if this is appropriate for your situation.

After checking the appropriate values for these settings for your language in the Oracle Text Reference, create a custom preference, then add it to the XML_IDX_INDEX_TEXT DBTUNE parameter's value.

Related Topics


11/18/2013