internationalization in xml api


Home \| Concepts \| API \| Samples

Concepts > XML

Internationalization

Internationalization is the process of designing an application so that it can be adapted to various languages and regions without engineering changes.

An internationalized program has the following characteristics:

Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end user’s region and language.
With the addition of localized data, the same application can run worldwide.
Support for new languages does not require recompilation.
It supports multiple character encoding standards and can be localized quickly.

To support internationalization, characters in XML documents can be encoded in several formats, such as UNICODE (UTF-8 or UTF-16), utf-8, and ISO-8859-1. XML uses the UNICODE character set by default, but other encodings can be used if they are declared in the XML declaration at the beginning of the document.

ArcSDE, through its support for internationalization, imposes no restrictions over the encoding of an XML document that is inserted into the database. To support XML documents in various encodings, ArcSDE converts the XML document from its specified encoding into a UNICODE encoding. The UNICODE document is then parsed and stored in the database in UNICODE format. The XML index tag names and XPath expressions are also converted from the user’s locale to UNICODE before being sent to the database.

ArcSDE also returns the XML document to the user in exactly the same format as it was received. To avoid DBMS character conversion, the document is usually compressed and stored in a BLOB column.

Encoding issues in ArcSDE

If an XML document’s encoding is not known, the user must not compress the document but must insert it uncompressed using the BINARY format. The BINARY format is the default setting for storing documents for which encoding is unknown. For more details on types of formats for storing documents, see the XML Documents section.
When fetching an XML index, ArcSDE converts the tag names into the user’s codepage. Users may lose information if they are in a codepage that cannot represent the characters in the index definition.
When specifying a list of tag names for an XML index, ArcSDE assumes that the tag names are in the codepage that is specified in the user’s locale. For example, if a user wants to enter tag names with Japanese characters, then he must be in a Japanese or UNICODE environment for ArcSDE to be able to correctly process the tag names. The user’s locale is taken from the DBMS environment variables that the user sets, as with all other NLS processing done by ArcSDE.
For databases that allow the specification of a language in the creation of the full-text index, it is recommended that the user maintain a language-neutral setting. If the user is absolutely sure that they are only using a single language, there is no harm, from ArcSDE’s perspective, in setting a language on the full-text index. Language settings for full-text indexing are specified through the DBTUNE table.
As with XML indexes, ArcSDE relies on the user’s locale to determine the codepage of the XPath expression given by the user. Users must be in the correct codepage to search with language-specific characters.
Since some UNICODE documents can end with two \0 characters (such as UCS2/UTF16), CHAR * is not used to represent a pointer to the document. Instead, void * is used, and tells the user how many bytes to memcpy for the document. Similarly, when setting an XML document, the user will use a void * and a byte count.

DBMS-specific encoding issues

Every DBMS handles UNICODE support differently:

Oracle SQL Server Informix IBM DB2 and PostgreSQL

The codepage for UNICODE columns can be set by the user (for example UTF8 or AL16UTF16) at database create time. Note that only Oracle 9.x strictly stores UNICODE string in NCHAR/NVARCHAR columns. In Oracle 8.x these column types can be set to any character set. In both releases, the character set is specified by the 'national character set' statement when creating the database.

UNICODE columns are stored in UCS2

UNICODE columns are stored in UTF8

DB2 and PostgreSQL have no UNICODE columns. It is therefore the user’s responsibility to set the database’s codepage to UNICODE if they plan to store XML documents in various codepages.

ArcSDE will not be converting all data into UNICODE. ArcSDE will simply store the data in varchar/clob columns, letting the DBMS convert the data to the database’s codepage. The database administrator has the responsibility of making sure the XML documents and tag names are in a consistent codepage.

Determining a document’s encoding

ArcSDE determines the XML document's encoding from the document itself and not from the client’s environment as it does for regular data. In ArcSDE, XML documents are encoded in a character set that is independent of the client’s locale.

There are three ways to determine a document’s codepage.

In the opening XML element, the encoding can be specified by name.

For example: <?xml version="1.0" encoding="ISO-8859-1"?>

If an XML document has the encoding attribute, that is the codepage ArcSDE assumes it is in. For a more thorough discussion of the encoding attribute, please see the W3C's specification of XML.
If there is no encoding attribute, ArcSDE looks at the first 2 bytes of the document, searching for a BOM (Byte Order Mark). In the XML world, there are two BOMs used:
- 0xFEFF—The document is encoded in UNICODE UTF-16/UCS-2, big endian
- 0xFFFE—The document is encoded in UNICODE UTF-16/UCS-2, little endian
If there is no encoding attribute and no BOM, ArcSDE assumes that the document is in UNICODE UTF-8.

Top

feedback | privacy | legal