Internationalization is the process of designing an application so that it can be adapted to various languages and regions without engineering changes.
An internationalized program has the following characteristics:
- Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end users region and language.
- With the addition of localized data, the same application can run worldwide.
- Support for new languages does not require recompilation.
- It supports multiple character encoding standards and can be localized quickly.
To support internationalization, characters in XML documents can be encoded in several formats, such as
UNICODE (UTF-8
or UTF-16), utf-8, and ISO-8859-1.
XML uses the UNICODE character set by default, but other encodings can be used if they are declared in the XML declaration at the beginning of
the document.
ArcSDE, through its support for internationalization, imposes no restrictions over the encoding of an XML document that is inserted
into the database. To support XML documents in various encodings, ArcSDE converts the XML document from its specified encoding into a
UNICODE
encoding. The UNICODE document is then parsed and stored in the database in
UNICODE format. The XML index tag names and XPath expressions are also
converted from the users locale to UNICODE before being sent to the database.
ArcSDE also returns the XML document to the user in exactly the same format as it was received.
To avoid DBMS character conversion, the document is usually compressed and stored in a BLOB column.
Encoding issues in ArcSDE
- If an XML documents encoding is not known, the user must not compress the document but must insert it uncompressed
using the BINARY format. The BINARY format is the default setting for storing documents
for which encoding is unknown. For more details on types
of formats for storing documents, see the XML Documents section.
- When fetching an XML index, ArcSDE converts the tag names into the
users codepage. Users may lose information if they are in a codepage that
cannot represent the characters in the index definition.
- When specifying a list of tag names for an XML index, ArcSDE assumes that the tag names are in the codepage that is specified in the users locale. For example, if a user wants to enter tag names with Japanese characters, then he must be in a Japanese or
UNICODE environment for ArcSDE
to be able to correctly process the tag names. The users locale is taken from the DBMS environment variables that the
user sets, as with all other NLS
processing done by ArcSDE.
- For databases that allow the specification of a language in the creation of the full-text index, it is recommended that the user maintain a language-neutral setting. If the user is absolutely sure that they are only using a single language, there is no harm, from ArcSDEs perspective, in setting a language on the full-text index. Language settings for full-text indexing are specified through the DBTUNE table.
- As with XML indexes, ArcSDE relies on the users locale to determine the codepage of the XPath expression given by the user. Users must be in the
correct codepage to search with language-specific characters.
- Since some UNICODE documents can end with two \0 characters (such as UCS2/UTF16),
CHAR * is not used to represent a pointer to the document. Instead,
void * is used, and tells the user how many bytes to memcpy for the document.
Similarly, when setting an XML document, the user will use a void
* and a byte count.
DBMS-specific encoding issues
Every DBMS handles UNICODE support differently:
The codepage for UNICODE columns can be set by the user (for example UTF8 or AL16UTF16) at database create time. Note that only Oracle 9.x strictly stores
UNICODE string in NCHAR/NVARCHAR columns. In Oracle 8.x these column types
can be set to any character set. In both releases, the character set is
specified by the 'national character set' statement when creating the
database.
UNICODE columns are stored in UCS2
UNICODE columns are stored in UTF8
DB2 and PostgreSQL have no UNICODE columns. It is therefore the users responsibility to
set the databases codepage to UNICODE if they plan to store XML documents
in various codepages.
ArcSDE will not be converting all data
into UNICODE. ArcSDE will simply store the data in varchar/clob columns,
letting the DBMS convert the data to the databases codepage. The database
administrator
has the responsibility of making sure the XML documents and tag names
are in a consistent codepage.
|
Determining a documents encoding
ArcSDE determines
the XML document's encoding from the document itself and not from the
clients environment as it does for regular data. In ArcSDE, XML documents
are encoded in a character set that is independent of the clients locale.
There are three ways to determine a documents codepage.
- In the opening XML element, the encoding can be specified by name.
For example: <?xml version="1.0" encoding="ISO-8859-1"?>
If an XML document has the encoding attribute, that is the codepage ArcSDE assumes it is in. For a more thorough discussion of the encoding attribute,
please see the W3C's specification of XML.
- If there is no encoding attribute, ArcSDE looks at the first 2 bytes of the document, searching for a BOM (Byte Order Mark). In the XML world, there are two
BOMs used:
- 0xFEFFThe document is encoded in UNICODE UTF-16/UCS-2, big endian
- 0xFFFEThe document is encoded in UNICODE UTF-16/UCS-2, little endian
- If there is no encoding attribute and no BOM, ArcSDE assumes that the document is in
UNICODE UTF-8.
|