Dr.Bob Examines... #107: Delphi 2009 Unicode

Delphi Clinic

C++Builder Gate

Training & Consultancy

Delphi Notes Weblog

Dr.Bob's Webshop


	Dr.Bob Examines... #107
See Also:	Dr.Bob's Delphi Papers and Columns

Delphi 2009 Unicode
In this article, I'll examine Delphi 2009 Unicode.

Unicode
Unicode is a standard for the definition and identification of characters and symbols in all written languages, by assigning a unique value to each character or symbol.. The Unicode Consortium defines which number (code point) represents which character or symbol.
In 1991 we got Unicode version 1.0.0, which was extended to 1.0.1 and 1.1 according to the ISO-10646 standard. Originally, Unicode 1.1 was limited to 64K characters and symbols, which meant 2 bytes were enough to encode all characters. Unfortunately, the 64K was also insufficient to support all written languages in the world.
As a result, Unicode 2.0, from June 1996, extended the number of code points to $10FFFD characters and symbols (1.114.109 to be exact). At the time of writing, Unicode 5.0 is current, for which the Unicode Consortium has already defined 101.203 characters in the standard (so we've got space for another 900.000 characters or symbols).

Unicode Transformation Formats
Unicode data (the code point values) can be presented in different formats, such as UTF-8, UTF-16 or UTF-32. We can also optionally compress Unicode data. UTF stands for Unicode Transformation Format, and each UTF defines the mapping between a code point to a unique series of bytes that represent this code point. So where Unicode itself defines the character or symbol that belongs to a code point, the UTF defines the physical representation (in file on disk or in memory for example).

UTF-8
Using UTF-8 we get between 1 and 4 bytes for each Unicode character. This is an encoding where we never know in advance how much (storage) bytes are needed to contain a string. Although we can predict the minimum number of bytes: which is the same as the number of characters for a 7-bit ASCII data stream.
The standard 7-bits ASCII characters are the same in UTF-8, which means there is a great level of compatibility between 'normal' characters. Apart from these standard ASCII characters, UTF-8 supports all 1 million Unicode characters using a UTF-8 specific coding. UTF-8 is mainly used on the internet for web pages for example (since it produces smaller files compared to the UTF-16 and UTF-32 formats).

UTF-16
Using UTF-16, we get 2 or 4 bytes for each Unicode character. This encoding is easier and faster to process than UTF-8, and compatible with UCS2 (the original 2-byte Unicode 1.x 2-byte character encoding), which uses 2 bytes for each character, but is not enough for the full Unicode 2.x character set. Using UTF-16, the value ranges from $D800-DBFF and $DC00-DCFF are used to specify so-called surrogate pairs. Using these surrogate pairs, we can map Unicode code points of $10000 and higher (in the range $10000 to $10FFFD). This is done by subtracting $10000 from the value, leaving a value in the range 0 to $FFFFD, which can be represented in 20 bits. These 20 bits are split in two pairs of 10 bits each, added to the $D800 resp. $DC00 pairs. So for the Unicode code point $1D11E the UTF-16 surrogate pair is calculated as follows: first subtract $10000, which leaves $D11E, which is 00001101000100011110 in 20 bits, split in $34 and $11E. $34 is added to $D800, and $11E is added to $DC00 resulting in $D834 for the most significant surrogate, and $DD1E for the least significant surrogate. We'll get back to this special character example in a minute.
Note that the Unicode code points $D800 to $DFFD will not be assigned a valid character by the Unicode standard (to avoid problems with UTF-16), so the individual surrogate charcters are never mapped to actual characters themselves (but should always be used as a pair).
The main disadvantage of UTF-16 is the fact that the 'normal' characters are also represented by 2 bytes instead of 1 byte (like with UTF-8), so any ANSI String will be at least twice as big (in storage space) using UTF-16. Which is a bit of a waste for source code and data in database that doesn't use many special characters. The UTF-16 format is used by Windows and Java, and is the default format used by Delphi 2009.

UTF-32
Using UTF-32 we always get exactly 4 bytes for each Unicode character. This is the easiest encoding, but also the one resulting in the largest storage space - four times as big as before for the standard 7-bit ASCII characters. The biggest advantage is the fact that storage space is increased by a factor four compared to ANSI data. UTF-32 is mainly used in the UNIX world.

Windows API
Since Windows NT and 2000, the Windows API supports two sets of APIs: one for ANSI (A) and one for Unicode characters (with W for Wide, using UTF-16). It's important to realise that Windows 95, 98 or ME do not support Unicode. And as a result, Delphi 2009 is not able to produce applications for Windows 95, 98 or ME (so it's really time to move to at least Windows 2000 or XP).
Delphi up to version 2007 is still using the ANSI version of the Windows API for the Win32 personality. The .NET side is different, since the .NET Framework itself supports Unicode (and so does Delphi for .NET). This has in fact been a big help, since the VCL for .NET has been Unicode enabled and made sure the entire VCL was prepared for Unicode already (before the work on Delphi 2009 was started).

Delphi 2009 and Unicode
Where previous versions of Delphi used a String type based on ANSI Character types of only 1 byte long, Delphi 2009 defines a new string type based on Unicode data, with WideChar elements of 2 bytes long. Delphi 2009 is fully Unicode based, and defines a new type called UnicodeString which is the new equivalent for the String type. Previously, String was synonymous with AnsiString (a type which is also still available, just like AnsiChar and PAnsiChar).
Delphi 2009 Character types are Char, AnsiChar and WideChar, where Char defaults to WideChar. In previous versions of Delphi, a Char would be equivalent to an AnsiChar. In order to ensure existing code compiler without changes in behaviour, change Char to AnsiChar (as well as PChar to PAnsiChar).
The most important Delphi 2009 String types are: UnicodeString, WideString, AnsiString, UTF8String (a AnsiString with UTF-8 encoding) and ShortString. The default String type is equivalent to UnicodeString, which consists of WideChar characters (like WideString), but is reference counted and memory managed by Delphi (instead of by Windows itself), so a lot faster than a WideString.

Delphi 2009 and Unicode Tips
On my weblog, I've (re)published a number of Delphi 2009 Unicode tips, originally taken from my Delphi 209 Development Essentials:

More tips will follow, and will be collected here as well.

Delphi 2009 Development Essentials
This article is an excerpt from my Delphi 2009 Development Essentials courseware manual which has been sent in PDF format to all my clients, and which is sold at Lulu.com.