Unicode in 3B2

Summary

This chapter details the use of the Unicode Standard in 3B2: compliance, advantages and 3B2 specifics.

7 Further Information and References

1 Introduction

To further enhance compatibility and connectivity, 3B2 supports Unicode. The Unicode standard specifies the representation of text in modern software products by providing a unique encoding for every character. The incorporation of the Unicode standard has enhanced 3B2's already comprehensive language support. As a result of Unicode compliance, 3B2 is able to offer language support for over 70 languages, including Chinese, Japanese and Korean.

2 The Unicode Standard

2.1	Definition

Unicode is a worldwide character encoding standard that represents all the characters in modern computer use, including technical symbols and special characters used in publishing. In this context, a character refers to the display of a glyph (letter, number or symbol such as the percentage sign) on a computer monitor or a command to the computer, for example a 'backspace.'

The following is a definition of the Unicode standard taken from the official Unicode web site, www.unicode.org:

'The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages.'

The Unicode standard is defined in the book 'The Unicode Standard'. For more information see the Further Information and References at the end of this document.

2.2	How many unique characters can you represent using Unicode?

The Unicode standard has approximately 91,000 characters defined. However, through the use of Surrogate Pairs it is possible to represent approximately 1 million unique characters using Unicode.

For a list of the primary scripts currently supported by the Unicode standard see www.unicode.org/onlinedat/languages-scripts.html

3 Encoding

3.1	Definition

Computers store characters by assigning a number for each one. In the past, there were different encoding systems for assigning these numbers. An encoding system specifies how character sets are mapped to numbers for storage and transmission. No single encoding contained enough characters for all the letters, punctuation, and technical symbols used throughout the world.

These encoding systems also conflicted with one another. That is, two encodings could use the same number for two different characters, or use different numbers for the same character. Computers need to support many different encodings, but whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

As well as defining the identity of each character and its numeric value or code point, character encoding standards also define how this value is represented in bits.

3.2	Legacy encoding

There were many different encoding systems in the past including ASCII and ANSI. These two encodings are described below.

3.2.1 ASCII

ASCII stands for American Standard Code for Information Interchange. It is a character set that consists of 128 characters (0 to 127) which are standard on most computers.

The only characters used in ASCII for displaying text are those with the values 32 to 126. The other characters in the ASCII character set (0-12, 13-31, and 127) are keyboard commands to the computer.

The ASCII character set is now incorporated into the Unicode standard. ASCII can be found in the Hex range 0000…007F (Basic Latin).

3.2.2 ANSI

The fact that ASCII is a character set that is limited to 128 characters was fine for the USA and the UK. However, other languages require further characters including accented characters and punctuation. Due to this need, character sets were enlarged to 256 characters.

These additional characters are referred to in several different ways, including extended ASCII, high ASCII, ANSI (American National Standards Institute), extended characters or special characters.

The extended number of character slots not only provide the characters required for a wider range of languages, they also provided more English characters, such as — (em dash), ©, ®, and ™.

The ANSI character set is now incorporated into the Unicode standard. The additional characters provided by ANSI can be found in the range 0100…00FF (ISO8859-1/Latin 1).

3.3 ISO

ISO stands for the International Organization for Standardization. Among many other standards it defines character encoding standards.

Unicode is also sometimes referred to as ISO 10646. These are actually two different standards: ISO 10646 from ISO and Unicode from the Unicode consortium. However, they are very similar. The ISO Working Group responsible for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium decided to create one universal standard for coding multilingual text. The ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium have worked together to extend the standard and to keep their respective versions synchronised.

ASCII and ANSII are also ISO standards that form part of the Unicode standard.

3.3.1 The Advent ISO character set

The Advent ISO character set is based on the ISO8859-1/Latin 1 (ANSI) character set. The Advent ISO character set also has additional special characters defined for specific use within 3B2 such as tab column break page break table cell and row break and soft return etc.

3.4	Unicode Encoding

Unicode standardises character encoding so that when data is in Unicode, it can be sorted, searched, and manipulated without data corruption. This is because Unicode provides a unique number for every character, regardless of platform, program, or language.

The Unicode Standard defines several encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (in 7, 8, 16 or 32-bits per code unit). Three encodings and widely used, known as UTF-8, UTF-16 and UTF-32.

The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard. All languages can use any of the encodings. The encodings simply provide methods of file encoding so that information can be passed to one application to another.

UTF stand for UCS Transformation Format. UCS stands for Universal Character Set.

For more information see the Further Information and References

4 Unicode in XML

It is possible to encode your XML documents in Unicode. XML processors understand two encodings: UTF-8 and UTF-16 (an alternate way of encoding Unicode characters using two bytes).

5 Unicode in Perl

From version 5.6 Perl has full UTF-8 support. Because Perl uses UTF-8 encoding and 3B2 uses UTF-16 encoding it is necessary for 3B2 to internally convert characters to UTF-8 to be manipulated in Perl. Once the manipulation has been performed, characters are converted back to UTF-16.

This process means that Perl scripts take longer to complete than in the 3B2 standard version.

For further references on Perl and Unicode see the Further Information and References at the end of this document.

6 Unicode in 3B2

6.1 Internal

Regardless of what encoding you use to supply your data, 3B2 always uses UTF-16 internally. It is possible to load an document encoded in UTF-32, but you will lose all of the characters above UTF-16.

3B2 supports the current version of the Unicode standard. The difference between different versions is that more code points are defined in order to add more characters to the standard. The 3B2 system file unidata.3ad is based on Unicode version 3.2.

Surrogate pairs are currently unsupported in 3B2 Unicode. For more information about surrogate pairs see the Further Information and References at the end of this document.

For more information on 3B2 system files see 3B2 System Files_en.

6.2	The ICU Library

From version 7.98a, the ICU (International Components for Unicode) library has been added to 3B2 to handle the import and export of different file encodings, and to provide localized sorting. This library is produced by IBM. The ICU library supports approximately 200 different encodings / variations on encodings. For more information see http://oss.software.ibm.com/icu

6.3	Text Input and Output (tft / tsavetxt)

This is only present in the Unicode version. In normal operation, the functionality of both tft and tsavetxt should not have changed, and 3B2 still handles everything in UTF-16 internally.

However, it is now possible to specify an optional encoding to use when reading and writing files. The dialogues for both macros now have an extra box to specify which encoding to use. Instead of adding a new parameter to the macros, the syntax for the filename has been extended: "{:encoding}filename" where encoding is the encoding to read from or write to. Areas can still be placed in the filename as normal, but the {:encoding} must appear at the start of the filename otherwise it will get ignored and treated as part of the filename. If no encoding is specified, 3B2 will expect UTF-16 as it always has done.

For example to save the text in strm0 in UTF8, you can do:

tsavetxt "strm0","{:UTF8}{.}strm0.txt"

Then to read it back in, you can do:

tft *,"strm1",0,"{:UTF8}{.}strm0.txt"

If UTF-16 is specified, it will automatically detect which version it is (big endian / little endian). If the encoding is not known, you will either get prompted for a correct encoding, or it will default to UTF-16 (if dialogues are not allowed).

The Frame Contents (tftext) and Save Text Tag (tsavetxt) dialogue boxes only list five encodings that are most likely to be required, but it is possible to type anything you want into the edit field if it is supported by the ICU library. ICU will not auto detect encodings from the file above and beyond working out UTF-16le UTF-16be if given UTF-16.

Note there will be no warning you if you pick an encoding in which most of your characters will get lost (e.g. saving Arabic UTF-16 to iso-8859-1).

6.3.1 XML Automatic mode

In addition to normal encodings, an automatic XML mode has been added. Instead of an encoding use "{:xml-auto}filename"

When loading with tft, 3B2 will check for XML, check the encoding, use that encoding if present, or assume Unicode and try that instead.

6.4	Unicode character access

There are several different ways to use Unicode characters:

Using a character entity, for example:

Decimal: 本

Hexidecimal: &#0x672c;

It is possible to access the special characters contained in the Advent ISO character set in the Unicode standard. However, in 3B2 Unicode these characters are only available in the code range 0 - 32 and not in the range 255 - 272 as in the standard version of 3B2.

For more information about accessing special character in 3B2 see 3B2 Character Sets

6.5 Sorting

6.5.1 Index Sorting

7.98l

From version 7.98l, the Unicode version has a new set of options to allow sorting of indexes based on locale.

It is still possible to sort using the traditional 3B2 method (using the heading and ordering maps), using locales to determine the sort order, or using both. Below are explanations of these methods:

The Index Control Stream dialogue box.

6.5.1.1 Sort Using {Index control stream keyword "usesort"}

If you select "3B2" {value 0 - the default}, then the index is sorted using the traditional 3B2 method. Any options in the "Locale Sorting" area are ignored. If you select "Locale" {value 1}, then the index is sorted using the locale specified in the "Sorting Locale" text box. Any options in the 3B2 Sorting area are ignored. If you select "Both" {value 2}, then the text will go through the head and order maps before being locale sorted. Note: the default heading map.

(_head_norm) Removes accents and converts characters to upper case before sorting, therefore using this map with locale sorting will probably not achieve the effect you desire. If one of the maps is specifically set to nothing, then that mapping stage will not get applied, regardless whether 3B2 sorting is chosen or not (for example, to take care of the case if you want to use a letter order map but no heading map).

6.5.1.2 Sorting Locale {Index control stream keyword "locale"}

A list of the locales, and the languages they relate to is available on the pop-up menu, or in string 1452. If you want to know what ICU thinks the sort order for the different locales is, see http://oss.software.ibm.com/icu/charts/collation

The control stream and dialogue must contain the abbreviated form shown in string 1452, i.e. use "kl" NOT "Kalaallisut". The default is "en" [English]

6.5.1.3 Locale case order {Index control stream keyword "loccase"}

None {value 0 - the default}	No specific case ordering is done above and beyond the standard language dependant order.
Upper {value 1}	Upper case letters are sorted first
Lower {value 2}	Lower case letters are sorted first

If the three new keywords (locale, usesort, loccase) are used in 3B2 standard version, they will be ignored and the traditional metrhod of 3B2 sorting will be used. Any settings not applicable to the chosen sort (i.e. heading and ordering maps for locale sorting and locale and case order for 3B2 sorting) will be ignored.

6.5.2 XSLT Sorting

Using <xsl:sort> in an XSLT transformation with libxml/libxslt will now sort correctly depending on the "lang" attribute provided.

For more information see the XSLT in 3B2 chapter.

6.6	Recommended Fonts

TrueType fonts, OpenType fonts or OpenType fonts with TrueType outlines are preferable for use in 3B2 Unicode. With OpenType fonts there are some options that can cause problems.

6.7	Display of missing characters

If a glyph is not in the font the Unicode version will either display the missing glyph box □ for new or TrueType fonts or a question mark. 3B2 standard version fonts may not have a missing glyph symbol.

No other warning will appear if fonts can not be displayed. However, if a character can not be found in a font , it is possible to use the tfdef attribute to specify a default font so that the correct characters will be displayed. For more information see tfdef

6.8	Naming Conventions

All types of 3B2 objects, such as files, can be named using Unicode characters, providing the normal naming conventions are followed.

6.9	Differences between 3B2 Standard and 3B2 Unicode

It is important that you are aware of the differences between the standard and Unicode versions of 3B2. This is particularly important if you are switching to 3B2 Unicode.

6.9.1 Default Font

The default font is different in the Unicode version. The standard version uses Times (Monotype). The Unicode version uses Times (TrueType).

6.9.2 Edit Bar Font

The Edit Bar is also different in the Unicode version of 3B2. The standard version uses a custom font, but the Unicode version uses a Microsoft system dialogue box font. The Edit Bar uses this font so that the font can be automatically updated if you install other fonts.

6.9.3 Embedded Fonts

Embedded fonts will not work in 3B2 Unicode. In the standard version of 3B2 embedded fonts enable you to incorporate the fonts that are used within 3B2 documents into the document itself.

6.9.4 Smart Fonts

At the time of writing it is not possible to use Smart Fonts in the Unicode version because they are not compliant with the Unicode encoding. For more information about Smart Fonts see www.originalab.se

6.9.5 The tfx command

The tfx command does not work in the 3B2 Unicode version.

In the standard version of 3B2, the tfx command sets the extension font to be used by the special extension font shift characters.

For more information see tfx

6.9.6 Ligatures and Kerning

Ligatures will work in 3B2 Unicode, however ligatures that were set-up in the standard version of 3B2 will need migrating to 3B2 Unicode.

Similarly kerning will work in 3B2 Unicode, however, kerning that was set-up in the standard version of 3B2 will need migrating to 3B2 Unicode.

6.9.7 Spell Checker

The Spell Checker is not available in the Unicode version.

6.9.8 Database Connectivity

There is no database connectivity in the Unicode version of 3B2.

6.9.9 Scripts and Showstrings

Showstrings that use the # macro may have unpredictable results.

6.9.10 Using Unicode characters in 3B2 Regular Expressions

It is not possible to use Unicode characters in regular expressions in the Unicode version. However, Perl regular expressions do support Unicode.

For more information on Perl and Unicode see Perl and Unicode

For more information on regular expressions see Regular Expressions

6.9.11 File Size

The document file size will be twice as large in 3B2 Unicode version compared to a file created in 3B2 standard version. This is a result of every character taking up twice as much disk space and memory. It is therefore possible that the performance of 3B2 will be slower. In order to avoid this you may have to upgrade your machine.

6.10	Compatibility

6.10.1 Operating System Compatibility

The following Operating Systems support Unicode:

Windows XP

6.10.2 Platform Compatibility

The following versions of 3B2 are Unicode compliant:

3B2 Version	Unicode Compliant?
Windows	Yes
OCX	Yes
Linux/Solaris/HPUX (Unix).	No

7 Further Information and References

7.1	Internet Resources

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard. The Unicode Consortium website provides extensive information and resources for Unicode:

www.unicode.org/

The World Wide Web Consortium (W3C) Unicode information:

Describes Unicode in XML and other markup languages:

www.w3.org/TR/unicode-xml/

For further references on Perl and Unicode see:

Perldoc.com provides a complete and up-to-date repository of online Perl and CPAN module documentation:

www.perldoc.com/perl5.8.0/pod/perlunicode.html

7.2	Books/Literature

The Unicode Standard Version 3.0

The Unicode Consortium

Addison Wesley

Unicode Demystified

Richard Gillam

Addison Wesley

Document created on 17-Sep-2002, last reviewed on 24-Apr-2003 (revision 0.2)

Summary

Contents

1

2

2.1

2.2

3

3.1

3.2

3.2.1

3.2.2

3.3

3.3.1

3.4

4

5

6

6.1

6.2

6.3

6.3.1

6.4

6.5

6.5.1

6.5.1.1

6.5.1.2

6.5.1.3

6.5.2

6.6

6.7

6.8

6.9

6.9.1

6.9.2

6.9.3

6.9.4

6.9.5

6.9.6

6.9.7

6.9.8

6.9.9

6.9.10

6.9.11

6.10

6.10.1

6.10.2

7

7.1

7.2