|Unicode Support in Helix 7.0|
Helix 7.0 is the first version to feature support for Unicode text. Wikipedia defines Unicode as “a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems.”
For millennials (and those who wish they were) this means you can now add emojis to your text in Helix. To the rest of the world, it means native support for non-Roman character sets like Japanese, Russian, etc. There are hundreds of useful Unicode character sets, such as a multitude of arrows, pictographs, musical symbols, and much more.
Helix uses the macOS native Unicode support — it does not provide any additional features, nor does it offer any overrides for the standard behavior. The Unicode standard is constantly changing — new characters are being added, and sorting rules are being refined. It should not be a surprise to find that some Unicode characters do not display consistently from version to version of macOS, as older versions will have a more limited set of characters and use different sorting rules. It should also be expected that text may sort differently based on the version of macOS where the sorting is done.
In prior versions of Helix, text was stored using a Mac-specific encoding method, such as MacRoman. These methods were either limited to a small (less than 250) set of unique characters, or to a specific character set, such as Cyrillic. A major advantage of Unicode is that all character sets are supported at all times. You can freely mix characters from Chinese, English, etc. anywhere you enter text.
Classic versions of the Mac operating system used ‘international’ (itlx) resources to determine the text encoding used by the system. These resources — which are limited in scope — are being phased out by Apple. As of macOS 10.9, these resources no longer provide complete functionality, having been replaced by more sophisticated methods of ‘localizing’ the computer to the local language.*
* The use of the word ‘language’ here encompasses all aspects of a particular region, including date and time formats, currency symbols, etc.
One of the effects of Apple’s decision is that their old method of encoding non-Roman character sets (e.g. MacJapanese, MacHebrew) is no longer necessary. The Unicode standard includes all of the character sets in a single encoding method, eliminating the need to choose which language’s charaters to use. With Unicode, Japanese, Hebrew and every other major character set in the world are always available.
The Unicode character set is constantly expanding. Aside from the addition of new emoji — which is what grabs the headlines these days — version 9.0 (released June 21, 2016) added support for lesser-used languages such as Osage (Native American) and Fulani (African).
There is no doubt that Unicode will continue to be revised, but those revisions are typically only available in newer systems. Using the latest version of macOS provides the widest range of characters.
|Where Helix Uses Unicode||
Helix 7.0 uses Unicode in every place text is entered. This includes text and styled text fields, of course, but it also includes user passwords, icon names, comments, text entered into sockets in abacus tiles (such as for validation messages), control elements (checkboxes, radio buttons, popups), menu names, data input/output (I/O), AppleScript… anywhere text is found.
If you do not intend on reading this entire page, be sure to read the Before You Update technote, which covers places where the switch to Unicode may require changes to your collections.
For the typical user, normal text entry is unaffected by this change. It is irrelevant whether text is stored as Unicode, MacRoman, or any other text encoding system: the user types characters on a keyboard* without having to consider the underlying technology involved. That is as it should be.
* Unicode characters can also be pasted, or entered using alternate input methods.
|Accessing Unicode Characters||
There are many ways to access Unicode characters, and a thorough discussion of these methods is beyond the scope of this technote. For those who want to experiment, the simplest way to access Unicode characters is to open the Keyboard panel of System Preferences, check the ‘Show Input menu in menu bar’ option (as shown at right) and close that window.
When that is done, the ‘Input menu’ appears in the menu bar. It appear as a country flag or as a small window containing a ⌘ symbol. From this menu, choose “Show Character Viewer” to open the floating “Character Viewer” palette. The first time it is opened, the Character Viewer appears in a compact form, as a scrolling list of thousands of characters. A search field can narrow the search, if you know the right term to enter to find what you are looking for. (Try finding the ⌘ symbol and you’ll probably come up empty.)
The upper right corner of the window contains a small button that, when clicked, expands the viewer to a larger, much more useful format. Even then, the full function of this window may not be evident: resizing the viewer hides and shows various functions if they can fix within the current size. Experiment briefly, and you’ll quickly understand.
Either way, once the desired character has been found, double clicking it will insert it wherever the cursor is currently located in the active application.
The four images on the right show the steps in accessing and using the Character Viewer. Click the first one to expand it, then use the right arrow keys to step through them one by one. (Use the left arrow key to step through in reverse.)
The balance of this technote is targeted to the collection designer, IT staff, or technical user who needs or wants to understand the details this change brings. There are some changes that will be visible to the astute observer; this technote provides insight into these changes.
|Unicode Support When Updating Existing Collections||
For all structural elements (that is, everything except field data) the conversion to Unicode is done automatically when the collection is first updated to Helix 7.0. For this reason, it is critical that collections that contain non-Roman characters (e.g. Japanese) follow the instructions in the Before You Update technote when updating a collection to Helix 7.
Passwords that contain characters that are typed using ‘High-ASCII” characters — essentially, those that are typed while holding the option key — can not be converted from the old encoding to Unicode, because it is impossible for us to know what those characters are. For this reason it is critical to reset passwords that use these characters before updating a collection to Helix 7. The Before You Update technote contains more information about this, along with tools to help assess a collection’s use of non-convertible passwords before you update.
Views that use High-ASCII delimiters for data Input/Output are reset to the default I/O characters. Helix 7.0 does not support Unicode characters as I/0 delimiters. See below for details.
|Updating Existing Data to Unicode||
In collections that use MacRoman text encoding, existing data (text and styled text fields) is not converted when the collection is updated, for reasons that are explained in the next paragraph. When field data is modified it may or may not be re-encoded as Unicode text. For the end user, this is totally transparent and should be of no concern.
When a field is updated and the record replaced, the data is re-encoded as Unicode if and only if any character in the text is not part of the original encoding method. For example, characters such as • and Ä, which are part of the MacRoman character set continue to be stored using MacRoman encoding as long as no non-MacRoman characters are part of the text. When a character that is not part of the MacRoman character set — such as ⦿ — is found in a field, the entire field is re-encoded and stored using the Unicode equivalent encoding.
Conversely, when all characters that are not part of the original encoding method are deleted from a field, the updated field is re-encoded and stored as MacRoman.
There is no visible indicator as to whether a particular field has been updated to use Unicode encoding. other than by examining the characters in the text to see if any characters that are not part of the original encoding are present.
This method provides transparent operation for the end user, more efficient storage for fields that use only MacRoman characters, and avoids the potentially significant delay that would be involved in updating all of the text and styled text field data in a collection.
Collections that previously used alternative encodings, such as MacJapanese, may encounter another behavior, where most data in fields are converted to Unicode when modified. This should transparent to the end user, but please contact our technical support department if unexpected result occur.
|Indexing Unicode Values||
Extended Roman characters (such as Ä and é) that are common in European languages continue to sort as they have in Helix 6.2.4, so the following discussion should not be a significant issue for most users.
Indexes in Helix do not yet fully support Unicode: although characters that are part of the MacRoman character set continue to sort as they do in Helix 6.2.4, all other Unicode characters are unsorted, and appear together near the beginning of an index, after most symbols (such as * and ≥ shown here) but before currency symbols, numbers and letters.
The image on the right shows a sample of characters in an unsorted list (entry order) and as a list (right side) with an ascending index on the displayed field. (The list uses the ‘Down, then across’ option of the ‘Repeat direction’ property to show more data in a compact format.)
By comparing the two lists, you can see that all of the Unicode characters — such as ⑤ and the non-Roman versions of ‘hello’ — appear together near the beginning of the list in an apparently random order. The order is actually based first on the internal length of the Unicode characters, then on the order in which they were entered into the database.
In the example on the right, ☏☏☏ comes before 👤👤 because the telephone is U+260F internally, while the silhouette is U-D83D+U-DC64. The total internal length of three telephones is 6 bytes, and the total internal length of two silhouettes is 8 bytes.
Helix is currently using the ‘lozenge’ character (◊) as a placeholder for Unicode characters and, as seen in this chart from technote R7708, that is the sort position of the lozenge. Therefore, the two examples above would be indexed as ◊◊◊◊◊◊ and ◊◊◊◊◊◊◊◊◊ respectively, causing them to sort in the order shown.
|Comparing Unicode Values||
Fortunately, none of the above discussion has any bearing on string comparisons. The equivalency for alphabetic characters such as a and å continues to come from this chart from technote R7708. Other Unicode characters are compared using macOS internal comparisons, so characters such as ⑤ and ⑥ are not equal when doing equivalency testing (▣ = ▣, ▣ ≤ ▣ ≤ ▣, ▣ starts with ▣, ▣ word equals ▣, etc.).
For the technically minded, Helix uses the Core Foundation (CF) string comparison API with these options: kCFCompareCaseInsensitive, kCFCompareNonliteral, kCFCompareLocalized, kCFCompareDiacriticInsensitive, kCFCompareWidthInsensitive.
|Unicode and Word Searches (Keywords)||
The HKWT table is obsolete in Helix 7.0. Existing applications that use a customized HKWT table will need to be tested to make sure they behave as expected. For the majority of our users, this discussion is irrelevant.
The technote on the Helix ‘Keyword’ property describes how keywords work in conjunction with the HKWT (Helix KeyWord Table) resource to allow collection-level customization of which characters constitute word delimiters. While a customizable separator table based on MacRoman encoding (with its 256 character limit) was workable, Unicode Standard v9.0 has 128,172 characters. Creating a customizable word delimiter table for such a large (and growing) number of characters would be unmanageable for everybody concerned.
macOS provides an API for separating (‘tokenizing’) the words in a field. Using this API eliminates the ability to customize the word delimiters, but the advantages — particularly for non-US users — are too great to ignore.
Therefore, Helix 7.0 uses the tokenizer provided by macOS to separate words in a field, as specified by the default language chosen in the “Language and Region” System Preference panel.
When importing and exporting text, Helix has historically allowed users to choose any ASCII character from 0–255 as start characters, field delimiters, and record delimiters. The switch to Unicode means that the ‘High-ASCII’ characters (those with an ASCII value of 128–255) can no longer be used as delimiters.
When a collection is updated to Helix 7.0, any view delimiters that are in this range are reset to the default value. Those defaults are:
A utility script available on our public scripts page can be used to check a collection for views that will be affected by this change. The script must be run prior to updating a collection to Helix 7.0 in order to identify such views before they are updated.
|ASCII Character 0 (NUL)||
Indirectly related to adding Unicode support is the handling of ASCII character 0, sometimes referred to as the NUL character. Modern programming languages (written in C and its derivatives) use ASCII 0 as a signal that the end of a text has been reached. As such, it is not allowed in a string. Prior to Helix 7, it was possible to embed ASCII 0 characters in strings, and although it is impossible to type this character on a keyboard, some collection designers have created structure that takes advantage of this character.
Helix 7 no longer supports the use of ASCII 0 (NUL) in strings. As such, there may be collections that need to be revised with this in mind. The typical visual indicator that an ASCII 0 character is present is text that is cut off before its end.
Helix 7 does still support the user of ASCII 0 in certain places, such as being a choice for I/O delimiters, but collection designers should not rely on it remaining a valid option in the future.
|Private Use Area and Character Presentation||
The Unicode specification provides the definition for characters, not the presentation. In other words, each provider of application and system software is free to provide their own interpretation for any given character. However, it is in the common interest of all parties that they adhere to the Design Guidelines provided.
The Unicode specification also provides for a “Private Use Area” which is where application and font specific characters are found. The symbols found in most dingbat fonts (Hoeﬂler Text-Ornaments, Wingdings, etc.) are characters defined in the Private Use Area, and occupy shared space. When using Unicode characters from the Private Use Area, care must be taken to specify the correct font for the data rectangle, or to use styled text in order to display the character in the desired font.
If a Unicode character does not appear as expected after adding it to Helix, it is most likely a Private Use Area character from a different font. It may also have been copied from a source that provides application specific presentation for some characters.
Helix does not provide any Private Use Area characters, nor does it alter the presentation of any characters from that provided by macOS.
In Helix Client/Server, most operations discussed here — indexing, searching, etc. — are done by the Server, not the Client. To ensure that all users experience consistent behavior with Unicode characters, our recommendation is to make sure the version of macOS running on the Server is as new as — or newer than — that of the Clients connecting to it.
Unicode and NSStrings in macOS A great overview and background history, mixed with technical info for programmers.
Unicode Hex Input explains how to enter Unicode characters from the keyboard via the option (alt) key.