Sorting and Collation Explained

Collation is one of most tricky questions I've met in solving different customer's problems. I want to summarize some points and provide a clear explanation.

Collation is the way to place different symbols and/or symbol groups in some unique order allowing you to apply the definite comparative rule 'less than or equal to' between words (words in general, not individual chars) in order to set or predict the position for an old or new word in that ordered set. Basically, the order of sorting words.

Historically, the comparing of binary ASCII values was the original way to set the comparative rule for a symbols. Under DOS, it was enough to compare char values in order to decide where it have been placed in datasheet. Word comparing is performed symbol by symbol after that.

It is used widely in Clipper and in early versions of Fox. And this old practice forced many Windows programmers to save non-linguistic context in Character fields. Later Clipper International releases allowed the ability to use different European languages and set the way to sort symbols linguistically (== alphabetically) rather than based on their ASCII values. So, the comparative rule 'less than or equal to' became grounded on the char's index in the so called _collation_table_.

As for now, symbols are compared not by their value, but by their indexes in the _collation_table_. Usually the reality brings more complicated problems. You can see the CollationExample with Spanish words sorted rather than individual letters. The same is true for a number of other languages. An additional problem with collation is that indexing on individual letters is not enough -- entire words must be taken into context in order to be sorted. There are even different approaches to sort symbol groups. Duden and ETEC supported by Apollo are examples of different group sorting rules. It expands additionally the _collation_table_ term.

Another problem is Characters from the upper half of the ASCII table can have different values in DOS and Win32/Win64. So the collation_tables_ is different for DOS and Windows.

All of these listed aspects of collation have generated a new problem: the problem of compatibility between applications. Or it's more correct to say, the problem of sharing indexes between applications. In xBase, the comparative rule is applied not to data directly, but to the key set, determined after an xBase index expression is calculated for records. The key is saved in the index tree. The index tree structure is organized according to the same 'less than or equal to' rule. Because the index tree data structure depends on the comparative rule, it means this rule must be:

  • a) the same for all applications which are using the index simultaneously
  • b) the same for index creation time moment and for a latest moments of seek /delete/append operations in index tree

This is the main point for collation. We need to use same collation in all applications in order to share indexes. By the way, most DBMS' working in client/server mode never have to worry about collation. This collation issue is purely a "file-server" mode specific problem. With Apollo version 6.x and later, you can create your own collation table in order to build 'less than or equal to' rules used by SDE for both DOS and Windows charsets. As mentioned above, you can use ETEC and Duden in order to compare word groups. It leads to a problem of identification of collation rules used in indexes. The main one, what to do if an index was created in a one application (or even apllication version) and then used by another application.

Actually the issue is caused by a lack of DBF specifications, which suggests nothing about collation in indexes. In according to MSDN the official FoxPro table specification sets a byte 'Code page mark' in DBF header, but unfortunately it says nothing about the collation rules used. Additionally, it seems there are another two bytes reserved in order to mark collation but this is my guess and cannot be confirmed. An official specification of the Fox format referred to these bytes as "reserved" and asked Visual FoxPro users to place respective operators into a config file.

Both Clipper and the latest version of Visual FoxPro use their own "suggestions" about collation in indexes. Seems Clipper did nothing at all in order to write external reference for information about used collation. It was assumed the use of the International Edition of Clipper has different libraries for different languages. Apollo uses the same core engine DLLs for all languages, therefore having hundreds of language-oriented DLLs is clearly not the solution.

The Apollo database engine orients collation onto the current locale, which is set by Windows. Apollo users must think about collation synchronization manually, if their application shares indexes on the net or with a DOS application.