Is That Really Sorted?

Is That Really Sorted?

Sometimes it's the little things that can drive you to madness. I frankly admit it - many times I was looking for complicated errors and it was an insignificant detail I had ignored. This occasionally tends to happen when sorting data.

Es wurde kein Alt-Text für dieses Bild angegeben.







Here it seems to be quite obvious, which order would result, if you sort in ascending order.

Es wurde kein Alt-Text für dieses Bild angegeben.







However, this result is somewhat surprising. I would not have expected to find the "9" at the end, behind "1111" or "42", for example. 9 is certainly the smaller number, isn't it?

If there are already such strange results with numbers, what might it look like with letters? 

Es wurde kein Alt-Text für dieses Bild angegeben.







It was almost to be expected that the results would look like this ... first all uppercase letters, then the lowercase letters, then all with accent.

Es wurde kein Alt-Text für dieses Bild angegeben.







Why these strange results? Now, in the numbers-only example, there are two influencing factors (or ways to get the expected result).

One reason is quite simple - just because all values are numeric the datatype itself will not be numeric.

Es wurde kein Alt-Text für dieses Bild angegeben.

The data is loaded from a csv file using the Input Data Tool, and so all fields have a string data type at first. If we change the data type to "double", the sorting will also look completely different:

Es wurde kein Alt-Text für dieses Bild angegeben.







The second reason is related to a setting in the Sort tool, where you can choose between two fundamentally different approaches:

  1. Dictionary Sort Order
  2. Unicode Sort Order

As so often, you get exactly what you wanted. The Unicode sort order sorts according to the order of the characters in the Unicode table. First the digits, then the upper case letters, then the lower case letters, finally letters with accent. By the way, ">", "<" and colon would be sorted between numbers and capital letters, the comma before the numbers.

Es wurde kein Alt-Text für dieses Bild angegeben.

This is exactly the order we have chosen here - "Use Dictionary Order" is not selected, so Unicode sort order is used. In general, all values are sorted character by character (it is sorted first by the first character, then by the second, and so on, regardless of whether it is numeric or string data). 

The same setting was effective for the string data, and the result was an order according to the Unicode table.

If we change this (i.e. select the "Use Dictionary Order" option) it will be sorted by letter - first lowercase, then uppercase, then the same order for the letter with accent).

Es wurde kein Alt-Text für dieses Bild angegeben.

If you select the dictionary order, you can additionally set the exact language (i.e. take into account the rules in the respective country). In some cases, it even goes beyond that - in Germany, there are two different standards, which are especially about the classification of German letters with accent ("Ä" = "A" or "Ä" = "AE").

So you should keep in mind that if you choose a "local" version, there may be some special rules that might cause an unexpected result. Most problems can be avoided if you pay attention to the data type and set the sorting method (Unicode/Dictionary) correctly.

To view or add a comment, sign in

More articles by Roland Schubert

  • Time For Summary

    When I receive a new table or file, I always try to first get an overview. Usually by bringing an INPUT DATA tool onto…

  • A Little Bit More: Oversampling

    When selecting data records, it is often simply a matter of selecting only the first (or last) data from a table…

    1 Comment
  • Building Groups Based on Relations

    Grouping is not necessarily a very unusual task - customer groups always come to my mind spontaneously. Common…

  • Different Types of Correlation

    I have to admit it - I intuitively tend to look for relationships between different data. And indeed, I often recognize…

  • Grouping Data

    Grouping data in some way is an essential part of day-to-day business for data analysts. Many people immediately think…

  • Comparing To Lists

    Long years ago, a DIY chain in Germany launched a discount campaign entitled "20% discount on all items - except pet…

  • Famous (or Not-So-Famous) Last Words

    Sometimes you just have to have the last word - the last word from a text field, of course. When it comes to "breaking…

    1 Comment
  • Break on Error

    An error has occurred in a workflow and it continues to run anyway? Sometimes that's all right, but only sometimes…

  • Year-To-Date Calculations

    If you are working in Finance/FP&A/Controlling, calculating "Year to Date" (YTD) values is an essential part of your…

  • Compare Date and Time

    We often need data only for a specific period of time - a year, a month or a week, sometimes just a few hours, but the…

Others also viewed

Explore content categories