How to Perform Natural Sort in Elasticsearch


by Thomas Tran



This article serves as a guide on how to perform natural sort in Elasticsearch. So what is natural sort?

Natural sort order is the ordering of string in alphanumeric characters that is more human-friendly than machine-oriented, alphabetical sort order. For instance, in alphabetical sort, the string “q15” would be sorted before “q2” because the “1” in the first string is sorted as smaller than “2”. In natural sorting, “q2” comes before “q15” because “2” is treated as smaller than “15”.

Natural sort order is subjective and depends on the language. For instance, many languages ignore capital letters and diacritics while others treat them separately.

Most modern implementations of natural sort order relies on the Unicode Collation Algorithm (UCA), which specifies how two Unicode strings compare to each other. This process of comparing strings and determining which string comes “before” or “after” the other is referred to as collation. The Collation Element table specifies how characters relate to one another. For instance, in the standard DUCET table, it specifies that the letter “a” is before the letter “b”.

The UCA algorithm takes in an input Unicode string and a Collation Element Table and produces a sort key, which can be an array of unsigned 16-bit integers or it can be combined to be a binary value.

In Elasticsearch, there is an implementation of the UCA algorithm in a plugin called “ICU Analysis plugin”. This plugin can be installled using the following command:

sudo bin/elasticsearch-plugin install analysis-icu

The official documentation states that this plugin must be installed on every node in the cluster, and each node must be restarted after installation.

The “ICU Analysis plugin” implements a new field type called icu_collation_keyword which encodes the input value that the user gives it directly into bytes, which then acts as the sort key. You can then refer to this field directly when sorting.

The following command creates an index, a field called “StoreLocation” and a multi-field called “sort” that can be accessed via “StoreLocation.sort”. The multi-field “sort” stores the sort key which can then be accurately sorted by Elasticsearch in a natural way.

PUT /test-index/
{
  "mappings": {
    "properties": {
      "StoreLocation": {
        "type": "text",
        "fields": {
          "sort": {
            "type": "icu_collation_keyword",
            "index": false,
            "numeric": true,
            "strength": "tertiary",
            "language": "en",
            "country": "US"
          }
        }
      }
    }
  }
}

The field StoreLocation.sort is an icu_collation_keyword field that will preserve the name as a single token and applies the “numeric” setting, which will sort digits according to their numeric representation. Without the “numeric” setting, it will sort the value “bread-22” before “bread-9”. But with the “numeric” setting, it will sort “bread-9” before “bread-22”. This numeric collation is what differentiates a “natural sort” and an “alphabetic” sort, at least in English.

Other languages have other rules for collation. In those cases, make sure to replace the “language” and “country” properties to fit your locale.