The collation feature allows specifying the sort order and character
classification behavior of data per-column, or even per-operation.
This alleviates the restriction that the
LC_COLLATE
and LC_CTYPE
settings
of a database cannot be changed after its creation.
Conceptually, every expression of a collatable data type has a
collation. (The built-in collatable data types are
text
, varchar
, and char
.
User-defined base types can also be marked collatable, and of course
a domain over a collatable data type is collatable.) If the
expression is a column reference, the collation of the expression is the
defined collation of the column. If the expression is a constant, the
collation is the default collation of the data type of the
constant. The collation of a more complex expression is derived
from the collations of its inputs, as described below.
The collation of an expression can be the “default” collation, which means the locale settings defined for the database. It is also possible for an expression's collation to be indeterminate. In such cases, ordering operations and other operations that need to know the collation will fail.
When the database system has to perform an ordering or a character
classification, it uses the collation of the input expression. This
happens, for example, with ORDER BY
clauses
and function or operator calls such as <
.
The collation to apply for an ORDER BY
clause
is simply the collation of the sort key. The collation to apply for a
function or operator call is derived from the arguments, as described
below. In addition to comparison operators, collations are taken into
account by functions that convert between lower and upper case
letters, such as lower
, upper
, and
initcap
; by pattern matching operators; and by
to_char
and related functions.
For a function or operator call, the collation that is derived by examining the argument collations is used at run time for performing the specified operation. If the result of the function or operator call is of a collatable data type, the collation is also used at parse time as the defined collation of the function or operator expression, in case there is a surrounding expression that requires knowledge of its collation.
The collation derivation of an expression can be
implicit or explicit. This distinction affects how collations are
combined when multiple different collations appear in an
expression. An explicit collation derivation occurs when a
COLLATE
clause is used; all other collation
derivations are implicit. When multiple collations need to be
combined, for example in a function call, the following rules are
used:
If any input expression has an explicit collation derivation, then all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation is present, that is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation is present, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, then the combination is deemed to have indeterminate collation. This is not an error condition unless the particular function being invoked requires knowledge of the collation it should apply. If it does, an error will be raised at run-time.
For example, consider this table definition:
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES", ... );
Then in
SELECT a < 'foo' FROM test1;
the <
comparison is performed according to
de_DE
rules, because the expression combines an
implicitly derived collation with the default collation. But in
SELECT a < ('foo' COLLATE "fr_FR") FROM test1;
the comparison is performed using fr_FR
rules,
because the explicit collation derivation overrides the implicit one.
Furthermore, given
SELECT a < b FROM test1;
the parser cannot determine which collation to apply, since the
a
and b
columns have conflicting
implicit collations. Since the <
operator
does need to know which collation to use, this will result in an
error. The error can be resolved by attaching an explicit collation
specifier to either input expression, thus:
SELECT a < b COLLATE "de_DE" FROM test1;
or equivalently
SELECT a COLLATE "de_DE" < b FROM test1;
On the other hand, the structurally similar case
SELECT a || b FROM test1;
does not result in an error, because the ||
operator
does not care about collations: its result is the same regardless
of the collation.
The collation assigned to a function or operator's combined input expressions is also considered to apply to the function or operator's result, if the function or operator delivers a result of a collatable data type. So, in
SELECT * FROM test1 ORDER BY a || 'foo';
the ordering will be done according to de_DE
rules.
But this query:
SELECT * FROM test1 ORDER BY a || b;
results in an error, because even though the ||
operator
doesn't need to know a collation, the ORDER BY
clause does.
As before, the conflict can be resolved with an explicit collation
specifier:
SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
A collation is an SQL schema object that maps an SQL name to locales
provided by libraries installed in the operating system. A collation
definition has a provider that specifies which
library supplies the locale data. One standard provider name
is libc
, which uses the locales provided by the
operating system C library. These are the locales that most tools
provided by the operating system use. Another provider
is icu
, which uses the external
ICU library. Support for ICU has to be
configured when PostgreSQL is built.
A collation object provided by libc
maps to a
combination of LC_COLLATE
and LC_CTYPE
settings. (As
the name would suggest, the main purpose of a collation is to set
LC_COLLATE
, which controls the sort order. But
it is rarely necessary in practice to have an
LC_CTYPE
setting that is different from
LC_COLLATE
, so it is more convenient to collect
these under one concept than to create another infrastructure for
setting LC_CTYPE
per expression.) Also,
a libc
collation
is tied to a character set encoding (see Section 23.3).
The same collation name may exist for different encodings.
A collation provided by icu
maps to a named collator
provided by the ICU library. ICU does not support
separate “collate” and “ctype” settings, so they
are always the same. Also, ICU collations are independent of the
encoding, so there is always only one ICU collation for a given name in a
database.
On all platforms, the collations named default
,
C
, and POSIX
are available. Additional
collations may be available depending on operating system support.
The default
collation selects the LC_COLLATE
and LC_CTYPE
values specified at database creation time.
The C
and POSIX
collations both specify
“traditional C” behavior, in which only the ASCII letters
“A
” through “Z
”
are treated as letters, and sorting is done strictly by character
code byte values.
Additionally, the SQL standard collation name ucs_basic
is available for encoding UTF8
. It is equivalent
to C
and sorts by Unicode code point.
If the operating system provides support for using multiple locales
within a single program (newlocale
and related functions),
or support for ICU is configured,
then when a database cluster is initialized, initdb
populates the system catalog pg_collation
with
collations based on all the locales it finds on the operating
system at the time.
To inspect the currently available locales, use the query SELECT
* FROM pg_collation
, or the command \dOS+
in psql.
For example, the operating system might
provide a locale named de_DE.utf8
.
initdb
would then create a collation named
de_DE.utf8
for encoding UTF8
that has both LC_COLLATE
and
LC_CTYPE
set to de_DE.utf8
.
It will also create a collation with the .utf8
tag stripped off the name. So you could also use the collation
under the name de_DE
, which is less cumbersome
to write and makes the name less encoding-dependent. Note that,
nevertheless, the initial set of collation names is
platform-dependent.
The default set of collations provided by libc
map
directly to the locales installed in the operating system, which can be
listed using the command locale -a
. In case
a libc
collation is needed that has different values
for LC_COLLATE
and LC_CTYPE
, or new
locales are installed in the operating system after the database system
was initialized, then a new collation may be created using
the CREATE COLLATION command.
Within any particular database, only collations that use that
database's encoding are of interest. Other entries in
pg_collation
are ignored. Thus, a stripped collation
name such as de_DE
can be considered unique
within a given database even though it would not be unique globally.
Use of the stripped collation names is recommended, since it will
make one less thing you need to change if you decide to change to
another database encoding. Note however that the default
,
C
, and POSIX
collations, as well as all collations
provided by ICU can be used regardless of the database encoding.
PostgreSQL considers distinct collation objects to be incompatible even when they have identical properties. Thus for example,
SELECT a COLLATE "C" < b COLLATE "POSIX" FROM test1;
will draw an error even though the C
and POSIX
collations have identical behaviors. Mixing stripped and non-stripped
collation names is therefore not recommended.
Collations provided by ICU are created with names in BCP 47 language tag
format, with a “private use”
extension -x-icu
appended, to distinguish them from
libc locales. So de-x-icu
would be an example.
With ICU, it is not sensible to enumerate all possible locale names. ICU
uses a particular naming system for locales, but there are many more ways
to name a locale than there are actually distinct locales. (In fact, any
string will be accepted as a locale name.)
See http://userguide.icu-project.org/locale for
information on ICU locale naming. initdb
uses the ICU
APIs to extract a set of locales with distinct collation rules to populate
the initial set of collations. Here are some examples collations that
might be created:
de-x-icu
German collation, default variant
de-u-co-phonebk-x-icu
German collation, phone book variant
de-AT-x-icu
German collation for Austria, default variant
(Note that as of this writing, there is no,
say, de-DE-x-icu
or de-CH-x-icu
,
because those are equivalent to de-x-icu
.)
de-AT-u-co-phonebk-x-icu
German collation for Austria, phone book variant
und-x-icu
(for “undefined”)ICU “root” collation. Use this to get a reasonable language-agnostic sort order.
Some (less frequently used) encodings are not supported by ICU. If the database cluster was initialized with such an encoding, no ICU collations will be predefined.
The command CREATE COLLATION can also be used to create a new collation from an existing collation, which can be useful to be able to use operating-system-independent collation names in applications, create compatibility names, or use an ICU-provided collation under a more readable name. For example:
CREATE COLLATION german FROM "de_DE"; CREATE COLLATION french FROM "fr-x-icu"; CREATE COLLATION "de-DE-x-icu" FROM "de-x-icu";
The standard and predefined collations are in the
schema pg_catalog
, like all predefined objects.
User-defined collations should be created in user schemas. This also
ensures that they are saved by pg_dump
.