<project title="CrateDB" summary="CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is based on Lucene, inherits technologies from Elasticsearch, and is compatible with PostgreSQL.">Things to remember when working with CrateDB are:

- CrateDB is a distributed database written in Java, where individual nodes form a database cluster, using a shared-nothing architecture.
- CrateDB brings together fundamental components to manage big data after the Hadoop and Spark batch-processing era, more like Teradata, BigQuery and Snowflake are doing it.
- Clients can connect to CrateDB using HTTP or the PostgreSQL wire protocol.
- The default TCP ports of CrateDB are 4200 for the HTTP interface and 5432 for the PostgreSQL interface.
- The language of choice after connecting to CrateDB is to use SQL, mostly compatible with PostgreSQL's SQL dialect.
- The data storage layer is based on Lucene, the data distribution layer was inspired by Elasticsearch.
- Storage concepts of CrateDB include partitioning and sharding to manage data larger than fitting on a single machine.
- CrateDB Cloud offers a managed option for running CrateDB and provides additional features like automated backups, data ingest / ETL utilities, or scheduling recurrent jobs.
- Get started with CrateDB Cloud at `https://console.cratedb.cloud`.
- CrateDB also provides an option to run it on your premises, optimally by using its Docker/OCI image `docker.io/crate`. Nightly images are available per `docker.io/crate/crate:nightly`.<docs><doc title="CrateDB README" desc="README about CrateDB.">.. image:: docs/_static/crate-logo.svg
    :alt: CrateDB
    :target: https://cratedb.com

----

.. image:: https://github.com/crate/crate/actions/workflows/main.yml/badge.svg
    :target: https://github.com/crate/crate/actions?query=workflow%3A%22CrateDB+SQL%22

.. image:: https://img.shields.io/badge/docs-latest-brightgreen.svg
    :target: https://cratedb.com/docs/crate/reference/en/latest/

.. image:: https://img.shields.io/badge/container-docker-green.svg
    :target: https://hub.docker.com/_/crate/

|

`Help us improve CrateDB by taking our User Survey! <https://cratedb.com/user-survey>`_

About
=====

CrateDB is a distributed SQL database that makes it simple to store and analyze
massive amounts of data in real-time.

CrateDB offers the `benefits`_ of an SQL database *and* the scalability and
flexibility typically associated with NoSQL databases. Modest CrateDB clusters
can ingest tens of thousands of records per second without breaking a
sweat. You can run ad-hoc queries using `standard SQL`_. CrateDB's blazing-fast
distributed query execution engine parallelizes query workloads across the
whole cluster.

CrateDB is well suited to `containerization`_, can be `scaled horizontally`_
using ephemeral virtual machines (e.g., `Kubernetes`_, `AWS`_, and `Azure`_)
with `no shared state`_. You can deploy and run CrateDB on any sort of network
— from personal computers to `multi-region hybrid clouds and the edge`_.


Features
========

- Use `standard SQL`_ via the `PostgreSQL wire protocol`_ or an `HTTP API`_.

- Dynamic table schemas and queryable objects provide
  document-oriented features in addition to the relational features of SQL.

- Support for time-series data, real-time full-text search, geospatial data
  types and search capabilities.

- Horizontally scalable, highly available and fault-tolerant clusters that run
  very well in virtualized and containerized environments.

- Extremely fast distributed query execution.

- Auto-partitioning, auto-sharding, and auto-replication.

- Self-healing and auto-rebalancing.

- `User-defined functions`_ (UDFs) can be used to extend the functionality of CrateDB.


Screenshots
===========

CrateDB provides an `Admin UI`_:

.. image:: crate-admin.gif
    :alt: Screenshots of the CrateDB Admin UI


Try CrateDB
===========

Run CrateDB via the official `Docker Image`_:

.. code-block:: console

    sh$ docker run --publish 4200:4200 --publish 5432:5432 --env CRATE_HEAP_SIZE=1g crate '-Cdiscovery.type=single-node'

Or visit the `installation documentation`_ to see all the available download and
install options.

Once you're up and running, head over to the `introductory docs`_. To interact
with CrateDB, you can use the Admin UI `sql console`_ or the `CrateDB shell`_
CLI tool. Alternatively, review the list of recommended `clients and tools`_
that work with CrateDB.

For container-specific documentation, check out the `CrateDB on Docker how-to
guide`_ or the `CrateDB on Kubernetes how-to guide`_.


Contributing
============

This project is primarily maintained by `Crate.io`_, but we welcome community
contributions!

See the `developer docs`_ and the `contribution docs`_ for more information.


Security
========

The CrateDB team and community take security bugs seriously. We appreciate your
efforts to `responsibly disclose`_ your findings, and will make every effort to
acknowledge your contributions.

If you think you discovered a security flaw, please follow the guidelines at
`SECURITY.md`_.


Help
====

Looking for more help?

- Try one of our `beginner tutorials`_, `how-to guides`_, or consult the
  `reference manual`_.

- Check out our `support channels`_.

- `Crate.io`_ also offers `CrateDB Cloud`_, a fully-managed *CrateDB Database
  as a Service* (DBaaS). The `CrateDB Cloud Tutorials`_ will get you started.


.. _Admin UI: https://cratedb.com/docs/crate/admin-ui/
.. _AWS: https://cratedb.com/docs/crate/tutorials/en/latest/cloud/aws/index.html
.. _Azure: https://cratedb.com/docs/crate/tutorials/en/latest/cloud/azure/index.html
.. _beginner tutorials: https://cratedb.com/docs/crate/tutorials/
.. _benefits: https://cratedb.com/product#compare
.. _clients and tools: https://cratedb.com/docs/crate/clients-tools/
.. _containerization: https://cratedb.com/docs/crate/tutorials/en/latest/containers/docker.html
.. _contribution docs: CONTRIBUTING.rst
.. _Crate.io: https://cratedb.com/company/team
.. _CrateDB clients and tools: https://cratedb.com/docs/crate/clients-tools/
.. _CrateDB Cloud Tutorials: https://cratedb.com/docs/cloud/
.. _CrateDB Cloud: https://cratedb.com/product/pricing
.. _CrateDB on Docker how-to guide: https://cratedb.com/docs/crate/tutorials/en/latest/containers/docker.html
.. _CrateDB on Kubernetes how-to guide: https://cratedb.com/docs/crate/tutorials/en/latest/containers/kubernetes/index.html
.. _CrateDB shell: https://cratedb.com/docs/crate/crash/
.. _developer docs: devs/docs/index.rst
.. _Docker image: https://hub.docker.com/_/crate/
.. _document-oriented: https://en.wikipedia.org/wiki/Document-oriented_database
.. _Dynamic table schemas: https://cratedb.com/docs/crate/reference/en/master/general/ddl/column-policy.html
.. _fulltext search: https://cratedb.com/docs/crate/reference/en/latest/general/dql/fulltext.html
.. _geospatial features: https://cratedb.com/docs/crate/reference/en/master/general/dql/geo.html
.. _how-to guides: https://cratedb.com/docs/crate/howtos/
.. _HTTP API: https://cratedb.com/docs/crate/reference/en/latest/interfaces/http.html
.. _installation documentation: https://cratedb.com/docs/crate/tutorials/en/latest/basic/index.html
.. _introductory docs: https://cratedb.com/docs/crate/tutorials/
.. _Kubernetes: https://cratedb.com/docs/crate/tutorials/en/latest/containers/kubernetes/index.html
.. _multi-region hybrid clouds and the edge: https://cratedb.com/docs/cloud/en/latest/tutorials/edge/index.html
.. _no shared state: https://en.wikipedia.org/wiki/Shared-nothing_architecture
.. _PostgreSQL wire protocol: https://cratedb.com/docs/crate/reference/en/latest/interfaces/postgres.html
.. _queryable objects: https://cratedb.com/docs/crate/reference/en/master/general/dql/selects.html#container-data-types
.. _reference manual: https://cratedb.com/docs/crate/reference/
.. _relational: https://en.wikipedia.org/wiki/Relational_model
.. _responsibly disclose: https://en.wikipedia.org/wiki/Coordinated_vulnerability_disclosure
.. _scaled horizontally: https://stackoverflow.com/questions/11707879/difference-between-scaling-horizontally-and-vertically-for-databases
.. _SECURITY.md: https://github.com/crate/crate/blob/master/SECURITY.md
.. _sql console: https://cratedb.com/docs/crate/admin-ui/en/latest/console.html#sql-console
.. _standard SQL: https://cratedb.com/docs/crate/reference/en/latest/sql/index.html
.. _support channels: https://cratedb.com/support
.. _time-series data: https://cratedb.com/docs/crate/howtos/en/latest/getting-started/normalize-intervals.html
.. _user-defined functions: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html</doc><doc title="Welcome to CrateDB" desc="Benefits of CrateDB at a glance."><!--
NOTE: When adding or removing top-level entries in this toctree, you must also
update the corresponding hardcoded links in the theme's sidebartoc.py file:
https://github.com/crate/crate-docs-theme/blob/main/src/crate/theme/rtd/sidebartoc.py

Look for the "else" branch under the 'CrateDB: Guide' project check.
-->

```{toctree}
:hidden:

overview/index
start/index
```

```{toctree}
:hidden:
:caption: Build

ingest/index
connect/index
integrate/index
feature/index
```

```{toctree}
:hidden:
:caption: Operations

install/index
admin/index
performance/index
```

(index)=
# Welcome to CrateDB

CrateDB is a fully open-source **distributed SQL database** designed for
**real-time analytics, search and AI** at scale. Whether you are working with
time series data, full-text search, or large volumes of structured and
semi-structured data, CrateDB gives you the **power of SQL**, the **scalability
of NoSQL**, and the **flexibility of a modern data platform**.

<br>

## Is CrateDB right for me?

Learn about CrateDB's features, use cases, and capabilities.

:::::{grid} 1 1 3 3
:gutter: 2
:padding: 0

::::{grid-item-card} {material-outlined}`info;1.5em` Product Overview
:link: https://cratedb.com/database
:link-type: url
:link-alt: CrateDB Product Overview
Learn what CrateDB is and what it can do for you.
::::

::::{grid-item-card} {material-outlined}`stars;1.5em` Feature Overview
:link: all-features
:link-type: ref
:link-alt: All CrateDB Features
Explore CrateDB's complete feature set at a glance.
::::

::::{grid-item-card} {material-outlined}`rocket_launch;1.5em` Use Cases
:link: solutions
:link-type: ref
:link-alt: CrateDB Use Cases
Discover how CrateDB solves real-world problems.
::::

:::::


<br>

## New to CrateDB?

::::{grid}
:::{grid-item-card} {material-outlined}`arrow_circle_right;1.5em` Get Started
:link: getting-started
:link-type: ref
:link-alt: Get started
:class-title: sd-fs-5

Start your free Cloud or self-hosted cluster and learn through simple tutorials
or comprehensive courses.

```{button-ref} getting-started
:color: primary
:expand:
**Get Started →**
```

:::
::::

<br>

## Quick links

:::::{grid} 2 2 3 3
:gutter: 2
:padding: 0

::::{grid-item-card} {material-outlined}`link;1.5em` Connect
:link: connect
:link-type: ref
:link-alt: Connect to CrateDB
Database drivers, libraries, and client adapters.
::::

::::{grid-item-card} {material-outlined}`upload;1.5em` Ingest
:link: ingest
:link-type: ref
:link-alt: Data Ingestion
Methods for importing and loading data into CrateDB.
::::

::::{grid-item-card} {material-outlined}`hub;1.5em` Integrate
:link: integrate
:link-type: ref
:link-alt: CrateDB Integrations
Third-party tools, data pipelines, and frameworks.
::::

::::{grid-item-card} {material-outlined}`settings;1.5em` Admin
:link: administration
:link-type: ref
:link-alt: Database Administration
Deploy, monitor, maintain, and optimize clusters.
::::

::::{grid-item-card} {material-outlined}`menu_book;1.5em` Reference
:link: crate-reference:index
:link-type: ref
:link-alt: CrateDB Reference Manual
Complete SQL syntax, functions, and API reference.
::::

:::::

<br>

:::::{admonition} Need help?
:class: tip
<br>

::::{grid} 1 2 2 2
:gutter: 3

:::{grid-item-card}
:link: https://community.cratedb.com/
:class-header: sd-text-center

{material-outlined}`groups;2em` **Community**
^^^
Join our Community Forum to ask questions and connect with other CrateDB
users.
:::

:::{grid-item-card}
:link: https://cratedb.com/contact/
:class-header: sd-text-center

{material-outlined}`support;2em` **Support**
^^^
Contact our support team for assistance with your CrateDB deployment.
:::
::::
:::::

:::{admonition} CrateDB is open-source
:class: tip

**Join our community of contributors!** CrateDB is open-source software
licensed under the Apache License 2.0. We appreciate contributions from
everyone.

**Improve the documentation:**

- Use the feedback widget (top right) for quick feedback or PR on any page

**Contribute to CrateDB:**

- Report bugs or request features for CrateDB on
  [GitHub](https://github.com/crate/crate/issues)
- Explore our other [open-source projects](https://github.com/crate)

:::</doc><doc title="CrateDB reference documentation" desc="The reference documentation of CrateDB.">.. _index:

=================
CrateDB Reference
=================

CrateDB is a distributed SQL database that makes it simple to store and analyze
massive amounts of data in real-time.

.. NOTE::

    This resource assumes you know the basics. If not, check out the
    `Tutorials`_ section for beginner material.

.. SEEALSO::

    CrateDB is an open source project and is `hosted on GitHub`_.

.. rubric:: Table of contents

.. toctree::
   :maxdepth: 2

   concepts/index
   cli-tools
   config/index
   general/index
   admin/index
   sql/index
   interfaces/index
   appendices/index


.. _Tutorials: https://cratedb.com/docs/crate/tutorials/en/latest/
.. _hosted on GitHub: https://github.com/crate/crate</doc><doc title="Concept: Clustering" desc="How the distributed SQL database CrateDB uses a shared nothing architecture to form high-availability, resilient database clusters with minimal effort of configuration. ">.. _concept-clustering:

==========
Clustering
==========

The aim of this document is to describe, on a high level, how the distributed
SQL database CrateDB uses a shared nothing architecture to form high-
availability, resilient database clusters with minimal effort of configuration.

It will lay out the core concepts of the shared nothing architecture at the
heart of CrateDB. The main difference to a `primary-secondary architecture`_ is
that every node in the CrateDB cluster can perform every operation - hence all
nodes are equal in terms of functionality (see
:ref:`concept-node-components`) and are configured the same.


.. _concept-node-components:

Components of a CrateDB Node
============================

To understand how a CrateDB cluster works it makes sense to first take a look
at the components of an individual node of the cluster.

.. _figure_1:

.. figure:: interconnected-crate-nodes.png
   :align: center

   Figure 1

   Multiple interconnected instances of CrateDB form a single database cluster.
   The components of each node are equal.

:ref:`figure_1` shows that in CrateDB each node of a cluster contains the same
components that (a) interface with each other, (b) with the same component from
a different node and/or (c) with the outside world. These four major components
are: SQL Handler, Job Execution Service, Cluster State Service, and Data
Storage.

SQL Handler
-----------

The SQL Handler part of a node is responsible for three aspects:

(a) handling incoming client requests,
(b) parsing and analyzing the SQL statement from the request and
(c) creating an execution plan based on the analyzed statement
    (`abstract syntax tree`_)

The SQL Handler is the only of the four components that interfaces with the
"outside world". CrateDB supports three protocols to handle client requests:

(a) HTTP
(b) a Binary Transport Protocol
(c) the PostgreSQL Wire Protocol

A typical request contains a SQL statement and its corresponding arguments.

Job Execution Service
---------------------

The Job Execution Service is responsible for the execution of a plan ("job").
The phases of the job and the resulting operations are already defined in the
execution plan. A job usually consists of multiple operations that are
distributed via the Transport Protocol to the involved nodes, be it the local
node and/or one or multiple remote nodes. Jobs maintain IDs of their individual
operations. This allows CrateDB to "track" (or for example "kill") distributed
queries.

Cluster State Service
---------------------

The three main functions of the Cluster State Service are:

(a) cluster state management,
(b) election of the master node and
(c) node discovery, thus being the main component for cluster building (as
    described in section :ref:`concept-clusters`).

It communicates using the Binary Transport Protocol.

Data storage
------------

The data storage component handles operations to store and retrieve data from
disk based on the execution plan.

In CrateDB, the data stored in the tables is sharded, meaning that tables are
divided and (usually) stored across multiple nodes. Each shard is a separate
Lucene index that is stored physically on the filesystem. Reads and writes are
operating on a shard level.

.. _concept-clusters:

Multi-node setup: Clusters
==========================

A CrateDB cluster is a set of two or more CrateDB instances (referred to as
*nodes*) running on different hosts which form a single, distributed database.

For inter-node communication, CrateDB uses a software specific transport
protocol that utilizes byte-serialized Plain Old Java Objects (`POJOs`_) and
operates on a separate port. That so-called "transport port" must be open and
reachable from all nodes in the cluster.

Cluster state management
------------------------

The cluster state is versioned and all nodes in a cluster keep a copy of the
latest cluster state. However, only a single node in the cluster -- the
*master node* -- is allowed to change the state at runtime.

Settings, metadata, and routing
................................

The cluster state contains all necessary meta information to maintain the
cluster and coordinate operations:

* Global cluster settings
* Discovered nodes and their status
* Schemas of tables
* The status and location of primary and replica shards

When the master node updates the cluster state it will publish the new state to all
nodes in the cluster and wait for all nodes to respond before processing
the next update.

.. _concept-master-election:

Master Node Election
--------------------

In a CrateDB cluster there can only be one master node at any single time.
The cluster only becomes available to serve requests once a master has been
elected, and a new election takes place if the current master node becomes
unavailable.

By default, all nodes are master-eligible, but
:ref:`a node setting <node.master>`
is available to indicate, if desired, that a node must not take on the role
of master.

To elect a master among the eligible nodes, a majority
(``floor(half)+1``), also known as *quorum*, is required among a subset of
all master-eligible nodes, this subset of nodes is known as the
*voting configuration*.
The *voting configuration* is a list which is persisted as part of the cluster
state. It is maintained automatically in a way that makes so that split-brain
scenarios are never possible.

Every time a node joins the cluster, or leaves the cluster, even if it is
for a few seconds, CrateDB re-evaluates the voting configuration.
If the new number of master-eligible nodes in the cluster is odd, CrateDB will
put them all in the voting configuration.
If the number is even, CrateDB will exclude one of the master-eligible nodes
from the voting configuration.

The voting configuration is not shrunk below 3 nodes, meaning that if there
were 3 nodes in the voting configuration and one of them becomes unavailable,
they all stay in the voting configuration and a quorum of 2 nodes is still
required.
A master node rescinds its role if it cannot contact a quorum of nodes from
the latest voting configuration.

.. WARNING::

   If you do infrastructure maintenance, please note that as nodes are shutdown
   or rebooted, they will temporarily leave the voting configuration, and for
   the cluster to elect a master a quorum is required among the
   nodes that were last in the voting configuration.

   For instance, if you
   have a 5-nodes cluster, with all nodes master-eligible, and node 1 is
   currently the master, and you shutdown node 5, then node 4, then node 3,
   the cluster will stay available as the voting configuration will have
   adapted to only have nodes 1, 2, and 3 on it.

   If you then shutdown one more node the cluster will become unavailable as
   a quorum of 2 nodes is now required and not available.
   To bring the cluster back online at this point you will require two nodes
   among 1, 2, and 3. Bringing back nodes 3, 4, and 5, will not be sufficient.

.. _concept-discovery:

Discovery
---------

The process of finding, adding and removing nodes is done in the discovery
module.

.. _figure_2:

.. figure:: discovery-process.png
   :align: center

   Figure 2

   Phases of the node discovery process. n1 and n2 already form a cluster where
   n1 is the elected master node, n3 joins the cluster. The cluster state
   update happens in parallel!

Node discovery happens in multiple steps:

* CrateDB requires a list of potential host addresses for other CrateDB nodes
  when it is starting up. That list can either be provided by a static
  configuration or can be dynamically generated, for example by fetching DNS
  SRV records, querying the Amazon EC2 API, and so on.

* All potential host addresses are pinged. Nodes which receive the request
  respond to it with information about the cluster it belongs to, the current
  master node, and its own node name.

* Now that the node knows the master node, it sends a join request. The
  Primary verifies the incoming request and adds the new node to the cluster
  state that now contains the complete list of all nodes in the cluster.

* The cluster state is then published across the cluster. This guarantees the
  common knowledge of the node addition.

.. CAUTION::

    If a node is started without any :ref:`initial_master_nodes
    <cluster.initial_master_nodes>` or a :ref:`discovery_type <discovery.type>`
    set to ``single-node`` (e.g., the default configuration), it will never join
    a cluster even if the configuration is subsequently changed.


    It is possible to force the node to forget its current cluster state by
    using the :ref:`cli-crate-node` CLI tool. However, be aware that this may
    result in data loss.


Networking
----------

In a CrateDB cluster all nodes have a direct link to all other nodes; this is
known as `full mesh`_ topology. Due to simplicity reasons every node maintains
a one-way connections to every other node in the network. The network topology
of a 5 node cluster looks like this:

.. _figure_3:

.. figure:: mesh-network-topology.png
   :align: center
   :width: 50%

   Figure 3

   Network topology of a 5 node CrateDB cluster. Each line represents a one-way
   connection.

The advantages of a fully connected network are that it provides a high degree
of reliability and the paths between nodes are the shortest possible. However,
there are limitations in the size of such networked applications because the
number of connections (c) grows quadratically with the number of nodes (n):

.. code-block:: mathematica

  c = n * (n - 1)

Cluster behavior
================

The fact that each CrateDB node in a cluster is equal allows applications and
users to connect to any node and get the same response for the same operations.
As already described in section :ref:`concept-node-components`, the SQL
handler is responsible for handling incoming client SQL requests, either using
the HTTP transport protocol, or the PostgreSQL wire protocol.

The "handler node" that
accepts the client request also returns the response to the client. It does
neither redirect nor delegate the request to a different nodes. The handler
node parses the incoming request into a syntax tree, analyzes it and creates
an execution plan locally. Then the operations of the plan are executed in a
distributed manner. The upstream of the final phase of the execution is always
the handler which then returns the response to the client.

Application use case
====================

In a conventional setup of an application using a primary-secondary database the
deployed stack looks similar to this:

.. _figure_4:

.. figure:: conventional-deployment.png
   :align: center

   Figure 4

   Conventional deployment of an application-database stack.

However, this given setup does not scale because all application servers use
the same, single entry point to the database for writes (the application can
still read from secondaries) and if that entry point is unavailable the complete
stack is broken.

Choosing a shared nothing architecture allows DevOps to deploy their
applications in an "elastic" manner without SPoF. The idea is to extend the
shared nothing architecture from the database to the application which in most
cases is stateless already.

.. _figure_5:

.. figure:: shared-nothing-deployment.png
   :align: center

   Figure 5

   Elastic deployment making use of the shared nothing architecture.

If you deploy an instance of CrateDB together with every application server you
will be able to dynamically scale up and down your database backend depending
on your needs. The application only needs to communicate to its "bound" CrateDB
instance on localhost. The load balancer tracks the health of the hosts and if
either the application or the database on a single host fails the complete host
will taken out of the load balancing.

.. _primary-secondary architecture: https://en.wikipedia.org/wiki/Master/slave_(technology)
.. _abstract syntax tree: https://en.wikipedia.org/wiki/Abstract_syntax_tree
.. _POJOs: https://en.wikipedia.org/wiki/Plain_Old_Java_Object
.. _full mesh: https://en.wikipedia.org/wiki/Network_topology#Mesh
.. _split-brain: https://en.wikipedia.org/wiki/Split-brain_(computing)</doc><doc title="Concept: Distributed joins" desc="Make joins work on large volumes of data, stored distributed.">.. _concept-joins:

=====
Joins
=====

:ref:`Joins <sql_joins>` are essential operations in relational databases. They
create a link between rows based on common values and allow the meaningful
combination of these rows. CrateDB supports joins and due to its distributed
nature allows you to work with large amounts of data.

In this document we will present the following topics. First, an overview of
the existing types of joins and algorithms provided. Then a description of how
CrateDB implements them along with the necessary optimizations, which allows us
to work with huge datasets.


.. _join-types:

Join types
==========

A join is a relational operation that merges two data sets based on certain
properties. :ref:`joins_figure_1` shows which elements appear in which join.

.. _joins_figure_1:

.. figure:: joins.png
   :align: center

   Join Types

   From left to right, top to bottom: left join, right join, inner join, outer
   join, and cross join of a set L and R.


.. _join-types-cross:

Cross join
----------

A :ref:`cross join <cross-joins>` returns the Cartesian product of two or more
relations. The result of the Cartesian product on the relation *L* and *R*
consists of all possible permutations of each tuple of the relation *L* with
every tuple of the relation *R*.


.. _join-types-inner:

Inner join
----------

An :ref:`inner join <inner-joins>` is a join of two or more relations that
returns only tuples that satisfy the join condition.


.. _join-types-equi:

Equi Join
.........

An *equi join* is a subset of an inner join and a comparison-based join, that
uses equality comparisons in the join condition. The equi join of the relation
*L* and *R* combines tuple *l* of relation *L* with a tuple *r* of the relation
*R* if the join attributes of both tuples are identical.


.. _join-types-outer:

Outer join
----------

An :ref:`outer join <outer-joins>` returns a relation consisting of tuples that
satisfy the join condition and dangling tuples from both or one of the
relations, respectively to the outer join type.

An outer join can be one of the following types:

- **Left** outer join returns tuples of the relation *L* matching tuples of
  the relation *R* and dangling tuples of the relation *R* padded with null
  values.

- **Right** outer join returns tuples of the relation *R* matching tuples of
  the relation *L* and dangling tuples from the relation *L* padded with null
  values.

- **Full** outer join returns matching tuples of both relations and dangling
  tuples produced by left and right outer joins.


.. _join-algos:

Join algorithms
===============

CrateDB supports (a) CROSS JOIN, (b) INNER JOIN, (c) EQUI JOIN, (d) LEFT JOIN,
(e) RIGHT JOIN and (f) FULL JOIN. All of these join types are executed using
the :ref:`nested loop join algorithm <join-algos-nested-loop>` except for the
:ref:`Equi Joins <join-types-equi>` which are executed using the :ref:`hash
join algorithm <join-algos-hash>`. Special optimizations, according to the
specific use cases, are applied to improve execution performance.


.. _join-algos-nested-loop:

Nested loop join
----------------

The **nested loop** join is the simplest join algorithm. One of the relations
is nominated as the inner relation and the other as the outer relation. Each
tuple of the outer relation is compared with each tuple of the inner relation
and if the join condition is satisfied, the tuples of the relation *L* and *R*
are concatenated and added into the returned virtual relation::

    for each tuple l ∈ L do
        for each tuple r ∈ R do
            if l.a Θ r.b
                put tuple(l, r) in Q

*Listing 1. Nested loop join algorithm.*


.. _join-algos-nested-loop-prim:

Primitive nested loop
.....................

For joins on some relations, the nested loop operation can be executed directly
on the handler node. Specifically for queries involving a CROSS JOIN or joins on
tables from :ref:`system-information` or :ref:`information_schema`. Each shard
sends the data to the handler node. Afterwards, this node runs the nested loop,
applies limits, etc. and ultimately returns the results. Similarly, joins can be
nested, so instead of collecting data from shards the rows can be the result of
a previous join or :ref:`table function <table-functions>`.


.. _join-algos-nested-loop-dist:

Distributed nested loop
.......................

Relations are usually distributed to different nodes which require the nested
loop to acquire the data before being able to join. After finding the locations
of the required shards (which is done in the planning stage), the smaller data
set (based on the row count) is broadcast amongst all the nodes holding the
shards they are joined with.

After that, each of the receiving nodes can start
running a nested loop on the subset it has just received. Finally, these
intermediate results are pushed to the original (handler) node to merge and
return the results to the requesting client (see :ref:`joins_figure_2`).

.. _joins_figure_2:

.. figure:: nested-loop.png
   :align: center

   Nodes that are holding the smaller shards broadcast the data to the
   processing nodes which then return the results to the requesting node.

Queries can be optimized if they contain (a) ORDER BY, (b) LIMIT, or (c) if
INNER/EQUI JOIN. In any of these cases, the nested loop can be terminated
earlier:

- Ordering allows determining whether there are records left

- Limit states the maximum number of rows that are returned

Consequently, the number of rows is significantly reduced allowing the
operation to complete much faster.


.. _join-algos-hash:

Hash join
---------

The Hash Join algorithm is used to execute certain types of joins in a more
efficient way than :ref:`Nested Loop <join-algos-nested-loop>`.


.. _join-algos-hash-basic:

Basic algorithm
...............

The operation takes place in one node (the handler node to which the client is
connected). The rows of the left relation of the join are read and a hashing
algorithm is applied on the fields of the relation which participate in the
join condition. The hashing algorithm generates a hash value which is used to
store every row of the left relation in the proper position in a `hash table`_.

Then the rows of the right relation are read one-by-one and the same hashing
algorithm is applied on the fields that participate in the join condition. The
generated hash value is used to make a lookup in the `hash table`_. If no entry
is found, the row is skipped and the processing continues with the next row
from the right relation. If an entry is found, the join condition is validated
(handling hash collisions) and on successful validation the combined tuple of
left and right relation is returned.

.. _joins_figure_3:

.. figure:: hash-join.png
   :align: center

   Basic hash join algorithm


.. _join-algos-hash-block:

Block hash join
...............

The Hash Join algorithm requires a `hash table`_ containing all the rows of the
left relation to be stored in memory. Therefore, depending on the size of the
relation (number of rows) and the size of each row, the size of this hash table
might exceed the available memory of the node executing the hash join. To
resolve this limitation the rows of the left relation are loaded into the hash
table in blocks.

On every iteration the maximum available size of the `hash table`_ is
calculated, based on the number of rows and size of each row of the table but
also taking into account the available memory for query execution on the node.
Once this block-size is calculated the rows of the left relation are processed
and inserted into the `hash table`_ until the block-size is reached.

The operation then starts reading the rows of the right relation, process them
one-by-one and performs the lookup and the join condition validation. Once all
rows from the right relation are processed the `hash table`_ is re-initialized
based on a new calculation of the block size and a new iteration starts until
all rows of the left relation are processed.

With this algorithm the memory limitation is handled in expense of having to
iterate over the rows of the right table multiple times, and it is the default
algorithm used for Hash Join execution by CrateDB.


.. _join-algos-hash-block-switch:

Switch tables optimization
''''''''''''''''''''''''''

Since the right table can be processed multiple times (number of rows from left
/ block-size) the right table should be the smaller (in number of rows) of the
two relations participating in the join. Therefore, if originally the right
relation is larger than the left the query planner performs a switch to take
advantage of this detail and execute the hash join with better performance.


.. _join-algos-hash-dist:

Distributed block hash join
...........................

Since CrateDB is a distributed database and a standard deployment consists of
at least three nodes and in most case of much more, the Hash Join algorithm
execution can be further optimized (performance-wise) by executing it in a
distributed manner across the CrateDB cluster.

The idea is to have the hash join operation executing in multiple nodes of the
cluster in parallel and then merge the intermediate results before returning
them to the client.

A hashing algorithm is applied on every row of both the left and right
relations. On the integer value generated by this hash, a modulo, by the number
of nodes in the cluster, is applied and the resulting number defines the node
to which this row should be sent. As a result each node of the cluster receives
a subset of the whole data set which is ensured (by the hashing and modulo) to
contain all candidate matching rows.

Each node in turn performs a :ref:`block hash join <join-algos-hash-block>` on
this subset and sends its result tuples
to the handler node (where the client issued the query). Finally, the handler
node receives those intermediate results, merges them and applies any pending
``ORDER BY``, ``LIMIT`` and ``OFFSET`` and sends the final result to the
client.

This algorithm is used by CrateDB for most cases of hash join execution except
for joins on complex subqueries that contain ``LIMIT`` and/or ``OFFSET``.

.. _joins_figure_4:

.. figure:: distributed-hash-join.png
   :align: center

   Distributed hash join algorithm


.. _join-optim:

Join optimizations
==================


.. _join-optim-optim-query-fetch:

Query then fetch
----------------

Join operations on large relation can be extremely slow especially if the join
is executed with a :ref:`Nested Loop <join-algos-nested-loop>`. - which means that
the runtime complexity grows quadratically (O(n*m)). Specifically for
:ref:`cross joins <cross-joins>` this results in large amounts of data sent
over the network and loaded into memory at the handler node. CrateDB reduces
the volume of data transferred by employing "Query Then Fetch": First, filtering
and ordering are applied (if possible where the data is located) to obtain the
required document IDs. Next, as soon as the final data set is ready, CrateDB
fetches the selected fields and returns the data to the client.


.. _join-optim-optim-push-down:

Push-down query optimization
----------------------------

Complex queries such as Listing 2 require the planner to decide when to filter,
sort, and merge in order to efficiently execute the plan. In this case, the
query would be split internally into subqueries before running the join. As
shown in :ref:`joins_figure_5`, first filtering (and ordering) is applied to
relations *L* and *R* on their shards, then the result is directly broadcast to
the nodes running the join. Not only will this behavior reduce the number of
rows to work with, it also distributes the workload among the nodes so that the
(expensive) join operation can run faster.

.. code-block:: SQL

    SELECT L.a, R.x
    FROM L, R
    WHERE L.id = R.id
      AND L.b > 100
      AND R.y < 10
    ORDER BY L.a

*Listing 2. An INNER JOIN on ids (effectively an EQUI JOIN) which can be
optimized.*

.. _joins_figure_5:

.. figure:: push-down.png
   :align: center

   Figure 5

   Complex queries are broken down into subqueries that are run on their shards
   before joining.

.. _join-optim-cross-join-elimination:

Cross join elimination
----------------------

The optimizer will try to eliminate cross joins in the query plan by changing
the join-order. Cross join elimination replaces a CROSS JOIN with an INNER JOIN
if query conditions used in the WHERE clause or other join conditions allow
for it. An example:

.. code-block:: SQL

    SELECT *
    FROM t1 CROSS JOIN t2
    INNER JOIN t3
    ON t3.z = t1.x AND t3.z = t2.y

The cross join elimination will change the order of the query from t1, t2, t3
to t2, t1, t3 so that each join has a join condition and the CROSS JOIN can be
replaced by an INNER JOIN. When reordering, it will try to preserve the
original join order as much as possible. If a CROSS JOIN cannot be eliminated,
the original join order will be maintained. This optimizer rule can be disabled
with the :ref:`optimizer eliminate cross join session setting
<conf-session-optimizer_eliminate_cross_join>`::

    SET optimizer_eliminate_cross_join = false

Note that this setting is experimental, and may change in the future.


.. _hash table: https://en.wikipedia.org/wiki/Hash_table
.. _here: http://www.dcs.ed.ac.uk/home/tz/phd/thesis.pdf</doc><doc title="Concept: Storage and consistency" desc="How CrateDB stores and distributes state across the cluster and what consistency and durability guarantees are provided. ">.. _concept-storage-consistency:

=======================
Storage and consistency
=======================

This document provides an overview on how CrateDB stores and distributes state
across the cluster and what consistency and durability guarantees are provided.

.. NOTE::

  Since CrateDB heavily relies on Elasticsearch_ and Lucene_ for storage and
  cluster consensus, concepts shown here might look familiar to Elasticsearch_
  users, since the implementation is actually reused from the Elasticsearch_
  code.


.. _concept-data-storage:

Data storage
============

Every table in CrateDB is sharded, which means that tables are divided and
distributed across the nodes of a cluster. Each shard in CrateDB is a Lucene_
index broken down into segments getting stored on the filesystem. Physically
the files reside under one of the configured data directories of a node.

Lucene only appends data to segment files, which means that data written to the
disc will never be mutated. This makes it easy for replication and
:ref:`recovery <gloss-shard-recovery>`, since syncing a shard is simply a
matter of fetching data from a specific marker.

An arbitrary number of replica shards can be configured per table. Every
operational replica holds a full synchronized copy of the primary shard.

With read operations, there is no difference between executing the
operation on the primary shard or on any of the replicas. CrateDB
randomly assigns a shard when routing an operation. It is possible to
configure this behavior if required, see our best practice guide on
:ref:`multi zone setups <multi-zone-setup>`
for more details.

Write operations are handled differently than reads. Such operations are
synchronous over all active replicas with the following flow:

1. The primary shard and the active replicas are looked up in the cluster state
   for the given operation. The primary shard and a quorum of the configured
   replicas need to be available for this step to succeed.

2. The operation is routed to the according primary shard for execution.

3. The operation gets executed on the primary shard

4. If the operation succeeds on the primary, the operation gets executed on all
   replicas in parallel.

5. After all replica operations finish the operation result gets returned to
   the caller.

Should any replica shard fail to write the data or times out in step 5, it's
immediately considered as unavailable.


.. _concept-atomicity:

Atomicity at document level
===========================

Each row of a table in CrateDB is a semi structured document which can be
nested arbitrarily deep through the use of object and array types.

Operations on documents are atomic. Meaning that a write operation on a
document either succeeds as a whole or has no effect at all. This is always the
case, regardless of the nesting depth or size of the document.

CrateDB does not provide transactions. Since every document in CrateDB has a
version number assigned, which gets increased every time a change occurs,
patterns like :ref:`sql_occ` can help to work around that
limitation.


.. _concept-durability:

Durability
==========

Each shard has a WAL_ also known as translog. It guarantees that operations on
documents are persisted to disk without having to issue a Lucene-Commit for
every write operation. When the translog gets flushed all data is written to
the persistent index storage of Lucene and the translog gets cleared.

In case of an unclean shutdown of a shard, the transactions in the translog are
getting replayed upon startup to ensure that all executed operations are
permanent.

The translog is also directly transferred when a newly allocated replica
initializes itself from the primary shard. There is no need to flush segments
to disc just for replica :ref:`recovery <gloss-shard-recovery>` purposes.


.. _concept-addressing-documents:

Addressing documents
====================

Every document has an :ref:`internal identifier
<sql_administration_system_column_id>`. By default this identifier is derived
from the primary key. Documents living in tables without a primary key are
assigned a unique auto-generated ID automatically when created.

Each document is :ref:`routed <sharding-routing>` to one specific shard
according to the :ref:`routing column <gloss-routing-column>`. All rows that
have the same routing column row value are stored in the same shard. The
routing column can be specified with the :ref:`CLUSTERED
<sql-create-table-clustered>` clause when creating the table. If a
:ref:`primary key <primary_key_constraint>` has been defined, it will be used
as the default routing column, otherwise the :ref:`internal document ID
<sql_administration_system_column_id>` is used.

While transparent to the user, internally there are two ways how CrateDB
accesses documents:

:get:
  Direct access by identifier. Only applicable if the routing key and the
  identifier can be computed from the given query specification. (e.g: the full
  primary key is defined in the where clause).

  This is the most efficient way to access a document, since only a single shard
  gets accessed and only a simple index lookup on the ``_id`` field has to be
  done.

:search:
  Query by matching against fields of documents across all candidate shards of
  the table.


.. _concept-consistency:

Consistency
===========

CrateDB is eventual consistent for search operations. Search operations are
performed on shared ``IndexReaders`` which besides other functionality, provide
caching and reverse lookup capabilities for shards. An ``IndexReader`` is
always bound to the Lucene_ segment it was started from, which means it has to
be refreshed in order to see new changes, this is done on a time based manner,
but can also be done manually (see :ref:`sql-refresh`). Therefore a search only sees a
change if the according ``IndexReader`` was refreshed after that change
occurred.

If a query specification results in a ``get`` operation, changes are visible
immediately. This is achieved by looking up the document in the translog first,
which will always have the most recent version of the document. The common
update and fetch use-case is therefore possible. If a client updates a row and
that row is looked up by its primary key after that update the changes will
always be visible, since the information will be retrieved directly from the
translog. There is an exception to that, when the ``WHERE`` clause contains
complex filtering and/or lots of Primary Key values. You can find more details
:ref:`here <sql-refresh-description_collect_exception>`.

.. NOTE::

  ``Dirty reads`` can occur if the primary shard becomes isolated. The primary
  will only realize it is isolated once it tries to communicate with its
  replicas or the master. At that point, a write operation is already committed
  into the primary and can be read by a concurrent read operation. In order to
  minimise the window of opportunity for this phenomena, the CrateDB nodes
  communicate with the master every second (by default) and once they realise
  no master is known, they will start rejecting write operations.

  Every replica shard is updated synchronously with its primary and always
  carries the same information. Therefore it does not matter if the primary or
  a replica shard is accessed in terms of consistency. Only the refresh of the
  ``IndexReader`` affects consistency.

.. NOTE::

    Due to internal constraints, when the ``WHERE`` clause filters on multiple
    columns of a ``PRIMARY KEY``, but one or more of those columns is tested
    against lots of values, the query might be executed using a ``Collect``
    operator instead of a ``Get``, thus records might be unavailable until a
    ``REFRESH`` is run. The same situation could occur when the ``WHERE`` clause
    contains long complex expressions, e.g.::

        SELECT * FROM t
        WHERE pk1 IN (<long_list_of_values>) AND pk2 = 3 AND pk3 = 'foo'

        SELECT * FROM t
        WHERE pk1 = ?
            AND pk2 = ?
            AND pk3 = ?
            OR pk1 = ?
            AND pk2 = ?
            AND pk3 = ?
            OR pk1 = ?
            ...

.. CAUTION::

   Some outage conditions can affect these consistency claims. See the
   :ref:`resiliency documentation <concept-resiliency>` for details.


.. _concept-cluster-metadata:

Cluster meta data
=================

Cluster meta data is held in the so called "Cluster State", which contains the
following information:

- Tables schemas.

- Primary and replica shard locations. Basically just a mapping from shard
  number to the storage node.

- Status of each shard, which tells if a shard is currently ready for use or
  has any other state like "initializing", "recovering" or cannot be assigned
  at all.

- Information about discovered nodes and their status.

- Configuration information.

Every node has its own copy of the cluster state. However there is only one
node allowed to change the cluster state at runtime. This node is called the
"master" node and gets auto-elected. The "master" node has no special
configuration at all, all nodes are master-eligible by default, and any
master-eligible node can be elected as the master. There
is also an automatic re-election if the current master node goes down for some
reason.

.. NOTE::

  To avoid a scenario where two masters could be elected due to network
  partitioning, CrateDB automatically defines a quorum of nodes with
  which it is possible to elect a master. For details on how this works
  and further information see :ref:`concept-master-election`.

To explain the flow of events for any cluster state change, here is an example
flow for an ``ALTER TABLE`` statement which changes the schema of a table:

#. A node in the cluster receives the ``ALTER TABLE`` request.

#. The node sends out a request to the current master node to change the table
   definition.

#. The master node applies the changes locally to the cluster state and sends
   out a notification to all affected nodes about the change.

#. The nodes apply the change, so that they are now in sync with the master.

#. Every node might take some local action depending on the type of cluster
   state change.

.. _Elasticsearch: https://www.elastic.co
.. _Lucene: https://lucene.apache.org/core/
.. _WAL: https://en.wikipedia.org/wiki/Write-ahead_logging</doc><doc title="Concept: Resiliency" desc="How CrateDB copes with network-, disk-, or machine-failures.">.. _concept-resiliency:

==========
Resiliency
==========

Distributed systems are tricky. All sorts of things can go wrong that are
beyond your control. The network can go away, disks can fail, hosts can be
terminated unexpectedly. CrateDB tries very hard to cope with these sorts of
issues while maintaining :ref:`availability <concept-clustering>`,
:ref:`consistency <concept-consistency>`, and :ref:`durability
<concept-durability>`.

However, as with any distributed system, sometimes, *rarely*, things can go
wrong.

Thankfully, for most use-cases, if you follow best practices, you are extremely
unlikely to experience resiliency issues with CrateDB.

.. SEEALSO::

    :ref:`Appendix: Resiliency Issues <appendix-resiliency>`


.. _concept-resiliency-monitoring:

Monitoring cluster status
=========================

.. figure:: resilience-status.png
   :alt:

The Admin UI in CrateDB has a status indicator which can be used to determine
the stability and health of a cluster.

A green status indicates that all shards have been replicated, are available,
and are not being relocated. This is the lowest risk status for a cluster. The
status will turn yellow when there is an elevated risk of encountering issues,
due to a network failure or the failure of a node in the cluster.

The status is updated every few seconds.


.. _concept-resiliency-consistency:

Storage and consistency
=======================

Code that expects the behavior of an `ACID
<https://en.wikipedia.org/wiki/ACID>`_ compliant database like MySQL may not
always work as expected with CrateDB.

CrateDB does not support ACID transactions, but instead has :ref:`atomic
operations <concept-atomicity>` and :ref:`eventual consistency
<concept-consistency>` at the row level. See also :ref:`concept-clustering`.

Eventual consistency is the trade-off that CrateDB makes in exchange for
high-availability that can tolerate most hardware and network failures. So you
may observe data from different cluster nodes temporarily falling very briefly
out-of-sync with each other, although over time they will become consistent.

For example, you know a row has been written as soon as you get the ``INSERT
OK`` message. But that row might not be read back by a subsequent ``SELECT`` on
a different node until after a :ref:`table refresh <sql-refresh>` (which
typically occurs within one second).

Your applications should be designed to work this storage and consistency model.


.. _concept-resiliency-deployment:

Deployment strategies
=====================

When deploying CrateDB you should carefully weigh your need for
high-availability and disaster recovery against operational complexity and
expense.

Which strategy you pick is going to depend on the specifics of your situation.

Here are some considerations:

-  CrateDB is designed to scale horizontally. Make sure that your machines are
   fit for purpose, i.e. use SSDs, increase RAM up to 64 GB, and use multiple
   CPU cores when you can. But if you want to dynamically increase (or
   decrease) the capacity of your cluster, :ref:`add (or remove) nodes
   <multi-node-setup>`.

-  If availability is a concern, you can add :ref:`nodes across multiple zones
   <multi-zone-setup>`
   (e.g. different data centers or geographical regions). The more available
   your CrateDB cluster is, the more likely it is to withstand external
   failures like a zone going down.

-  If data durability or read performance is a concern, you can increase the
   number of :ref:`table replicas <concept-data-storage>`.
   More table replicas means a smaller chance of permanent data loss due to
   hardware failures, in exchange for the use of more disk space and more
   intra-cluster network traffic.

-  If disaster recovery is important, you can :ref:`take regular snapshots
   <snapshot-restore>` and store those snapshots in cold storage. This
   safeguards data that has already been successfully written and replicated
   across the cluster.

-  CrateDB works well as part of a `data pipeline
   <https://cratedb.com/docs/tools/streamsets/>`_, especially if you’re working
   with high-volume data. If you have a message queue in front of CrateDB, you
   can configure it with backups and replay the data flow for a specific
   timeframe. This can be used to recover from issues that affect your data
   before it has been successfully written and replicated across the cluster.

   Indeed, this is the generally recommended way to recover from any of the
   rare consistency or data-loss issues you might encounter when CrateDB
   experiences network or hardware failures (see next section).</doc><doc title="CrateDB reference: Partitioned tables" desc="A partitioned table is a virtual table consisting of zero or more partitions. A partition is similar to a regular single table and consists of one or more shards. A table becomes a partitioned table by defining partition columns. When a record with a new distinct combination of values for the configured partition columns is inserted, a new partition is created, and the document is inserted into this new partition. ">.. highlight:: psql

.. _partitioned-tables:

==================
Partitioned tables
==================


.. _partitioned-intro:

Introduction
============

A partitioned table is a virtual table consisting of zero or more partitions. A
partition is similar to a regular single table and consists of one or more
shards.

::

    partitioned_table
      |
      +-- partition 1
      |     |
      |     +- shard 0
      |     |
      |     +- shard 1
      |
      +-- partition 2
            |
            +- shard 0
            |
            +- shard 1


A table becomes a partitioned table by defining :ref:`partition columns
<gloss-partition-column>`.  When a record with a new distinct combination of
values for the configured :ref:`partition columns <gloss-partition-column>` is
inserted, a new partition is created and the document will be inserted into
this partition.

A partitioned table can be queried like a regular table.

Partitioned tables have the following advantages:

- The number of shards can be changed on the partitioned table, which will then
  change how many shards will be used for the next partition creation. This
  enables one to start out with few shards per partition initially, and scale
  up the number of shards for later partitions once traffic and ingest rates
  increase with the lifetime of an application.

- Partitions can be backed up and restored individually.

- Queries which contain filters in the ``WHERE`` clause which identify a single
  partition or a subset of partitions is less expensive than querying all
  partitions because the shards of the excluded partitions won't have to be
  accessed.

- Deleting data from a partitioned table is cheap if full partitions are
  dropped. Full partitions are dropped with ``DELETE`` statements where the
  optimizer can infer from the ``WHERE`` clause and partition columns that all
  records of a partition match without having to :ref:`evaluate
  <gloss-evaluation>` against the records.


Partitioned tables have the following disadvantages:


- If the partition columns are badly chosen you can end up with too many shards
  in the cluster, affecting the overall stability and performance negatively.

- You may end up with empty, stale partitions if delete operations couldn't be
  optimized to drop full partitions. You may have to watch out for this and
  invoke ``DELETE`` statements to target single partitions to clean them up.

- Some optimizations don't apply to partitioned tables. An example for this is
  a GROUP BY query where the grouping keys match the ``CLUSTERED BY`` columns
  of a table. This kind of query can be optimized on regular tables, but cannot
  be optimized on a partitioned table.


.. NOTE::

    Keep in mind that the values of the partition columns are internally base32
    encoded into the partition name (which is a separate table).

    So, for every partition, the partition table name includes:

    - The table schema (optional)
    - The table name
    - The base32 encoded partition column value(s)
    - An internal overhead of 14 bytes

    Altogether, the table name length must not exceed the :ref:`255 bytes
    length limitation <ddl-create-table-naming>`.

.. CAUTION::

    Every table partition is clustered into as many shards as you configure for
    the table. Because of this, a good partition configuration depends on good
    :ref:`shard allocation <gloss-shard-allocation>`.

    Well tuned shard allocation is vital. Read the `sharding guide`_ to make
    sure you're getting the best performance out of CrateDB.


.. _partitioned-creation:

Creation
========

It can be created using the :ref:`sql-create-table` statement using the
:ref:`sql-create-table-partitioned-by`::

    cr> CREATE TABLE parted_table (
    ...   id bigint,
    ...   title text,
    ...   content text,
    ...   width double precision,
    ...   day timestamp with time zone
    ... ) CLUSTERED BY (title) INTO 4 SHARDS PARTITIONED BY (day);
    CREATE OK, 1 row affected (... sec)

This creates an empty partitioned table which is not yet backed by real
partitions. Nonetheless it does behave like a *normal* table.

When the value to partition by references one or more
:ref:`sql-create-table-base-columns`, their values must be supplied upon
:ref:`sql-insert` or :ref:`sql-copy-from`. Often these values are computed on
client side. If this is not possible, a :ref:`generated column
<sql-create-table-generated-columns>` can be used to create a suitable
partition value from the given values on database-side::

    cr> CREATE TABLE computed_parted_table (
    ...   id bigint,
    ...   data double precision,
    ...   created_at timestamp with time zone,
    ...   month timestamp with time zone GENERATED ALWAYS AS date_trunc('month', created_at)
    ... ) PARTITIONED BY (month);
    CREATE OK, 1 row affected (... sec)


.. _partitioned-info-schema:

Information schema
==================

This table shows up in the ``information_schema.tables`` table, recognizable as
partitioned table by a non null ``partitioned_by`` column (aliased as ``p_b``
here)::

    cr> SELECT table_schema as schema,
    ...   table_name,
    ...   number_of_shards as num_shards,
    ...   number_of_replicas as num_reps,
    ...   clustered_by as c_b,
    ...   partitioned_by as p_b,
    ...   blobs_path
    ... FROM information_schema.tables
    ... WHERE table_name='parted_table';
    +--------+--------------+------------+----------+-------+---------+------------+
    | schema | table_name   | num_shards | num_reps | c_b   | p_b     | blobs_path |
    +--------+--------------+------------+----------+-------+---------+------------+
    | doc    | parted_table |          4 |      0-1 | title | ["day"] | NULL       |
    +--------+--------------+------------+----------+-------+---------+------------+
    SELECT 1 row in set (... sec)

::

    cr> SELECT table_schema as schema, table_name, column_name, data_type
    ... FROM information_schema.columns
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table'
    ... ORDER BY table_schema, table_name, column_name;
    +--------+--------------+-------------+--------------------------+
    | schema | table_name   | column_name | data_type                |
    +--------+--------------+-------------+--------------------------+
    | doc    | parted_table | content     | text                     |
    | doc    | parted_table | day         | timestamp with time zone |
    | doc    | parted_table | id          | bigint                   |
    | doc    | parted_table | title       | text                     |
    | doc    | parted_table | width       | double precision         |
    +--------+--------------+-------------+--------------------------+
    SELECT 5 rows in set (... sec)

And so on.

You can get information about the partitions of a partitioned table by querying
the ``information_schema.table_partitions`` table::

    cr> SELECT count(*) as partition_count
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table';
    +-----------------+
    | partition_count |
    +-----------------+
    | 0               |
    +-----------------+
    SELECT 1 row in set (... sec)

As this table is still empty, no partitions have been created.


.. _partitioned-insert:

Insert
======

::

    cr> INSERT INTO parted_table (id, title, width, day)
    ... VALUES (1, 'Don''t Panic', 19.5, '2014-04-08');
    INSERT OK, 1 row affected (... sec)

::

    cr> SELECT partition_ident, "values", number_of_shards
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table'
    ... ORDER BY partition_ident;
    +--------------------------+------------------------+------------------+
    | partition_ident          | values                 | number_of_shards |
    +--------------------------+------------------------+------------------+
    | 04732cpp6osj2d9i60o30c1g | {"day": 1396915200000} |                4 |
    +--------------------------+------------------------+------------------+
    SELECT 1 row in set (... sec)

On subsequent inserts with the same :ref:`partition column
<gloss-partition-column>` values, no additional partition is created::

    cr> INSERT INTO parted_table (id, title, width, day)
    ... VALUES (2, 'Time is an illusion, lunchtime doubly so', 0.7, '2014-04-08');
    INSERT OK, 1 row affected (... sec)

::

    cr> REFRESH TABLE parted_table;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT partition_ident, "values", number_of_shards
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table'
    ... ORDER BY partition_ident;
    +--------------------------+------------------------+------------------+
    | partition_ident          | values                 | number_of_shards |
    +--------------------------+------------------------+------------------+
    | 04732cpp6osj2d9i60o30c1g | {"day": 1396915200000} |                4 |
    +--------------------------+------------------------+------------------+
    SELECT 1 row in set (... sec)


.. _partitioned-update:

Update
======

:ref:`Partition columns <gloss-partition-column>` cannot be changed, because
this would necessitate moving all affected documents. Such an operation would
not be atomic and could lead to inconsistent state::

    cr> UPDATE parted_table set content = 'now panic!', day = '2014-04-07'
    ... WHERE id = 1;
    ColumnValidationException[Validation failed for day: Updating a partitioned-by column is not supported]

When using a :ref:`generated column <sql-create-table-generated-columns>` as
partition column, all the columns referenced in its :ref:`generation expression
<ddl-generated-columns-expressions>` cannot be updated either::

    cr> UPDATE computed_parted_table set created_at='1970-01-01'
    ... WHERE id = 1;
    ColumnValidationException[Validation failed for created_at: Updating a column which is referenced in a partitioned by generated column expression is not supported]

::

    cr> UPDATE parted_table set content = 'now panic!'
    ... WHERE id = 2;
    UPDATE OK, 1 row affected (... sec)

::

    cr> REFRESH TABLE parted_table;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT * from parted_table WHERE id = 2;
    +----+------------------------------------------+------------+-------+---------------+
    | id | title                                    | content    | width |           day |
    +----+------------------------------------------+------------+-------+---------------+
    |  2 | Time is an illusion, lunchtime doubly so | now panic! |   0.7 | 1396915200000 |
    +----+------------------------------------------+------------+-------+---------------+
    SELECT 1 row in set (... sec)


.. _partitioned-delete:

Delete
======

Deleting with a ``WHERE`` clause matching all rows of a partition will drop the
whole partition instead of deleting every matching document, which is a lot
faster::

    cr> delete from parted_table where day = 1396915200000;
    DELETE OK, -1 rows affected (... sec)

::

    cr> SELECT count(*) as partition_count
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table';
    +-----------------+
    | partition_count |
    +-----------------+
    | 0               |
    +-----------------+
    SELECT 1 row in set (... sec)


.. _partitioned-querying:

Querying
========

``UPDATE``, ``DELETE`` and ``SELECT`` queries are all optimized to only affect
as few partitions as possible based on the partitions referenced in the
``WHERE`` clause.

The ``WHERE`` clause is analyzed for partition use by checking the ``WHERE``
conditions against the values of the :ref:`partition columns
<gloss-partition-column>`.

For example, the following query will only operate on the partition for
``day=1396915200000``:

.. Hidden: insert some rows::

    cr> INSERT INTO parted_table (id, title, content, width, day) VALUES
    ... (1, 'The incredible foo', 'foo is incredible', 12.9, '2015-11-16'),
    ... (2, 'The dark bar rises', 'na, na, na, na, na, na, na, na, barman!', 0.5, '1970-01-01'),
    ... (3, 'Kill baz', '*splatter*, *oommph*, *zip*', 13.5, '1970-01-01'),
    ... (4, 'Spice Pork And haM', 'want some roses?', -0.0, '1999-12-12');
    INSERT OK, 4 rows affected (... sec)

.. Hidden: refresh

    cr> REFRESH TABLE parted_table;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT count(*) FROM parted_table
    ... WHERE day='1970-01-01'
    ... ORDER by 1;
    +-------+
    | count |
    +-------+
    |     2 |
    +-------+
    SELECT 1 row in set (... sec)

Any combination of conditions that can be :ref:`evaluated <gloss-evaluation>`
to a partition before actually executing the query is supported::

    cr> SELECT id, title FROM parted_table
    ... WHERE date_trunc('year', day) > '1970-01-01'
    ... OR extract(day_of_week from day) = 1
    ... ORDER BY id DESC;
    +----+--------------------+
    | id | title              |
    +----+--------------------+
    |  4 | Spice Pork And haM |
    |  1 | The incredible foo |
    +----+--------------------+
    SELECT 2 rows in set (... sec)

Internally the ``WHERE`` clause is evaluated against the existing partitions
and their partition values. These partitions are then filtered to obtain the
list of partitions that need to be accessed.

.. Hidden: delete::

    cr> DELETE FROM parted_table;
    DELETE OK, -1 rows affected (... sec)


.. _partitioned-generated:

Partitioning by generated columns
---------------------------------

Querying on tables partitioned by generated columns is optimized to infer a
minimum list of partitions from the :ref:`partition columns
<gloss-partition-column>` referenced in the ``WHERE`` clause:

.. Hidden: insert some stuff::

    cr> INSERT INTO computed_parted_table (id, data, created_at) VALUES
    ... (1, 42.0, '2015-11-16T14:27:00+01:00'),
    ... (2, 0.0, '2015-11-16T00:00:00Z'),
    ... (3, 23.0,'1970-01-01');
    INSERT OK, 3 rows affected (... sec)

.. Hidden: refresh::

    cr> REFRESH TABLE computed_parted_table;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT id, date_format('%Y-%m', month) as m FROM computed_parted_table
    ... WHERE created_at = '2015-11-16T13:27:00.000Z'
    ... ORDER BY id;
    +----+---------+
    | id | m       |
    +----+---------+
    | 1  | 2015-11 |
    +----+---------+
    SELECT 1 row in set (... sec)


.. _partitioned-alter:

Alter
=====

Parameters of partitioned tables can be changed as usual (see
:ref:`sql_ddl_alter_table` for more information on how to alter regular tables)
with the :ref:`sql-alter-table` statement. Common ``ALTER TABLE`` parameters
affect both existing partitions and partitions that will be created in the
future.

::

    cr> ALTER TABLE parted_table SET (number_of_replicas = '0-all')
    ALTER OK, -1 rows affected (... sec)

Altering schema information (such as the column policy or adding columns) can
only be done on the table (not on single partitions) and will take effect on
both existing and new partitions of the table.

::

    cr> ALTER TABLE parted_table ADD COLUMN new_col text
    ALTER OK, -1 rows affected (... sec)


.. _partitioned-alter-shards:

Changing the number of shards
-----------------------------

It is possible at any time to change the number of shards of a partitioned
table.

::

    cr> ALTER TABLE parted_table SET (number_of_shards = 10)
    ALTER OK, -1 rows affected (... sec)

.. NOTE::

  This will **not** change the number of shards of existing partitions,
  but the new number of shards will be taken into account when **new**
  partitions are created.

::

    cr> INSERT INTO parted_table (id, title, width, day)
    ... VALUES (2, 'All Good', 3.1415, '2014-04-08');
    INSERT OK, 1 row affected (... sec)

.. Hidden: refresh table::

    cr> REFRESH TABLE parted_table;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT count(*) as num_shards, sum(num_docs) as num_docs
    ... FROM sys.shards
    ... WHERE schema_name = 'doc' AND table_name = 'parted_table';
    +------------+----------+
    | num_shards | num_docs |
    +------------+----------+
    |         10 |      1   |
    +------------+----------+
    SELECT 1 row in set (... sec)

::

    cr> SELECT partition_ident, "values", number_of_shards
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table'
    ... ORDER BY partition_ident;
    +--------------------------+------------------------+------------------+
    | partition_ident          | values                 | number_of_shards |
    +--------------------------+------------------------+------------------+
    | 04732cpp6osj2d9i60o30c1g | {"day": 1396915200000} |               10 |
    +--------------------------+------------------------+------------------+
    SELECT 1 row in set (... sec)


.. _partitioned-alter-single:

Altering a single partition
...........................

We also provide the option to change the number of shards that are already
:ref:`allocated <gloss-shard-allocation>` for an existing partition. This
option operates on a partition basis, thus a specific partition needs to be
specified::

    cr> ALTER TABLE parted_table PARTITION (day=1396915200000) SET ("blocks.write" = true)
    ALTER OK, -1 rows affected (... sec)

    cr> ALTER TABLE parted_table PARTITION (day=1396915200000) SET (number_of_shards = 5)
    ALTER OK, 0 rows affected (... sec)

    cr> ALTER TABLE parted_table PARTITION (day=1396915200000) SET ("blocks.write" = false)
    ALTER OK, -1 rows affected (... sec)

::

    cr> SELECT partition_ident, "values", number_of_shards
    ... FROM information_schema.table_partitions
    ... WHERE table_schema = 'doc' AND table_name = 'parted_table'
    ... ORDER BY partition_ident;
    +--------------------------+------------------------+------------------+
    | partition_ident          | values                 | number_of_shards |
    +--------------------------+------------------------+------------------+
    | 04732cpp6osj2d9i60o30c1g | {"day": 1396915200000} |                5 |
    +--------------------------+------------------------+------------------+
    SELECT 1 row in set (... sec)

.. NOTE::

   The same prerequisites and restrictions as with normal tables apply. See
   :ref:`alter-shard-number`.


.. _partitioned-alter-parameters:

Alter table parameters
----------------------

It is also possible to alter parameters of single partitions of a partitioned
table. However, unlike with partitioned tables, it is not possible to alter the
schema information of single partitions.

To change table parameters such as ``number_of_replicas`` or other table
settings use the :ref:`sql-alter-table-partition`.

::

    cr> ALTER TABLE parted_table PARTITION (day=1396915200000) RESET (number_of_replicas)
    ALTER OK, -1 rows affected (... sec)


.. _partitioned-alter-table:

Alter table ``ONLY``
--------------------

Sometimes one wants to alter a partitioned table, but the changes should only
affect new partitions and not existing ones. This can be done by using the
``ONLY`` keyword.

::

    cr> ALTER TABLE ONLY parted_table SET (number_of_replicas = 1);
    ALTER OK, -1 rows affected (... sec)


.. _partitioned-alter-close-open:

Closing and opening a partition
-------------------------------

A single partition within a partitioned table can be opened and closed in the
same way a normal table can.

::

    cr> ALTER TABLE parted_table PARTITION (day=1396915200000) CLOSE;
    ALTER OK, -1 rows affected (... sec)

This will all operations beside ``ALTER TABLE ... OPEN`` to fail on this
partition. The partition will also not be included in any query on the
partitioned table.


.. _partitioned-limitations:

Limitations
===========

* ``WHERE`` clauses cannot contain queries like ``partitioned_by_column='x' OR
  normal_column=x``


.. _partitioned-consistency:

Consistency notes related to concurrent DML statement
=====================================================

If a partition is deleted during an active insert or update bulk operation this
partition won't be re-created.

The number of affected rows will always reflect the real number of
inserted/updated documents.

.. Hidden: drop table::

    cr> drop table parted_table;
    DROP OK, 1 row affected (... sec)

.. Hidden: drop computed table::

    cr> DROP TABLE computed_parted_table;
    DROP OK, 1 row affected (... sec)


.. _sharding guide: https://cratedb.com/docs/crate/howtos/en/latest/performance/sharding.html</doc><doc title="CrateDB reference: Storage" desc="Data storage options can be tuned for each column similar to how indexing is defined. Using the Column Store limits the values of TEXT columns to a maximal length of 32766 bytes. Turning off the Column Store in conjunction with turning off indexing will remove the length limitation. ">.. _ddl-storage:

=======
Storage
=======

Data storage options can be tuned for each column similar to how indexing is defined.

.. _ddl-storage-columnstore:

Column store
============

Beside of storing the row data as-is (and indexing each value by default), each
value term is stored into a `Column Store`_ by default. The usage of a `Column
Store`_ is greatly improving global aggregations and groupings and enables
ordering possibility as the data for one column is packed at one place. Using
the `Column Store`_ limits the values of :ref:`type-text` columns to a maximal
length of 32766 bytes.

Turning off the `Column Store`_ in conjunction of :ref:`turning off indexing
<sql_ddl_index_off>` will remove the length limitation.

Example:
::

    cr> CREATE TABLE t1 (
    ...   id INTEGER,
    ...   url TEXT INDEX OFF STORAGE WITH (columnstore = false)
    ... );
    CREATE OK, 1 row affected  (... sec)

Doing so will enable support for inserting strings longer than 32766 bytes into
the ``url`` column, but the performance for global aggregations, groupings and
sorting using this ``url`` column will decrease.

.. NOTE::

    ``INDEX OFF`` and therefore ``columnstore = false`` cannot be used with
    :ref:`partition columns <gloss-partition-column>`, as those are not stored
    as normal columns of a table.

.. hide:

    cr> drop table t1;
    DROP OK, 1 row affected  (... sec)

Supported data types
--------------------

Controlling if values are stored into a `Column Store`_ is only supported on
following data types:

- :ref:`type-text`
- :ref:`data-types-numeric`
- :ref:`type-timestamp`
- :ref:`type-timestamp-with-tz`

For all other :ref:`data-types-primitive` and :ref:`data-types-geo-point` it is
enabled by default and cannot be disabled. :ref:`data-types-container` and
:ref:`data-types-geo-shape` do not support storing values into a
`Column Store`_ at all.

.. _Column Store: https://en.wikipedia.org/wiki/Column-oriented_DBMS</doc><doc title="CrateDB reference: Replication" desc="You can configure CrateDB to replicate tables. When you configure replication, CrateDB will try to ensure that every table shard has one or more copies available at all times. This ensures data resiliency when individual cluster nodes go offline for maintenance. ">.. _ddl-replication:

===========
Replication
===========

You can configure CrateDB to *replicate* tables. When you configure
replication, CrateDB will try to ensure that every table :ref:`shard
<ddl-sharding>` has one or more copies available at all times.

When there are multiple copies of the same shard, CrateDB will mark one as the
*primary shard* and treat the rest as *replica shards*. Write operations
always go to the primary shard, whereas read operations can go to any
shard. CrateDB continually synchronizes data from the primary shard to all
replica shards (through a process known as :ref:`shard recovery
<gloss-shard-recovery>`).

When a primary shard is lost (e.g., due to node failure), CrateDB will promote
a replica shard to a primary. Hence, more table replicas mean a smaller chance
of permanent data loss (through increased `data redundancy`) in exchange for
more disk space utilization and intra-cluster network traffic.

Replication can also improve read performance because any increase in the
number of shards distributed across a cluster also increases the opportunities
for CrateDB to `parallelize`_ query execution across multiple nodes.


.. _ddl-replication-config:

Table configuration
===================

You can configure the number of per-shard replicas :ref:`WITH
<sql-create-table-with>` the :ref:`sql-create-table-number-of-replicas` table
setting.

For example::

    cr> CREATE TABLE my_table (
    ...   first_column integer,
    ...   second_column text
    ... ) WITH (number_of_replicas = 0);
    CREATE OK, 1 row affected (... sec)

As well as being able to configure a fixed number of replicas, you can
configure a range of values by using a string to specify a minimum and a
maximum (dependent on the number of nodes in the cluster).

Here are some examples of replica ranges:

========= =====================================================================
Range     Explanation
========= =====================================================================
``0-1``   If you only have one node, CrateDB will not create any replicas. If
          you have more than one node, CrateDB will create one replica per
          shard.

          This range is the default value.
--------- ---------------------------------------------------------------------
``2-4``   Each table will require at least two replicas for CrateDB to consider
          it fully replicated (i.e., a *green* replication :ref:`health status
          <sys-health-def>`).

          If the cluster has five nodes, CrateDB will create four replicas and
          allocate each one to a node that does not hold the corresponding
          primary.

          Suppose a cluster has four nodes or fewer. In that case, CrateDB will
          be unable to allocate every replica to a node that does not hold the
          corresponding primary, putting the table into :ref:`underreplication
          <ddl-replication-underreplication>`. As a result, CrateDB will give
          the table a *yellow* replication :ref:`health status
          <sys-health-def>`.
--------- ---------------------------------------------------------------------
``0-all`` CrateDB will create one replica shard for every node that is
          available in addition to the node that holds the primary shard.
========= =====================================================================

If you do not specify a ``number_of_replicas``, CrateDB will create one or zero
replicas, depending on the number of available nodes at the cluster (e.g., on a
single-node cluster, ``number_of_replicas`` will be set to zero to allow fast
write operations with the default setting of
:ref:`sql-create-table-write-wait-for-active-shards`).

You can change the :ref:`sql-create-table-number-of-replicas` setting at any
time.

.. SEEALSO::

    :ref:`CREATE TABLE: WITH clause <sql-create-table-number-of-replicas>`


.. _ddl-replication-recovery:

Shard recovery
==============

CrateDB :ref:`allocates <gloss-shard-allocation>` each primary and replica
shard to a specific node. You can control this behavior by configuring the
:ref:`allocation <conf_routing>` settings.

If one or more nodes become unavailable (e.g., due to hardware failure or
network issues), CrateDB will try to recover a replicated table by doing the
following:

.. rst-class:: open

- For every lost primary shard, locate a replica and promote it to a primary.

  When CrateDB promotes a replica to primary, it can no longer function as a
  replica, and so the total number of replicas decreases by one. Because each
  primary requires a fixed :ref:`sql-create-table-number-of-replicas`, a new
  replica has to be created (see next item).

- For every primary with too few replicas (due to node loss or replica
  promotion), use the primary shard to :ref:`recover <gloss-shard-recovery>`
  the required number of replicas.

Shard recovery is one of the features that allows CrateDB to provide continuous
`availability`_ and `partition tolerance`_ in exchange for some
:ref:`consistency trade-offs <concept-resiliency-consistency>`.

.. SEEALSO::

    `Wikipedia: CAP theorem`_

.. _ddl-replication-underreplication:

Underreplication
================

Having more replicas per primary and distributing shards as thinly as possible
(i.e., fewer shards per node) can both increase chances of a :ref:`successful
recovery <ddl-replication-recovery>` in the event of node loss.

A single node can hold multiple shards belonging to the same table. For
example, suppose a table has more shards (primaries and replicas) than nodes
available in the cluster. In that case, CrateDB will determine the best way to
allocate shards to the nodes available.

However, there is never a benefit to allocating multiple copies of the same
shard to a single node (e.g., the primary and a replica of the same shard or
two replicas of the same shard).

For example:

.. rst-class:: open

- Suppose a single node held the primary and a replica of the same
  shard. If that node were lost, CrateDB would be unable to use either copy of
  the shard for :ref:`recovery <ddl-replication-recovery>` (because both were
  lost), effectively making the replica useless.

- Suppose a single node held two replicas of the same shard. If the primary
  shard were lost (on a different node), CrateDB would only need one of the
  replica shards on this node to promote a new primary, effectively making the
  second replica useless.

In both cases, the second copy of the shard serves no purpose.

For this reason, CrateDB will never allocate multiple copies of the same shard
to a single node.

The above rule means that for *one* primary shard and *n* replicas, a cluster
must have at least *n + 1* available nodes for CrateDB to fully replicate all
shards. When CrateDB cannot fully replicate all shards, the table enters a
state known as *underreplication*.

CrateDB gives underreplicated tables a *yellow* :ref:`health status
<sys-health-def>`.

.. TIP::

    The `CrateDB Admin UI`_ provides visual indicators of cluster health that
    take replication status into account.

    Alternatively, you can query health information directly from the
    :ref:`sys.health <sys-health>` table and replication information from the
    :ref:`sys.shards <sys-shards>` and :ref:`sys.allocations <sys-allocations>`
    tables.


.. _availability: https://en.wikipedia.org/wiki/Availability
.. _CrateDB Admin UI: https://cratedb.com/docs/crate/admin-ui/en/latest/
.. _data redundancy: https://en.wikipedia.org/wiki/Data_redundancy
.. _parallelize: https://en.wikipedia.org/wiki/Distributed_computing
.. _partition tolerance: https://en.wikipedia.org/wiki/Network_partitioning
.. _Wikipedia\: CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem</doc><doc title="CrateDB reference: Views" desc="Views are stored named queries which can be used in place of table names. They’re resolved at runtime and can be used to simplify common queries. Database views have special privilege properties. ">.. _ddl-views:

=====
Views
=====


.. _views-create:

Creating views
==============

Views are stored named queries which can be used in place of table names.
They're resolved at runtime and can be used to simplify common queries.

Views are created using the :ref:`CREATE VIEW statement <sql-create-view>`

For example, a common use case is to create a view which queries a table with a
pre-defined filter::

    cr> CREATE VIEW big_mountains AS
    ... SELECT * FROM sys.summits WHERE height > 2000;
    CREATE OK, 1 row affected (... sec)


.. _views-query:

Querying views
==============

Once created, views can be used instead of a table in a statement::

    cr> SELECT mountain, height FROM big_mountains ORDER BY 1 LIMIT 3;
    +--------------+--------+
    | mountain     | height |
    +--------------+--------+
    | Acherkogel   |   3008 |
    | Ackerlspitze |   2329 |
    | Adamello     |   3539 |
    +--------------+--------+
    SELECT 3 rows in set (... sec)


.. _views-privileges:

Privileges
----------

In order to be able to query data from a view, a user needs to have ``DQL``
privileges on a view. DQL privileges can be granted on a cluster level, on the
schema in which the view is contained, or the view itself. Privileges on
relations accessed by the view are not necessary.

However, it is required, at all times, that the *owner* (the user who created
the view), has ``DQL`` privileges on all relations occurring within the view's
query definition.

A common use case for this is to give users access to a subset of a table
without exposing the table itself as well. If the owner ``DQL`` permissions
on the underlying relations, a user who has access to the view will no longer
be able to query it.

.. SEEALSO::

    :ref:`Administration: Privileges <administration-privileges>`


.. _views-drop:

Dropping views
==============

Views can be dropped using the :ref:`DROP VIEW statement <sql-drop-view>`::

    cr> DROP VIEW big_mountains;
    DROP OK, 1 row affected (... sec)</doc><doc title="Data modeling: Sequences" desc="About autogenerated sequences and PRIMARY KEY values in CrateDB.">(model-primary-key)=
(autogenerated-sequences)=
# Primary key strategies and autogenerated sequences

:::{rubric} Introduction
:::

As you begin working with CrateDB, you might be puzzled why CrateDB does not
have a built-in, auto-incrementing "serial" data type, like PostgreSQL or MySQL.

This page explains why that is and walks you through **five common alternatives**
to generate unique primary key values in CrateDB, including a recipe to implement
your own auto-incrementing sequence mechanism when needed.

:::{rubric} Why auto-increment sequences don't exist in CrateDB
:::
In traditional RDBMS systems, auto-increment fields rely on a central counter.
In a distributed system like CrateDB, maintaining a global auto-increment value
would require that a node checks with other nodes before allocating a new value.
This would create a **global coordination bottleneck**, limit insert throughput,
and reduce scalability.

CrateDB is designed for horizontal scalability and [high ingestion throughput].
To achieve this, operations must complete independently on each node—without
central coordination. This design choice means CrateDB does **not** support
traditional auto-incrementing primary key types like `SERIAL` in PostgreSQL
or MySQL.

:::{rubric} Solutions
:::
CrateDB provides flexibility: You can choose a primary key strategy
tailored to your use case, whether for strict uniqueness, time ordering, or
external system integration. You can also implement true consistent/synchronized
sequences if you want to.

## Using a timestamp as a primary key

This option involves declaring a column using `DEFAULT now()`.
```psql
CREATE TABLE example (
  id BIGINT DEFAULT now() PRIMARY KEY
);
```

:Pros:
  - Auto-generated, always-increasing value
  - Useful when records are timestamped anyway

:Cons:
  - Can result in gaps
  - Collisions possible if multiple records are created in the same millisecond

## Using elasticflake identifiers

This option involves declaring a column using `DEFAULT gen_random_text_uuid()`.
```psql
CREATE TABLE example2 (
  id TEXT DEFAULT gen_random_text_uuid() PRIMARY KEY
);
```

:Pros:
  - Universally unique
  - No conflicts when merging from multiple environments or sources

:Cons:
  - Not ordered
  - Harder to read/debug
  - No efficient range queries

## Using UUIDv7 identifiers

[UUIDv7] is a new format that preserves **temporal ordering**, making UUIDs
better suited for inserts and range queries in distributed databases.

You can use [UUIDv7 for CrateDB] via a {ref}`User-Defined Function (UDF) <udf>`
in JavaScript, or use a [UUIDv7 library] in your application layer.

:Pros:
  - Globally unique and **almost sequential**
  - Efficient range queries possible

:Cons:
  - Not as human-friendly as integer numbers
  - Slight overhead due to UDF use

## Using IDs from external systems

If you are importing data from a source system that **already generates unique
IDs**, you can reuse those by inserting primary key values as-is from the
source system.

In this case, CrateDB does not need to generate any identifier values,
and consistency is ensured across systems.

:::{seealso}
An example for that is [Replicating data from other databases to CrateDB with Debezium and Kafka].
:::

## Implementing a custom sequence table

If you **must** have an auto-incrementing numeric ID (e.g., for compatibility
or legacy reasons), you can implement a simple sequence generator using a
dedicated table and client-side logic.

This approach involves a table to keep the latest values that have been consumed
and client side code to keep it up-to-date in a way that guarantees unique
values even when many ingestion processes run in parallel.

:Pros:
  - Fully customizable (you can add prefixes, adjust increment size, etc.)
  - Sequential IDs possible

:Cons:
  - Additional client logic about optimistic updates is required for writing
  - The sequence table may become a bottleneck at very high ingestion rates

### Step 1: Create a sequence tracking table
Create a table to keep the latest values for the sequences.
```psql
CREATE TABLE sequences (
  name TEXT PRIMARY KEY,
  last_value BIGINT
) CLUSTERED INTO 1 SHARDS;
```

### Step 2: Initialize your sequence
Initialize the table with one new sequence at 0.
```psql
INSERT INTO sequences (name,last_value)
VALUES ('mysequence',0);
```

### Step 3: Create a target table
Start an example with a newly defined table.
```psql
CREATE TABLE mytable (
  id BIGINT PRIMARY KEY,
  field1 TEXT
);
```

### Step 4: Generate and use sequence values in Python

Use optimistic concurrency control to generate unique, incrementing values
even in parallel ingestion scenarios.

The Python code below reads the last value used from the sequences table, and
then attempts an [optimistic UPDATE] with a `RETURNING` clause. If a
contending process already consumed the identity nothing will be returned so our
process will retry until a value is returned. Then it uses that value as the new
ID for the record we are inserting into the `mytable` table.

```python
# Requires: records, sqlalchemy-cratedb
#
# /// script
# requires-python = ">=3.8"
# dependencies = [
#     "records",
#     "sqlalchemy-cratedb",
# ]
# ///

import time
import records

db = records.Database("crate://")
sequence_name = "mysequence"

max_retries = 5
base_delay = 0.1  # 100 milliseconds

for attempt in range(max_retries):
    select_query = """
        SELECT last_value, _seq_no, _primary_term
        FROM sequences
        WHERE name = :sequence_name;
    """
    row = db.query(select_query, sequence_name=sequence_name).first()
    new_value = row.last_value + 1

    update_query = """
        UPDATE sequences
        SET last_value = :new_value
        WHERE name = :sequence_name
        AND _seq_no = :seq_no
        AND _primary_term = :primary_term
        RETURNING last_value;
    """
    if (
        str(
            db.query(
                update_query,
                new_value=new_value,
                sequence_name=sequence_name,
                seq_no=row._seq_no,
                primary_term=row._primary_term,
            ).all()
        )
        != "[]"
    ):
        break

    delay = base_delay * (2**attempt)
    print(f"Attempt {attempt + 1} failed. Retrying in {delay:.1f} seconds...")
    time.sleep(delay)
else:
    raise Exception(f"Failed after {max_retries} retries with exponential backoff")

insert_query = "INSERT INTO mytable (id, field1) VALUES (:id, :field1)"
db.query(insert_query, id=new_value, field1="abc")
db.close()
```

## Summary

| Strategy            | Ordered  | Unique | Scalable | Human-friendly | Range queries | Notes                |
|---------------------|----------| ------ | -------- |----------------|---------------| -------------------- |
| Timestamp           | ✅       | ⚠️     | ✅        | ✅              | ✅             | Potential collisions |
| Elasticflake        | ❌       | ✅      | ✅        | ❌              | ❌             | Default UUIDs        |
| UUIDv7              | ✅       | ✅      | ✅        | ❌              | ✅             | Requires UDF         |
| External system IDs | ✅/❌    | ✅      | ✅        | ✅              | ✅             | Depends on source    |
| Sequence table      | ✅       | ✅      | ⚠️       | ✅              | ✅             | Manual retry logic   |


[high ingestion throughput]: https://cratedb.com/blog/how-we-scaled-ingestion-to-one-million-rows-per-second
[optimistic update]: https://cratedb.com/docs/crate/reference/en/latest/general/occ.html#optimistic-update
[replicating data from other databases to cratedb with debezium and kafka]: https://cratedb.com/blog/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka
[udf]: https://cratedb.com/docs/crate/reference/en/latest/general/user-defined-functions.html
[UUIDv7]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-version-7
[UUIDv7 for CrateDB]: https://github.com/nalgeon/uuidv7/blob/main/src/uuidv7.cratedb
[UUIDv7 library]: https://github.com/nalgeon/uuidv7</doc><doc title="Data modeling: Optimistic Concurrency Control" desc="Even though CrateDB does not support transactions, optimistic concurrency control can be achieved by using the internal system columns `_seq_no` and `_primary_term`. ">.. highlight:: psql
.. _sql_occ:

==============================
Optimistic Concurrency Control
==============================


Introduction
============

Even though CrateDB does not support transactions, `Optimistic Concurrency
Control`_ can be achieved by using the internal system columns
:ref:`_seq_no <sql_administration_system_columns_seq_no>` and
:ref:`_primary_term <sql_administration_system_columns_primary_term>`.

Every new primary shard row has an initial sequence number of ``0``. This value
is increased by ``1`` on every insert, delete or update operation the primary
shard executes. The primary term will be incremented when a shard is promoted
to primary so the user can know if they are executing an update against the
most up to date cluster configuration.

.. Hidden: update some documents to raise their ``_seq_no`` values.::

    cr> CREATE TABLE sensors (
    ...   id text primary key,
    ...   type text,
    ...   last_verification timestamp
    ... );
    CREATE OK, 1 row affected  (... sec)

    cr> INSERT INTO sensors (id, type, last_verification) VALUES ('ID1', 'DHT11', null);
    INSERT OK, 1 row affected (... sec)

    cr> INSERT INTO sensors (id, type, last_verification) VALUES ('ID2', 'DHT21', null);
    INSERT OK, 1 row affected (... sec)

    cr> refresh table sensors;
    REFRESH OK, 1 row affected (... sec)

It's possible to fetch the ``_seq_no`` and ``_primary_term`` by selecting
them::

    cr> SELECT id, type, _seq_no, _primary_term FROM sensors ORDER BY 1;
    +-----+-------+---------+---------------+
    | id  | type  | _seq_no | _primary_term |
    +-----+-------+---------+---------------+
    | ID1 | DHT11 |       0 |             1 |
    | ID2 | DHT21 |       0 |             1 |
    +-----+-------+---------+---------------+
    SELECT 2 rows in set (... sec)

These ``_seq_no`` and ``_primary_term`` values can now be used on updates
and deletes.

.. NOTE::

    Optimistic concurrency control only works using the ``=`` :ref:`operator
    <gloss-operator>`, checking for the exact ``_seq_no`` and ``_primary_term``
    your update or delete is based on.

Optimistic update
=================

Querying for the correct ``_seq_no`` and ``_primary_term`` ensures that no
concurrent update and cluster configuration change has taken place::

    cr> UPDATE sensors SET last_verification = '2020-01-10 09:40'
    ... WHERE
    ...   id = 'ID1'
    ...   AND "_seq_no" = 0
    ...   AND "_primary_term" = 1;
    UPDATE OK, 1 row affected (... sec)

Updating a row with a wrong or outdated sequence number or primary term will
not execute the update and results in 0 affected rows::

    cr> UPDATE sensors SET last_verification = '2020-01-10 09:40'
    ... WHERE
    ...   id = 'ID1'
    ...   AND "_seq_no" = 42
    ...   AND "_primary_term" = 5;
    UPDATE OK, 0 rows affected (... sec)

Optimistic delete
=================

The same can be done when deleting a row::

    cr> DELETE FROM sensors WHERE id = 'ID2'
    ...   AND "_seq_no" = 0
    ...   AND "_primary_term" = 1;
    DELETE OK, 1 row affected (... sec)

Known limitations
=================

- The ``_seq_no`` and ``_primary_term`` columns can only be used when
  specifying the whole primary key in a query. For example, the query below is
  not possible with the database schema used for testing, because ``type`` is
  not declared as a primary key::

      cr> DELETE FROM sensors WHERE type = 'DHT11'
      ...   AND "_seq_no" = 3
      ...   AND "_primary_term" = 1;
      UnsupportedFeatureException["_seq_no" and "_primary_term" columns can only be used
      together in the WHERE clause with equals comparisons and if there are also equals
      comparisons on primary key columns]

- In order to use the optimistic concurrency control mechanism, both the
  ``_seq_no`` and ``_primary_term`` columns need to be specified. It is not
  possible to only specify one of them. For example, the query below will
  result in an error::

      cr> DELETE FROM sensors WHERE id = 'ID1' AND "_seq_no" = 3;
      VersioningValidationException["_seq_no" and "_primary_term" columns can only be used
      together in the WHERE clause with equals comparisons and if there are also equals
      comparisons on primary key columns]

- There is an exception to this behaviour, when the ``WHERE`` clause contains
  complex filtering and/or lots of Primary Key values. You can find more details
  :ref:`here <sql-refresh-description_collect_exception>`.

.. NOTE::

   Both ``DELETE`` and ``UPDATE`` commands will return a row count of ``0``, if
   the given required version does not match the actual version of the relevant
   row.

.. _Optimistic Concurrency Control: https://en.wikipedia.org/wiki/Optimistic_concurrency_control</doc><doc title="Guide: CrateDB sharding" desc="A best practice guide about sharding with CrateDB.">(sharding-guide)=
(sharding-performance)=

# Sharding recommendations

:::{div} sd-text-muted
Applying sharding can drastically improve the performance on large datasets.
:::

This document is a sharding best practice guide for CrateDB.
A brief recap: CrateDB tables are split into a configured number of shards.
These shards are distributed across the cluster to optimize concurrent and
parallel data processing.

Whenever possible, CrateDB will parallelize query workloads and distribute them
across the whole cluster. The more CPUs this query workload can be distributed
across, the faster the query will run.

:::{seealso}
This guide assumes you know the basics.
If you are looking for an intro to sharding, see also the
{ref}`sharding-partitioning` and the
{ref}`sharding reference <crate-reference:ddl-sharding>` documentation.
:::


## General recommendations

To avoid running your clusters with too many shards or too large shards,
implement the following guidelines as a rule of thumb:

- Use shard sizes between 5 GB and 50 GB.

- Keep the number of records on each shard below 200 million.

Finding the right balance when it comes to sharding will vary on a lot of
things. While it is generally advisable to slightly over-allocate, we
recommend to benchmark your particular setup to find the sweet spot to
implement an appropriate sharding strategy.

Figuring out how many shards to use for your tables requires you to think about
the type of data you are processing, the types of queries you are running, and
the type of hardware you are using.

- Too many shards can degrade search performance and make the cluster unstable.
  This is referred to as _oversharding_.

- Very large shards can slow down cluster operations and prolong recovery times
  after failures.

## Sizing considerations

General principles require careful consideration of cluster
sizing and architecture.
Keep the following things in mind when building your sharding strategy.
Each shard incurs overhead in terms of open files, RAM allocation, and CPU cycles
for maintenance operations.

### Shard size vs. number of shards

The optimal approach balances shard count with shard size. Individual shards should
typically contain 5-50 GB of data, being the sweet spot for most
workloads. In large clusters, this often means fewer shards than total CPU cores,
as larger shards can still be processed efficiently by multiple CPU cores during
query execution.

### Shard-per-CPU ratio

If most nodes have more shards per table than they have CPUs, the cluster can
experience performance degradations.
For example, on clusters with substantial CPU resources (e.g., 8 nodes × 32 CPUs
= 256 total CPUs), creating 256+ shards per table often proves counterproductive.
If you don't manually set the number of shards per table, CrateDB will make a
best guess, based on the assumption that your nodes have two CPUs each.
The general advice is to calculate with 1 shard per CPU as a starting point.

### 1000 shards per node limit

To avoid _oversharding_, CrateDB by default limits the number of shards per node to
1_000 as a protection limit. Any operation that would exceed that limit
leads to an exception.
For an 8-node cluster, this allows up to 8_000 total shards across all tables.
Approaching this limit typically indicates a suboptimal sharding strategy rather
than optimal performance tuning. See also relevant documentation about
{ref}`table reconfiguration <number-of-shards>` wrt. sharding options.

### Partitions

If you are using {ref}`partitioned tables <crate-reference:partitioned-tables>`,
note that each partition is clustered into as many shards as you configure
for the table.

For example, a table with four shards and two partitions will have eight
shards that can be commonly queried across. But a query that only touches
one partition will only query across four shards.

How this factors into balancing your shard allocation will depend on the
types of queries you intend to run.

### Replicas

CrateDB uses replicas for both data durability and query performance. When a
node goes down, replicas ensure no data is lost. For read operations, CrateDB
randomly distributes queries across both primary and replica shards, improving
concurrent read throughput.

Each replica adds to the total shard count in the cluster. By default, CrateDB
uses the replica setting `0-1` on newly created tables, resulting in twice the
number of configured shards. The more replicas you add, the higher the
multiplier (x3, x4, etc.) for capacity planning

See the {ref}`replication reference <crate-reference:ddl-replication>`
documentation for more details.

### Segments

The number of segments within a shard affects query performance because more
segments have to be visited.

## Notes

:::{caution}
:class: hero
Balancing the number and size of your shards is important for the performance
and stability of your CrateDB clusters.
:::

(sharding-under-allocation)=
### Avoid under-allocation

:::{CAUTION}
If you have fewer shards than CPUs in the cluster, this is called
*under-allocation*, and it means you're not getting the best performance out
of CrateDB.
:::

To increase the chances that a query can be parallelized and distributed
maximally, there should be at least as many shards for a table than there are
CPUs in the cluster. This is because CrateDB will automatically balance shards
across the cluster so that each node contains as few shards as possible.

In summary: the smaller your shards are, the more of them you will have, and so
the more likely it is that they will be distributed across the whole cluster,
and hence across all of your CPUs, and hence the faster your queries will run.

(sharding-over-allocation)=
### Avoid extensive over-allocation

:::{CAUTION}
If you have more shards per table than CPUs, this is called *over-allocation*. A
little over-allocation is desirable. But if you significantly over-allocate
your shards per table, you will see performance degradation.
:::

When you have slightly more shards per table than CPUs, you ensure that query
workloads can be parallelized and distributed maximally, which in turn ensures
maximal query performance.

(sharding-ingestion)=
### Optimize for ingestion

When doing heavy ingestion, it is
good to cluster a table across as many nodes as possible. However, [we have
found][we have found] that ingestion throughput can often increase as the table shard per CPU
ratio on each node *decreases*.

Ingestion throughput typically varies on: data volume, individual payload
sizes, batch insert size, and the hardware. In particular: using solid-state
drives (SSDs) instead of hard-disk drives (HDDs) can massively increase
ingestion throughput.

We recommend to benchmark your particular ingest workload to find the sweet
spot.

[we have found]: https://cratedb.com/blog/big-cluster-insights-ingesting</doc><doc title="Guide: CrateDB query optimization" desc="Essential principles for optimizing queries in CrateDB while avoiding the most common pitfalls.">(performance-optimization)=

# Query Optimization 101

:::{div} sd-text-muted
Essential principles for optimizing queries in CrateDB...
:::

...while avoiding the most common pitfalls. The patterns are relevant to both the
troubleshooting of slow queries and the proactive tuning of CrateDB deployments,
and they show how small adjustments to filters, data transformations, and
schemas can yield dramatic improvements in execution speed and resource
utilization.

(group-early-filtering)=

## Early Filtering and Data Reduction

:::{div} sd-text-muted
Minimize processed data early in queries to reduce overhead.
:::

(filtering-early)=

### Do all filtering as soon as possible

Sometimes it may be tempting to define some VIEWs, some CTEs, do some JOINs, and
only filter results at the end, but in this context the optimizer may lose track
of how the fields we are filtering on relate to the indexes on the actual
tables.

Whenever there is an opportunity to filter data immediately next to the `FROM`
clause, try to narrow down results as early as possible.

See [using common table expressions to speed up queries] for an example.

(select-star)=

### Avoid `SELECT *`

CrateDB is a columnar database. The fewer columns you specify in a `SELECT`
clause, the less data CrateDB needs to read from disk.

```sql
-- Avoid selecting all columns
SELECT *
FROM customers;

-- Instead, select explicitly the subset of columns you need
SELECT customerid, country
FROM customers;
```

(minimise-result-sets)=

### Avoid large result sets

Be aware of the number of rows you are returning in a `SELECT` query.
Analytical databases, such as CrateDB, excel at processing large data sets and
returning small to medium-sized result sets. Serializing, transporting them over
the network, and deserializing large result sets is expensive. When dealing with
large result sets in the range of several hundred thousand records, consider
whether your application needs the whole result set at once. Use [cursors]
or `LIMIT`/`OFFSET` to fetch data in batches.

See also [Fetching large result sets from CrateDB] for examples.

(propagate-limit)=

### Propagate LIMIT clauses when applicable

Similarly to the above, we may have for instance a `LIMIT 10` at the end of
the query and to get there it may have been sufficient to only pull 10 records
(or some other number of records) at an earlier stage from some given table. If
that is the case duplicate or move (depending on the specific query) the
`LIMIT` clause to the relevant place.

In some cases, we may not know how many rows we need in the intermediate working
sets but we know that there will be 10 records on the last day. Doing filtering
early will help the optimizer and can protect the database from accidentally
processing years of data. By not filtering early, the load on your cluster
will increase tremendously.

So for instance instead of:

```sql
SELECT
  factory_metadata.factory_name,
  device_data.device_name,
  device_data.reading_value
FROM device_data
INNER JOIN factory_metadata ON device_data.factory_id = factory_metadata.factory_id
WHERE reading_time BETWEEN '2024-01-01' AND '2025-01-01'
LIMIT 10;
```

do:

```sql
WITH filtered_device_data AS (
   SELECT
         device_data.factory_id,
         device_data.device_name,
         device_data.reading_value
   FROM device_data
   WHERE
        /*
          We are sure one month of data is sufficient to find
          10 results and it may help with partition pruning
        */
         reading_time BETWEEN '2024-12-01' AND '2025-01-01'
   LIMIT 10
)
SELECT
  factory_metadata.factory_name,
  filtered_device_data.device_name,
  filtered_device_data.reading_value
FROM filtered_device_data
INNER JOIN factory_metadata ON filtered_device_data.factory_id = factory_metadata.factory_id;
```

(filter-with-array-expressions)=

### Use filters with array expressions when filtering on the output of UNNEST

On denormalized data sets, you may observe records including columns
storing arrays of objects.

You may want to unnest the array in a subquery or CTE and later filter on a
property of the OBJECTs.

The next statement (in versions of CrateDB < 6.0.0) will result in every row
in the table (not filtered with other conditions) being read and unnested,
to check if it meets the criteria on `field1`.

```sql
SELECT *
FROM (
   SELECT UNNEST(my_array_of_objects) obj
   FROM my_table
)
WHERE obj['field1'] = 1;
```

However, CrateDB can do a lot better than this if we add an additional condition
like this:

```sql
SELECT *
FROM (
   SELECT UNNEST(my_array_of_objects) obj
   FROM my_table
   WHERE 1 = ANY(my_array_of_objects['field1'])
) AS subquery
WHERE obj['field1'] = 1;
```

CrateDB leverages indexes to only unnest the relevant records from `my_table`
which can make a huge difference.

(group-efficient-query-structure)=

## Efficient Query Structure and Constructs

:::{div} sd-text-muted
Optimize SQL logic by prioritizing efficient syntax
and avoid redundant operations.
:::

(only-sort-when-needed)=

### Only sort data when needed

Indexing in CrateDB is optimized to support filtering and aggregations without
requiring expensive defragmentation operations, but it is not optimized for
sorting​.

Maintaining a sorted index would slow down ingestion, that is why​ other
analytical database systems like Cassandra and Redshift make similar
trade-offs​.

This means that when an `ORDER BY` operation is requested, the whole dataset
needs to be loaded into the main memory on the relevant cluster node to be
sorted. For this reason, it is important to not request `ORDER BY` operations when
not actually needed, and most importantly, not on tables of large cardinalities
without aggregating records beforehand. On the other hand, of course it is no
problem to sort a few thousand rows in the final stage of a `SELECT`
operation, but we need to avoid requesting sort operations over millions of
rows.

Consider leveraging filters and aggregations like `max_by` and `min_by` to
limit the scope of `ORDER BY` operations, or avoid them altogether when
possible.

So for instance instead of:

```sql
SELECT reading_time, reading_value
FROM device_data
WHERE reading_time BETWEEN '2024-01-01' AND '2025-01-01'
ORDER BY reading_time DESC
LIMIT 10;
```

use:

```sql
SELECT reading_time, reading_value
FROM device_data
WHERE reading_time BETWEEN '2024-12-20' AND '2025-01-01'
ORDER BY reading_time DESC
LIMIT 10;
```

(format-as-last-step)=

### Format output as a last step

In many cases, data may be stored in an efficient format, but we want to
transform it to make it more human-readable in the output of the query. To
accomodate such situations, we may use [scalar functions] such as
`date_format` or `timezone`.

Sometimes queries apply these transformations in an intermediate step and later
do further operations like filtering on the transformed values.

CrateDB's query optimizer attempts to determine the most efficient way to
execute a given query by considering the possible query plans. Based on the
query scenario/situation, it is always aiming to use existing indexes on the
original data for maximum efficiency.

However, there is always a chance that some particular clause in the query
expression prevents the optimizer from selecting an optimal plan, ending up
applying the transformation on thousands or millions of records that later would
be discarded anyway. So, whenever it makes sense, we want to ensure these
transformations are only applied after the database has already worked out the
final result set to be sent back to the client.

So instead of:

```sql
WITH mydata AS (
  SELECT
        DATE_FORMAT(device_data.reading_time) AS formatted_reading_time,
        device_data.reading_value
  FROM device_data
)
SELECT *
FROM mydata
WHERE formatted_reading_time LIKE '2025%';
```

use:

```sql
SELECT
  DATE_FORMAT(device_data.reading_time) AS formatted_reading_time,
  device_data.reading_value
FROM device_data
WHERE device_data.reading_time BETWEEN '2025-01-01' AND '2026-01-01'
```

(replace-case)=

### Replace CASE in expressions used for filtering, JOINs, grouping, etc

It is not always obvious to the optimizer what we may be trying to do with a
`CASE` expression (see for instance [Shortcut CASE evaluation Issue 16022]).

If you are using CASE expression for “formatting” see the previous point about
formatting output as late as possible,
but if you are using a CASE expression as part of a filter of other operation
consider replacing it with an equivalent expression, for instance:

```sql
SELECT SUM(a) as count_greater_than_10,...
FROM (
  SELECT CASE WHEN field1 > 10 THEN 1 ELSE 0 END
        , ...
  FROM mytable
  ...
) subquery
...;
```

can be rewritten as

```sql
SELECT COUNT(field1) FILTER (WHERE field1 > 10) as count_greater_than_10
FROM mytable;
```

And

```postgresql
SELECT *
FROM mytable
WHERE
  CASE
        WHEN $1 = 'ALL COUNTRIES' THEN true
        WHEN $1 = mytable.country AND $2 = 'ALL CITIES' THEN true
        ELSE $1 = mytable.country AND $2 = mytable.city
  END;
```

can be rewritten as

```postgresql
SELECT *
FROM mytable
WHERE ($1 = 'ALL COUNTRIES')
OR ($1 = mytable.country AND $2 = 'ALL CITIES')
OR ($1 = mytable.country AND $2 = mytable.city)
```

(the exact replacement expressions of course depend on the semantics of each
case)

(groups-instead-distinct)=

### Use groupings instead of DISTINCT

(Reference: [Issue 13818])

Instead of

```sql
SELECT DISTINCT country FROM customers;
```

use

```sql
SELECT country FROM customers GROUP BY country;
```

and instead of

```sql
SELECT COUNT(DISTINCT a) FROM t;
```

use

```sql
SELECT COUNT(a)
FROM (
        SELECT a
        FROM t
        GROUP BY a
) tmp;
```

(subqueries-instead-groups)=

### Use subqueries instead of GROUP BY if the groups are already known

Consider the following query:

```sql
SELECT customerid, SUM(order_amount) AS total
FROM customer_orders
GROUP BY customerid;
```

This looks simple but to execute it CrateDB needs to keep the full result set in
memory for all groups.

If we already know what the groups will be we can use correlated subqueries
instead:

```sql
SELECT customerid,
  (SELECT SUM(order_amount)
   FROM customer_orders
   WHERE customer_orders.customerid = customers.customerid
  ) AS total
FROM customers;
```

(group-large-and-complex-queries)=

## Handling Large and Complex Queries

:::{div} sd-text-muted
Strategies for breaking down complex operations on large
datasets into manageable steps.
:::

(batch-operations)=

### Batch operations

If you need to perform lots of UPDATEs or expensive INSERTs from SELECT, consider
exploring different settings for the [overload protection] or
[thread pool sizing] which can be used to fine tune the performance for these
operations.

Otherwise, if you only need to run it once and performance is not critical,
consider using small batches instead, where the operations are done on groups of
records each time.

So for instance instead of doing:

```sql
UPDATE mytable SET field1 = field1 + 1;
```

consider a different approach such as:

```shell
for id in {1..100}; do
        crash -c "UPDATE mytable SET field1 = field1 + 1 WHERE customer_id = $id;"
done
```

(pagination-filters)=

### Paginate on filters instead of results

For instance instead of

```sql
SELECT deviceid, AVG(field1)
FROM device_data
GROUP BY deviceid
LIMIT 1000 OFFSET 5000;
```

We can do something like

```sql
WITH devices AS (
  SELECT deviceid
  FROM devices
  LIMIT 5 OFFSET 25
)
SELECT deviceid, AVG(field1)
FROM device_data
WHERE device_data.deviceid IN (SELECT devices.deviceid FROM devices)
GROUP BY deviceid;
```

(staging-tables)=

### Use staging tables for intermediate results if you are doing a lot of JOINs

If you have many CTEs or VIEWs with a need to JOIN them, it can be benefical to
query them individually, store intermediate results into dedicated tables, and
then use these tables for JOINing.

While there is a cost in writing to disk and reading data back, the whole
operation can benefit from indexing and from giving the optimizer more
straightforward execution plans, to enable it optimizing for better parallel
execution using multiple cluster nodes.

(group-schema-and-function-optimization)=

## Schema and Function Optimization

:::{div} sd-text-muted
Schema design and function usage to streamline performance.
:::

(consider-generated-columns)=

### Consider generated columns

If you frequently find yourself extracting information from fields and then
using this extracted data on filters or aggregations, it can be good to consider
doing this operation on ingestion with a [generated column] . In this way the value
we need for filtering and aggregations can be indexed. This involves a trade-off
between storage space and query performance, evaluate the frequency and
execution times of these queries with the additional storage requirements of
storing the generated value.

See [Using regex comparisons and other features for inspection of logs] for an
example.

(udf-right-context)=

### Be mindful of UDFs, leverage them in the right contexts, but only in the right contexts

When using user-defined functions (UDFs), two important details relevant for
performance aspects need to be considered.

1. Once values are processed by an UDF, the database engine will load results
   into memory, and will not be able to leverage indexes on the underlying
   fields any longer. In this spirit, please apply the relevant general
   considerations about delaying formatting as much as possible.
2. UDFs run on a JavaScript virtual machine on a single thread, so they can have
   an impact on performance as relevant operations cannot be
   parallelized.

However, some operations may be more straightforward to do in JavaScript than
SQL.

(group-filter-expression-optimizations)=

## Filter and Expression Optimization

:::{div} sd-text-muted
Expressions that improve filter efficiency
and processing of specific data structures.
:::

(positive-filters)=

### Avoid expression negation in filters

Positive filter expressions can directly leverage indexing. With negative
expressions, the optimizer may be able to still use indexes, but this may not
always happen and the optimizer might not rewrite the query optimally.
Explicitly using positive conditions removes ambiguity and ensures the most
efficient path is chosen.

So instead of:

```sql
SELECT
  customerid,
  status
FROM customers_table
WHERE NOT (customerid <= 2) AND NOT (status = 'inactive');
```

We can rewrite this as:

```sql
SELECT
  customerid,
  status
FROM customers_table
WHERE customerid > 3 AND status = 'active';
```

(use-null-or-empty)=

### Use the special null_or_empty function with OBJECTs and ARRAYs when relevant

CrateDB has a special scalar function called [null_or_empty] , using this in
filter conditions against OBJECTs and ARRAYs is much faster than using an `IS
NULL` clause, if allowing empty objects and arrays is acceptable.

So instead of:

```sql
SELECT ...
FROM mytable
WHERE array_column IS NULL OR array_column = [];
```

We can rewrite this as:

```sql
SELECT ...
FROM mytable
WHERE null_or_empty(array_column);
```

(group-performance-analysis)=

## Performance Analysis and Execution Plans

(execution-plans)=

### Review execution plans

If a query is slow but still completes in a certain amount of time, we can use
[EXPLAIN ANALYZE] to get a detailed execution plan. The main thing to watch for
on these is `MatchAllDocsQuery` and `GenericFunctionQuery`. These operations
are full table scans, so you may want to review if that is expected in your
query (you may actually intentionally be pulling all records from a table with a
list of factory sites for instance) or if this is about a filter that is not
being pushed down properly.

[cursors]: https://cratedb.com/docs/crate/reference/en/latest/sql/statements/declare.html
[explain analyze]: https://cratedb.com/docs/crate/reference/en/latest/sql/statements/explain.html
[fetching large result sets from cratedb]: https://community.cratedb.com/t/fetching-large-result-sets-from-cratedb/1270
[generated column]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/generated-columns.html
[issue 13818]: https://github.com/crate/crate/issues/13818
[null_or_empty]: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#null-or-empty-object
[overload protection]: https://cratedb.com/docs/crate/reference/en/latest/config/cluster.html#overload-protection
[scalar functions]: https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html
[shortcut case evaluation issue 16022]: https://github.com/crate/crate/issues/16022
[thread pool sizing]: https://cratedb.com/docs/crate/reference/en/latest/config/cluster.html#thread-pools
[using common table expressions to speed up queries]: https://community.cratedb.com/t/using-common-table-expressions-to-speed-up-queries/1719
[using regex comparisons and other features for inspection of logs]: https://community.cratedb.com/t/using-regex-comparisons-and-other-advanced-database-features-for-real-time-inspection-of-web-server-logs/1564</doc><doc title="Guide: Design for scale" desc="Critical design considerations to successfully scale CrateDB in large production environments to ensure performance and reliability as workloads grow. ">(performance-scaling)=

# Design for scale

This article explores critical design considerations to successfully scale
CrateDB in large production environments to ensure performance and reliability
as workloads grow.

(mindful-of-memory)=

## Be mindful of memory capacity

In CrateDB, operations requiring a working set like groupings, aggregations, and
sorting are performed fully in memory without spilling over to disk.

Sometimes you may have a query that leads to a sub-optimal execution plan
requiring lots of memory. If you are coming to CrateDB from other database
systems, your experience may be that these queries will proceed to run taking
longer than required and impacting other workloads in the meanwhile. Sometimes
this effect may be obvious if a query takes a lot of resources and runs for a
long time, other times it may go unnoticed if a query that could complete in say
100 milliseconds takes one hundred times longer, 10 seconds, but the users put
up with it without reporting to you.

If a query would require more heap memory than the interested nodes
have available the query will fail with a particular type of error message that
we call a `CircuitBreakerException`. This is a fail-fast approach as we
quickly see there is an issue and can optimize the query to get the best
performance, without impacting other workloads.

Please take a look at {ref}`Query Optimization 101 <performance-optimization>`
for strategies to optimize your queries when you encounter this situation.

(reading-lots-of-records)=

## Reading lots of records

When the HTTP endpoint is used CrateDB will prepare the entire response in
memory before sending it to the client.

When the PostgreSQL protocol is used CrateDB attempts to stream the results but
in many cases it still needs to bring all rows to the query handler node first.

So we should always limit how many rows we request at a time, see [Fetching
large result sets from CrateDB][fetching large result sets from cratedb].

(number-of-shards)=

## Number of shards

In CrateDB data in tables and partitions is distributed in storage units
called "shards".

If we do not specify how many shards we want for a table/partition, CrateDB will
derive a default from the number of nodes.

Having too many or too few shards has performance implications, so it is very
important to get familiar with the {ref}`sharding-guide`.

In particular, there is a soft limit of 1000 shards per node; so table schemas,
partitioning strategy, and number of nodes need to be planned to stay well below
this limit, one strategy can be to aim for a configuration where even if one node
in the cluster is lost, the remaining nodes would still have less than 1000 shards.

If this was not taken into account when initially defining the tables, we have the
following considerations:

- changing the partitioning strategy requires creating a new table and copying
  over the data
- the easiest way to change the number of shards on a partitioned table is to
  do it for new shards only with the `ALTER TABLE ONLY` command
- see also [Changing the number of shards]

(amount-of-indexed-columns)=

## Number of indexed fields in OBJECTs

`OBJECT` columns are `DYNAMIC` by default and CrateDB indexes all their
fields, providing excellent query performance without requiring manual indexing.
However, excessive indexing can impact storage, write speed, and resource
utilization.

- All fields in OBJECTs are automatically indexed when inserted.
- CrateDB optimizes indexing using Lucene-based columnar storage.
- A soft limit of 1,000 total indexed columns and OBJECT fields per table
  exists.
- Going beyond this limit may impact performance.

In cases with many fields and columns, it is advised to determine if some
OBJECTs or nested parts of them need to be indexed, and use the [ignored column
policy][ignored column policy] where applicable.

(section-joins)=

## JOINs

CrateDB is a lot better at JOINs than many of our competitors and is getting
better at every release, but JOINs in distributed databases are tricky to
optimize, so in many cases queries involving JOINs may need a bit of tweaking.

See [Using common table expressions to speed up queries]

[changing the number of shards]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/alter-table.html#alter-shard-number
[fetching large result sets from cratedb]: https://community.cratedb.com/t/fetching-large-result-sets-from-cratedb/1270
[ignored column policy]: https://cratedb.com/docs/crate/reference/en/latest/general/ddl/data-types.html#ignored
[using common table expressions to speed up queries]: https://community.cratedb.com/t/using-common-table-expressions-to-speed-up-queries/1719</doc><doc title="Integration Tutorials I" desc="Integrating 3rd party software with CrateDB.">(integrate)=
(integrations)=
# Integrations

You have a variety of options to connect and integrate 3rd-party
applications, mostly using [CrateDB's PostgreSQL interface].

This documentation section lists applications, frameworks, and libraries,
which can be used together with CrateDB, and outlines how to use them
optimally.

Explore integrations by category.
:::{toctree}
:maxdepth: 1
category/overview
:::

Explore integrations sorted alphanumerically.
:::{toctree}
:maxdepth: 1
:glob:
*/index
:::


[CrateDB's PostgreSQL interface]: inv:crate-reference#interface-postgresql</doc><doc title="Integration Tutorials II" desc="Overview of CrateDB integration tutorials.">We've created many integration-focused tutorials to help you use CrateDB with other awesome tools and libraries.👇 
All tutorials require the working installation of CrateDB. 

|**Tool** | **Articles/Tutorials** | C|
|--- | --- | ---|
|[Apache Airflow](https://airflow.apache.org/) / [Astronomer](https://www.astronomer.io/) | - https://community.cratedb.com/t/cratedb-and-apache-airflow-automating-data-export-to-s3/901 <br/> - https://community.cratedb.com/t/cratedb-and-apache-airflow-implementation-of-data-retention-policy/913 <br/> - https://community.cratedb.com/t/cratedb-and-apache-airflow-building-a-data-ingestion-pipeline/926 <br/> - https://community.cratedb.com/t/cratedb-and-apache-airflow-building-a-hot-cold-storage-data-retention-policy/934 | |
|[Apache Arrow](https://arrow.apache.org) | https://community.cratedb.com/t/import-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 | |
|[Apache Kafka](https://kafka.apache.org/) | https://crate.io/docs/crate/howtos/en/latest/integrations/kafka-connect.html | |
|[Apache NiFi](https://nifi.apache.org/) | https://community.cratedb.com/t/connecting-to-cratedb-from-apache-nifi/647 | |
|[Apache Spark](https://spark.apache.org/) | - https://community.cratedb.com/t/getting-started-with-apache-spark-and-cratedb-a-step-by-step-tutorial/1595 <br/> - https://community.cratedb.com/t/introduction-to-azure-databricks-with-cratedb/764 <br/> - https://github.com/crate/cratedb-examples/tree/main/by-dataframe/spark/scala-http | |
|[Apache Superset](https://github.com/apache/superset) / [Preset](https://preset.io/) | - https://community.cratedb.com/t/set-up-apache-superset-with-cratedb/1716 <br/> - https://crate.io/blog/use-cratedb-and-apache-superset-for-open-source-data-warehousing-and-visualization <br/> - [Introduction to Time-Series Visualization in CrateDB and Superset](https://crate.io/blog/introduction-to-time-series-visualization-in-cratedb-and-superset) | |
|[Balena](https://www.balena.io/) | https://community.cratedb.com/t/deploying-cratedb-on-balena-io/1067 | |
|[Cluvio](https://www.cluvio.com/) | https://community.cratedb.com/t/data-analysis-with-cluvio-and-cratedb/1571 | |
|[Dapr](https://dapr.io/) | https://community.cratedb.com/t/connecting-to-cratedb-from-dapr/660 | |
|[DataGrip](https://www.jetbrains.com/datagrip/) | https://cratedb.com/docs/guide/integrate/datagrip/ | |
|[Datashader](https://datashader.org/) | [CrateDB Time Series Exploration and Visualization](https://github.com/crate/cratedb-examples/tree/amo/cloud-datashader/topic/timeseries/explore) | |
|[Dask](https://www.dask.org/) | https://community.cratedb.com/t/guide-to-efficient-data-ingestion-to-cratedb-with-pandas-and-dask/1482 | |
|[DBeaver](https://dbeaver.io/about/) | https://crate.io/blog/cratedb-dbeaver | |
|[dbt](https://github.com/dbt-labs/dbt-core) | https://community.cratedb.com/t/using-dbt-with-cratedb/1566 | |
|[Debezium](https://debezium.io/) | https://community.cratedb.com/t/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka/1388 | |
|[Explo](https://www.explo.co/) | https://crate.io/blog/introduction-to-time-series-visualization-in-cratedb-and-explo | |
|[JMeter](https://jmeter.apache.org) | https://community.cratedb.com/t/jmeter-jdbc-connection-to-cratedb/1051/2?u=jayeff | |
|[Grafana](https://grafana.com/) | - https://crate.io/blog/visualizing-time-series-data-with-grafana-and-cratedb <br/> - https://community.cratedb.com/t/monitoring-an-on-premises-cratedb-cluster-with-prometheus-and-grafana/1236 | |
|[Kestra.io](https://kestra.io/) | https://community.cratedb.com/t/guide-to-cratedb-data-pipelines-with-kestra-io/1400 | |
|[LangChain](https://www.langchain.com/) | https://community.cratedb.com/t/how-to-set-up-langchain-with-cratedb/1576 | |
|[Locust](https://locust.io) | https://community.cratedb.com/t/loadtesting-cratedb-using-locust/1686 | |
|[Meltano](https://meltano.com/) | [Meltano Examples](https://github.com/crate/cratedb-examples/tree/amo/meltano/framework/singer-meltano) | |
|[Metabase](https://www.metabase.com/) | - https://community.cratedb.com/t/visualizing-data-with-metabase/1401 <br/> - https://community.cratedb.com/t/demo-of-metabase-and-cratedb-getting-started/1436 | |
|[Node-RED](https://nodered.org/) | https://community.cratedb.com/t/ingesting-mqtt-messages-into-cratedb-using-node-red/803 | |
|[pandas](https://pandas.pydata.org/) | - https://community.cratedb.com/t/from-data-storage-to-data-analysis-tutorial-on-cratedb-and-pandas-2/1440 <br/> - https://community.cratedb.com/t/automating-financial-data-collection-and-storage-in-cratedb-with-python-and-pandas/916 <br/> - https://community.cratedb.com/t/importing-parquet-files-into-cratedb-using-apache-arrow-and-sqlalchemy/1161 <br/> - https://community.cratedb.com/t/guide-to-efficient-data-ingestion-from-pandas-to-cratedb/1541 | |
|[PowerBI](https://powerbi.microsoft.com/en-us/) | https://crate.io/docs/crate/howtos/en/latest/integrations/powerbi-desktop.html <br/> https://crate.io/docs/crate/howtos/en/latest/integrations/powerbi-gateway.html | |
|[Prefect](https://www.prefect.io/) | https://community.cratedb.com/t/building-seamless-data-pipelines-made-easy-combining-prefect-and-cratedb/1555 | |
|[Prometheus](https://prometheus.io/) | - https://community.cratedb.com/t/cratedb-and-prometheus-for-long-term-metrics-storage/1012 <br/> - https://community.cratedb.com/t/monitoring-an-on-premises-cratedb-cluster-with-prometheus-and-grafana/1236 | |
|[PyCaret](https://pycaret.org/) | [AutoML with PyCaret and CrateDB](https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/automl) | |
|[R](https://www.r-project.org/) | https://crate.io/docs/crate/howtos/en/latest/integrations/r.html | |
|[Rill](https://www.rilldata.com/) | https://community.cratedb.com/t/introducing-rill-and-bi-as-code-with-cratedb-cloud/1718 | |
|[Rsyslog](https://www.rsyslog.com/) | https://community.cratedb.com/t/storing-server-logs-on-cratedb-for-fast-search-and-aggregations/1562 | |
|[SQLPad](https://crate.io/blog/use-cratedb-with-sqlpad-as-a-self-hosted-query-tool-and-visualizer) | https://crate.io/blog/use-cratedb-with-sqlpad-as-a-self-hosted-query-tool-and-visualizer | |
|[StreamSets](https://crate.io/docs/crate/howtos/en/latest/integrations/streamsets.html) | https://crate.io/docs/crate/howtos/en/latest/integrations/streamsets.html | |
|[Tableau](https://www.tableau.com/) | https://community.cratedb.com/t/using-cratedb-with-tableau/1192 | |
|[Telegraf](https://www.influxdata.com/time-series-platform/telegraf/) | https://crate.io/blog/use-cratedb-with-telegraf-an-agent-for-collecting-reporting-metrics | |
|[TensorFlow](https://www.tensorflow.org/) | https://crate.io/docs/crate/howtos/en/latest/integrations/ml-dist.html | |
|[Terraform](https://www.terraform.io/) | https://community.cratedb.com/t/deploying-cratedb-to-the-cloud-via-terraform/849 | |
|[Trino](https://trino.io/) | https://community.cratedb.com/t/connecting-to-cratedb-using-trino/993 | |</doc></docs><api><doc title="CrateDB reference: HTTP interface" desc="CrateDB provides a HTTP Endpoint that can be used to submit SQL queries.">.. highlight:: sh

.. _interface-http:

=============
HTTP endpoint
=============

CrateDB provides a HTTP Endpoint that can be used to submit SQL queries. The
endpoint is accessible under ``<servername:port>/_sql``.

SQL statements are sent to the ``_sql`` endpoint in ``json`` format, whereby
the statement is sent as value associated to the key ``stmt``.

.. SEEALSO::

    :ref:`dml`

A simple ``SELECT`` statement can be submitted like this::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' \
    ... -d '{"stmt":"select name, position from locations order by id limit 2"}'
    {
      "cols": [
        "name",
        "position"
      ],
      "rows": [
        [
          "North West Ripple",
          1
        ],
        [
          "Outer Eastern Rim",
          2
        ]
      ],
      "rowcount": 2,
      "duration": ...
    }

.. NOTE::

    We're using a simple command line invocation of ``curl`` here so you can
    see how to run this by hand in the terminal. For the rest of the examples
    in this document, we use `here documents`_ (i.e. ``EOF``) for multi line
    readability.


.. _http-param-substitution:

Parameter substitution
======================

In addition to the ``stmt`` key the request body may also contain an ``args``
key which can be used for SQL parameter substitution.

The SQL statement has to be changed to use placeholders where the values should
be inserted. Placeholders can either be numbered (in the form of ``$1``,
``$2``, etc.) or unnumbered using a question mark ``?``.

The placeholders will then be substituted with values from an array that is
expected under the ``args`` key::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' -d@- <<- EOF
    ... {
    ...   "stmt":
    ...     "select date,position from locations
    ...     where date <= \$1 and position < \$2 order by position",
    ...   "args": ["1979-10-12", 3]
    ... }
    ... EOF
    {
      "cols": [
        "date",
        "position"
      ],
      "rows": [
        [
          308534400000,
          1
        ],
        [
          308534400000,
          2
        ]
      ],
      "rowcount": 2,
      "duration": ...
    }

.. NOTE::

    In this example the placeholders start with an backslash due to shell
    escaping.

.. WARNING::

    Parameter substitution must not be used within subscript notation.

    For example, ``column[?]`` is not allowed.

The same query using question marks as placeholders looks like this::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' -d@- <<- EOF
    ... {
    ...   "stmt":
    ...     "select date,position from locations
    ...     where date <= ? and position < ? order by position",
    ...   "args": ["1979-10-12", 3]
    ... }
    ... EOF
    {
      "cols": [
        "date",
        "position"
      ],
      "rows": [
        [
          308534400000,
          1
        ],
        [
          308534400000,
          2
        ]
      ],
      "rowcount": 2,
      "duration": ...
    }

.. NOTE::

    With some queries the row count is not ascertainable. In this cases
    ``rowcount`` is ``-1``.


.. _http-default-schema:

Default schema
==============

It is possible to set a default schema while querying the CrateDB cluster via
``_sql`` end point. In such case the HTTP request should contain the
``Default-Schema`` header with the specified schema name::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' \
    ... -H 'Default-Schema: doc' -d@- <<- EOF
    ... {
    ...   "stmt":"select name, position from locations order by id limit 2"
    ... }
    ... EOF
    {
      "cols": [
        "name",
        "position"
      ],
      "rows": [
        [
          "North West Ripple",
          1
        ],
        [
          "Outer Eastern Rim",
          2
        ]
      ],
      "rowcount": 2,
      "duration": ...
    }

If the schema name is not specified in the header, the default ``doc`` schema
will be used instead.


.. _http-column-types:

Column types
============

CrateDB can respond a list ``col_types`` with the data type ID of every
responded column. This way one can know what exact data type a column is
holding.

In order to get the list of column data types, a ``types`` query parameter must
be passed to the request::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql?types' -d@- <<- EOF
    ... {
    ...   "stmt":
    ...     "select date, position from locations
    ...      where date <= \$1 and position < \$2 order by position",
    ...   "args": ["1979-10-12", 3]
    ... }
    ... EOF
    {
      "cols": [
        "date",
        "position"
      ],
      "col_types": [
        11,
        9
      ],
      "rows": [
        [
          308534400000,
          1
        ],
        [
          308534400000,
          2
        ]
      ],
      "rowcount": 2,
      "duration": ...
    }

The ``Array`` collection data type is displayed as a list where the first value
is the collection type and the second is the inner type. The inner type could
also be a collection.

Example of JSON representation of a column list of (String, Integer[])::

  "column_types": [ 4, [ 100, 9 ] ]


.. _http-data-types-table:

Available data types
--------------------

IDs of all currently available data types:

.. list-table::
   :widths: 8 30
   :header-rows: 1

   * - ID
     - Data type
   * - 0
     - :ref:`NULL <type-null>`
   * - 1
     - Not supported
   * - 2
     - :ref:`CHAR <type-char>`
   * - 3
     - :ref:`BOOLEAN <type-boolean>`
   * - 4
     - :ref:`TEXT <type-text>`
   * - 5
     - :ref:`IP <type-ip>`
   * - 6
     - :ref:`DOUBLE PRECISION <type-double-precision>`
   * - 7
     - :ref:`REAL <type-real>`
   * - 8
     - :ref:`SMALLINT <type-smallint>`
   * - 9
     - :ref:`INTEGER <type-integer>`
   * - 10
     - :ref:`BIGINT <type-bigint>`
   * - 11
     - :ref:`TIMESTAMP WITH TIME ZONE <type-timestamp-with-tz>`
   * - 12
     - :ref:`OBJECT <type-object>`
   * - 13
     - :ref:`GEO_POINT <type-geo_point>`
   * - 14
     - :ref:`GEO_SHAPE <type-geo_shape>`
   * - 15
     - :ref:`TIMESTAMP WITHOUT TIME ZONE <type-timestamp-without-tz>`
   * - 16
     - Unchecked object
   * - 17
     - :ref:`INTERVAL <type-interval>`
   * - 18
     - :ref:`ROW <type-row>`
   * - 19
     - :ref:`REGPROC <type-regproc>`
   * - 20
     - :ref:`TIME <type-time>`
   * - 21
     - :ref:`OIDVECTOR <type-oidvector>`
   * - 22
     - :ref:`NUMERIC <data-types-numeric>`
   * - 23
     - :ref:`REGCLASS <type-regclass>`
   * - 24
     - :ref:`DATE <type-date>`
   * - 25
     - :ref:`BIT <data-types-bit-strings>`
   * - 26
     - :ref:`JSON <data-type-json>`
   * - 27
     - :ref:`CHARACTER <data-type-character>`
   * - 28
     - :ref:`FLOAT VECTOR <type-float_vector>`
   * - 100
     - :ref:`ARRAY <type-array>`


.. _http-error-handling:

Error handling
==============

Queries that are invalid or cannot be satisfied will result in an error
response. The response will contain an error code, an error message and in some
cases additional arguments that are specific to the error code.

Client libraries should use the error code to translate the error into an
appropriate exception::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' -d@- <<- EOF
    ... {
    ...   "stmt":"select name, position from foo.locations"
    ... }
    ... EOF
    {
      "error": {
        "message": "SchemaUnknownException[Schema 'foo' unknown]",
        "code": 4045
      }
    }

To get more insight into what exactly went wrong an additional ``error_trace``
``GET`` parameter can be specified to return the stack trace::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql?error_trace=true' -d@- <<- EOF
    ... {
    ...   "stmt":"select name, position from foo.locations"
    ... }
    ... EOF
    {
      "error": {
        "message": "SchemaUnknownException[Schema 'foo' unknown]",
        "code": 4045
      },
      "error_trace": "..."
    }

.. NOTE::

    This parameter is intended for CrateDB developers or for users requesting
    support for CrateDB. Client libraries shouldn't make use of this option and
    not include the stack trace.

.. _http-error-codes:

Error codes
-----------

====== =======================================================================
Code   Error
====== =======================================================================
40000  Generic bad request error.
------ -----------------------------------------------------------------------
4000   The statement contains an invalid syntax or unsupported SQL statement
------ -----------------------------------------------------------------------
4001   The statement contains an invalid analyzer definition.
------ -----------------------------------------------------------------------
4002   The name of the relation is invalid.
------ -----------------------------------------------------------------------
4003   Field type validation failed
------ -----------------------------------------------------------------------
4004   Possible feature not supported (yet)
------ -----------------------------------------------------------------------
4005   Alter table using a table alias is not supported.
------ -----------------------------------------------------------------------
4006   The used column alias is ambiguous.
------ -----------------------------------------------------------------------
4007   The operation is not supported on this relation, as it is not
       accessible.
------ -----------------------------------------------------------------------
4008   The name of the column is invalid.
------ -----------------------------------------------------------------------
40010  The document storage source is missing.
------ -----------------------------------------------------------------------
40011  Invalid snapshot name.
------ -----------------------------------------------------------------------
40012  A running snapshot is using the relation a user wants to drop or close.
------ -----------------------------------------------------------------------
40013  Raised if trying to remove an object while others still depend on it.
------ -----------------------------------------------------------------------
4010   User is not authorized to perform the SQL statement.
------ -----------------------------------------------------------------------
4011   Missing privilege for user.
------ -----------------------------------------------------------------------
4031   Only read operations are allowed on this node.
------ -----------------------------------------------------------------------
4032   The relation is closed, any read or write is forbidden.
------ -----------------------------------------------------------------------
4033   Only DQL operations are allowed on this relation.
------ -----------------------------------------------------------------------
4034   Only DML operations are allowed on this relation.
------ -----------------------------------------------------------------------
4035   Only DQL and DDL operations are allowed on this relation.
------ -----------------------------------------------------------------------
4036   Only DDL operations are allowed on this relation.
------ -----------------------------------------------------------------------
4037   Only DQL or DELETE operations are allowed on this relation.
------ -----------------------------------------------------------------------
4040   Resource not found (generic error, no further details).
------ -----------------------------------------------------------------------
4041   Unknown relation.
------ -----------------------------------------------------------------------
4042   Unknown analyzer.
------ -----------------------------------------------------------------------
4043   Unknown column.
------ -----------------------------------------------------------------------
4044   Unknown type.
------ -----------------------------------------------------------------------
4045   Unknown schema.
------ -----------------------------------------------------------------------
4046   Unknown Partition.
------ -----------------------------------------------------------------------
4047   Unknown Repository.
------ -----------------------------------------------------------------------
4048   Unknown Snapshot.
------ -----------------------------------------------------------------------
4049   Unknown :ref:`user-defined function <user-defined-functions>`.
------ -----------------------------------------------------------------------
40410  Unknown user.
------ -----------------------------------------------------------------------
40411  Document not found.
------ -----------------------------------------------------------------------
4091   A document with the same primary key exists already.
------ -----------------------------------------------------------------------
4092   A VersionConflict. Might be thrown if an attempt was made to update
       the same document concurrently.
------ -----------------------------------------------------------------------
4093   A relation with the same name exists already.
------ -----------------------------------------------------------------------
4094   The used table alias contains tables with different schema.
------ -----------------------------------------------------------------------
4095   A repository with the same name exists already.
------ -----------------------------------------------------------------------
4096   A snapshot with the same name already exists in the repository.
------ -----------------------------------------------------------------------
4097   A partition for the same values already exists in this table.
------ -----------------------------------------------------------------------
4098   A user-defined function with the same signature already exists.
------ -----------------------------------------------------------------------
4099   A user with the same name already exists.
------ -----------------------------------------------------------------------
40910  An object with the same name already exists.
------ -----------------------------------------------------------------------
5000   Unhandled server error.
------ -----------------------------------------------------------------------
5001   The execution of one or more tasks failed.
------ -----------------------------------------------------------------------
5002   One or more shards are not available.
------ -----------------------------------------------------------------------
5003   The query failed on one or more shards
------ -----------------------------------------------------------------------
5004   Creating a snapshot failed
------ -----------------------------------------------------------------------
5005   The query was killed by a ``kill`` statement
------ -----------------------------------------------------------------------
5006   Verification of the related repository failed
------ -----------------------------------------------------------------------
5030   The query was killed by a ``kill`` statement
------ -----------------------------------------------------------------------
5030   Cluster unavailable, no further details
------ -----------------------------------------------------------------------
5031   No master node discovered (yet)
------ -----------------------------------------------------------------------
5032   No node is available
------ -----------------------------------------------------------------------
5033   No shard of the related relation is available
------ -----------------------------------------------------------------------
5034   Processing global cluster change events timed out
------ -----------------------------------------------------------------------
5035   The global cluster state is not recovered (yet)
====== =======================================================================


.. _http-bulk-ops:

Bulk operations
===============

The HTTP endpoint supports executing a single SQL statement many times with
different parameters.

Instead of the ``args`` (:ref:`http-param-substitution`) key, use the key
``bulk_args``. This allows to specify a list of lists, containing all the
parameters which shall be processed. The inner lists need to match the specified
columns.

The bulk response contains a ``results`` array, with a row count for each bulk
operation. Those results are in the same order as the issued operations of the
bulk operation.


Here an example that inserts three records at once::

    sh$ curl -sS -H 'Content-Type: application/json' \
    ... -X POST '127.0.0.1:4200/_sql' -d@- <<- EOF
    ... {
    ...   "stmt": "INSERT INTO locations (id, name, kind, description)
    ...           VALUES (?, ?, ?, ?)",
    ...   "bulk_args": [
    ...     [1337, "Earth", "Planet", "An awesome place to spend some time on."],
    ...     [1338, "Sun", "Star", "An extraordinarily hot place."],
    ...     [1339, "Titan", "Moon", "Titan, where it rains fossil fuels."]
    ...   ]
    ... }
    ... EOF
    {
      "cols": [],
      "duration": ...,
      "results": [
        {
          "rowcount": 1
        },
        {
          "rowcount": 1
        },
        {
          "rowcount": 1
        }
      ]
    }


Statements with a result set cannot be executed in bulk. The supported bulk SQL
statements are:

- Insert
- Update
- Delete


.. _http-bulk-errors:

Bulk errors
-----------

There are two kinds of error behaviors for bulk requests:

1. **Analysis error:** Occurs if the statement is invalid, either due to syntax
   errors or semantic errors identified during the analysis phase before the
   execution starts. In this case the **whole** operation fails and you'll get
   a single error::

    {
        "error": {
            "code": 4043,
            "message": "ColumnUnknownException[Column y unknown]"
        }
    }


2. **Runtime error:** For errors happening after the analysis phase succeeded
   during execution. For example on duplicate primary key errors or check
   constraint failures. In this case CrateDB continues processing the other
   bulk arguments and reports the results via a ``rowcount`` where ``-2``
   indicates an error. Additionally, each failing bulk operation result element
   contains an ``error`` object with the related :ref:`code <http-error-codes>`
   and ``message``::


    {
        "cols": [],
        "duration": 2.195417,
        "results": [
            {
                "rowcount": 1
            },
            {
                "rowcount": -2,
                "error": {
                    "code": 4091,
                    "message": "DuplicateKeyException[A document with the same primary key exists already]"
                }
            }
        ]
    }

.. note::

    To avoid too much memory pressure caused by errors, only the first ``10``
    errors happening on each involved shard will contain an ``error`` payload.
    Any following error is exposed only by the ``-2`` row count without any
    details.

.. note::

    The ``error_trace`` option does not work with bulk operations.


.. _here documents: https://en.wikipedia.org/wiki/Here_document
.. _prepared statement: https://en.wikipedia.org/wiki/Prepared_statement</doc><doc title="CrateDB reference: PostgreSQL interface" desc="CrateDB supports the PostgreSQL wire protocol v3.">.. _interface-postgresql:

========================
PostgreSQL wire protocol
========================

CrateDB supports the `PostgreSQL wire protocol v3`_.

If a node is started with PostgreSQL wire protocol support enabled it will bind
to port ``5432`` by default. To use a custom port, set the corresponding
:ref:`conf_ports` in the :ref:`Configuration <config>`.

However, even though connecting PostgreSQL tools and client libraries is
supported, the actual SQL statements have to be supported by CrateDB's SQL
dialect. A notable difference is that CrateDB doesn't support transactions,
which is why clients should generally enable ``autocommit``.

.. NOTE::

    In order to use ``setFetchSize`` in JDBC it is possible to set auto commit
    to false.

    The client will utilize the fetchSize on SELECT statements and only load up
    to fetchSize rows into memory.

    See the `PostgreSQL JDBC Query docs`_ for more information.

    Write operations will still behave as if auto commit was enabled and commit
    or rollback calls are ignored.


.. _postgres-server-compat:

Server compatibility
====================

CrateDB emulates PostgreSQL server version ``14``.


.. _postgres-start-up:

Start-up
--------


.. _postgres-ssl:

SSL Support
'''''''''''

SSL can be configured using :ref:`admin_ssl`.


.. _postgres-auth:

Authentication
''''''''''''''

Authentication methods can be configured using :ref:`admin_hba`.


.. _postgres-parameterstatus:

ParameterStatus
'''''''''''''''

After the authentication succeeded, the server has the possibility to send
multiple ``ParameterStatus`` messages to the client. These are used to
communicate information like ``server_version`` (emulates PostgreSQL 9.5) or
``server_encoding``.

``CrateDB`` also sends a message containing the ``crate_version`` parameter.
This contains the current ``CrateDB`` version number.

This information is useful for clients to detect that they're connecting to
``CrateDB`` instead of a PostgreSQL instance.


.. _postgres-db-selection:

Database selection
''''''''''''''''''

Since CrateDB uses schemas instead of databases, the ``database`` parameter
sets the default schema name for future queries. If no schema is specified, the
schema ``doc`` will be used as default. Additionally, the only supported
charset is ``UTF8``.


.. _postgres-query-modes:

Query modes
-----------


.. _postgres-query-modes-simple:

Simple query
''''''''''''

The `PostgreSQL simple query`_ protocol mode is fully implemented.


.. _postgres-query-modes-extended:

Extended query
''''''''''''''

The `PostgreSQL extended query`_ protocol mode is implemented with the
following limitations:

- The ``ParameterDescription`` message works for the most common use cases
  except for DDL statements.

- To optimize the execution of bulk operations the execution of statements is
  delayed until the ``Sync`` message is received


.. _postgres-copy-na:

Copy operations
---------------

CrateDB does not support the ``COPY`` sub-protocol, see also
:ref:`postgres-copy`.


.. _postgres-fn-call:

Function call
-------------

The :ref:`function call <sql-function-call>` sub-protocol is not supported
since it's a legacy feature.


.. _postgres-cancel-reqs:

Canceling requests
------------------

`PostgreSQL cancelling requests`_ is fully implemented.


.. _postgres-pg_catalog:

``pg_catalog``
--------------

For improved compatibility, the ``pg_catalog`` schema is implemented containing
following tables:

- `pg_am`_
- `pg_attrdef <pgsql_pg_attrdef_>`__
- `pg_attribute <pgsql_pg_attribute_>`__
- `pg_auth_members`_
- `pg_class <pgsql_pg_class_>`__
- `pg_constraint <pgsql_pg_constraint_>`__
- `pg_cursors <pgsql_pg_cursors_>`__
- `pg_database <pgsql_pg_database_>`__
- `pg_depend`_
- `pg_description`_
- `pg_enum`_
- `pg_event_trigger`_
- `pg_index <pgsql_pg_index_>`__
- `pg_indexes <pgsql_pg_indexes_>`__
- `pg_locks <pgsql_pg_locks_>`__
- `pg_matviews <pgsql_pg_matviews_>`__
- `pg_namespace <pgsql_pg_namespace_>`__
- `pg_proc <pgsql_pg_proc_>`__
- `pg_publication <pgsql_pg_publication_>`__
- `pg_publication_tables <pgsql_pg_publication_tables_>`__
- `pg_range`_
- `pg_roles`_
- `pg_settings <pgsql_pg_settings_>`__
- `pg_shdescription`_
- `pg_stats`_
- `pg_subscription <pgsql_pg_subscription_>`__
- `pg_subscription_rel <pgsql_pg_subscription_rel_>`__
- `pg_tables`_
- `pg_tablespace`_
- `pg_type`_
- `pg_views`_


.. _postgres-pg_type:

``pg_type``
'''''''''''

Some clients require the ``pg_catalog.pg_type`` in order to be able to stream
arrays or other non-primitive types.

For compatibility reasons, there is a trimmed down `pg_type <pgsql_pg_type_>`__
table available in CrateDB::

    cr> SELECT oid, typname, typarray, typelem, typlen, typtype, typcategory
    ... FROM pg_catalog.pg_type
    ... ORDER BY oid;
    +------+--------------+----------+---------+--------+---------+-------------+
    |  oid | typname      | typarray | typelem | typlen | typtype | typcategory |
    +------+--------------+----------+---------+--------+---------+-------------+
    |   16 | bool         |     1000 |       0 |      1 | b       | N           |
    |   18 | char         |     1002 |       0 |      1 | b       | S           |
    |   19 | name         |       -1 |       0 |     64 | b       | S           |
    |   20 | int8         |     1016 |       0 |      8 | b       | N           |
    |   21 | int2         |     1005 |       0 |      2 | b       | N           |
    |   23 | int4         |     1007 |       0 |      4 | b       | N           |
    |   24 | regproc      |     1008 |       0 |      4 | b       | N           |
    |   25 | text         |     1009 |       0 |     -1 | b       | S           |
    |   26 | oid          |     1028 |       0 |      4 | b       | N           |
    |   30 | oidvector    |     1013 |      26 |     -1 | b       | A           |
    |  114 | json         |      199 |       0 |     -1 | b       | U           |
    |  199 | _json        |        0 |     114 |     -1 | b       | A           |
    |  600 | point        |     1017 |       0 |     16 | b       | G           |
    |  700 | float4       |     1021 |       0 |      4 | b       | N           |
    |  701 | float8       |     1022 |       0 |      8 | b       | N           |
    |  705 | unknown      |        0 |       0 |     -2 | p       | X           |
    | 1000 | _bool        |        0 |      16 |     -1 | b       | A           |
    | 1002 | _char        |        0 |      18 |     -1 | b       | A           |
    | 1005 | _int2        |        0 |      21 |     -1 | b       | A           |
    | 1007 | _int4        |        0 |      23 |     -1 | b       | A           |
    | 1008 | _regproc     |        0 |      24 |     -1 | b       | A           |
    | 1009 | _text        |        0 |      25 |     -1 | b       | A           |
    | 1014 | _bpchar      |        0 |    1042 |     -1 | b       | A           |
    | 1015 | _varchar     |        0 |    1043 |     -1 | b       | A           |
    | 1016 | _int8        |        0 |      20 |     -1 | b       | A           |
    | 1017 | _point       |        0 |     600 |     -1 | b       | A           |
    | 1021 | _float4      |        0 |     700 |     -1 | b       | A           |
    | 1022 | _float8      |        0 |     701 |     -1 | b       | A           |
    | 1042 | bpchar       |     1014 |       0 |     -1 | b       | S           |
    | 1043 | varchar      |     1015 |       0 |     -1 | b       | S           |
    | 1082 | date         |     1182 |       0 |      8 | b       | D           |
    | 1114 | timestamp    |     1115 |       0 |      8 | b       | D           |
    | 1115 | _timestamp   |        0 |    1114 |     -1 | b       | A           |
    | 1182 | _date        |        0 |    1082 |     -1 | b       | A           |
    | 1184 | timestamptz  |     1185 |       0 |      8 | b       | D           |
    | 1185 | _timestamptz |        0 |    1184 |     -1 | b       | A           |
    | 1186 | interval     |     1187 |       0 |     16 | b       | T           |
    | 1187 | _interval    |        0 |    1186 |     -1 | b       | A           |
    | 1231 | _numeric     |        0 |    1700 |     -1 | b       | A           |
    | 1266 | timetz       |     1270 |       0 |     12 | b       | D           |
    | 1270 | _timetz      |        0 |    1266 |     -1 | b       | A           |
    | 1560 | bit          |     1561 |       0 |     -1 | b       | V           |
    | 1561 | _bit         |        0 |    1560 |     -1 | b       | A           |
    | 1700 | numeric      |     1231 |       0 |     -1 | b       | N           |
    | 2205 | regclass     |     2210 |       0 |      4 | b       | N           |
    | 2210 | _regclass    |        0 |    2205 |     -1 | b       | A           |
    | 2249 | record       |     2287 |       0 |     -1 | p       | P           |
    | 2276 | any          |        0 |       0 |      4 | p       | P           |
    | 2277 | anyarray     |        0 |    2276 |     -1 | p       | P           |
    | 2287 | _record      |        0 |    2249 |     -1 | p       | A           |
    | 2950 | uuid         |     2951 |       0 |     16 | b       | U           |
    | 2951 | _uuid        |        0 |    2950 |     -1 | b       | A           |
    +------+--------------+----------+---------+--------+---------+-------------+
    SELECT 52 rows in set (... sec)

.. NOTE::

   This is just a snapshot of the table.

   Check table :ref:`information_schema.columns <information_schema_columns>`
   to get information for all supported columns.


.. _postgres-pg_type-oid:

OID types
.........

*Object Identifiers* (OIDs) are used internally by PostgreSQL as primary keys
for various system tables.

CrateDB supports the :ref:`oid <type-oid>` type and the following aliases:

+-------------------+----------------------+-------------+-------------+
| Name              | Reference            | Description | Example     |
+===================+======================+=============+=============+
| :ref:`regproc     | `pg_proc             | A function  | ``sum``     |
| <type-regproc>`   | <pgsql_pg_proc_>`__  | name        |             |
+-------------------+----------------------+-------------+-------------+
| :ref:`regclass    | `pg_class            | A relation  | ``pg_type`` |
| <type-regclass>`  | <pgsql_pg_class_>`__ | name        |             |
+-------------------+----------------------+-------------+-------------+

CrateDB also supports the :ref:`oidvector <type-oidvector>` type.

.. NOTE::

    Casting a :ref:`string <data-types-character-data>` or an :ref:`integer
    <type-numeric>` to the ``regproc`` type does not result in a function
    lookup (as it does with PostgreSQL).

    Instead:

    .. rst-class:: open

    - Casting a string to the ``regproc`` type results in an object of the
      ``regproc`` type with a name equal to the string value and an ``oid``
      equal to an integer hash of the string.

    - Casting an integer to the ``regproc`` type results in an object of the
      ``regproc`` type with a name equal to the string representation of the
      integer and an ``oid`` equal to the integer value.

    Consult the :ref:`CrateDB data types reference
    <data-types-postgres-internal>` for more information about each OID type
    (including additional type casting behaviour).


.. _postgres-show-trans-isolation:

Show transaction isolation
--------------------------

For compatibility with JDBC the ``SHOW TRANSACTION ISOLATION LEVEL`` statement
is implemented::

    cr> show transaction isolation level;
    +-----------------------+
    | transaction_isolation |
    +-----------------------+
    | read uncommitted      |
    +-----------------------+
    SHOW 1 row in set (... sec)


.. _postgres-begin-start-comit:

``BEGIN``, ``START``, and ``COMMIT`` statements
-----------------------------------------------

For compatibility with clients that use the PostgresSQL wire protocol (e.g.,
the Golang lib/pq and pgx drivers), CrateDB will accept the :ref:`BEGIN
<ref-begin>`, :ref:`COMMIT <ref-commit>`, and :ref:`START TRANSACTION
<sql-start-transaction>` statements. For example::

    cr> BEGIN TRANSACTION ISOLATION LEVEL READ UNCOMMITTED,
    ...                   READ ONLY,
    ...                   NOT DEFERRABLE;
    BEGIN OK, 0 rows affected  (... sec)

    cr> COMMIT
    COMMIT OK, 0 rows affected  (... sec)

CrateDB will silently ignore the ``COMMIT``, ``BEGIN``, and ``START
TRANSACTION`` statements and all respective parameters.


.. _postgres-client-compat:

Client compatibility
====================


.. _postgres-client-jdbc:

JDBC
----

`pgjdbc`_ JDBC drivers version ``9.4.1209`` and above are compatible.


.. _postgres-client-jdbc-limit:

Limitations
'''''''''''

- Versions ``42.7.5``, ``42.7.6`` and ``42.7.7``` do not support some metadata
  methods when used with CrateDB version ``5.x``, e.g.::

      conn.getMetaData().getTables(...)

  These metadata calls only work with CrateDB ``6.0.0`` and later. If you rely
  on such metadata methods, and you use CrateDB ``5.x`` you should avoid those
  JDBC versions and use ``42.7.4`` instead.

- ``OBJECT`` and ``GEO_SHAPE`` columns can be streamed as ``JSON`` but require
  `pgjdbc`_ version ``9.4.1210`` or newer.

- Multidimensional arrays will be streamed as ``JSON`` encoded string to avoid
  a protocol limitation where all sub-arrays are required to have the same
  length.

- The behavior of ``PreparedStatement.executeBatch`` in error cases depends on
  in which stage an error occurs: A ``BatchUpdateException`` is thrown if no
  processing has been done yet, whereas single operations failing after the
  processing started are indicated by an ``EXECUTE_FAILED`` (-3) return value.

- Transaction limitations as described above.

- Having ``escape processing`` enabled could prevent the usage of :ref:`Object
  Literals <data-types-object-literals>` in case an object key's starting
  character clashes with a JDBC escape keyword (see also `JDBC escape syntax
  <https://docs.oracle.com/javadb/10.10.1.2/ref/rrefjdbc1020262.html>`_).
  Disabling ``escape processing`` will remedy this appropriately for `pgjdbc`_
  version >= ``9.4.1212``.


.. _postgres-client-jdbc-conn:

Connection failover and load balancing
''''''''''''''''''''''''''''''''''''''

Connection failover and load balancing is supported as described here:
`PostgreSQL JDBC connection failover`_.

.. NOTE::

   It is not recommended to use the **targetServerType** parameter since
   CrateDB has no concept of master-replica nodes.


.. _postgres-implementation:

Implementation differences
==========================

The PostgreSQL Wire Protocol makes it easy to use many PostgreSQL compatible
tools and libraries directly with CrateDB. However, many of these tools assume
that they are talking to PostgreSQL specifically, and thus rely on SQL
extensions and idioms that are unique to PostgreSQL. Because of this, some
tools or libraries may not work with other SQL databases such as CrateDB.

CrateDB's SQL query engine enables real-time search & aggregations for online
analytic processing (OLAP) and business intelligence (BI) with the benefit of
the ability to scale horizontally. The use-cases of CrateDB are different than
those of PostgreSQL, as CrateDB's specialized storage schema and query
execution engine addresses different needs (see :ref:`Clustering
<concept-clustering>`).

The features listed below cover the main differences in implementation and
dialect between CrateDB and PostgreSQL. A detailed comparison between CrateDB's
SQL dialect and standard SQL is outlined in :ref:`appendix-compatibility`.


.. _postgres-copy:

Copy operations
---------------

CrateDB does not support the distinct sub-protocol that is used to serve
``COPY`` operations and provides another implementation for transferring bulk
data using the :ref:`sql-copy-from` and :ref:`sql-copy-to` statements.


.. _postgres-types:

Data types
----------


.. _postgres-date-times:

Dates and times
'''''''''''''''

At the moment, CrateDB does not support ``TIME`` without a time zone.

Additionally, CrateDB does not support the ``INTERVAL`` input units
``MILLENNIUM``, ``CENTURY``, ``DECADE``, or ``MICROSECOND``.


.. _postgres-objects:

Objects
'''''''

The definition of structured values by using ``JSON`` types, *composite types*
or ``HSTORE`` are not supported. CrateDB alternatively allows the definition of
nested documents (of type :ref:`type-object`) that store fields containing any
CrateDB supported data type, including nested object types.


.. _postgres-arrays:

Arrays
''''''


.. _postgres-arrays-declare:

Declaration of arrays
.....................

While multidimensional arrays in PostgreSQL must have matching extends for each
dimension, CrateDB allows different length nested arrays as this example
shows::

    cr> select [[1,2,3],[1,2]] from sys.cluster;
    +---------------------+
    | [[1, 2, 3], [1, 2]] |
    +---------------------+
    | [[1, 2, 3], [1, 2]] |
    +---------------------+
    SELECT 1 row in set (... sec)


.. _postgres-type-casts:

Type casts
''''''''''

CrateDB accepts the :ref:`data-types-casting` syntax for conversion of one data
type to another.

.. SEEALSO::

    `PostgreSQL value expressions`_

    :ref:`CrateDB value expressions <sql-value-expressions>`


.. _postgres-search:

Text search functions and operators
-----------------------------------

The :ref:`functions <gloss-function>` and :ref:`operators <gloss-operator>`
provided by PostgreSQL for :ref:`full-text search <sql_dql_fulltext_search>`
(see `PostgreSQL fulltext Search`_) are not compatible with those provided by
CrateDB.

If you are missing features, functions or dialect improvements and have a great
use case for it, let us know on `GitHub`_. We're always improving and extending
CrateDB and we love to hear feedback.


.. _GitHub: https://github.com/crate/crate
.. _pg_am: https://www.postgresql.org/docs/14/catalog-pg-am.html
.. _pg_description: https://www.postgresql.org/docs/14/catalog-pg-description.html
.. _pg_enum: https://www.postgresql.org/docs/14/catalog-pg-enum.html
.. _pg_range: https://www.postgresql.org/docs/14/catalog-pg-range.html
.. _pg_roles: https://www.postgresql.org/docs/14/view-pg-roles.html
.. _pg_auth_members: https://www.postgresql.org/docs/17/catalog-pg-auth-members.html
.. _pg_tables: https://www.postgresql.org/docs/14/view-pg-tables.html
.. _pg_tablespace: https://www.postgresql.org/docs/14/catalog-pg-tablespace.html
.. _pg_views: https://www.postgresql.org/docs/14/view-pg-views.html
.. _pg_shdescription: https://www.postgresql.org/docs/14/catalog-pg-shdescription.html
.. _pg_stats: https://www.postgresql.org/docs/14/view-pg-stats.html
.. _pg_event_trigger: https://www.postgresql.org/docs/current/catalog-pg-event-trigger.html
.. _pg_depend: https://www.postgresql.org/docs/current/catalog-pg-depend.html
.. _pgjdbc: https://github.com/pgjdbc/pgjdbc
.. _pgsql_pg_attrdef: https://www.postgresql.org/docs/14/catalog-pg-attrdef.html
.. _pgsql_pg_attribute: https://www.postgresql.org/docs/14/catalog-pg-attribute.html
.. _pgsql_pg_class: https://www.postgresql.org/docs/14/catalog-pg-class.html
.. _pgsql_pg_constraint: https://www.postgresql.org/docs/14/catalog-pg-constraint.html
.. _pgsql_pg_cursors: https://www.postgresql.org/docs/15/view-pg-cursors.html
.. _pgsql_pg_database: https://www.postgresql.org/docs/14/catalog-pg-database.html
.. _pgsql_pg_index: https://www.postgresql.org/docs/14/catalog-pg-index.html
.. _pgsql_pg_indexes: https://www.postgresql.org/docs/14/view-pg-indexes.html
.. _pgsql_pg_locks: https://www.postgresql.org/docs/14/view-pg-locks.html
.. _pgsql_pg_matviews: https://www.postgresql.org/docs/current/view-pg-matviews.html
.. _pgsql_pg_namespace: https://www.postgresql.org/docs/14/catalog-pg-namespace.html
.. _pgsql_pg_proc: https://www.postgresql.org/docs/14/catalog-pg-proc.html
.. _pgsql_pg_publication: https://www.postgresql.org/docs/14/catalog-pg-publication.html
.. _pgsql_pg_publication_tables: https://www.postgresql.org/docs/14/view-pg-publication-tables.html
.. _pgsql_pg_subscription: https://www.postgresql.org/docs/14/catalog-pg-subscription.html
.. _pgsql_pg_subscription_rel: https://www.postgresql.org/docs/14/catalog-pg-subscription-rel.html
.. _pgsql_pg_settings: https://www.postgresql.org/docs/14/view-pg-settings.html
.. _pgsql_pg_type: https://www.postgresql.org/docs/14/catalog-pg-type.html
.. _PostgreSQL Arrays: https://www.postgresql.org/docs/14/static/arrays.html
.. _PostgreSQL extended query: https://www.postgresql.org/docs/14/protocol-flow.html
.. _PostgreSQL Fulltext Search: https://www.postgresql.org/docs/14/functions-textsearch.html
.. _PostgreSQL JDBC connection failover: https://jdbc.postgresql.org/documentation/use/#connection-fail-over
.. _PostgreSQL JDBC Query docs: https://jdbc.postgresql.org/documentation/query
.. _PostgreSQL simple query: https://www.postgresql.org/docs/14/protocol-flow.html
.. _PostgreSQL value expressions: https://www.postgresql.org/docs/14/sql-expressions.html
.. _PostgreSQL wire protocol v3: https://www.postgresql.org/docs/14/protocol.html
.. _PostgreSQL cancelling requests: https://www.postgresql.org/docs/14/protocol-flow.html#id-1.10.5.7.10</doc><doc title="CrateDB reference: Information schema" desc="`information_schema` is a special schema that contains virtual tables which are read-only and can be queried to get information about the state of the cluster. ">.. highlight:: psql
.. _information_schema:

==================
Information schema
==================

``information_schema`` is a special schema that contains virtual tables which
are read-only and can be queried to get information about the state of the
cluster.


Access
======

When the user management is enabled, accessing the ``information_schema`` is
open to all users and it does not require any privileges.

However, being able to query ``information_schema`` tables will not allow the
user to retrieve all the rows in the table, as it can contain information
related to tables over which the connected user does not have any privileges.
The only rows that will be returned will be the ones the user is allowed to
access.

For example, if the user ``john`` has any privilege on the ``doc.books`` table
but no privilege at all on ``doc.locations``, when ``john`` issues a ``SELECT *
FROM information_schema.tables`` statement, the tables information related to
the ``doc.locations`` table will not be returned.

.. NOTE::

    During a rolling upgrade of the cluster to a newer version, while the
    cluster is in a mixed state with nodes on the older and on the new version,
    avoid querying the ``sys`` tables using ``SELECT *``, as new columns could
    have been added, removed or modified between versions. Instead, use a
    defined list of the columns that you need to return from the query.

Virtual tables
==============

.. _information_schema_tables:

``tables``
----------

The ``information_schema.tables`` virtual table can be queried to get a list of
all available tables and views and their settings, such as number of shards or
number of replicas.

.. hide: CREATE VIEW::

   cr> CREATE VIEW galaxies AS
   ... SELECT id, name, description FROM locations WHERE kind = 'Galaxy';
   CREATE OK, 1 row affected (... sec)

.. hide: CREATE TABLE::

   cr> create table partitioned_table (
   ... id bigint,
   ... title text,
   ... date timestamp with time zone
   ... ) partitioned by (date);
   CREATE OK, 1 row affected (... sec)

::

    cr> SELECT table_schema, table_name, table_type, number_of_shards, number_of_replicas
    ... FROM information_schema.tables
    ... ORDER BY table_schema ASC, table_name ASC;
    +--------------------+-----------------------------------+------------+------------------+--------------------+
    | table_schema       | table_name                        | table_type | number_of_shards | number_of_replicas |
    +--------------------+-----------------------------------+------------+------------------+--------------------+
    | doc                | galaxies                          | VIEW       |             NULL | NULL               |
    | doc                | locations                         | BASE TABLE |                2 | 0                  |
    | doc                | partitioned_table                 | BASE TABLE |                4 | 0-1                |
    | doc                | quotes                            | BASE TABLE |                2 | 0                  |
    | information_schema | administrable_role_authorizations | BASE TABLE |             NULL | NULL               |
    | information_schema | applicable_roles                  | BASE TABLE |             NULL | NULL               |
    | information_schema | character_sets                    | BASE TABLE |             NULL | NULL               |
    | information_schema | columns                           | BASE TABLE |             NULL | NULL               |
    | information_schema | enabled_roles                     | BASE TABLE |             NULL | NULL               |
    | information_schema | foreign_server_options            | BASE TABLE |             NULL | NULL               |
    | information_schema | foreign_servers                   | BASE TABLE |             NULL | NULL               |
    | information_schema | foreign_table_options             | BASE TABLE |             NULL | NULL               |
    | information_schema | foreign_tables                    | BASE TABLE |             NULL | NULL               |
    | information_schema | key_column_usage                  | BASE TABLE |             NULL | NULL               |
    | information_schema | referential_constraints           | BASE TABLE |             NULL | NULL               |
    | information_schema | role_table_grants                 | BASE TABLE |             NULL | NULL               |
    | information_schema | routines                          | BASE TABLE |             NULL | NULL               |
    | information_schema | schemata                          | BASE TABLE |             NULL | NULL               |
    | information_schema | sql_features                      | BASE TABLE |             NULL | NULL               |
    | information_schema | table_constraints                 | BASE TABLE |             NULL | NULL               |
    | information_schema | table_partitions                  | BASE TABLE |             NULL | NULL               |
    | information_schema | tables                            | BASE TABLE |             NULL | NULL               |
    | information_schema | user_mapping_options              | BASE TABLE |             NULL | NULL               |
    | information_schema | user_mappings                     | BASE TABLE |             NULL | NULL               |
    | information_schema | views                             | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_am                             | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_attrdef                        | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_attribute                      | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_auth_members                   | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_class                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_constraint                     | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_cursors                        | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_database                       | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_depend                         | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_description                    | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_enum                           | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_event_trigger                  | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_index                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_indexes                        | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_locks                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_matviews                       | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_namespace                      | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_proc                           | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_publication                    | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_publication_tables             | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_range                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_roles                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_settings                       | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_shdescription                  | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_stats                          | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_subscription                   | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_subscription_rel               | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_tables                         | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_tablespace                     | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_type                           | BASE TABLE |             NULL | NULL               |
    | pg_catalog         | pg_views                          | BASE TABLE |             NULL | NULL               |
    | sys                | allocations                       | BASE TABLE |             NULL | NULL               |
    | sys                | checks                            | BASE TABLE |             NULL | NULL               |
    | sys                | cluster                           | BASE TABLE |             NULL | NULL               |
    | sys                | cluster_health                    | BASE TABLE |             NULL | NULL               |
    | sys                | health                            | BASE TABLE |             NULL | NULL               |
    | sys                | jobs                              | BASE TABLE |             NULL | NULL               |
    | sys                | jobs_log                          | BASE TABLE |             NULL | NULL               |
    | sys                | jobs_metrics                      | BASE TABLE |             NULL | NULL               |
    | sys                | node_checks                       | BASE TABLE |             NULL | NULL               |
    | sys                | nodes                             | BASE TABLE |             NULL | NULL               |
    | sys                | operations                        | BASE TABLE |             NULL | NULL               |
    | sys                | operations_log                    | BASE TABLE |             NULL | NULL               |
    | sys                | privileges                        | BASE TABLE |             NULL | NULL               |
    | sys                | repositories                      | BASE TABLE |             NULL | NULL               |
    | sys                | roles                             | BASE TABLE |             NULL | NULL               |
    | sys                | segments                          | BASE TABLE |             NULL | NULL               |
    | sys                | sessions                          | BASE TABLE |             NULL | NULL               |
    | sys                | shards                            | BASE TABLE |             NULL | NULL               |
    | sys                | snapshot_restore                  | BASE TABLE |             NULL | NULL               |
    | sys                | snapshots                         | BASE TABLE |             NULL | NULL               |
    | sys                | summits                           | BASE TABLE |             NULL | NULL               |
    | sys                | users                             | BASE TABLE |             NULL | NULL               |
    +--------------------+-----------------------------------+------------+------------------+--------------------+
    SELECT 78 rows in set (... sec)


The table also contains additional information such as the specified
:ref:`routing column <gloss-routing-column>` and :ref:`partition columns
<gloss-partition-column>`::

    cr> SELECT table_name, clustered_by, partitioned_by
    ... FROM information_schema.tables
    ... WHERE table_schema = 'doc'
    ... ORDER BY table_schema ASC, table_name ASC;
    +-------------------+--------------+----------------+
    | table_name        | clustered_by | partitioned_by |
    +-------------------+--------------+----------------+
    | galaxies          | NULL         | NULL           |
    | locations         | id           | NULL           |
    | partitioned_table | _id          | ["date"]       |
    | quotes            | id           | NULL           |
    +-------------------+--------------+----------------+
    SELECT 4 rows in set (... sec)

.. rubric:: Schema

+----------------------------------+------------------------------------------------------------------------------------+-------------+
| Name                             | Description                                                                        | Data Type   |
+==================================+====================================================================================+=============+
| ``blobs_path``                   | The data path of the blob table                                                    | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``closed``                       | The state of the table                                                             | ``BOOLEAN`` |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``clustered_by``                 | The :ref:`routing column <gloss-routing-column>` used to cluster the table         | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``column_policy``                | Defines whether the table uses a ``STRICT`` or a ``DYNAMIC`` :ref:`column_policy`  | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``number_of_replicas``           | The number of replicas the table currently has                                     | ``INTEGER`` |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``number_of_shards``             | The number of shards the table is currently distributed across                     | ``INTEGER`` |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``partitioned_by``               | The :ref:`partition columns <gloss-partition-column>` (used to partition the       | ``TEXT``    |
|                                  | table)                                                                             |             |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``reference_generation``         | Specifies how values in the self-referencing column are generated                  | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``routing_hash_function``        | The name of the hash function used for internal :ref:`routing <sharding-routing>`  | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``self_referencing_column_name`` | The name of the column that uniquely identifies each row (always ``_id``)          | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``settings``                     | :ref:`sql-create-table-with`                                                       | ``OBJECT``  |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``table_catalog``                | Refers to the ``table_schema``                                                     | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``table_name``                   | The name of the table                                                              | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``table_schema``                 | The name of the schema the table belongs to                                        | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``table_type``                   | The type of the table (``BASE TABLE`` for tables, ``VIEW`` for views)              | ``TEXT``    |
+----------------------------------+------------------------------------------------------------------------------------+-------------+
| ``version``                      | A collection of version numbers relevant to the table                              | ``OBJECT``  |
+----------------------------------+------------------------------------------------------------------------------------+-------------+

``settings``
............

Table settings specify configuration parameters for tables. Some settings can
be set during Cluster runtime and others are only applied on cluster restart.

This list of table settings in :ref:`sql-create-table-with` shows detailed
information of each parameter.

Table parameters can be applied with ``CREATE TABLE`` on creation of a table.
With ``ALTER TABLE`` they can be set on already existing tables.

The following statement creates a new table and sets the refresh interval of
shards to 500 ms and sets the :ref:`shard allocation <gloss-shard-allocation>`
for primary shards only::

    cr> create table parameterized_table (id integer, content text)
    ... with ("refresh_interval"=500, "routing.allocation.enable"='primaries');
    CREATE OK, 1 row affected (... sec)

The settings can be verified by querying ``information_schema.tables``::

    cr> select settings['routing']['allocation']['enable'] as alloc_enable,
    ...   settings['refresh_interval'] as refresh_interval
    ... from information_schema.tables
    ... where table_name='parameterized_table';
    +--------------+------------------+
    | alloc_enable | refresh_interval |
    +--------------+------------------+
    | primaries    |              500 |
    +--------------+------------------+
    SELECT 1 row in set (... sec)

On existing tables this needs to be done with ``ALTER TABLE`` statement::

    cr> alter table parameterized_table
    ... set ("routing.allocation.enable"='none');
    ALTER OK, -1 rows affected (... sec)

.. hide:

    cr> drop table parameterized_table;
    DROP OK, 1 row affected (... sec)

``views``
---------

The table ``information_schema.views`` contains the name, definition and
options of all available views.

::

    cr> SELECT table_schema, table_name, view_definition
    ... FROM information_schema.views
    ... ORDER BY table_schema ASC, table_name ASC;
    +--------------+------------+-------------------------+
    | table_schema | table_name | view_definition         |
    +--------------+------------+-------------------------+
    | doc          | galaxies   | SELECT                  |
    |              |            |   "id"                  |
    |              |            | , "name"                |
    |              |            | , "description"         |
    |              |            | FROM "locations"        |
    |              |            | WHERE "kind" = 'Galaxy' |
    +--------------+------------+-------------------------+
    SELECT 1 row in set (... sec)

.. rubric:: Schema

+---------------------+-------------------------------------------------------------------------------------+-------------+
| Name                | Description                                                                         | Data Type   |
+=====================+=====================================================================================+=============+
| ``table_catalog``   | The catalog of the table of the view (refers to ``table_schema``)                   | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``table_schema``    | The schema of the table of the view                                                 | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``table_name``      | The name of the table of the view                                                   | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``view_definition`` | The SELECT statement that defines the view                                          | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``check_option``    | Not applicable for CrateDB, always return ``NONE``                                  | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``is_updatable``    | Whether the view is updatable. Not applicable for CrateDB, always returns ``FALSE`` | ``BOOLEAN`` |
+---------------------+-------------------------------------------------------------------------------------+-------------+
| ``owner``           | The user that created the view                                                      | ``TEXT``    |
+---------------------+-------------------------------------------------------------------------------------+-------------+

.. note::

   If you drop the table of a view, the view will still exist and show up in
   the ``information_schema.tables`` and ``information_schema.views`` tables.

.. hide:

   cr> DROP view galaxies;
   DROP OK, 1 row affected (... sec)

.. _information_schema_columns:

``columns``
-----------

This table can be queried to get a list of all available columns of all tables
and views and their definition like data type and ordinal position inside the
table::

    cr> select table_name, column_name, ordinal_position as pos, data_type
    ... from information_schema.columns
    ... where table_schema = 'doc' and table_name not like 'my_table%'
    ... order by table_name asc, column_name asc;
    +-------------------+--------------------------------+-----+--------------------------+
    | table_name        | column_name                    | pos | data_type                |
    +-------------------+--------------------------------+-----+--------------------------+
    | locations         | date                           |   3 | timestamp with time zone |
    | locations         | description                    |   6 | text                     |
    | locations         | id                             |   1 | integer                  |
    | locations         | information                    |  11 | object_array             |
    | locations         | information['evolution_level'] |  13 | smallint                 |
    | locations         | information['population']      |  12 | bigint                   |
    | locations         | inhabitants                    |   7 | object                   |
    | locations         | inhabitants['description']     |   9 | text                     |
    | locations         | inhabitants['interests']       |   8 | text_array               |
    | locations         | inhabitants['name']            |  10 | text                     |
    | locations         | kind                           |   4 | text                     |
    | locations         | landmarks                      |  14 | text_array               |
    | locations         | name                           |   2 | text                     |
    | locations         | position                       |   5 | integer                  |
    | partitioned_table | date                           |   3 | timestamp with time zone |
    | partitioned_table | id                             |   1 | bigint                   |
    | partitioned_table | title                          |   2 | text                     |
    | quotes            | id                             |   1 | integer                  |
    | quotes            | quote                          |   2 | text                     |
    +-------------------+--------------------------------+-----+--------------------------+
    SELECT 19 rows in set (... sec)

You can even query this table's own columns (attention: this might lead to
infinite recursion of your mind, beware!)::

    cr> select column_name, data_type, ordinal_position
    ... from information_schema.columns
    ... where table_schema = 'information_schema'
    ... and table_name = 'columns' order by column_name asc;
    +--------------------------+------------+------------------+
    | column_name              | data_type  | ordinal_position |
    +--------------------------+------------+------------------+
    | character_maximum_length | integer    |                1 |
    | character_octet_length   | integer    |                2 |
    | character_set_catalog    | text       |                3 |
    | character_set_name       | text       |                4 |
    | character_set_schema     | text       |                5 |
    | check_action             | integer    |                6 |
    | check_references         | text       |                7 |
    | collation_catalog        | text       |                8 |
    | collation_name           | text       |                9 |
    | collation_schema         | text       |               10 |
    | column_default           | text       |               11 |
    | column_details           | object     |               12 |
    | column_details['name']   | text       |               13 |
    | column_details['oid']    | bigint     |               14 |
    | column_details['path']   | text_array |               15 |
    | column_details['policy'] | text       |               16 |
    | column_name              | text       |               17 |
    | data_type                | text       |               18 |
    | datetime_precision       | integer    |               19 |
    | domain_catalog           | text       |               20 |
    | domain_name              | text       |               21 |
    | domain_schema            | text       |               22 |
    | generation_expression    | text       |               23 |
    | identity_cycle           | boolean    |               24 |
    | identity_generation      | text       |               25 |
    | identity_increment       | text       |               26 |
    | identity_maximum         | text       |               27 |
    | identity_minimum         | text       |               28 |
    | identity_start           | text       |               29 |
    | interval_precision       | integer    |               30 |
    | interval_type            | text       |               31 |
    | is_generated             | text       |               32 |
    | is_identity              | boolean    |               33 |
    | is_nullable              | text       |               34 |
    | numeric_precision        | integer    |               35 |
    | numeric_precision_radix  | integer    |               36 |
    | numeric_scale            | integer    |               37 |
    | ordinal_position         | integer    |               38 |
    | table_catalog            | text       |               39 |
    | table_name               | text       |               40 |
    | table_schema             | text       |               41 |
    | udt_catalog              | text       |               42 |
    | udt_name                 | text       |               43 |
    | udt_schema               | text       |               44 |
    +--------------------------+------------+------------------+
    SELECT 44 rows in set (... sec)


.. rubric:: Schema

+-------------------------------+-----------------------------------------------+---------------+
|            Name               |                Description                    |   Data Type   |
+===============================+===============================================+===============+
| ``table_catalog``             | Refers to the ``table_schema``                | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``table_schema``              | Schema name containing the table              | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``table_name``                | Table Name                                    | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_name``               | Column Name                                   | ``TEXT``      |
|                               | For fields in object columns this is not an   |               |
|                               | identifier but a path and therefore must not  |               |
|                               | be double quoted when programmatically        |               |
|                               | obtained.                                     |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``ordinal_position``          | The position of the column within the         | ``INTEGER``   |
|                               | table                                         |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``is_nullable``               | 'YES' if the column is nullable, 'NO'         | ``TEXT``      |
|                               | if it's not nullable                          |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``data_type``                 | The data type of the column                   | ``TEXT``      |
|                               |                                               |               |
|                               | For further information see :ref:`data-types` |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_default``            | The default :ref:`expression                  | ``TEXT``      |
|                               | <gloss-expression>` of the column             |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_details``            | Contains CrateDB specific column information  | ``OBJECT``    |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_details['name']``    | The top-level name of any nested column.      | ``TEXT``      |
|                               | If the column is not a child of an ``OBJECT`` |               |
|                               | it's equal to ``column_name``.                |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_details['oid']``     | The internal identifier of a column, unique   | ``LONG``      |
|                               | inside the same table. Used mostly internal   |               |
|                               | and may change in future versions.            |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_details['path']``    | The child path of a nested column.            | ``TEXT_ARRAY``|
|                               | Empty list for any non-nested column.         |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``column_details['policy']``  | The column policy, mostly interesting on      | ``TEXT``      |
|                               | columns of type ``OBJECT``.                   |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``character_maximum_length``  | If the data type is a :ref:`character type    | ``INTEGER``   |
|                               | <data-types-character-data>` then return the  |               |
|                               | declared length limit; otherwise ``NULL``.    |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``character_octet_length``    | Not implemented (always returns ``NULL``)     | ``INTEGER``   |
|                               |                                               |               |
|                               | Please refer to :ref:`type-text` type         |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``numeric_precision``         | Indicates the number of significant digits    | ``INTEGER``   |
|                               | for a numeric ``data_type``. For all other    |               |
|                               | data types this column is ``NULL``.           |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``numeric_precision_radix``   | Indicates in which base the value in the      | ``INTEGER``   |
|                               | column ``numeric_precision`` for a numeric    |               |
|                               | ``data_type`` is exposed. This can either be  |               |
|                               | 2 (binary) or 10 (decimal). For all other     |               |
|                               | data types this column is ``NULL``.           |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``numeric_scale``             | Not implemented (always returns ``NULL``)     | ``INTEGER``   |
+-------------------------------+-----------------------------------------------+---------------+
| ``datetime_precision``        | Contains the fractional seconds precision for | ``INTEGER``   |
|                               | a ``timestamp`` ``data_type``. For all other  |               |
|                               | data types this column is ``null``.           |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``interval_type``             | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``interval_precision``        | Not implemented (always returns ``NULL``)     | ``INTEGER``   |
+-------------------------------+-----------------------------------------------+---------------+
| ``character_set_catalog``     | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``character_set_schema``      | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``character_set_name``        | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``collation_catalog``         | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``collation_schema``          | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``collation_name``            | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``domain_catalog``            | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``domain_schema``             | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``domain_name``               | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``udt_catalog``               | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``udt_schema``                | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``udt_name``                  | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``check_references``          | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``check_action``              | Not implemented (always returns ``NULL``)     | ``INTEGER``   |
+-------------------------------+-----------------------------------------------+---------------+
| ``generation_expression``     | The expression used to generate ad column.    | ``TEXT``      |
|                               | If the column is not generated ``NULL`` is    |               |
|                               | returned.                                     |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``is_generated``              | Returns ``ALWAYS`` or ``NEVER`` wether the    | ``TEXT``      |
|                               | column is generated or not.                   |               |
+-------------------------------+-----------------------------------------------+---------------+
| ``is_identity``               | Not implemented (always returns ``false``)    | ``BOOLEAN``   |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_cycle``            | Not implemented (always returns ``NULL``)     | ``BOOLEAN``   |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_generation``       | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_increment``        | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_maximum``          | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_minimum``          | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+
| ``identity_start``            | Not implemented (always returns ``NULL``)     | ``TEXT``      |
+-------------------------------+-----------------------------------------------+---------------+

.. _information_schema_table_constraints:

``table_constraints``
---------------------

This table can be queried to get a list of all defined table constraints, their
type, name and which table they are defined in.

.. NOTE::

    Currently only ``PRIMARY_KEY`` constraints are supported.

.. hide:

    cr> create table tbl (col TEXT NOT NULL);
    CREATE OK, 1 row affected (... sec)

::

    cr> select table_schema, table_name, constraint_name, constraint_type as type
    ... from information_schema.table_constraints
    ... where table_name = 'tables'
    ...   or table_name = 'quotes'
    ...   or table_name = 'documents'
    ...   or table_name = 'tbl'
    ... order by table_schema desc, table_name asc limit 10;
    +--------------------+------------+------------------------+-------------+
    | table_schema       | table_name | constraint_name        | type        |
    +--------------------+------------+------------------------+-------------+
    | information_schema | tables     | tables_pkey            | PRIMARY KEY |
    | doc                | quotes     | quotes_pkey            | PRIMARY KEY |
    | doc                | quotes     | doc_quotes_id_not_null | CHECK       |
    | doc                | tbl        | doc_tbl_col_not_null   | CHECK       |
    +--------------------+------------+------------------------+-------------+
    SELECT 4 rows in set (... sec)

.. _information_schema_key_column_usage:

``key_column_usage``
--------------------

This table may be queried to retrieve primary key information from all user
tables:

.. hide:

    cr> create table students (id bigint, department integer, name text, primary key(id, department))
    CREATE OK, 1 row affected (... sec)

::

    cr> select constraint_name, table_name, column_name, ordinal_position
    ... from information_schema.key_column_usage
    ... where table_name = 'students'
    +-----------------+------------+-------------+------------------+
    | constraint_name | table_name | column_name | ordinal_position |
    +-----------------+------------+-------------+------------------+
    | students_pkey   | students   | id          |                1 |
    | students_pkey   | students   | department  |                2 |
    +-----------------+------------+-------------+------------------+
    SELECT 2 rows in set (... sec)

.. rubric:: Schema

+-------------------------+-------------------------------------------------------------------------+-------------+
| Name                    | Description                                                             | Data Type   |
+=========================+=========================================================================+=============+
| ``constraint_catalog``  | Refers to ``table_catalog``                                             | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``constraint_schema``   | Refers to ``table_schema``                                              | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``constraint_name``     | Name of the constraint                                                  | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``table_catalog``       | Refers to ``table_schema``                                              | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``table_schema``        | Name of the schema that contains the table that contains the constraint | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``table_name``          | Name of the table that contains the constraint                          | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``column_name``         | Name of the column that contains the constraint                         | ``TEXT``    |
+-------------------------+-------------------------------------------------------------------------+-------------+
| ``ordinal_position``    | Position of the column within the constraint (starts with 1)            | ``INTEGER`` |
+-------------------------+-------------------------------------------------------------------------+-------------+

.. _is_table_partitions:

``table_partitions``
--------------------

This table can be queried to get information about all :ref:`partitioned tables
<partitioned-tables>`, Each partition of a table is represented as one row. The
row contains the information table name, schema name, partition ident, and the
values of the partition. ``values`` is a key-value object with the
:ref:`partition column <gloss-partition-column>` (or columns) as key(s) and the
corresponding value as value(s).

.. hide:

    cr> create table a_partitioned_table (id integer, content text)
    ... partitioned by (content);
    CREATE OK, 1 row affected (... sec)

::

    cr> insert into a_partitioned_table (id, content) values (1, 'content_a');
    INSERT OK, 1 row affected (... sec)

::

    cr> alter table a_partitioned_table set (number_of_shards=5);
    ALTER OK, -1 rows affected (... sec)

::

    cr> insert into a_partitioned_table (id, content) values (2, 'content_b');
    INSERT OK, 1 row affected (... sec)

The following example shows a table where the column ``content`` of table
``a_partitioned_table`` has been used to partition the table. The table has two
partitions. The partitions are introduced when data is inserted where
``content`` is ``content_a``, and ``content_b``.::

    cr> select table_name, table_schema as schema, partition_ident, "values"
    ... from information_schema.table_partitions
    ... order by table_name, partition_ident;
    +---------------------+--------+--------------------+--------------------------+
    | table_name          | schema | partition_ident    | values                   |
    +---------------------+--------+--------------------+--------------------------+
    | a_partitioned_table | doc    | 04566rreehimst2vc4 | {"content": "content_a"} |
    | a_partitioned_table | doc    | 04566rreehimst2vc8 | {"content": "content_b"} |
    +---------------------+--------+--------------------+--------------------------+
    SELECT 2 rows in set (... sec)

The second partition has been created after the number of shards for future
partitions have been changed on the partitioned table, so they show ``5``
instead of ``4``::

    cr> select table_name, partition_ident,
    ... number_of_shards, number_of_replicas
    ... from information_schema.table_partitions
    ... order by table_name, partition_ident;
    +---------------------+--------------------+------------------+--------------------+
    | table_name          | partition_ident    | number_of_shards | number_of_replicas |
    +---------------------+--------------------+------------------+--------------------+
    | a_partitioned_table | 04566rreehimst2vc4 |                4 | 0-1                |
    | a_partitioned_table | 04566rreehimst2vc8 |                5 | 0-1                |
    +---------------------+--------------------+------------------+--------------------+
    SELECT 2 rows in set (... sec)

``routines``
------------

The routines table contains tokenizers, token-filters, char-filters, custom
analyzers created by ``CREATE ANALYZER`` statements (see
:ref:`sql-ddl-custom-analyzer`), and :ref:`functions <user-defined-functions>`
created by ``CREATE FUNCTION`` statements::

    cr> select routine_name, routine_type
    ... from information_schema.routines
    ... group by routine_name, routine_type
    ... order by routine_name asc limit 5;
    +----------------------+--------------+
    | routine_name         | routine_type |
    +----------------------+--------------+
    | PathHierarchy        | TOKENIZER    |
    | apostrophe           | TOKEN_FILTER |
    | arabic               | ANALYZER     |
    | arabic_normalization | TOKEN_FILTER |
    | arabic_stem          | TOKEN_FILTER |
    +----------------------+--------------+
    SELECT 5 rows in set (... sec)

For example you can use this table to list existing tokenizers like this::

    cr> select routine_name
    ... from information_schema.routines
    ... where routine_type='TOKENIZER'
    ... order by routine_name asc limit 10;
    +----------------+
    | routine_name   |
    +----------------+
    | PathHierarchy  |
    | char_group     |
    | classic        |
    | edge_ngram     |
    | keyword        |
    | letter         |
    | lowercase      |
    | ngram          |
    | path_hierarchy |
    | pattern        |
    +----------------+
    SELECT 10 rows in set (... sec)

Or get an overview of how many routines and routine types are available::

    cr> select count(*), routine_type
    ... from information_schema.routines
    ... group by routine_type
    ... order by routine_type;
    +-------+--------------+
    | count | routine_type |
    +-------+--------------+
    |    45 | ANALYZER     |
    |     3 | CHAR_FILTER  |
    |    16 | TOKENIZER    |
    |    61 | TOKEN_FILTER |
    +-------+--------------+
    SELECT 4 rows in set (... sec)

.. rubric:: Schema

+--------------------+-------------+
| Name               | Data Type   |
+====================+=============+
| routine_name       | ``TEXT``    |
+--------------------+-------------+
| routine_type       | ``TEXT``    |
+--------------------+-------------+
| routine_body       | ``TEXT``    |
+--------------------+-------------+
| routine_schema     | ``TEXT``    |
+--------------------+-------------+
| data_type          | ``TEXT``    |
+--------------------+-------------+
| is_deterministic   | ``BOOLEAN`` |
+--------------------+-------------+
| routine_definition | ``TEXT``    |
+--------------------+-------------+
| specific_name      | ``TEXT``    |
+--------------------+-------------+

:routine_name:
    Name of the routine (might be duplicated in case of overloading)
:routine_type:
    Type of the routine.
    Can be ``FUNCTION``, ``ANALYZER``, ``CHAR_FILTER``, ``TOKEN_FILTER``
    or ``TOKEN_FILTER``.
:routine_schema:
    The schema where the routine was defined.
    If it doesn't apply, then ``NULL``.
:routine_body:
    The language used for the routine implementation.
    If it doesn't apply, then ``NULL``.
:data_type:
    The return type of the function.
    If it doesn't apply, then ``NULL``.
:is_deterministic:
    If the routine is deterministic then ``True``, else ``False`` (``NULL`` if
    it doesn't apply).
:routine_definition:
    The function definition (``NULL`` if it doesn't apply).
:specific_name:
    Used to uniquely identify the function in a schema, even if the function is
    overloaded.  Currently the specific name contains the types of the function
    arguments. As the format might change in the future, it should be only used
    to compare it to other instances of ``specific_name``.


.. _schemata:

``schemata``
------------

The schemata table lists all existing schemas. The ``blob``,
``information_schema``, and ``sys`` schemas are always available. The ``doc``
schema is available after the first user table is created.

::

    cr> select schema_name from information_schema.schemata order by schema_name;
    +--------------------+
    | schema_name        |
    +--------------------+
    | blob               |
    | doc                |
    | information_schema |
    | pg_catalog         |
    | sys                |
    +--------------------+
    SELECT 5 rows in set (... sec)

.. _sql_features:

``sql_features``
----------------

The ``sql_features`` table outlines supported and unsupported SQL features of
CrateDB based to the current SQL standard (see :ref:`sql_supported_features`)::

    cr> select feature_name, is_supported, sub_feature_id, sub_feature_name
    ... from information_schema.sql_features
    ... where feature_id='F501';
    +--------------------------------+--------------+----------------+--------------------+
    | feature_name                   | is_supported | sub_feature_id | sub_feature_name   |
    +--------------------------------+--------------+----------------+--------------------+
    | Features and conformance views | FALSE        |                |                    |
    | Features and conformance views | TRUE         | 1              | SQL_FEATURES view  |
    | Features and conformance views | FALSE        | 2              | SQL_SIZING view    |
    | Features and conformance views | FALSE        | 3              | SQL_LANGUAGES view |
    +--------------------------------+--------------+----------------+--------------------+
    SELECT 4 rows in set (... sec)

+------------------+-----------+----------+
| Name             | Data Type | Nullable |
+==================+===========+==========+
| feature_id       | ``TEXT``  | NO       |
+------------------+-----------+----------+
| feature_name     | ``TEXT``  | NO       |
+------------------+-----------+----------+
| sub_feature_id   | ``TEXT``  | NO       |
+------------------+-----------+----------+
| sub_feature_name | ``TEXT``  | NO       |
+------------------+-----------+----------+
| is_supported     | ``TEXT``  | NO       |
+------------------+-----------+----------+
| is_verified_by   | ``TEXT``  | YES      |
+------------------+-----------+----------+
| comments         | ``TEXT``  | YES      |
+------------------+-----------+----------+

:feature_id:
    Identifier of the feature
:feature_name:
    Descriptive name of the feature by the Standard
:sub_feature_id:
    Identifier of the sub feature;
    If it has zero-length, this is a feature
:sub_feature_name:
    Descriptive name of the sub feature by the Standard;
    If it has zero-length, this is a feature
:is_supported:
    ``YES`` if the feature is fully supported by the current version of
    CrateDB, ``NO`` if not
:is_verified_by:
    Identifies the conformance test used to verify the claim;

    Always ``NULL`` since the CrateDB development group does not perform formal
    testing of feature conformance
:comments:
    Either ``NULL`` or shows a comment about the supported status of the
    feature


.. _character_sets:

``character_sets``
------------------

The ``character_sets`` table identifies the character sets available in the
current database.

In CrateDB there is always a single entry listing `UTF8`::

    cr> SELECT character_set_name, character_repertoire FROM information_schema.character_sets;
    +--------------------+----------------------+
    | character_set_name | character_repertoire |
    +--------------------+----------------------+
    | UTF8               | UCS                  |
    +--------------------+----------------------+
    SELECT 1 row in set (... sec)


.. list-table::
    :header-rows: 1

    * - Column Name
      - Return Type
      - Description
    * - ``character_set_catalog``
      - ``TEXT``
      - Not implemented, this column is always null.
    * - ``character_set_schema``
      - ``TEXT``
      - Not implemented, this column is always null.
    * - ``character_set_name``
      - ``TEXT``
      - Name of the character set.
    * - ``character_repertoire``
      - ``TEXT``
      - Character repertoire.
    * - ``form_of_use``
      - ``TEXT``
      - Character encoding form, same as ``character_set_name``.
    * - ``default_collate_catalog``
      - ``TEXT``
      - Name of the database containing the default collation (Always ``crate``).
    * - ``default_collate_schema``
      - ``TEXT``
      - Name of the schema containing the default collation (Always ``NULL``).
    * - ``default_collate_name``
      - ``TEXT``
      - Name of the default collation (Always ``NULL``).


.. _foreign_servers:

``foreign_servers``
-------------------

Lists foreign servers created using :ref:`ref-create-server`.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``foreign_server_catalog``
     - ``TEXT``
     - Name of the database of the foreign server. Always ``crate``.
   * - ``foreign_server_name``
     - ``TEXT``
     - Name of the foreign server.
   * - ``foreign_data_wrapper_catalog``
     - ``TEXT``
     - Name of the database that contains the foreign-data wrapper. Always
       ``crate``.
   * - ``foreign_data_wrapper_name``
     - ``TEXT``
     - Name of the foreign-data wrapper used by the foreign server.
   * - ``foreign_server_type``
     - ``TEXT``
     - Foreign server type information. Always ``null``.
   * - ``foreign_server_version``
     - ``TEXT``
     - Foreign server version information. Always ``null``.
   * - ``authorization_identifier``
     - ``TEXT``
     - Name of the user who created the server.

.. _foreign_server_options:

``foreign_server_options``
--------------------------

Lists options of foreign servers created using :ref:`ref-create-server`.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``foreign_server_catalog``
     - ``TEXT``
     - Name of the database that the foreign server is defined in. Always ``crate``.
   * - ``foreign_server_name``
     - ``TEXT``
     - Name of the foreign server.
   * - ``option_name``
     - ``TEXT``
     - Name of an option.
   * - ``option_value``
     - ``TEXT``
     - Value of the option cast to string.

.. _foreign_tables:

``foreign_tables``
------------------

Lists foreign tables created using :ref:`ref-create-foreign-table`.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``foreign_table_catalog``
     - ``TEXT``
     - Name of the database where the foreign table is defined in. Always
       ``crate``.
   * - ``foreign_table_schema``
     - ``TEXT``
     - Name of the schema that contains the foreign table.
   * - ``foreign_table_name``
     - ``TEXT``
     - Name of the foreign table.
   * - ``foreign_server_catalog``
     - ``TEXT``
     - Name of the database where the foreign server is defined in. Always
       ``crate``.
   * - ``foreign_server_name``
     - ``TEXT``
     - Name of the foreign server.

.. _foreign_table_options:

``foreign_table_options``
-------------------------

Lists options for foreign tables created using :ref:`ref-create-foreign-table`.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``foreign_table_catalog``
     - ``TEXT``
     - Name of the database that contains the foreign table. Always ``crate``.
   * - ``foreign_table_schema``
     - ``TEXT``
     - Name of the schema that contains the foreign table.
   * - ``foreign_table_name``
     - ``TEXT``
     - Name of the foreign table.
   * - ``option_name``
     - ``TEXT``
     - Name of an option.
   * - ``option_value``
     - ``TEXT``
     - Value of the option cast to string.

.. _user_mappings:

``user_mappings``
-----------------

Lists user mappings created for foreign servers.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``authorization_identifier``
     - ``TEXT``
     - Name of the user being mapped.
   * - ``foreign_server_catalog``
     - ``TEXT``
     - Name of the database of the foreign server. Always ``crate``.
   * - ``foreign_server_name``
     - ``TEXT``
     - Name of the foreign server for this user mapping.

.. _user_mapping_options:

``user_mapping_options``
------------------------

Lists the options for user mappings created for foreign servers.
See :ref:`administration-fdw`.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``authorization_identifier``
     - ``TEXT``
     - Name of the user being mapped.
   * - ``foreign_server_catalog``
     - ``TEXT``
     - Name of the database of the foreign server. Always ``crate``.
   * - ``foreign_server_name``
     - ``TEXT``
     - Name of the foreign server for this user mapping.
   * - ``option_name``
     - ``TEXT``
     - Name of an option.
   * - ``option_value``
     - ``TEXT``
     - Value of the option. The value is visible only to the user being mapped
       and to superusers otherwise it will show as a ``NULL``.

.. _administrable_role_authorizations:

``administrable_role_authorizations``
-------------------------------------

Lists all the roles that the current user has ``AL`` privileges for.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``grantee``
     - ``TEXT``
     - Name of the role to which this role was granted. Can be either the
       current user or a different role in case of nested memberships.
   * - ``role_name``
     - ``TEXT``
     - Name of the role.
   * - ``is_grantable``
     - ``BOOLEAN``
     - Always ``TRUE``.

.. _applicable_roles:

``applicable_roles``
--------------------

Lists all the roles that are applicable for the current user.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``grantee``
     - ``TEXT``
     - Name of the role to which this role was granted to.
   * - ``role_name``
     - ``TEXT``
     - Name of the role.
   * - ``is_grantable``
     - ``BOOLEAN``
     - ``TRUE`` if the grantee has ``AL`` privilege, else ``FALSE``.

.. _enabled_roles:

``enabled_roles``
-----------------

Lists all the roles the current user has, directly or indirectly (inherited).

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``role_name``
     - ``TEXT``
     - Name of the role.

.. _role_table_grants:

``role_table_grants``
---------------------

Lists all the privileges granted on tables or views where the grantor
or grantee is a currently enabled role.

.. list-table::
   :header-rows: 1

   * - Column Name
     - Return Type
     - Description
   * - ``grantor``
     - ``TEXT``
     - Name of the role that granted this privilege.
   * - ``grantee``
     - ``TEXT``
     - Name of the role that this privilege was granted to.
   * - ``table_catalog``
     - ``TEXT``
     - Name of the database that contains the table. Always ``crate``.
   * - ``table_schema``
     - ``TEXT``
     - Name of the schema that contains the table.
   * - ``table_name``
     - ``TEXT``
     - Name of the table.
   * - ``privilege_type``
     - ``TEXT``
     - Type of the privilege that was granted. See :ref:`privilege_types` for a
       list of possible values.
   * - ``is_grantable``
     - ``BOOLEAN``
     - Whether this privilege can be granted to another user or not. ``TRUE`` if
       the current role has ``AL`` privilege.
   * - ``with_hierarchy``
     - ``BOOLEAN``
     - Defines if the privilege contains a separate (sub-)privilege allowing
       certain operations on table inheritance hierarchies. CrateDB does not
       support this, thus it is always ``FALSE``.</doc><doc title="CrateDB reference: Users and roles management" desc="Users and roles account information is stored in the cluster metadata of CrateDB and supports standard SQL statements to create, alter and drop users and roles. You need this knowledge to work with permissions in CrateDB. ">.. _administration_user_management:

==========================
Users and roles management
==========================

Users and roles account information is stored in the cluster metadata of CrateDB
and supports the following statements to create, alter and drop users and roles:

* `CREATE USER`_
* `CREATE ROLE`_
* `ALTER USER`_ or `ALTER ROLE`_
* `DROP USER`_ or `DROP ROLE`_

These statements are database management statements that can be invoked by
superusers that already exist in the CrateDB cluster. The `CREATE USER`_,
`CREATE ROLE`_, `DROP USER`_ and `DROP ROLE`_ statements can also be invoked by
users with the ``AL`` privilege. `ALTER USER`_ or `ALTER ROLE`_ can be invoked
by users to change their own password, without requiring any privilege.

When CrateDB is started, the cluster contains one predefined superuser. This
user is called ``crate``. It is not possible to create any other superusers.

The definition of all users and roles, including hashes of their passwords,
together with their :ref:`privileges <administration-privileges>` is backed up
together with the cluster's metadata when a snapshot is created, and it is
restored when using the ``ALL``, ``METADATA``, or ``USERMANAGEMENT`` keywords
with the :ref:`sql-restore-snapshot` command.

``ROLES``
---------

Roles are entities that are **not** allowed to login, but can be assigned
privileges and they can be granted to other roles, thus creating a role
hierarchy, or directly to users. For example, a role ``myschema_dql_role`` can
be granted with ``DQL`` privileges on schema ``myschema`` and afterwards the
role can be :ref:`granted <granting_roles>` to a user, which will automatically
:ref:`inherit <roles_inheritance>` those privileges from the
``myschema_dql_role``. A role ``myschema_dml_role`` can be granted with ``DML``
privileges on schema ``myschema`` and can also be granted the role
``myschema_dql_role``, thus gaining also ``DQL`` privileges. When
``myschema_dml_role`` is granted to a user, this user will automatically have
both ``DQL`` and ``DML`` privileges on ``myschema``.


``CREATE ROLE``
===============

To create a new role for the CrateDB database cluster use the
:ref:`ref-create-role` SQL statement::

    cr> CREATE ROLE role_a;
    CREATE OK, 1 row affected (... sec)

.. TIP::

    Newly created roles do not have any privileges. After creating a role, you
    should :ref:`configure user privileges <administration-privileges>`.

    For example, to grant all privileges to the ``role_a`` user, run::

        cr> GRANT ALL PRIVILEGES TO role_a;
        GRANT OK, 4 rows affected (... sec)

.. hide:

    cr> REVOKE ALL PRIVILEGES FROM role_a;
    REVOKE OK, 4 rows affected (... sec)


The name parameter of the statement follows the principles of an identifier
which means that it must be double-quoted if it contains special characters
(e.g. whitespace) or if the case needs to be maintained::

    cr> CREATE ROLE "Custom Role";
    CREATE OK, 1 row affected (... sec)

If a role or user with the name specified in the SQL statement already exists the
statement returns an error::

    cr> CREATE ROLE "Custom Role";
    RoleAlreadyExistsException[Role 'Custom Role' already exists]


.. hide:

    cr> DROP ROLE "Custom Role";
    DROP OK, 1 row affected (... sec)


``ALTER ROLE``
==============

:ref:`ref-alter-role` and :ref:`ref-alter-user` SQL statements are not supported
for roles, only for users.


``DROP ROLE``
=============

.. hide:

    cr> CREATE ROLE role_c;
    CREATE OK, 1 row affected (... sec)

.. hide:

    cr> CREATE ROLE role_d;
    CREATE OK, 1 row affected (... sec)

To remove an existing role from the CrateDB database cluster use the
:ref:`ref-drop-role` or :ref:`ref-drop-user` SQL statement::

    cr> DROP ROLE role_c;
    DROP OK, 1 row affected (... sec)

::

    cr> DROP USER role_d;
    DROP OK, 1 row affected (... sec)

If a role with the name specified in the SQL statement does not exist, the
statement returns an error::

    cr> DROP ROLE role_d;
    RoleUnknownException[Role 'role_d' does not exist]


List roles
==========

.. hide:

    cr> CREATE ROLE role_b;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE role_c;
    CREATE OK, 1 row affected (... sec)
    cr> GRANT role_c TO role_b;
    GRANT OK, 1 row affected (... sec)

CrateDB exposes database roles via the read-only :ref:`sys-roles` system table.
The ``sys.roles`` table shows all roles in the cluster which can be used to
group privileges.

To list all existing roles query the table::

    cr> SELECT name, granted_roles FROM sys.roles order by name;
    +--------+------------------------------------------+
    | name   | granted_roles                            |
    +--------+------------------------------------------+
    | role_a | []                                       |
    | role_b | [{"grantor": "crate", "role": "role_c"}] |
    | role_c | []                                       |
    +--------+------------------------------------------+
    SELECT 3 rows in set (... sec)


``USERS``
---------

``CREATE USER``
===============

To create a new user for the CrateDB database cluster use the
:ref:`ref-create-user` SQL statement::

    cr> CREATE USER user_a;
    CREATE OK, 1 row affected (... sec)

.. TIP::

    Newly created users do not have any privileges. After creating a user, you
    should :ref:`configure user privileges <administration-privileges>`.

    For example, to grant all privileges to the ``user_a`` user, run::

        cr> GRANT ALL PRIVILEGES TO user_a;
        GRANT OK, 4 rows affected (... sec)

.. hide:

    cr> REVOKE ALL PRIVILEGES FROM user_a;
    REVOKE OK, 4 rows affected (... sec)

It can be used to connect to the database cluster using available authentication
methods. You can specify the user's password in the ``WITH`` clause of the
``CREATE`` statement. This is required if you want to use the
:ref:`auth_password`::

    cr> CREATE USER user_b WITH (password = 'a_secret_password');
    CREATE OK, 1 row affected (... sec)

The username parameter of the statement follows the principles of an identifier
which means that it must be double-quoted if it contains special characters
(e.g. whitespace) or if the case needs to be maintained::

    cr> CREATE USER "Custom User";
    CREATE OK, 1 row affected (... sec)

If a user with the username specified in the SQL statement already exists the
statement returns an error::

    cr> CREATE USER "Custom User";
    RoleAlreadyExistsException[Role 'Custom User' already exists]


.. hide:

    cr> DROP USER "Custom User";
    DROP OK, 1 row affected (... sec)

.. _administration_user_management_alter_user:

``ALTER USER``
==============

To alter the password for an existing user from the CrateDB database cluster use
the :ref:`ref-alter-role` or :ref:`ref-alter-user` SQL statements::

    cr> ALTER USER user_a SET (password = 'pass');
    ALTER OK, 1 row affected (... sec)

The password can be reset (cleared) if specified as ``NULL``::

    cr> ALTER USER user_a SET (password = NULL);
    ALTER OK, 1 row affected (... sec)

.. NOTE::

    The built-in superuser ``crate`` has no password and it is not possible to
    set a new password for this user.

To add or alter :ref:`session settings <conf-session>` use the following SQL
statement::

    cr> ALTER USER user_b SET (search_path = 'myschema', statement_timeout = '10m');
    ALTER OK, 1 row affected (... sec)

To reset a :ref:`session setting <conf-session>` to its default value use the
following SQL statement::

    cr> ALTER USER user_b RESET statement_timeout;
    ALTER OK, 1 row affected (... sec)

.. hide:

   cr> ALTER USER user_a SET (search_path = 'new_schema', statement_timeout = '1h');
    ALTER OK, 1 row affected (... sec)

To reset all modified :ref:`session setting <conf-session>` for a user to their
default values, use the following SQL statement::

    cr> ALTER USER user_a RESET ALL;
    ALTER OK, 1 row affected (... sec)


``DROP USER``
=============

.. hide:

    cr> CREATE USER user_c;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE USER user_d;
    CREATE OK, 1 row affected (... sec)

To remove an existing user from the CrateDB database cluster use the
:ref:`ref-drop-role` or :ref:`ref-drop-user` SQL statements::

    cr> DROP USER user_c;
    DROP OK, 1 row affected (... sec)

::

    cr> DROP ROLE user_d;
    DROP OK, 1 row affected (... sec)

If a user with the username specified in the SQL statement does not exist the
statement returns an error::

    cr> DROP USER user_d;
    RoleUnknownException[Role 'user_d' does not exist]

.. NOTE::

    It is not possible to drop the built-in superuser ``crate``.

List users
==========

.. hide:

     cr> GRANT role_a, role_b TO user_a;
     GRANT OK, 2 rows affected (... sec)

CrateDB exposes database users via the read-only :ref:`sys-users` system table.
The ``sys.users`` table shows all users in the cluster which can be used for
authentication. The initial superuser ``crate`` which is available for all
CrateDB clusters is also part of that list.

To list all existing users query the table::

    cr> SELECT name, granted_roles, password, session_settings, superuser FROM sys.users order by name;
    +--------+----------------------------------------------------------------------------------+----------+-----------------------------+-----------+
    | name   | granted_roles                                                                    | password | session_settings            | superuser |
    +--------+----------------------------------------------------------------------------------+----------+-----------------------------+-----------+
    | crate  | []                                                                               | NULL     | {}                          | TRUE      |
    | user_a | [{"grantor": "crate", "role": "role_a"}, {"grantor": "crate", "role": "role_b"}] | NULL     | {}                          | FALSE     |
    | user_b | []                                                                               | ******** | {"search_path": "myschema"} | FALSE     |
    +--------+----------------------------------------------------------------------------------+----------+-----------------------------+-----------+
    SELECT 3 rows in set (... sec)


.. NOTE::

    CrateDB also supports retrieving the current connected user using the
    :ref:`system information functions <scalar-sysinfo>`: :ref:`CURRENT_USER
    <scalar-current_user>`, :ref:`USER <scalar-user>` and :ref:`SESSION_USER
    <scalar-session_user>`.

.. vale off
.. Drop Users & Roles
.. hide:

    cr> DROP USER user_a;
    DROP OK, 1 row affected (... sec)
    cr> DROP USER user_b;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_a;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_b;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_c;
    DROP OK, 1 row affected (... sec)</doc><doc title="CrateDB reference: Privileges" desc="To execute statements, a user needs to have the required privileges. CrateDB has a built-in superuser account (`crate`) which has the privilege to do anything. The privileges of other users and roles have to be managed using the `GRANT`, `DENY` or `REVOKE` statements. The privileges that can be granted, denied or revoked are: `DQL`, `DML`, `DDL`, `AL`. The privileges can be granted on different classes: `CLUSTER`, `SCHEMA`, `TABLE`, `VIEW`. You need this knowledge to work with permissions in CrateDB. ">.. highlight:: psql
.. _administration-privileges:

==========
Privileges
==========

To execute statements, a user needs to have the required privileges.


.. _privileges-intro:

Introduction
============

CrateDB has a superuser (``crate``) which has the privilege to do anything. The
privileges of other users and roles have to be managed using the ``GRANT``,
``DENY`` or ``REVOKE`` statements.

The privileges that can be granted, denied or revoked are:

- ``DQL``
- ``DML``
- ``DDL``
- ``AL``

Skip to :ref:`privilege_types` for details.

.. _privileges-classes:

Privilege Classes
=================

The privileges can be granted on different classes:

- ``CLUSTER``
- ``SCHEMA``
- ``TABLE`` and ``VIEW``

Skip to :ref:`hierarchical_privileges_inheritance` for details.

A user with ``AL`` on level ``CLUSTER`` can grant privileges they have
themselves to other users or roles as well.


.. _privilege_types:

Privilege types
===============

``DQL``
.......

Granting ``Data Query Language (DQL)`` privilege to a user or role, indicates
that this user/role is allowed to execute ``SELECT``, ``SHOW``, ``REFRESH`` and
``COPY TO`` statements, as well as using the available
:ref:`user-defined functions <user-defined-functions>`, on the object for which
the privilege applies.


``DML``
.......

Granting ``Data Manipulation Language (DML)`` privilege to a user or role,
indicates that this user/role is allowed to execute ``INSERT``, ``COPY FROM``,
``UPDATE`` and ``DELETE`` statements, on the object for which the privilege
applies.

``DDL``
.......

Granting ``Data Definition Language (DDL)`` privilege to a user or role,
indicates that this user/role is allowed to execute the following statements on
objects for which the privilege applies:

- ``CREATE TABLE``
- ``DROP TABLE``
- ``CREATE VIEW``
- ``DROP VIEW``
- ``CREATE FUNCTION``
- ``DROP FUNCTION``
- ``CREATE REPOSITORY``
- ``DROP REPOSITORY``
- ``CREATE SNAPSHOT``
- ``DROP SNAPSHOT``
- ``RESTORE SNAPSHOT``
- ``ALTER TABLE``

``AL``
......

Granting ``Administration Language (AL)`` privilege to a user or role, enables
the user/role to execute the following statements:

- ``CREATE USER/ROLE``
- ``DROP USER/ROLE``
- ``SET GLOBAL``

All statements enabled via the ``AL`` privilege operate on a cluster level. So
granting this on a schema or table level will have no effect.


.. _hierarchical_privileges_inheritance:

Hierarchical inheritance of privileges
======================================

.. vale off
.. hide:

    cr> CREATE USER riley;
    CREATE OK, 1 row affected (... sec)

    cr> CREATE USER kala;
    CREATE OK, 1 row affected (... sec)

    cr> CREATE TABLE IF NOT EXISTS doc.accounting (
    ...   id integer primary key,
    ...   name text,
    ...   joined timestamp with time zone
    ... ) clustered by (id);
    CREATE OK, 1 row affected (... sec)

    cr> INSERT INTO doc.accounting
    ...   (id, name, joined)
    ...   VALUES (1, 'Jon', 0);
    INSERT OK, 1 row affected (... sec)

    cr> REFRESH TABLE doc.accounting
    REFRESH OK, 1 row affected (... sec)

.. vale on

Privileges can be managed on three different levels, namely: ``CLUSTER``,
``SCHEMA``, and ``TABLE``/``VIEW``.

When a privilege is assigned on a certain level, the privilege will propagate
down the hierarchy. Privileges defined on a lower level will always override
those from a higher level:

.. code-block:: none

    cluster
      ||
    schema
     /  \
  table view

This statement will grant ``DQL`` privilege to user ``riley`` on all the tables
and :ref:`functions <gloss-function>` of the ``doc`` schema::

    cr> GRANT DQL ON SCHEMA doc TO riley;
    GRANT OK, 1 row affected (... sec)

This statement will deny ``DQL`` privilege to user ``riley`` on the ``doc``
schema table ``doc.accounting``. However, ``riley`` will still have ``DQL``
privilege on all the other tables of the ``doc`` schema::

    cr> DENY DQL ON TABLE doc.accounting TO riley;
    DENY OK, 1 row affected (... sec)

.. NOTE::

    In CrateDB, schemas are just namespaces that are created and dropped
    implicitly. Therefore, when ``GRANT``, ``DENY`` or ``REVOKE`` are invoked
    on a schema level, CrateDB takes the schema name provided without further
    validation.

    Privileges can be managed on all schemas and tables of the cluster,
    except the ``information_schema``.

Views are on the same hierarchy with tables, i.e. a privilege on a view
is gained through a ``GRANT`` on either the view itself, the schema the view
belongs to, or a cluster-wide privilege. Privileges on relations which are
referenced in the view do not grant any privileges on the view itself. On the
contrary, even if the user/role does not have any privileges on a view's
referenced relations but on the view itself, the user/role can still access the
relations through the view. For example::

    cr> CREATE VIEW first_customer as SELECT * from doc.accounting ORDER BY id LIMIT 1
    CREATE OK, 1 row affected (... sec)

Previously we had issued a ``DENY`` for user ``riley`` on ``doc.accounting``
but we can still access it through the view because we have access to it
through the ``doc`` schema::

    cr> SELECT id from first_customer;
    +----+
    | id |
    +----+
    |  1 |
    +----+
    SELECT 1 row in set (... sec)

.. SEEALSO::

    :ref:`Views: Privileges <views-privileges>`


Behavior of ``GRANT``, ``DENY`` and ``REVOKE``
==============================================

.. NOTE::

    You can only grant, deny, or revoke privileges for an existing user or role.
    You must first :ref:`create a user/role <administration_user_management>`
    and then configure privileges.

``GRANT``
.........

.. hide:

    cr> CREATE USER wolfgang;
    CREATE OK, 1 row affected (... sec)

    cr> CREATE USER will;
    CREATE OK, 1 row affected (... sec)

    cr> CREATE TABLE IF NOT EXISTS doc.books (
    ...   first_column integer primary key,
    ...   second_column text);
    CREATE OK, 1 row affected (... sec)

To grant a privilege to an existing user or role on the whole cluster,
we use the :ref:`ref-grant` SQL statement, for example::

    cr> GRANT DML TO wolfgang;
    GRANT OK, 1 row affected (... sec)

``DQL`` privilege can be granted on the ``sys`` schema to user ``wolfgang``,
like this::

    cr> GRANT DQL ON SCHEMA sys TO wolfgang;
    GRANT OK, 1 row affected (... sec)

The following statement will grant all privileges on table doc.books to user
``wolfgang``::

    cr> GRANT ALL PRIVILEGES ON TABLE doc.books TO wolfgang;
    GRANT OK, 4 rows affected (... sec)

Using "ALL PRIVILEGES" is a shortcut to grant all the :ref:`currently grantable
privileges <privilege_types>` to a user or role.

.. NOTE::

    If no schema is specified in the table ``ident``, the table will be
    looked up in the current schema.

If a user/role with the name specified in the SQL statement does not exist the
statement returns an error::

    cr> GRANT DQL TO layla;
    RoleUnknownException[Role 'layla' does not exist]

To grant ``ALL PRIVILEGES`` to user will on the cluster, we can use the
following syntax::

    cr> GRANT ALL PRIVILEGES TO will;
    GRANT OK, 4 rows affected (... sec)

Using ``ALL PRIVILEGES`` is a shortcut to grant all the currently grantable
privileges to a user or role, namely ``DQL``, ``DML`` and ``DDL``.

Privileges can be granted to multiple users/roles in the same statement, like
so::

    cr> GRANT DDL ON TABLE doc.books TO wolfgang, will;
    GRANT OK, 1 row affected (... sec)

``DENY``
........

To deny a privilege to an existing user or role on the whole cluster, use the
:ref:`ref-deny` SQL statement, for example::

    cr> DENY DDL TO will;
    DENY OK, 1 row affected (... sec)

``DQL`` privilege can be denied on the ``sys`` schema to user ``wolfgang`` like
this::

    cr> DENY DQL ON SCHEMA sys TO wolfgang;
    DENY OK, 1 row affected (... sec)

The following statement will deny ``DQL`` privilege on table doc.books to user
``wolfgang``::

    cr> DENY DQL ON TABLE doc.books TO wolfgang;
    DENY OK, 1 row affected (... sec)

``DENY ALL`` or ``DENY ALL PRIVILEGES`` will deny all privileges to a user or
role, on the cluster it can be used like this::

    cr> DENY ALL TO will;
    DENY OK, 3 rows affected (... sec)

``REVOKE``
..........

To revoke a privilege that was previously granted or denied to a user or role
use the :ref:`ref-revoke` SQL statement, for example the ``DQL`` privilege that
was previously denied to user ``wolfgang`` on the ``sys`` schema, can be revoked
like this::

    cr> REVOKE DQL ON SCHEMA sys FROM wolfgang;
    REVOKE OK, 1 row affected (... sec)

The privileges that were granted and denied to user ``wolfgang`` on doc.books
can be revoked like this::

    cr> REVOKE ALL ON TABLE doc.books FROM wolfgang;
    REVOKE OK, 4 rows affected (... sec)

The privileges that were granted to user ``will`` on the cluster can be revoked
like this::

    cr> REVOKE ALL FROM will;
    REVOKE OK, 4 rows affected (... sec)

.. NOTE::

    The ``REVOKE`` statement can remove only privileges that have been granted
    or denied through the ``GRANT`` or ``DENY`` statements. If the privilege
    on a specific object was not explicitly granted, the ``REVOKE`` statement
    has no effect. The effect of the ``REVOKE`` statement will be reflected
    in the row count.

.. NOTE::

    When a privilege is revoked from a user or role, it can still be active for
    that user/role, if the user/role :ref:`inherits <roles_inheritance>` it,
    from another role.

List privileges
===============

CrateDB exposes the privileges of users and roles of the database through the
:ref:`sys.privileges <sys-privileges>` system table.

By querying the ``sys.privileges`` table you can get all
information regarding the existing privileges. E.g.::

    cr> SELECT * FROM sys.privileges order by grantee, class, ident;
    +---------+----------+---------+----------------+-------+------+
    | class   | grantee  | grantor | ident          | state | type |
    +---------+----------+---------+----------------+-------+------+
    | SCHEMA  | riley    | crate   | doc            | GRANT | DQL  |
    | TABLE   | riley    | crate   | doc.accounting | DENY  | DQL  |
    | TABLE   | will     | crate   | doc.books      | GRANT | DDL  |
    | CLUSTER | wolfgang | crate   | NULL           | GRANT | DML  |
    +---------+----------+---------+----------------+-------+------+
    SELECT 4 rows in set (... sec)

.. hide:

    cr> DROP user riley;
    DROP OK, 1 row affected (... sec)

    cr> DROP user kala;
    DROP OK, 1 row affected (... sec)

    cr> DROP TABLE IF EXISTS doc.accounting;
    DROP OK, 1 row affected (... sec)

    cr> DROP user wolfgang;
    DROP OK, 1 row affected (... sec)

    cr> DROP user will;
    DROP OK, 1 row affected (... sec)

    cr> DROP TABLE IF EXISTS doc.books;
    DROP OK, 1 row affected (... sec)

    cr> DROP VIEW first_customer;
    DROP OK, 1 row affected (... sec)


.. _roles_inheritance:

Roles inheritance
=================

.. hide:

    cr> CREATE USER john;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE role_a;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE role_b;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE role_c;
    CREATE OK, 1 row affected (... sec)


Introduction
............

You can grant, or revoke roles for an existing user or role. This allows to
group granted or denied privileges and inherit them to other users or roles.

You must first :ref:`create usesr and roles <administration_user_management>`
and then grant roles to other roles or users. You can configure the privileges
of each role before or after granting roles to other roles or users.

.. NOTE::

    Roles can be granted to other roles or users, but users (roles which can
    also login to the database) cannot be granted to other roles or users.

.. NOTE::

    Superuser ``crate`` cannot be granted to other users or roles, and roles
    cannot be granted to it.

Inheritance
...........

The inheritance can span multiple levels, so you can have ``role_a`` which is
granted to ``role_b``, which in turn is granted to ``role_c``, and so on. Each
role can be granted to multiple other roles and each role or user can be granted
multiple other roles. Cycles cannot be created, for example::

    cr> GRANT role_a TO role_b;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT role_b TO role_c;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT role_c TO role_a;
    SQLParseException[Cannot grant role role_c to role_a, role_a is a parent role of role_c and a cycle will be created]


.. hide:

    cr> REVOKE role_b FROM role_c;
    REVOKE OK, 1 row affected (... sec)
    cr> REVOKE role_a FROM role_b;
    REVOKE OK, 1 row affected (... sec)


Privileges resolution
.....................

When a user executes a statement, the privileges mechanism will check first if
the user has been granted the required privileges, if not, it will check if the
roles which this user has been granted have those privileges and if not, it will
continue checking the roles granted to those parent roles of the user and so on.
For example::

    cr> GRANT role_a TO role_b;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT role_b TO role_c;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT DQL ON TABLE sys.users TO role_a;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT role_c TO john;
    GRANT OK, 1 row affected (... sec)

User ``john`` is able to query ``sys.users``, as even though he lacks ``DQL``
privilege on the table, he is granted ``role_c`` which in turn is granted
``role_b`` which is granted ``role_a``, and ``role`` has the ``DQL`` privilege
on ``sys.users``.


.. hide:

    cr> REVOKE role_c FROM john;
    REVOKE OK, 1 row affected (... sec)
    cr> REVOKE role_b FROM role_c;
    REVOKE OK, 1 row affected (... sec)
    cr> REVOKE role_a FROM role_b;
    REVOKE OK, 1 row affected (... sec)
    cr> REVOKE DQL ON TABLE sys.users FROM role_a;
    REVOKE OK, 1 row affected (... sec)

Keep in mind that ``DENY`` has precedence over ``GRANT``. If a role has been
both granted and denied a privilege (directly or through role inheritance), then
``DENY`` will take effect. For example, ``GRANT`` is inherited from a role
and ``DENY`` directly set on the user::

    cr> GRANT DQL ON TABLE sys.users TO role_a;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT role_a TO john
    GRANT OK, 1 row affected (... sec)

::

    cr> DENY DQL ON TABLE sys.users TO john
    DENY OK, 1 row affected (... sec)

User ``john`` cannot query ``sys.users``.


.. hide:

    cr> REVOKE role_a FROM john;
    REVOKE OK, 1 row affected (... sec)
    cr> REVOKE DQL ON TABLE sys.users FROM role_a;
    REVOKE OK, 1 row affected (... sec)

Another example with ``DENY`` in effect, inherited from a role::

    cr> GRANT DQL ON TABLE sys.users TO role_a;
    GRANT OK, 1 row affected (... sec)

::

    cr> DENY DQL ON TABLE sys.users TO role_b;
    DENY OK, 1 row affected (... sec)

::

    cr> GRANT role_a, role_b TO john;
    GRANT OK, 2 rows affected (... sec)

User ``john`` cannot query ``sys.users``.


.. hide:

    cr> DROP USER john;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_c;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_b;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_a;
    DROP OK, 1 row affected (... sec)

.. _granting_roles:

``GRANT``
.........

.. hide:

    cr> CREATE ROLE role_dql;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE role_all_on_books;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE USER wolfgang;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE USER will;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE USER layla;
    CREATE OK, 1 row affected (... sec)

    cr> CREATE TABLE IF NOT EXISTS doc.books (
    ...   first_column integer primary key,
    ...   second_column text);
    CREATE OK, 1 row affected (... sec)

To grant an existing role to an existing user or role on the whole cluster,
we use the :ref:`ref-grant` SQL statement, for example::

    cr> GRANT role_dql TO wolfgang;
    GRANT OK, 1 row affected (... sec)

``DML`` privilege can be granted on the ``sys`` schema to role ``role_dml``, so,
by inheritance, to user ``wolfgang`` as well, like this::

    cr> GRANT DQL ON SCHEMA sys TO role_dql;
    GRANT OK, 1 row affected (... sec)

The following statements will grant all privileges on table doc.books to role
``role_all_on_books``, and by inheritance to user ``wolfgang`` as well::

    cr> GRANT role_all_on_books TO wolfgang;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT ALL PRIVILEGES ON TABLE doc.books TO role_all_on_books;
    GRANT OK, 4 rows affected (... sec)


If a role with the name specified in the SQL statement does not exist the
statement returns an error::

    cr> GRANT DDL TO role_ddl;
    RoleUnknownException[Role 'role_ddl' does not exist]

Multiple roles can be granted to multiple users/roles in the same statement,
like so::

    cr> GRANT role_dql, role_all_on_books TO layla, will;
    GRANT OK, 4 rows affected (... sec)

Notice that `4 rows` affected is returned, as in total there are 2 users,
``will`` and ``layla`` and each of them is granted two roles: ``role_dql`` and
``role_all_on_books``.


``REVOKE``
..........

To revoke a role that was previously granted to a user or role use the
:ref:`ref-revoke` SQL statement. For example role ``role_dql`` which was
previously granted to users ``wolfgang``,``layla`` and ``will``, can be revoked
like this::

    cr> REVOKE role_dql FROM wolfgang, layla, will;
    REVOKE OK, 3 rows affected (... sec)

If a privilege is revoked from a role which is granted to other roles or users,
the privilege is automatically revoked also for those roles and users, for
example if we revoke privileges on table ``doc.books`` from
``role_all_on_books``::

    cr> REVOKE ALL PRIVILEGES ON TABLE doc.books FROM role_all_on_books;
    REVOKE OK, 4 rows affected (... sec)

user ``wolfgang``, who is granted the role ``role_all_on_books``, also looses
those privileges.

.. hide:

    cr> CREATE ROLE role_dml;
    CREATE OK, 1 row affected (... sec)
    cr> CREATE ROLE john;
    CREATE OK, 1 row affected (... sec)

If a user is granted the same privilege by inheriting two different roles, when
revoking one of the roles, the user still keeps the privilege. For example if
user ``john`` gets granted ```role_dql`` and ``role_dml``::

    cr> GRANT DQL TO role_dql;
    GRANT OK, 1 row affected (... sec)

::

    cr> GRANT DQL, DML TO role_dml;
    GRANT OK, 2 rows affected (... sec)

::

    cr> GRANT role_dql, role_dml TO john;
    GRANT OK, 2 rows affected (... sec)

and then we revoke ``role_dql`` from ``john``::

    cr> REVOKE role_dql FROM john;
    REVOKE OK, 1 row affected (... sec)

``john`` still has ``DQL`` privilege since it inherits it from ``role_dml``
which is still granted to him.


.. hide:

    cr> DROP USER wolfgang;
    DROP OK, 1 row affected (... sec)
    cr> DROP USER will;
    DROP OK, 1 row affected (... sec)
    cr> DROP USER layla;
    DROP OK, 1 row affected (... sec)
    cr> DROP USER john;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_dql;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_dml;
    DROP OK, 1 row affected (... sec)
    cr> DROP ROLE role_all_on_books;
    DROP OK, 1 row affected (... sec)

    cr> DROP TABLE doc.books;
    DROP OK, 1 row affected (... sec)</doc><doc title="CrateDB tutorial: Multi-tenancy with CrateDB" desc="Multi-tenancy is an architecture in which different tenants share a single software instance. CrateDB does not support the creation of multiple databases and catalogs as some other solutions (e.g., PostgreSQL). However, there are several ways to implement multi-tenancy in CrateDB, and, as is often the case, which one works the best depends on a variety of options and trade-offs. The article illustrates two methods for sharing a single CrateDB instance between multiple tenants. You need this knowledge to work with permissions in CrateDB. "><div data-theme-toc="true"> </div>

The multi-tenancy is an architecture in which different tenants share a single software instance. CrateDB does not support the creation of multiple databases and catalogs as some other solutions (e.g., PostgreSQL). However, there are several ways to implement multi-tenancy in CrateDB, and, as is often the case, which one works the best depends on a variety of options and trade-offs. In this article, we will illustrate two methods for sharing a single CrateDB instance between multiple tenants.

# Schema-based multi-tenancy

In schema-based multi-tenancy, every tenant has its own database schema. CrateDB supports the creation of tables in different schemas ([Schemas - CrateDB Reference](https://crate.io/docs/crate/reference/en/latest/general/ddl/create-table.html#schemas)). The following statements illustrate the creation of two tables with different schemas:

```sql
CREATE TABLE "tenantA"."table1" (
    id int,
    name text,
);
CREATE TABLE "tenantB"."table2" (
    id int,
    address text,
);
```

In this example, we created the first table inside schema ``tenantA`` and the second table inside schema ``tenantB``. Furthermore, access privileges can be administrated on the `SCHEMA` level to restrict access for tenant users only on their schema.

The schema-based multi-tenancy has a couple of benefits:

* Schema changes are independent of other tenants inside CrateDB
* Less risk of data leakage due to data isolation
* Application code does not have to be tenant-aware

However, there are some drawbacks:

* More complexity, as this approach requires the creation of different schemas for different tenants
* Performance considerations, such as sharding and partitioning, need to be done for every tenant individually (depending on the expected data volume)
* Higher risk of querying incorrect schema
* Risk of getting close to the maximum number of shards ([cluster.max_shards_per_node](https://crate.io/docs/crate/reference/en/5.3/config/cluster.html#shard-limits)) if there is a significant number of tenants

# Table-based multi-tenancy

In table-based multi-tenancy, all data resides in the same table, but it’s separated by a discriminator column. In this case, each query needs a `WHERE` statement to select data based on the tenant context. The following example illustrates table creation with a separate `tenant` column.
```sql
CREATE TABLE "doc"."name" (
    id int,
    name text,
    price int,
    tenant name
);
```

Record-based access control is not possible in this scenario. However, you can create a `VIEW` that is restricted to a single tenant. Without the usage of views, data isolation must be guaranteed on the application level.

Table-based multi-tenancy has some benefits:

* The application doesn't need to worry about which schema it is connecting to
* There is only one schema to maintain
* Performance considerations are easier to make, as you don't need to differentiate between tenants with high and low data volume in your sharding and partitioning strategy
* Data is shared across all tenants

Drawbacks are:

* Application code needs to be tenant-aware
* Schema changes affect all tenants
* Possible data leaks as record-based access control are not possible

Finally, if you need full data isolation between different tenants, then you must run a separate CrateDB cluster for each tenant.

# Configuring access privileges with CrateDB

The privileges of CrateDB users have to be managed using the `GRANT`, `DENY` or `REVOKE` statements ([Privileges](https://crate.io/docs/crate/reference/en/4.8/admin/privileges.html)). CrateDB supports four different privilege types:

* Data Query Language (DQL)
* Data Manipulation Language (DML)
* Data Definition Language (DDL)
* Administration Language (AL)

These privileges can be granted on `CLUSTER`, `SCHEMA`, `TABLE`, and `VIEW` levels. In schema-based multi-tenancy, you can grant a user full privileges on schema `tenantA` using the following statement:

```sql 
GRANT ALL PRIVILEGES ON SCHEMA tenantA TO tenantA_user1;
```

Similarly, in table-based multi-tenancy you can grant `DQL` privilege for a specific tenant view:

```sql
GRANT DQL ON VIEW tenantA_view TO tenantA_user1;
```

# Summary

This short article covers the main approaches to multi-tenancy with CrateDB: schema-based and table-based multi-tenancy. We also outlined the benefits and drawbacks of each approach, and which one works best for you depends on your use case and goals. If you find this article interesting and want to learn more about CrateDB, visit our [official documentation](https://crate.io/docs/crate/reference/en/4.8/) and check our tutorials on [CrateDB Community](https://community.cratedb.com/).</doc><doc title="CrateDB SQL reference: Syntax" desc="You can use Structured Query Language (SQL) to query your data.">.. _sql:

==========
SQL syntax
==========

You can use :ref:`Structured Query Language <appendix-compatibility>` (SQL) to
query your data.

This section of the documentation provides a complete SQL syntax
reference for CrateDB.

.. NOTE::

   For introductions to CrateDB functionality, we recommend you consult the
   appropriate top-level section of the documentation. The SQL syntax reference
   assumes a basic familiarity with the relevant parts of CrateDB.

.. SEEALSO::

    :ref:`General use: Data definition <ddl>`

    :ref:`General use: Data manipulation <dml>`

    :ref:`General use: Querying <dql>`

    :ref:`General use: Built-in functions and operators <builtins>`

.. toctree::
    :maxdepth: 2

    general/index
    statements/index</doc><doc title="CrateDB SQL reference: CREATE TABLE" desc="The `CREATE TABLE` command creates a new, initially empty table. The command accepts many important parameters for commandeering CrateDB's special features, mostly `CLUSTERED` and `PARTITIONED BY`, extended by exposing loads of options through its `WITH` modifier. ">.. highlight:: psql

.. _sql-create-table:

================
``CREATE TABLE``
================

Create a new table.

.. _sql-create-table-synopsis:

Synopsis
========

::

    CREATE TABLE [ IF NOT EXISTS ] table_ident ( [
        {
            base_column_definition
          | generated_column_definition
          | table_constraint
        }
        [, ... ] ]
    )
    [ PARTITIONED BY (column_name [, ...] ) ]
    [ CLUSTERED [ BY (routing_column) ] INTO num_shards SHARDS ]
    [ WITH ( table_parameter [= value] [, ... ] ) ]

where ``base_column_definition``::

    column_name data_type
    [ DEFAULT default_expr ]
    [ column_constraint [ ... ] ]  [ storage_options ]

where ``generated_column_definition`` is::

    column_name [ data_type ] [ GENERATED ALWAYS ]
    AS [ ( ] generation_expression [ ) ]
    [ column_constraint [ ... ] ]

where ``column_constraint`` is::

    { [ CONSTRAINT constraint_name ] PRIMARY KEY |
      NULL |
      NOT NULL |
      INDEX { OFF | USING { PLAIN |
                            FULLTEXT [ WITH ( analyzer = analyzer_name ) ]  }
      [ CONSTRAINT constraint_name ] CHECK (boolean_expression)
    }

where ``storage_options`` is::

    STORAGE WITH ( option = value_expression [, ... ] )

and ``table_constraint`` is::

    { [ CONSTRAINT constraint_name ] PRIMARY KEY ( column_name [, ... ] ) |
      INDEX index_name USING FULLTEXT ( column_name [, ... ] )
           [ WITH ( analyzer = analyzer_name ) ]
      [ CONSTRAINT constraint_name ] CHECK (boolean_expression)
    }


.. _sql-create-table-description:

Description
===========

``CREATE TABLE`` will create a new, initially empty table.

If the ``table_ident`` does not contain a schema, the table is created in the
``doc`` schema. Otherwise it is created in the given schema, which is
implicitly created, if it didn't exist yet.

A table consists of one or more *base columns* and any number of *generated
columns* and/or *table constraints*.

The optional constraint clauses specify constraints (tests) that new or updated
rows must satisfy for an ``INSERT``, ``UPDATE`` or ``COPY FROM`` operation to
succeed. A constraint is an SQL object that helps define the set of valid
values in the table in various ways.

There are two ways to define constraints: table constraints and column
constraints. A column constraint is defined as part of a column definition. A
table constraint definition is not tied to a particular column, and it can
encompass more than one column. Every column constraint can also be written as
a table constraint; a column constraint is only a notational convenience for
use when the constraint only affects one column.

.. SEEALSO::

    :ref:`Data definition: Creating tables <ddl-create-table>`


.. _sql-create-table-elements:

Table elements
--------------


.. _sql-create-table-base-columns:

Base Columns
~~~~~~~~~~~~

A base column is a persistent column in the table metadata. In relational terms
it is an attribute of the tuple of the table-relation. It has a name, a type,
an optional default clause and optional constraints.

Base columns are readable and writable (if the table itself is writable).
Values for base columns are given in DML statements explicitly or omitted, in
which case their value is null.


.. _sql-create-table-default-clause:

Default clause
^^^^^^^^^^^^^^

The optional default clause defines the default value of the column. The value
is inserted when the column is a target of an ``INSERT``, ``UPDATE``, or
``COPY FROM`` statement that doesn't contain an explicit value for it.

The default clause :ref:`expression <gloss-expression>` is variable-free, it
means that subqueries and cross-references to other columns are not allowed.

.. NOTE::

    Default values are not allowed for columns of type ``OBJECT``::

      cr> CREATE TABLE tbl (obj OBJECT DEFAULT {key='foo'})
      SQLParseException[Default values are not allowed for object columns: obj]

    They are allowed for sub columns of an object column. If an object column
    has at least one child with a default expression it will implicitly create
    the full object unless it's within an array.

    An example::

      cr> CREATE TABLE object_defaults (id int, obj OBJECT AS (key TEXT DEFAULT ''))
      CREATE OK, 1 row affected  (... sec)

      cr> INSERT INTO object_defaults (id) VALUES (1)
      INSERT OK, 1 row affected  (... sec)

      cr> REFRESH TABLE object_defaults
      REFRESH OK, 1 row affected  (... sec)

      cr> SELECT obj FROM object_defaults
      +-------------+
      | obj         |
      +-------------+
      | {"key": ""} |
      +-------------+
      SELECT 1 row in set (... sec)


.. _sql-create-table-generated-columns:

Generated columns
~~~~~~~~~~~~~~~~~

A generated column is a persistent column that is computed as needed from the
``generation_expression`` for every ``INSERT``, ``UPDATE`` and ``COPY FROM``
operation.

The ``GENERATED ALWAYS`` part of the syntax is optional.

.. NOTE::

   A generated column is not a virtual column. The computed value is stored in
   the table like a base column is. The automatic computation of the value is
   what makes it different.

.. SEEALSO::

    :ref:`Data definition: Generated columns <ddl-generated-columns>`

.. NOTE::

   Default columns and generated columns that are sub-columns of object columns
   are generated when the parent object is re-assigned and are assigned to
   ``null`` when any of their parent objects is assigned to ``null``.

.. _sql-create-table-table-constraints:

Table constraints
~~~~~~~~~~~~~~~~~

Table constraints are constraints that are applied to more than one column or
to the table as a whole.

.. SEEALSO::

    - :ref:`General SQL: Table constraints <table_constraints>`
    - :ref:`CHECK constraint <check_constraint>`


.. _sql-create-table-column-constraints:

Column constraints
~~~~~~~~~~~~~~~~~~

Column constraints are constraints that are applied on each column of the table
separately.

.. SEEALSO::

    - :ref:`General SQL: Column constraints <column_constraints>`
    - :ref:`CHECK constraint <check_constraint>`


.. _sql-create-table-storage-options:

Storage options
~~~~~~~~~~~~~~~

Storage options can be applied on each column of the table separately.

.. SEEALSO::

    :ref:`Data definition: Storage <ddl-storage>`


.. _sql-create-table-parameters:

Parameters
==========

:table_ident:
  The name (optionally schema-qualified) of the table to be created.

:column_name:
  The name of a column to be created in the new table.

:data_type:
  The :ref:`data type <data-types>` of the column. This can include array and
  object specifiers.

:generation_expression:
  An :ref:`expression <ddl-generated-columns-expressions>` (usually a
  :ref:`function call <sql-function-call>`) that is applied in the context of
  the current row. As such, it can reference other base columns of the table.
  Referencing other generated columns (including itself) is not supported. The
  generation expression is :ref:`evaluated <gloss-evaluation>` each time a row
  is inserted or the referenced base columns are updated.


.. _sql-create-table-if-not-exists:

``IF NOT EXISTS``
=================

If the optional ``IF NOT EXISTS`` clause is used, this statement won't do
anything if the table exists already, and ``0`` rows will be returned.


.. _sql-create-table-clustered:

``CLUSTERED``
=============

The optional ``CLUSTERED`` clause specifies how a table should be distributed
across a cluster.

::

    [ CLUSTERED [ BY (routing_column) ] INTO num_shards SHARDS ]

:num_shards:
  Specifies the number of :ref:`shards <ddl-sharding>` a table is stored
  in. Must be greater than 0. If not provided, the number of shards is
  calculated based on the number of currently active data nodes with the
  following formula::

      num_shards = max(4, num_data_nodes * 2)

  .. NOTE::

     The minimum value of ``num_shards`` is set to ``4``. This means if the
     calculation of ``num_shards`` does not exceeds its minimum it applies the
     minimum value to each table or partition as default.

:routing_column:
  Specify a :ref:`routing column <gloss-routing-column>` that :ref:`determines
  <sharding-routing>` how rows are sharded.

  All rows that have the same ``routing_column`` row value are stored in the
  same shard. If a :ref:`primary key <primary_key_constraint>` has been
  defined, it will be used as the default routing column, otherwise the
  :ref:`internal document ID <sql_administration_system_column_id>` is used.

.. SEEALSO::

    :ref:`Data definition: Sharding <ddl-sharding>`


.. _sql-create-table-partitioned-by:

``PARTITIONED BY``
==================

The ``PARTITIONED`` clause splits the created table into separate
:ref:`partitions <partitioned-tables>` for every distinct combination of row
values in the specified :ref:`partition columns <gloss-partition-column>`.

::

    [ PARTITIONED BY ( column_name [ , ... ] ) ]

:column_name:
  The name of a column to be used for partitioning. Multiple columns names can
  be specified inside the parentheses and must be separated by commas.


The following restrictions apply:

- Partition columns may not be part of the :ref:`sql-create-table-clustered`
  clause

- Partition columns must only contain :ref:`primitive types
  <data-types-primitive>`

- Partition columns may not be inside an object array

- Partition columns may not be indexed with a :ref:`fulltext index with
  analyzer <sql_ddl_index_fulltext>`

- If the table has a :ref:`primary_key_constraint` constraint, all of the
  partition columns must be included in the primary key definition

.. CAUTION::

    Partition columns :ref:`cannot be altered <partitioned-update>` by an
    ``UPDATE`` statement.


.. _sql-create-table-with:

``WITH``
========

The optional ``WITH`` clause can specify parameters for tables.

::

    [ WITH ( table_parameter [= value] [, ... ] ) ]

:table_parameter:
  Specifies an optional parameter for the table.

.. NOTE::

   Some parameters are nested, and therefore need to be wrapped in double
   quotes in order to be set. For example::

       WITH ("allocation.max_retries" = 5)

   Nested parameters are those that contain a ``.`` between parameter names
   (e.g. ``write.wait_for_active_shards``).

Available parameters are:


.. _sql-create-table-number-of-replicas:

``number_of_replicas``
----------------------

Specifies the number or range of replicas each shard of a table should have for
normal operation, the default is to have ``0-1`` replica.

The number of replicas is defined like this::

    min_replicas [ - [ max_replicas ] ]

:min_replicas:
  The minimum number of replicas required.

:max_replicas:
  The maximum number of replicas.

  The actual maximum number of replicas is max(num_replicas, N-1), where N is
  the number of data nodes in the cluster. If ``max_replicas`` is the string
  ``all`` then it will always be N-1.

.. NOTE::

   If the value is provided as a range or the default value ``0-1`` is used,
   :ref:`cluster.max_shards_per_node <cluster.max_shards_per_node>` and
   :ref:`cluster.routing.allocation.total_shards_per_node
   <cluster.routing.allocation.total_shards_per_node>` limits account only for
   primary shards and not for possible expanded replicas and thus actual
   number of all shards can exceed those limits.

.. SEEALSO::

    :ref:`ddl-replication`


.. _sql-create-table-number-of-routing-shards:

``number_of_routing_shards``
----------------------------

This number specifies the hashing space that is used internally to distribute
documents across shards.

This is an optional setting that enables users to later on increase the number
of shards using :ref:`sql-alter-table`. If it's not set explicitly, it's
automatically set to a default value based on the number of shards defined in
the :ref:`sql-create-table-clustered`, which allows to increase the shards by
a factor of `2` each time, up until the maximum of `1024` shards per table.

.. NOTE:: It's not possible to update this setting after table creation.


.. _sql-create-table-refresh-interval:

``refresh_interval``
--------------------

In CrateDB new written records are not immediately visible. A user has to
either invoke the :ref:`REFRESH <sql-refresh>` statement or wait for an
automatic background refresh.

The interval of this background refresh is specified in milliseconds using this
``refresh_interval`` setting.

By default it's not specified, which causes tables to be refreshed once every
second but only if the table is not idle. A table can become idle if no
query accesses it for more than 30 seconds.

If a table is idle, the periodic refresh is temporarily disabled. A query
hitting an idle table will trigger a refresh and enable the periodic refresh
again.

When ``refresh_interval`` is set explicitly, table is refreshed regardless of
idle state. Use :ref:`ALTER TABLE RESET <sql-alter-table-set-reset>` to switch
to default 1 second refresh and freeze-on-idle behavior.

:value:
  The refresh interval in milliseconds. A value smaller or equal than 0
  turns off the automatic refresh. A value of greater than 0 schedules a
  periodic refresh of the table.

.. NOTE::

   A ``refresh_interval`` of 0 does not guarantee that new writes are *NOT*
   visible to subsequent reads. Only the periodic refresh is disabled. There
   are other internal factors that might trigger a refresh.

.. NOTE::

   On partitioned tables, the idle mechanism works per partition. This can be
   useful for time-based partitions where older partitions are rarely queried.

   The downside is that if many partitions are idle and a query activates them,
   there will be a spike in refresh load. If you've such an access pattern, you
   may want to set an explicit ``refresh_interval`` to have a permanent
   background refresh.

.. SEEALSO::

    :ref:`Querying: Refresh <refresh_data>`

    :ref:`SQL syntax: REFRESH <sql-refresh>`


.. _sql-create-table-write-wait:

.. _sql-create-table-write-wait-for-active-shards:

``write.wait_for_active_shards``
--------------------------------

Specifies the number of shard copies that need to be active for write
operations to proceed. If less shard copies are active the operation must wait
and retry for up to 30s before timing out.

:value:
  ``all`` or a positive integer up to the total number of configured shard
  copies (``number_of_replicas + 1``).

  A value of ``1`` means only the primary has to be active. A value of ``2``
  means the primary plus one replica shard has to be active, and so on.

  The default value is set to ``1``.

  ``all`` is a special value that means all shards (primary + replicas) must be
  active for write operations to proceed.

Increasing the number of shard copies to wait for improves the resiliency of
the system. It reduces the chance of write operations not writing to the
desired number of shard copies, but it does not eliminate the possibility
completely, because the check occurs before the write operation starts.

Replica shard copies that missed some writes will be brought up to date by the
system eventually, but in case a node holding the primary copy has a system
failure, the replica copy couldn't be promoted automatically as it would lead
to data loss since the system is aware that the replica shard didn't receive
all writes. In such a scenario, :ref:`ALTER TABLE .. REROUTE PROMOTE REPLICA
<alter-table-reroute-promote-replica>` can be used to force the
:ref:`allocation <gloss-shard-recovery>` of a stale replica copy to at least
recover the data that is available in the stale replica copy.

Say you've a 3 node cluster and a table with 1 configured replica. With
``write.wait_for_active_shards=1`` and ``number_of_replicas=1`` a node in the
cluster can be restarted without affecting write operations because the primary
copies are either active or the replicas can be quickly promoted.

If ``write.wait_for_active_shards`` would be set to ``2`` instead and a node is
stopped, the write operations would block until the replica is fully replicated
again or the write operations would timeout in case the replication is not fast
enough.


.. _sql-create-table-blocks:

.. _sql-create-table-blocks-read-only:

``blocks.read_only``
--------------------

Allows to have a read only table.

:value:
  Table is read only if value set to ``true``. Allows writes and table settings
  changes if set to ``false``.


.. _sql-create-table-blocks-read-only-allow-delete:

``blocks.read_only_allow_delete``
---------------------------------

Allows to have a read only table that additionally can be deleted.

:value:
  Table is read only and can be deleted if value set to ``true``. Allows writes
  and table settings changes if set to ``false``. This flag should not be set
  manually as it's used, in an automated way, by the mechanism that protects
  CrateDB nodes from running out of available disk space.

  When a disk on a node exceeds the
  ``cluster.routing.allocation.disk.watermark.flood_stage`` threshold, this
  block is applied (set to ``true``) to all tables on that affected node. Once
  you've freed disk space again and the threshold is undershot, the setting is
  automatically reset to ``false`` for the affected tables.

.. SEEALSO::

    :ref:`Cluster-wide settings: Disk-based shard allocation
    <conf-routing-allocation-disk>`

.. NOTE::

    During maintenance operations, you might want to temporarily disable reads,
    writes or table settings changes. To achieve this, please use the
    corresponding settings :ref:`sql-create-table-blocks-read`,
    :ref:`sql-create-table-blocks-write`,
    :ref:`sql-create-table-blocks-metadata`, or
    :ref:`sql-create-table-blocks-read-only`, which must be manually reset after
    the maintenance operation has been completed.

.. _sql-create-table-blocks-read:

``blocks.read``
---------------

``disable``/``enable`` all the read operations

:value:
  Set to ``true`` to disable all read operations for a table, otherwise set
  ``false``.


.. _sql-create-table-blocks-write:

``blocks.write``
----------------

``disable``/``enable`` all the write operations

:value:
  Set to ``true`` to disable all write operations and table settings
  modifications, otherwise set ``false``.


.. _sql-create-table-blocks-metadata:

``blocks.metadata``
-------------------

``disable``/``enable`` the table settings modifications.

:values:
  Disables the table settings modifications if set to ``true``. If set to
  ``false``, table settings modifications are enabled.


.. _sql-create-table-soft-deletes:

Soft deletes allow CrateDB to preserve recent deletions within the Lucene
index. This information is used for :ref:`shard recovery<gloss-shard-recovery>`.

Before the introduction of soft deletes, CrateDB had to retain the information
in the :ref:`Translog <concept-durability>`. Using soft deletes uses less
storage than the Translog equivalent and is faster.

.. _sql-create-table-soft-deletes-retention-lease-period:

``soft_deletes.retention_lease.period``
---------------------------------------

The maximum period for which a retention lease is retained before it is
considered expired.

:value:
  ``12h`` (default). Any positive time value is allowed.

CrateDB sometimes needs to replay operations that were executed on one shard on
other shards. For example if a shard copy is temporarily unavailable but write
operations to the primary copy continues, the missed operations have to be
replayed once the shard copy becomes available again.

If soft deletes are enabled, CrateDB uses a Lucene feature to preserve recent
deletions in the Lucene index so that they can be replayed. Because of that,
deleted documents still occupy disk space, which is why CrateDB only preserves
certain recently-deleted documents. CrateDB eventually fully discards deleted
documents to prevent the index growing larger despite having deleted documents.

CrateDB keeps track of operations it expects to need to replay using a
mechanism called *shard history retention leases*. Retention leases are a
mechanism that allows CrateDB to determine which soft-deleted operations can be
safely discarded.

If a shard copy fails, it stops updating its shard history retention lease,
indicating that the soft-deleted operations should be preserved for later
recovery.

However, to prevent CrateDB from holding onto shard retention leases forever,
they expire after ``soft_deletes.retention_lease.period``, which defaults to
``12h``. Once a retention lease has expired CrateDB can again discard
soft-deleted operations. In case a shard copy recovers after a retention lease
has expired, CrateDB will fall back to copying the whole index since it can no
longer replay the missing history.


.. _sql-create-table-codec:

``codec``
---------

By default data is stored using ``LZ4`` compression. This can be changed to
``best_compression`` which uses ``DEFLATE`` for a higher compression ratio, at
the expense of slower column value lookups.

:values:
  ``default`` or ``best_compression``


.. _sql-create-table-store:

.. _sql-create-table-store-type:

``store.type``
--------------

The store type setting allows you to control how data is stored and accessed on
disk. It's not possible to update this setting after table creation. The
following storage types are supported:

:fs:
  Default file system implementation. It will pick the best implementation
  depending on the operating environment, which is currently ``hybridfs`` on
  all supported systems but is subject to change.

:niofs:
  The ``NIO FS`` type stores the shard index on the file system (Lucene
  ``NIOFSDirectory``) using NIO. It allows multiple threads to read from the
  same file concurrently.

:mmapfs:
  The ``MMap FS`` type stores the shard index on the file system (Lucene
  ``MMapDirectory``) by mapping a file into memory (mmap).  Memory mapping uses
  up a portion of the virtual memory address space in your process equal to the
  size of the file being mapped. Before using this type, be sure you have
  allowed plenty of virtual address space.

:hybridfs:
  The ``hybridfs`` type is a hybrid of ``niofs`` and ``mmapfs``, which chooses
  the best file system type for each type of file based on the read access
  pattern. Similarly to ``mmapfs`` be sure you have allowed plenty of virtual
  address space.

It is possible to restrict the use of the ``mmapfs`` and ``hybridfs`` store
type via the :ref:`node.store.allow_mmap <node.store_allow_mmap>` node setting.


.. _sql-create-table-mapping:

.. _sql-create-table-mapping-total-fields-limit:

``mapping.total_fields.limit``
------------------------------

Sets the maximum number of columns that is allowed for a table. Default is
``1000``.

:value:
  Maximum amount of fields in the Lucene index mapping. This includes both the
  user facing mapping (columns) and internal fields.


.. _sql-create-table-mapping-depth-limit:

``mapping.depth.limit``
-----------------------

Sets the maximum allowed nesting depth for object columns when adding new
columns to a table. Default is ``100``.

.. CAUTION::

    Increasing this limit may lead to significantly longer execution times or
    stack overflow errors, as certain queries recurse through the deeply nested
    object structures.


.. _sql-create-table-translog:

.. _sql-create-table-translog-flush-threshold-size:

``translog.flush_threshold_size``
---------------------------------

Sets size of transaction log prior to flushing.

:value:
  Size (bytes) of translog.


.. _sql-create-table-translog-sync-interval:

``translog.sync_interval``
--------------------------

How often the translog is fsynced to disk. Defaults to 5s.  When setting this
interval, please keep in mind that changes logged during this interval and not
synced to disk may get lost in case of a failure. This setting only takes
effect if :ref:`translog.durability <sql-create-table-translog-durability>` is
set to ``ASYNC``.

:value:
  Interval in milliseconds.


.. _sql-create-table-translog-durability:

``translog.durability``
-----------------------

If set to ``ASYNC`` the translog gets flushed to disk in the background every
:ref:`translog.sync_interval <sql-create-table-translog-sync-interval>`. If set
to ``REQUEST`` the flush happens after every operation.

:value:
  ``REQUEST`` (default), ``ASYNC``


.. _sql-create-table-routing:

.. _sql-create-table-routing-allocation:

.. _sql-create-table-routing-allocation.total-shards-per-node:

``routing.allocation.total_shards_per_node``
--------------------------------------------

Controls the total number of shards (replicas and primaries) allowed to be
:ref:`allocated <gloss-shard-allocation>` on a single node. Defaults to
unbounded (-1).

:value:
  Number of shards per node.


.. _sql-create-table-routing-allocation-enable:

``routing.allocation.enable``
-----------------------------

Controls shard :ref:`allocation <gloss-shard-allocation>` for a specific table.
Can be set to:

:all:
  Allows shard allocation for all shards. (Default)

:primaries:
  Allows shard allocation only for primary shards.

:new_primaries:
  Allows shard allocation only for primary shards for new tables.

:none:
  No shard allocation allowed.


.. _sql-create-table-allocation-max-retries:

``allocation.max_retries``
----------------------------------

Defines the number of attempts to :ref:`allocate <gloss-shard-allocation>` a
shard before giving up and leaving the shard unallocated.

:value:
  Number of retries to allocate a shard. Defaults to 5.


.. _sql-create-table-routing-allocation-include:

``routing.allocation.include.{attribute}``
------------------------------------------

Assign the table to a node whose ``{attribute}`` has at least one of the
comma-separated values. This setting overrides the related
:ref:`cluster setting <cluster.routing.allocation.include.*>` for the given
table, which will then ignore the cluster setting completely.

.. SEEALSO::

    :ref:`Data definition: Shard allocation filtering <ddl_shard_allocation>`


.. _sql-create-table-routing-allocation-require:

``routing.allocation.require.{attribute}``
------------------------------------------

Assign the table to a node whose ``{attribute}`` has all of the comma-separated
values. This setting overrides the related
:ref:`cluster setting <cluster.routing.allocation.require.*>` for the given
table which will then ignore the cluster setting completely.

.. SEEALSO::

    :ref:`Data definition: Shard allocation filtering <ddl_shard_allocation>`


.. _sql-create-table-routing-allocation-exclude:

``routing.allocation.exclude.{attribute}``
------------------------------------------

Assign the table to a node whose ``{attribute}`` has none of the comma-separated
values. This setting overrides the related
:ref:`cluster setting <cluster.routing.allocation.exclude.*>` for the given
table which will then ignore the cluster setting completely.

.. SEEALSO::

    :ref:`Data definition: Shard allocation filtering <ddl_shard_allocation>`


.. _sql-create-table-unassigned:

.. _sql-create-table-unassigned.node-left:

.. _sql-create-table-unassigned.node-left-delayed-timeout:

``unassigned.node_left.delayed_timeout``
----------------------------------------

Delay the :ref:`allocation <gloss-shard-allocation>` of replica shards which
have become unassigned because a node has left. It defaults to ``1m`` to give a
node time to restart completely (which can take some time when the node has
lots of shards). Setting the timeout to ``0`` will start allocation
immediately. This setting can be changed on runtime in order to
increase/decrease the delayed allocation if needed.


.. _sql-create-table-column-policy:

``column_policy``
-----------------

Specifies the column policy of the table. The default column policy is
``strict``.

The column policy is defined like this::

    WITH ( column_policy = {'dynamic' | 'strict'} )

:strict:
  Rejecting any column on ``INSERT``, ``UPDATE`` or ``COPY FROM`` which is not
  defined in the schema

:dynamic:
  New columns can be added using ``INSERT``, ``UPDATE`` or ``COPY FROM``. New
  columns added to ``dynamic`` tables are, once added, usable as usual
  columns. One can retrieve them, sort by them and use them in ``WHERE``
  clauses.

.. SEEALSO::

    :ref:`Data definition: Column policy <column_policy>`


.. _sql-create-table-max-ngram-diff:

``max_ngram_diff``
------------------

Specifies the maximum difference between ``max_ngram`` and ``min_ngram`` when
using the ``NGramTokenizer`` or the ``NGramTokenFilter``. The default is 1.


.. _sql-create-table-max-shingle-diff:

``max_shingle_diff``
--------------------

Specifies the maximum difference between ``min_shingle_size`` and
``max_shingle_size`` when using the ``ShingleTokenFilter``. The default is 3.


.. _sql-create-table-merge:

.. _sql-create-table-merge-scheduler:

.. _sql-create-table-merge-scheduler-max-thread-count:

``merge.scheduler.max_thread_count``
------------------------------------

The maximum number of threads on a single shard that may be merging at once.
Defaults to ``Math.max(1, Math.min(4,
Runtime.getRuntime().availableProcessors() / 2))`` which works well for a good
solid-state-disk (SSD). If your index is on spinning platter drives instead,
decrease this to 1.</doc><doc title="CrateDB SQL reference: CREATE TABLE AS" desc="`CREATE TABLE AS` will create a new table and insert rows based on the specified query. ">.. highlight:: psql
.. _ref-create-table-as:


===================
``CREATE TABLE AS``
===================

Define a new table from existing tables.

Synopsis
========

::

    CREATE TABLE [ IF NOT EXISTS ] table_ident AS { ( query ) | query }


Description
===========

``CREATE TABLE AS`` will create a new table and insert rows based on the
specified ``query``.

Only the column names, types, and the output rows will be used from the
``query``. Default values will be assigned to the optional parameters used for
the table creation.

For further details on the default values of the optional parameters,
see :ref:`sql-create-table`.

``IF NOT EXISTS``
=================

If the optional ``IF NOT EXISTS`` clause is used, this statement won't do
anything if the table exists already, and ``0`` rows will be returned.

Parameters
==========

:table_ident:
  The name (optionally schema-qualified) of the table to be created.

:query:
    A query (``SELECT`` statement) that supplies the rows to be inserted.
    Refer to the ``SELECT`` statement for a description of the syntax.</doc><doc title="CrateDB SQL reference: CREATE FOREIGN TABLE" desc="`CREATE FOREIGN TABLE` is a DDL statement that creates a new foreign table. A foreign table is a view onto data in a foreign system. ">.. highlight:: psql
.. _ref-create-foreign-table:

========================
``CREATE FOREIGN TABLE``
========================

Create a foreign table.

Synopsis
========

.. code-block:: psql

  CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_ident ([
    { column_name data_type }
      [, ... ]
  ])
    SERVER server_name
  [ OPTIONS ( option 'value' [, ... ] ) ]


Description
===========

``CREATE FOREIGN TABLE`` is DDL statement that creates a new foreign table.
A foreign table is a view onto data in a foreign system.

To create a foreign table you must first create a foreign server using
:ref:`ref-create-server`.

The name of the table must be unique, and distinct from the name of other
relations like user tables or views.

Foreign tables are listed in the ``information_schema.tables`` view and
``information_schema.foreign_tables``. You can use :ref:`ref-show-create-table`
to view the definition of an existing foreign table.

Creating a foreign table requires ``AL`` permission on schema or cluster level.

A foreign table cannot be used in :ref:`sql-create-publication` for logical
replication.


Clauses
=======

``IF NOT EXISTS``
-----------------

Do not raise an error if the table already exists.

``OPTIONS``
-----------

:option value:
  Key value pairs defining foreign data wrapper specific options for the server
  .See :ref:`administration-fdw` for the foreign data wrapper specific options.

.. seealso::

   - :ref:`administration-fdw`
   - :ref:`ref-drop-foreign-table`
   - :ref:`ref-create-server`</doc><doc title="CrateDB SQL reference: ALTER TABLE" desc="The `ALTER TABLE` command can be used to modify an existing table definition. It provides options to add columns, modify constraints, enabling or disabling table parameters and allows to execute a shard reroute allocation. The REROUTE command provides various options to manually control the allocation of shards. It allows the enforcement of explicit allocations, cancellations and the moving of shards between nodes in a cluster, giving you the ability to re-balance the cluster state manually. ">.. highlight:: psql

.. _sql-alter-table:

===============
``ALTER TABLE``
===============

Alter an existing table.

.. _sql-alter-table-synopsis:

Synopsis
========

::

    ALTER [ BLOB ] TABLE { ONLY table_ident
                           | table_ident [ PARTITION (partition_column = value [ , ... ]) ] }
      { SET ( parameter = value [ , ... ] ) [ WITH ( property = value [ , ...] ) ]
        | RESET ( parameter [ , ... ] ) [ WITH ( property = value [ , ...] ) ]
        | { ADD [ COLUMN ] column_name data_type [ column_constraint [ ... ] ] } [, ... ]
        | { DROP [ COLUMN ] [ IF EXISTS ] column_name } [, ... ]
        | { RENAME [ COLUMN ] column_name TO new_name } [, ... ]
        | OPEN
        | CLOSE
        | RENAME TO table_ident
        | REROUTE reroute_option
        | DROP CONSTRAINT constraint_name
      }

where ``column_constraint`` is::

    { PRIMARY KEY |
      NULL |
      NOT NULL |
      INDEX { OFF | USING { PLAIN |
                            FULLTEXT [ WITH ( analyzer = analyzer_name ) ]  } |
      [ CONSTRAINT constraint_name ] CHECK (boolean_expression)
    }


.. _sql-alter-table-description:

Description
===========

``ALTER TABLE`` can be used to modify an existing table definition. It provides
options to add columns, modify constraints, enabling or disabling table
parameters and allows to execute a shard  :ref:`reroute allocation
<sql-alter-table-reroute>`.

Use the ``BLOB`` keyword in order to alter a blob table (see
:ref:`blob_support`). Blob tables cannot have custom columns which means that
the ``ADD COLUMN`` keyword won't work.

While altering a partitioned table, using ``ONLY`` will apply changes for the
table *only* and not for any possible existing partitions. So these changes
will only be applied to new partitions. The ``ONLY`` keyword cannot be used
together with a `PARTITION`_ clause.

See ``CREATE TABLE`` :ref:`sql-create-table-with` for a list of available
parameters.

:table_ident:
  The name (optionally schema-qualified) of the table to alter.


.. _sql-alter-table-clauses:

Clauses
=======


.. _sql-alter-table-partition:

``PARTITION``
-------------

.. EDITORIAL NOTE
   ##############

   Multiple files (in this directory) use the same standard text for
   documenting the ``PARTITION`` clause. (Minor verb changes are made to
   accomodate the specifics of the parent statement.)

   For consistency, if you make changes here, please be sure to make a
   corresponding change to the other files.

If the table is :ref:`partitioned <partitioned-tables>`, the optional
``PARTITION`` clause can be used to alter one partition exclusively.

::

    [ PARTITION ( partition_column = value [ , ... ] ) ]

:partition_column:
  One of the column names used for table partitioning.

:value:
  The respective column value.

All :ref:`partition columns <gloss-partition-column>` (specified by the
:ref:`sql-create-table-partitioned-by` clause) must be listed inside the
parentheses along with their respective values using the ``partition_column =
value`` syntax (separated by commas).

Because each partition corresponds to a unique set of :ref:`partition column
<gloss-partition-column>` row values, this clause uniquely identifies a single
partition to alter.

.. TIP::

    The :ref:`ref-show-create-table` statement will show you the complete list
    of partition columns specified by the
    :ref:`sql-create-table-partitioned-by` clause.

.. NOTE::

   BLOB tables cannot be partitioned and hence this clause cannot be used.

.. SEEALSO::

    :ref:`Partitioned tables: Alter <partitioned-alter>`


.. _sql-alter-table-arguments:

Arguments
=========


.. _sql-alter-table-set-reset:

``SET/RESET``
-------------

Can be used to change a table parameter to a different value.  Using ``RESET``
will reset the parameter to its default value.

:parameter:
  The name of the parameter that is set to a new value or its default.

The supported parameters are listed in the :ref:`CREATE TABLE WITH CLAUSE
<sql-create-table-with>` documentation. In addition to those, for dynamically
changing the number of :ref:`allocated shards <gloss-shard-allocation>`, the
parameter ``number_of_shards`` can be used. For more info on that, see
:ref:`alter-shard-number`.


The additional ``WITH`` clause can be used to set properties changing the
behavior of the statement. Supported properties are:


:timeout:
  Sets the timeout for the statement. Defaults to 60 seconds for resize
  operations and 30 seconds for other setting changes.
  Note that if a global or per session :ref:`statement_timeout
  <conf-session-statement-timeout>` is shorter, it will take effect first.

.. _sql-alter-table-add-column:

``ADD COLUMN``
--------------

Can be used to add an additional column to a table. While columns can be added
at any time, adding a new :ref:`generated column
<sql-create-table-generated-columns>` is only possible if the table is empty.
In addition, adding a base column with :ref:`sql-create-table-default-clause`
is not supported. It is possible to define a ``CHECK`` constraint with the
restriction that only the column being added may be used in the :ref:`boolean
expression <sql-literal-value>`.

:data_type:
  Data type of the column which should be added.

:column_name:
  Name of the column which should be added.
  This can be a sub-column on an existing `OBJECT`.

It's possible to add multiple columns at once.

.. _sql-alter-table-drop-column:

``DROP COLUMN``
---------------

Can be used to drop a column from a table.

:column_name:
  Name of the column which should be dropped.
  This can be a sub-column of an `OBJECT`.

It's possible to drop multiple columns at once.

.. NOTE::

    It's not allowed to drop a column:

    - which is a :ref:`system column <sql_administration_system_columns>`
    - which is part of a :ref:`PRIMARY KEY <primary_key_constraint>`
    - used in :ref:`CLUSTERED BY column <gloss-clustered-by-column>`
    - used in :ref:`PARTITIONED BY <gloss-partitioned-by-column>`
    - is a :ref:`named index<named-index-column>` column
    - used in an :ref:`named index<named-index-column>`
    - is referenced in a
      :ref:`generated column <ddl-generated-columns-expressions>`
    - is referenced in a
      :ref:`table level constraint with other columns <check_constraint_multiple_cols>`

.. NOTE::

   It's not allowed to drop all columns of a table.

.. NOTE::

   Dropping columns of a table created before version 5.5 is not supported.

.. _sql-alter-table-rename-column:

``RENAME COLUMN``
-----------------

Renames a column of a table

:column_name:
  Name of the column to rename.
  Supports subscript expressions to rename sub-columns of ``OBJECT`` columns.

:new_name:
  The new name of the column.

.. NOTE::

   Renaming columns of a table created before version 5.5 is not supported.

.. _sql-alter-table-open-close:

``OPEN/CLOSE``
--------------

Can be used to open or close the table.

Closing a table means that all operations, except ``ALTER TABLE ...``, will
fail. Operations that fail will not return an error, but they will have no
effect. Operations on tables containing closed partitions won't fail, but those
operations will exclude all closed partitions.


.. _sql-alter-table-rename-to:

``RENAME TO``
-------------

Can be used to rename a table or view, while maintaining its schema and data.
If renaming a table, the shards of it become temporarily unavailable.


.. _sql-alter-table-reroute:

``REROUTE``
-----------

The ``REROUTE`` command provides various options to manually control the
:ref:`allocation of shards <gloss-shard-allocation>`. It allows the enforcement
of explicit allocations, cancellations and the moving of shards between nodes
in a cluster. See :ref:`ddl_reroute_shards` to get the convenient use-cases.

The row count defines if the reroute or allocation process of a shard was
acknowledged or rejected.

.. NOTE::

    Partitioned tables require a :ref:`sql-alter-table-partition` clause in
    order to specify a unique ``shard_id``.

::

    [ REROUTE reroute_option]


where ``reroute_option`` is::

    { MOVE SHARD shard_id FROM node TO node
      | ALLOCATE REPLICA SHARD shard_id ON node
      | PROMOTE REPLICA SHARD shard_id ON node [ WITH (accept_data_loss = { TRUE | FALSE }) ]
      | CANCEL SHARD shard_id ON node [ WITH (allow_primary = {TRUE|FALSE}) ]
    }

:shard_id:
  The shard ID. Ranges from 0 up to the specified number of :ref:`sys-shards`
  shards of a table.

:node:
  The ID or name of a node within the cluster.

  See :ref:`sys-nodes` how to gain the unique ID.


``REROUTE`` supports the following options to start/stop shard allocation:

**MOVE**
  A started shard gets moved from one node to another. It requests a
  ``table_ident`` and a ``shard_id`` to identify the shard that receives the
  new allocation. Specify ``FROM node`` for the node to move the shard from and
  ``TO node`` to move the shard to.

**ALLOCATE REPLICA**
  Allows to force allocation of an unassigned replica shard on a specific node.

.. _alter-table-reroute-promote-replica:

**PROMOTE REPLICA** Force promote a stale replica shard to a primary.  In case
  a node holding a primary copy of a shard had a failure and the replica shards
  are out of sync, the system won't promote the replica to primary
  automatically, as it would result in a silent data loss.

  Ideally the node holding the primary copy of the shard would be brought back
  into the cluster, but if that is not possible due to a permanent system
  failure, it is possible to accept the potential data loss and force promote a
  stale replica using this command.

  The parameter ``accept_data_loss`` needs to be set to ``true`` in order for
  this command to work. If it is not provided or set to false, the command will
  error out.

**CANCEL**
  This cancels the allocation or :ref:`recovery <gloss-shard-recovery>` of a
  ``shard_id`` of a ``table_ident`` on a given ``node``. The ``allow_primary``
  flag indicates if it is allowed to cancel the allocation of a primary shard.


.. _sql-alter-drop-constraint:

``DROP CONSTRAINT``
-------------------

Removes a :ref:`check_constraint` constraint from a table.

.. code-block:: sql

    ALTER TABLE table_ident DROP CONSTRAINT check_name

:table_ident:
  The name (optionally schema-qualified) of the table.

:check_name:
  The name of the check constraint to be removed.


.. WARNING::

    A removed CHECK constraints cannot be re-added to a table once dropped.</doc><doc title="CrateDB SQL reference: COPY FROM" desc="The `COPY FROM` command copies data from a URI to the specified table.">.. highlight:: psql

.. _sql-copy-from:

=============
``COPY FROM``
=============

You can use the ``COPY FROM`` :ref:`statement <gloss-statement>` to copy data
from a file into a table.

.. SEEALSO::

    :ref:`Data manipulation: Import and export <dml-import-export>`

    :ref:`SQL syntax: COPY TO <sql-copy-to>`


.. _sql-copy-from-synopsis:

Synopsis
========

::

    COPY table_identifier
      [ ( column_ident [, ...] ) ]
      [ PARTITION (partition_column = value [ , ... ]) ]
      FROM uri [ WITH ( option = value [, ...] ) ] [ RETURN SUMMARY ]


.. _sql-copy-from-desc:

Description
===========

A ``COPY FROM`` copies data from a URI to the specified table.

The nodes in the cluster will attempt to read the files available at the URI
and import the data.

Here's an example:

::

    cr> COPY quotes FROM 'file:///tmp/import_data/quotes.json';
    COPY OK, 3 rows affected (... sec)

.. NOTE::

    The ``COPY`` statements use :ref:`Overload Protection <overload_protection>` to ensure other
    queries can still perform. Please change these settings during large inserts if needed.

.. _sql-copy-from-formats:

File formats
------------

CrateDB accepts both JSON and CSV inputs. The format is inferred from the file
extension (``.json`` or ``.csv`` respectively) if possible. The :ref:`format
<sql-copy-from-format>` can also be set as an option. If a format is not
specified and the format cannot be inferred, the file will be processed as
JSON.

JSON files must contain a single JSON object per line and all files must be
UTF-8 encoded. Also, any empty lines are skipped.

Example JSON data::

    {"id": 1, "quote": "Don't panic"}
    {"id": 2, "quote": "Ford, you're turning into a penguin. Stop it."}

A CSV file may or may not contain a header. See :ref:`CSV header option
<sql-copy-from-header>` for further details.

Example CSV data::

    id,quote
    1,"Don't panic"
    2,"Ford, you're turning into a penguin. Stop it."

Example CSV data with no header::

    1,"Don't panic"
    2,"Ford, you're turning into a penguin. Stop it."

See also: :ref:`dml-importing-data`.


.. _sql-copy-from-type-checks:

Data type checks
----------------

CrateDB checks if the columns' data types match the types from the import file.
It casts the types and will always import the data as in the source file.
Furthermore CrateDB will check for all :ref:`column_constraints`.

For example a `WKT`_ string cannot be imported into a column of ``geo_shape``
or ``geo_point`` type, since there is no implicit cast to the `GeoJSON`_ format.

.. NOTE::

   In case the ``COPY FROM`` statement fails, the log output on the node will
   provide an error message. Any data that has been imported until then has
   been written to the table and should be deleted before restarting the
   import.


.. _sql-copy-from-params:

Parameters
==========

.. _sql-copy-from-table_ident:

``table_ident``
  The name (optionally schema-qualified) of an existing table where the data
  should be put.

.. _sql-copy-from-column_ident:

``column_ident``
  Used in an optional columns declaration, each ``column_ident`` is the name of a column in the ``table_ident`` table.

  This currently only has an effect if using the CSV file format. See the ``header`` section for how it behaves.

.. _sql-copy-from-uri:

``uri``
  An expression or array of expressions. Each :ref:`expression
  <gloss-expression>` must :ref:`evaluate <gloss-evaluation>` to a string
  literal that is a `well-formed URI`_.

  URIs must use one of the supported :ref:`URI schemes
  <sql-copy-from-schemes>`. CrateDB supports :ref:`globbing
  <sql-copy-from-globbing>` for the :ref:`file <sql-copy-from-file>` and
  :ref:`s3 <sql-copy-from-s3>` URI schemes.

  .. NOTE::

      If the URI scheme is missing, CrateDB assumes the value is a pathname and
      will prepend the :ref:`file <sql-copy-from-file>` URI scheme (i.e.,
      ``file://``). So, for example, CrateDB will convert ``/tmp/file.json`` to
      ``file:///tmp/file.json``.


.. _sql-copy-from-globbing:

URI globbing
------------

With :ref:`file <sql-copy-from-file>` and :ref:`s3 <sql-copy-from-s3>` URI
schemes, you can use pathname `globbing`_ (i.e., ``*`` wildcards) with the
``COPY FROM`` statement to construct URIs that can match multiple directories
and files.

Suppose you used ``file:///tmp/import_data/*/*.json`` as the URI. This URI
would match all JSON files located in subdirectories of the
``/tmp/import_data`` directory.

So, for example, these files would match:

- ``/tmp/import_data/foo/1.json``
- ``/tmp/import_data/bar/2.json``
- ``/tmp/import_data/1/boz.json``

.. CAUTION::

    A file named ``/tmp/import_data/foo/.json`` would also match the
    ``file:///tmp/import_data/*/*.json`` URI. The ``*`` wildcard matches any
    number of characters, including none.

However, these files would not match:

- ``/tmp/import_data/1.json`` (two few subdirectories)
- ``/tmp/import_data/foo/bar/2.json`` (too many subdirectories)
- ``/tmp/import_data/1/boz.js`` (file extension mismatch)


.. _sql-copy-from-schemes:

URI schemes
-----------

CrateDB supports the following URI schemes:

.. contents::
   :local:
   :depth: 1


.. _sql-copy-from-file:

``file``
''''''''

You can use the ``file://`` scheme to specify an absolute path to one or more
files accessible via the local filesystem of one or more CrateDB nodes.

For example:

.. code-block:: text

    file:///path/to/dir

The files must be accessible on at least one node and the system user running
the ``crate`` process must have read access to every file specified.
Additionally, only the ``crate`` superuser is allowed to use the ``file://``
scheme.

By default, every node will attempt to import every file. If the file is
accessible on multiple nodes, you can set the
:ref:`shared <sql-copy-from-shared>` option to true in order to avoid importing
duplicates.

Use :ref:`sql-copy-from-return-summary` to get information about what actions
were performed on each node.

.. TIP::

    If you are running CrateDB inside a container, the file must be inside the
    container. If you are using *Docker*, you may have to configure a `Docker
    volume`_ to accomplish this.

.. TIP::

    If you are using *Microsoft Windows*, you must include the drive letter in
    the file URI.

    For example:

    .. code-block:: text

        file://C:\/tmp/import_data/quotes.json

    Consult the `Windows documentation`_ for more information.


.. _sql-copy-from-s3:

``s3``
''''''

You can use the ``s3://`` scheme to access buckets on the `Amazon Simple
Storage Service`_ (Amazon S3).

For example:

.. code-block:: text

    s3://[<accesskey>:<secretkey>@][<host>:<port>/]<bucketname>/<path>

S3 compatible storage providers can be specified by the optional pair of host
and port, which defaults to Amazon S3 if not provided.

Here is a more concrete example:

.. code-block:: text

    COPY t FROM 's3://accessKey:secretKey@s3.amazonaws.com:443/myBucket/key/a.json' with (protocol = 'https')

If no credentials are set the s3 client will operate in anonymous mode.
See `AWS Java Documentation`_.

Using the ``s3://`` scheme automatically sets the
:ref:`shared <sql-copy-from-shared>` to true.

.. TIP::

   A ``secretkey`` provided by Amazon Web Services can contain characters such
   as '/', '+' or '='. These characters must be `URL encoded`_. For a detailed
   explanation read the official `AWS documentation`_.

   To escape a secret key, you can use a snippet like this:

   .. code-block:: console

      sh$ python -c "from getpass import getpass; from urllib.parse import quote_plus; print(quote_plus(getpass('secret_key: ')))"

   This will prompt for the secret key and print the encoded variant.

   Additionally, versions prior to 0.51.x use HTTP for connections to S3. Since
   0.51.x these connections are using the HTTPS protocol. Please make sure you
   update your firewall rules to allow outgoing connections on port ``443``.

.. _sql-copy-from-az:

``az``
''''''

You can use the ``az://`` scheme to access files on the `Azure Blob Storage`_.

URI must look like ``az:://<account>.<endpoint_suffix>/<container>/<blob_path>``.

For example:

.. code-block:: text

    az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json

One of the authentication parameters (:ref:`sql-copy-from-key` or :ref:`sql-copy-from-sas-token`)
must be provided in the ``WITH`` clause.

Protocol can be provided in the ``WITH`` clause, otherwise ``https`` is used by default.

For example:

.. code-block:: text

    COPY t
    FROM 'az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json'
    WITH (
        key = 'key'
    )

Using the ``az://`` scheme automatically sets the
:ref:`shared <sql-copy-from-shared>` to ``true``.

.. _sql-copy-from-other-schemes:

Other schemes
'''''''''''''

In addition to the schemes above, CrateDB supports all protocols supported by
the `URL`_ implementation of its JVM (typically ``http``, ``https``, ``ftp``,
and ``jar``). Please refer to the documentation of the JVM vendor for an
accurate list of supported protocols.

.. NOTE::

    These schemes *do not* support wildcard expansion.


.. _sql-copy-from-clauses:

Clauses
=======

The ``COPY FROM`` :ref:`statement <gloss-statement>` supports the following
clauses:

.. contents::
   :local:
   :depth: 1


.. _sql-copy-from-partition:

``PARTITION``
-------------

.. EDITORIAL NOTE
   ##############

   Multiple files (in this directory) use the same standard text for
   documenting the ``PARTITION`` clause. (Minor verb changes are made to
   accomodate the specifics of the parent statement.)

   For consistency, if you make changes here, please be sure to make a
   corresponding change to the other files.

If the table is :ref:`partitioned <partitioned-tables>`, the optional
``PARTITION`` clause can be used to import data into one partition exclusively.

::

    [ PARTITION ( partition_column = value [ , ... ] ) ]

:partition_column:
  One of the column names used for table partitioning

:value:
  The respective column value.

All :ref:`partition columns <gloss-partition-column>` (specified by the
:ref:`sql-create-table-partitioned-by` clause) must be listed inside the
parentheses along with their respective values using the ``partition_column =
value`` syntax (separated by commas).

Because each partition corresponds to a unique set of :ref:`partition column
<gloss-partition-column>` row values, this clause uniquely identifies a single
partition for import.

.. TIP::

    The :ref:`ref-show-create-table` statement will show you the complete list
    of partition columns specified by the
    :ref:`sql-create-table-partitioned-by` clause.

.. CAUTION::

    Partitioned tables do not store the row values for the partition columns,
    hence every row will be imported into the specified partition regardless of
    partition column values.


.. _sql-copy-from-with:

``WITH``
--------

You can use the optional ``WITH`` clause to specify option values.

::

    [ WITH ( option = value [, ...] ) ]

The ``WITH`` clause supports the following options:

.. contents::
   :local:
   :depth: 1


.. _sql-copy-from-bulk_size:

**bulk_size**
  | *Type:*    ``integer``
  | *Default:* ``10000``
  | *Optional*

  CrateDB will process the lines it reads from the ``path`` in bulks. This option
  specifies the size of one batch. The provided value must be greater than 0.


.. _sql-copy-from-fail_fast:

**fail_fast**
  | *Type:*    ``boolean``
  | *Default:* ``false``
  | *Optional*

  A boolean value indicating if the ``COPY FROM`` operation should abort early
  after an error. This is best effort and due to the distributed execution, it
  may continue processing some records before it aborts.

.. _sql-copy-from-wait_for_completion:

**wait_for_completion**
  | *Type:*    ``boolean``
  | *Default:* ``true``
  | *Optional*

  A boolean value indicating if the ``COPY FROM`` should wait for
  the copy operation to complete. If set to ``false`` the request
  returns at once and the copy operation runs in the background.

.. _sql-copy-from-shared:

**shared**
  | *Type:*    ``boolean``
  | *Default:* Depends on the scheme of each URI.
  | *Optional*

  This option should be set to true if the URIs location is accessible by more
  than one CrateDB node to prevent them from importing the same file.

  If an array of URIs is passed to ``COPY FROM`` this option will overwrite the
  default for *all* URIs.


.. _sql-copy-from-node_filters:

**node_filters**
  | *Type:* ``text``
  | *Optional*

  A filter :ref:`expression <gloss-expression>` to select the nodes to run the
  *read* operation.

  It's an object in the form of::

      {
          name = '<node_name_regex>',
          id = '<node_id_regex>'
      }

  Only one of the keys is required.

  The ``name`` :ref:`regular expression <gloss-regular-expression>` is applied on
  the ``name`` of all execution nodes, whereas the ``id`` regex is applied on the
  ``node id``.

  If both keys are set, *both* regular expressions have to match for a node to be
  included.

  If the :ref:`shared <sql-copy-from-shared>` option is false, a strict node
  filter might exclude nodes with access to the data leading to a partial import.

  To verify which nodes match the filter, run the statement with
  :ref:`EXPLAIN <ref-explain>`.


.. _sql-copy-from-num_readers:

**num_readers**
  | *Type:*    ``integer``
  | *Default:* Number of nodes available in the cluster.
  | *Optional*

  The number of nodes that will read the resources specified in the URI.
  If the option is set to a
  number greater than the number of available nodes it will still use each node
  only once to do the import. However, the value must be an integer greater than
  0.

  If :ref:`shared <sql-copy-from-shared>` is set to false this option has to be
  used with caution. It might exclude the wrong nodes, causing COPY FROM to read
  no files or only a subset of the files.


.. _sql-copy-from-compression:

**compression**
  | *Type:* ``text``
  | *Values:*  ``gzip``
  | *Default:* `gzip` if all specified URIs end in `.gz`, otherwise the input is
  |                   not decompressed.
  | *Optional*

  Defines if the files to import are ``gzip`` compressed.


.. _sql-copy-from-protocol:

**protocol**
  | *Type:*    ``text``
  | *Values:*  ``http``, ``https``
  | *Default:* ``https``
  | *Optional*

  Protocol to use.
  Used for :ref:`s3 <sql-copy-from-s3>` and :ref:`az <sql-copy-from-az>` schemes only.


.. _sql-copy-from-overwrite_duplicates:

**overwrite_duplicates**
  | *Type:*    ``boolean``
  | *Default:* ``false``
  | *Optional*

  ``COPY FROM`` by default won't overwrite rows if a document with the same
  primary key already exists. Set to true to overwrite duplicate rows.


.. _sql-copy-from-empty_string_as_null:

**empty_string_as_null**
  | *Type:*    ``boolean``
  | *Default:* ``false``
  | *Optional*

  If set to ``true`` the ``empty_string_as_null`` option enables conversion of
  empty strings into ``NULL``.

  The option is only supported when using the ``CSV`` format, otherwise, it will
  be ignored.


.. _sql-copy-from-delimiter:

**delimiter**
  | *Type:*    ``text``
  | *Default:* ``,``
  | *Optional*

  Specifies a single one-byte character that separates columns within each line
  of the file.

  The option is only supported when using the ``CSV`` format, otherwise, it will
  be ignored.


.. _sql-copy-from-format:

**format**
  | *Type:*    ``text``
  | *Values:*  ``csv``, ``json``
  | *Default:* ``json``
  | *Optional*

  This option specifies the format of the input file. Available formats are
  ``csv`` or ``json``. If a format is not specified and the format cannot be
  guessed from the file extension, the file will be processed as JSON.


.. _sql-copy-from-header:

**header**
  | *Type:*    ``boolean``
  | *Default:* ``true``
  | *Optional*

  Used to indicate if the first line of a CSV file contains a header with the
  column names.

  If set to ``false``, the CSV must not contain column names in the first line
  and instead the columns declared in the statement are used. If no columns are
  declared in the statement, it will default to all columns present in the table
  in their ``CREATE TABLE`` declaration order.

  If set to ``true`` the first line in the CSV file must contain the column
  names. You can use the optional column declaration in addition to import only a
  subset of the data.

  If the statement contains no column declarations, all fields in the CSV are
  read and if it contains fields where there is no matching column in the table,
  the behavior depends on the ``column_policy`` table setting. If ``dynamic`` it
  implicitly adds new columns, if ``strict`` the operation will fail.

  An example of using input file with no header

  ::

      cr> COPY quotes FROM 'file:///tmp/import_data/quotes.csv' with (format='csv', header=false);
      COPY OK, 3 rows affected (... sec)


.. _sql-copy-from-skip:

**skip**
  | *Type:*    ``integer``
  | *Default:* ``0``
  | *Optional*

  Setting this option to ``n`` skips the first ``n`` rows while copying.

  .. NOTE::

      CrateDB by default expects a header in CSV files. If you're using the SKIP
      option to skip the header, you have to set ``header = false`` as well. See
      :ref:`header <sql-copy-from-header>`.

.. _sql-copy-from-key:

**key**
  | *Type:*    ``text``
  | *Optional*

  Used for :ref:`az <sql-copy-from-az>` scheme only.
  The Azure Storage `Account Key`_.

  .. NOTE::

      It must be provided if :ref:`sql-copy-from-sas-token` is not provided.

.. _sql-copy-from-sas-token:

**sas_token**
  | *Type:*    ``text``
  | *Optional*

  Used for :ref:`az <sql-copy-from-az>` scheme only.
  The Shared Access Signatures (`SAS`_) token used for authentication for the
  Azure Storage account. This can be used as an alternative to the The Azure
  Storage `Account Key`_.

  The SAS token must have read, write, and list permissions for the
  container base path and all its contents. These permissions need to be
  granted for the blob service and apply to resource types service, container,
  and object.

  .. NOTE::

      It must be provided if :ref:`sql-copy-from-key` is not provided.

.. _sql-copy-from-return-summary:

``RETURN SUMMARY``
------------------

By using the optional ``RETURN SUMMARY`` clause, a per-node result set will be
returned containing information about possible failures and successfully
inserted records.

::

    [ RETURN SUMMARY ]

+---------------------------------------+------------------------------------------------+---------------+
| Column Name                           | Description                                    |  Return Type  |
+=======================================+================================================+===============+
| ``node``                              | Information about the node that has processed  | ``OBJECT``    |
|                                       | the URI resource.                              |               |
+---------------------------------------+------------------------------------------------+---------------+
| ``node['id']``                        | The id of the node.                            | ``TEXT``      |
+---------------------------------------+------------------------------------------------+---------------+
| ``node['name']``                      | The name of the node.                          | ``TEXT``      |
+---------------------------------------+------------------------------------------------+---------------+
| ``uri``                               | The URI the node has processed.                | ``TEXT``      |
+---------------------------------------+------------------------------------------------+---------------+
| ``error_count``                       | The total number of records which failed.      | ``BIGINT``    |
|                                       | A NULL value indicates a general URI reading   |               |
|                                       | error, the error will be listed inside the     |               |
|                                       | ``errors`` column.                             |               |
+---------------------------------------+------------------------------------------------+---------------+
| ``success_count``                     | The total number of records which were         | ``BIGINT``    |
|                                       | inserted.                                      |               |
|                                       | A NULL value indicates a general URI reading   |               |
|                                       | error, the error will be listed inside the     |               |
|                                       | ``errors`` column.                             |               |
+---------------------------------------+------------------------------------------------+---------------+
| ``errors``                            | Contains detailed information about all        | ``OBJECT``    |
|                                       | errors. Limited to at most 25 error messages.  |               |
+---------------------------------------+------------------------------------------------+---------------+
| ``errors[ERROR_MSG]``                 | Contains information about a type of an error. | ``OBJECT``    |
+---------------------------------------+------------------------------------------------+---------------+
| ``errors[ERROR_MSG]['count']``        | The number records failed with this error.     | ``BIGINT``    |
+---------------------------------------+------------------------------------------------+---------------+
| ``errors[ERROR_MSG]['line_numbers']`` | The line numbers of the source URI where the   | ``ARRAY``     |
|                                       | error occurred, limited to the first 50        |               |
|                                       | errors, to avoid buffer pressure on clients.   |               |
+---------------------------------------+------------------------------------------------+---------------+


.. _Amazon Simple Storage Service: https://aws.amazon.com/s3/
.. _AWS documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTAuthentication.html
.. _AWS Java Documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/AuthUsingAcctOrUserCredentials.html
.. _Azure Blob Storage: https://learn.microsoft.com/en-us/azure/storage/blobs/
.. _Account Key: https://learn.microsoft.com/en-us/purview/sit-defn-azure-storage-account-key-generic#format
.. _SAS: https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview
.. _Docker volume: https://docs.docker.com/engine/storage/volumes/
.. _GeoJSON: https://geojson.org/
.. _globbing: https://en.wikipedia.org/wiki/Glob_(programming)
.. _percent-encoding: https://en.wikipedia.org/wiki/Percent-encoding
.. _URI Scheme: https://en.wikipedia.org/wiki/URI_scheme
.. _URL encoded: https://en.wikipedia.org/wiki/Percent-encoding
.. _URL: https://docs.oracle.com/javase/8/docs/api/java/net/URL.html
.. _well-formed URI: https://www.rfc-editor.org/rfc/rfc2396
.. _Windows documentation: https://learn.microsoft.com/en-us/dotnet/standard/io/file-path-formats
.. _WKT: https://en.wikipedia.org/wiki/Well-known_text</doc><doc title="CrateDB SQL reference: COPY TO" desc="The `COPY TO` command exports the contents of a table to one or more files into a given directory with unique filenames. ">.. highlight:: psql

.. _sql-copy-to:

===========
``COPY TO``
===========

You can use the ``COPY TO`` :ref:`statement <gloss-statement>` to export table
data to a file.

.. SEEALSO::

    :ref:`Data manipulation: Import and export <dml-import-export>`

    :ref:`SQL syntax: COPY FROM <sql-copy-from>`


.. _sql-copy-to-synopsis:

Synopsis
========

::

    COPY table_ident [ PARTITION ( partition_column = value [ , ... ] ) ]
                     [ ( column [ , ...] ) ]
                     [ WHERE condition ]
                     TO DIRECTORY output_uri
                     [ WITH ( copy_parameter [= value] [, ... ] ) ]


.. _sql-copy-to-desc:

Description
===========

The ``COPY TO`` command exports the contents of a table to one or more files
into a given directory with unique filenames. Each node with at least one shard
of the table will export its contents onto their local disk.

The created files are JSON formatted and contain one table row per line and,
due to the distributed nature of CrateDB, *will remain on the same nodes*
*where the shards are*.

Here's an example:

::

    cr> COPY quotes TO DIRECTORY '/tmp/' with (compression='gzip');
    COPY OK, 3 rows affected ...

.. NOTE::

   Currently only user tables can be exported. System tables like ``sys.nodes``
   and blob tables don't work with the ``COPY TO`` statement.

   The ``COPY`` statements use :ref:`Overload Protection <overload_protection>` to ensure other
   queries can still perform. Please change these settings during large inserts if needed.

.. _sql-copy-to-params:

Parameters
==========

.. _sql-copy-to-table_ident:

``table_ident``
  The name (optionally schema-qualified) of the table to be exported.

.. _sql-copy-to-column:

``column``
  (optional) A list of column :ref:`expressions <gloss-expression>` that should
  be exported. E.g.

  ::

    cr> COPY quotes (quote, author) TO DIRECTORY '/tmp/';
    COPY OK, 3 rows affected ...


  .. NOTE::

      When declaring columns, this changes the output to JSON list format,
      which is currently not supported by the ``COPY FROM`` statement.


.. _sql-copy-to-clauses:

Clauses
=======


.. _sql-copy-to-partition:

``PARTITION``
-------------

.. EDITORIAL NOTE
   ##############

   Multiple files (in this directory) use the same standard text for
   documenting the ``PARTITION`` clause. (Minor verb changes are made to
   accomodate the specifics of the parent statement.)

   For consistency, if you make changes here, please be sure to make a
   corresponding change to the other files.

If the table is :ref:`partitioned <partitioned-tables>`, the optional
``PARTITION`` clause can be used to export data from a one partition
exclusively.

::

    [ PARTITION ( partition_column = value [ , ... ] ) ]

:partition_column:
  One of the column names used for table partitioning.

:value:
  The respective column value.

All :ref:`partition columns <gloss-partition-column>` (specified by the
:ref:`sql-create-table-partitioned-by` clause) must be listed inside the
parentheses along with their respective values using the ``partition_column =
value`` syntax (separated by commas).

Because each partition corresponds to a unique set of :ref:`partition column
<gloss-partition-column>` row values, this clause uniquely identifies a single
partition to export.

.. TIP::

    The :ref:`ref-show-create-table` statement will show you the complete list
    of partition columns specified by the
    :ref:`sql-create-table-partitioned-by` clause.


.. _sql-copy-to-where:

``WHERE``
---------

The ``WHERE`` clauses use the same syntax as ``SELECT`` statements, allowing
partial exports. (see :ref:`sql_dql_where_clause` for more information).


Example of using ``WHERE`` clause with
:ref:`comparison operators <comparison-operators-where>` for partial export:

::

  cr> COPY quotes WHERE category = 'philosophy' TO DIRECTORY '/tmp/';
  COPY OK, 3 rows affected ...


.. _sql-copy-to-to:

``TO``
------

The ``TO`` clause allows you to specify an output location.

::

    TO DIRECTORY output_uri


.. _sql-copy-to-to-params:

Parameters
''''''''''

``output_uri``
  An :ref:`expression <gloss-expression>` must :ref:`evaluate
  <gloss-evaluation>` to a string literal that is a `well-formed URI`_. URIs
  must use one of the supported :ref:`URI schemes <sql-copy-from-schemes>`.

  .. NOTE::

      If the URI scheme is missing, CrateDB assumes the value is a pathname and
      will prepend the :ref:`file <sql-copy-from-file>` URI scheme (i.e.,
      ``file://``). So, for example, CrateDB will convert ``/tmp/file.json`` to
      ``file:///tmp/file.json``.


.. _sql-copy-to-schemes:

URI schemes
-----------

CrateDB supports the following URI schemes:

.. contents::
   :local:
   :depth: 1


.. _sql-copy-to-file:

``file``
''''''''

You can use the ``file://`` scheme to specify an absolute path to an output
location on the local file system.

For example:

.. code-block:: text

    file:///path/to/dir

.. TIP::

    If you are running CrateDB inside a container, the location must be inside
    the container. If you are using *Docker*, you may have to configure a
    `Docker volume`_ to accomplish this.

.. TIP::

    If you are using *Microsoft Windows*, you must include the drive letter in
    the file URI.

    For example:

    .. code-block:: text

        file://C:\/tmp/import_data/quotes.json

    Consult the `Windows documentation`_ for more information.


.. _sql-copy-to-s3:

``s3``
''''''

You can use the ``s3://`` scheme to access buckets on the `Amazon Simple
Storage Service`_ (Amazon S3).

For example:

.. code-block:: text

    s3://[<accesskey>:<secretkey>@][<host>:<port>/]<bucketname>/<path>

S3 compatible storage providers can be specified by the optional pair of host
and port, which defaults to Amazon S3 if not provided.

Here is a more concrete example:

.. code-block:: text

    COPY t TO DIRECTORY 's3://myAccessKey:mySecretKey@s3.amazonaws.com:80/myBucket/key1' with (protocol = 'http')

If no credentials are set the s3 client will operate in anonymous mode.
See `AWS Java Documentation`_.

.. TIP::

   A ``secretkey`` provided by Amazon Web Services can contain characters such
   as '/', '+' or '='. These characters must be `URL encoded`_. For a detailed
   explanation read the official `AWS documentation`_.

   To escape a secret key, you can use a snippet like this:

   .. code-block:: console

      sh$ python -c "from getpass import getpass; from urllib.parse import quote_plus; print(quote_plus(getpass('secret_key: ')))"

   This will prompt for the secret key and print the encoded variant.

   Additionally, versions prior to 0.51.x use HTTP for connections to S3. Since
   0.51.x these connections are using the HTTPS protocol. Please make sure you
   update your firewall rules to allow outgoing connections on port ``443``.

.. _sql-copy-to-az:

``az``
''''''

You can use the ``az://`` scheme to access files on the `Azure Blob Storage`_.

URI must look like ``az:://<account>.<endpoint_suffix>/<container>/<blob_path>``.

For example:

.. code-block:: text

    az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json

One of the authentication parameters (:ref:`sql-copy-to-key` or :ref:`sql-copy-to-sas-token`)
must be provided in the ``WITH`` clause.

Protocol can be provided in the ``WITH`` clause, otherwise ``https`` is used by default.

For example:

.. code-block:: text

    COPY source
    TO DIRECTORY 'az://myaccount.blob.core.windows.net/my-container/dir1/dir2/file1.json'
    WITH (
        key = 'key'
    )

.. _sql-copy-to-with:

``WITH``
--------

You can use the optional ``WITH`` clause to specify copy parameter values.

::

    [ WITH ( copy_parameter [= value] [, ... ] ) ]


The ``WITH`` clause supports the following copy parameters:

.. contents::
   :local:
   :depth: 1


.. _sql-copy-to-compression:

**compression**
  | *Type:*    ``text``
  | *Values:*  ``gzip``
  | *Default:* By default the output is not compressed.
  | *Optional*

  Define if and how the exported data should be compressed.

.. _sql-copy-to-protocol:

**protocol**
  | *Type:*    ``text``
  | *Values:*  ``http``, ``https``
  | *Default:* ``https``
  | *Optional*

  Protocol to use.
  Used only by the :ref:`s3 <sql-copy-to-s3>` and :ref:`az <sql-copy-to-az>` schemes.

.. _sql-copy-to-format:

**format**
  | *Type:*    ``text``
  | *Values:*  ``json_object``, ``json_array``
  | *Default:* Depends on defined columns. See description below.
  | *Optional*

  Possible values for the ``format`` settings are:

  ``json_object``
    Each row in the result set is serialized as JSON object and written to an
    output file where one line contains one object. This is the default behavior
    if no columns are defined. Use this format to import with
    :ref:`COPY FROM <sql-copy-from>`.

  ``json_array``
    Each row in the result set is serialized as JSON array, storing one array per
    line in an output file. This is the default behavior if columns are defined.


.. _sql-copy-to-wait_for_completion:

**wait_for_completion**
  | *Type:*    ``boolean``
  | *Default:* ``true``
  | *Optional*

  A boolean value indicating if the ``COPY TO`` should wait for
  the copy operation to complete. If set to ``false`` the request
  returns at once and the copy operation runs in the background.

.. _sql-copy-to-key:

**key**
  | *Type:*    ``text``
  | *Optional*

  Used for :ref:`azblob <sql-copy-to-az>` scheme only.
  The Azure Storage `Account Key`_.

  .. NOTE::

      It must be provided if :ref:`sql-copy-to-sas-token` is not provided.

.. _sql-copy-to-sas-token:

**sas_token**
  | *Type:*    ``text``
  | *Optional*

  Used for :ref:`azblob <sql-copy-to-az>` scheme only.
  The Shared Access Signatures (`SAS`_) token used for authentication for the
  Azure Storage account. This can be used as an alternative to the The Azure
  Storage `Account Key`_.

  The SAS token must have read, write, and list permissions for the
  container base path and all its contents. These permissions need to be
  granted for the blob service and apply to resource types service, container,
  and object.

  .. NOTE::

      It must be provided if :ref:`sql-copy-to-key` is not provided.


.. _Amazon S3: https://aws.amazon.com/s3/
.. _Amazon Simple Storage Service: https://aws.amazon.com/s3/
.. _AWS documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/RESTAuthentication.html
.. _AWS Java Documentation: https://docs.aws.amazon.com/AmazonS3/latest/dev/AuthUsingAcctOrUserCredJava.html
.. _Azure Blob Storage: https://learn.microsoft.com/en-us/azure/storage/blobs/
.. _SAS: https://learn.microsoft.com/en-us/azure/storage/common/storage-sas-overview
.. _Account Key: https://learn.microsoft.com/en-us/purview/sit-defn-azure-storage-account-key-generic#format
.. _Docker volume: https://docs.docker.com/storage/volumes/
.. _gzip: https://www.gzip.org/
.. _NFS: https://en.wikipedia.org/wiki/Network_File_System
.. _URL encoded: https://en.wikipedia.org/wiki/Percent-encoding
.. _well-formed URI: https://www.rfc-editor.org/rfc/rfc2396
.. _Windows documentation: https://docs.microsoft.com/en-us/dotnet/standard/io/file-path-formats</doc><doc title="CrateDB SQL reference: Scalar functions" desc="Scalar functions are functions that return scalars.">.. highlight:: psql

.. _scalar-functions:
.. _builtins-scalar:

================
Scalar functions
================

Scalar functions are :ref:`functions <gloss-function>` that return
:ref:`scalars <gloss-scalar>`.


.. _scalar-string:

String functions
================


.. _scalar-concat:

``concat('first_arg', second_arg, [ parameter , ... ])``
--------------------------------------------------------

Concatenates a variable number of arguments into a single string. It ignores
``NULL`` values.

Returns: ``text``

::

    cr> select concat('foo', null, 'bar') AS col;
    +--------+
    | col    |
    +--------+
    | foobar |
    +--------+
    SELECT 1 row in set (... sec)

You can also use the ``||`` :ref:`operator <gloss-operator>`::

    cr> select 'foo' || 'bar' AS col;
    +--------+
    | col    |
    +--------+
    | foobar |
    +--------+
    SELECT 1 row in set (... sec)

.. NOTE::

    The ``||`` operator differs from the ``concat`` function regarding the
    handling of ``NULL`` arguments. It will return ``NULL`` if any of the
    operands is ``NULL`` while the ``concat`` scalar will return an empty
    string if both arguments are ``NULL`` and the non-null argument otherwise.

.. TIP::

    The ``concat`` function can also be used for merging objects:
    :ref:`concat(object, object) <scalar-concat-object>`.


.. _scalar-concat-ws:

``concat_ws('separator', second_arg, [ parameter , ... ])``
------------------------------------------------------------------------------

Concatenates a variable number of arguments into a single string using a
separator defined by the first argument. If first argument is ``NULL`` the
return value is ``NULL``. Remaining ``NULL`` arguments are ignored.

Returns: ``text``

::

    cr> select concat_ws(',','foo', null, 'bar') AS col;
    +---------+
    | col     |
    +---------+
    | foo,bar |
    +---------+
    SELECT 1 row in set (... sec)


.. _scalar-format:

``format('format_string', parameter, [ parameter , ... ])``
-----------------------------------------------------------

Formats a string similar to the C function ``printf``. For details about the
format string syntax, see `formatter`_

Returns: ``text``

::

    cr> select format('%s.%s', schema_name, table_name)  AS fqtable
    ... from sys.shards
    ... where table_name = 'locations'
    ... limit 1;
    +---------------+
    | fqtable       |
    +---------------+
    | doc.locations |
    +---------------+
    SELECT 1 row in set (... sec)

::

    cr> select format('%tY', date) AS year
    ... from locations
    ... group by format('%tY', date)
    ... order by 1;
    +------+
    | year |
    +------+
    | 1979 |
    | 2013 |
    +------+
    SELECT 2 rows in set (... sec)


.. _scalar-substr:

``substr('string', from, [ count ])``
-------------------------------------

Extracts a part of a string. ``from`` specifies where to start and ``count``
the length of the part.

Returns: ``text``

::

    cr> select substr('crate.io', 3, 2) AS substr;
    +--------+
    | substr |
    +--------+
    | at     |
    +--------+
    SELECT 1 row in set (... sec)


``substr('string' FROM 'pattern')``
-----------------------------------

Extract a part from a string that matches a POSIX regular expression pattern.

Returns: ``text``.

If the pattern contains groups specified via parentheses it returns the first
matching group.
If the pattern doesn't match, the function returns ``NULL``.

::

    cr> SELECT
    ...   substring('2023-08-07', '[a-z]') as no_match,
    ...   substring('2023-08-07', '\d{4}-\d{2}-\d{2}') as full_date,
    ...   substring('2023-08-07', '\d{4}-(\d{2})-\d{2}') as month;
    +----------+------------+-------+
    | no_match | full_date  | month |
    +----------+------------+-------+
    | NULL     | 2023-08-07 |    08 |
    +----------+------------+-------+
    SELECT 1 row in set (... sec)


.. _scalar-substring:

``substring(...)``
------------------

Alias for :ref:`scalar-substr`.


.. _scalar-char_length:

``char_length('string')``
-------------------------

Counts the number of characters in a string.

Returns: ``integer``

::

    cr> select char_length('crate.io') AS char_length;
    +-------------+
    | char_length |
    +-------------+
    |           8 |
    +-------------+
    SELECT 1 row in set (... sec)

Each character counts only once, regardless of its byte size.

::

    cr> select char_length('©rate.io') AS char_length;
    +-------------+
    | char_length |
    +-------------+
    |           8 |
    +-------------+
    SELECT 1 row in set (... sec)


.. _scalar-length:

``length(text)``
----------------

Returns the number of characters in a string.

The same as :ref:`char_length <scalar-char_length>`.


.. _scalar-bit_length:

``bit_length('string')``
------------------------

Counts the number of bits in a string.

Returns: ``integer``

.. NOTE::

    CrateDB uses UTF-8 encoding internally, which uses between 1 and 4 bytes
    per character.

::

    cr> select bit_length('crate.io') AS bit_length;
    +------------+
    | bit_length |
    +------------+
    |         64 |
    +------------+
    SELECT 1 row in set (... sec)

::

    cr> select bit_length('©rate.io') AS bit_length;
    +------------+
    | bit_length |
    +------------+
    |         72 |
    +------------+
    SELECT 1 row in set (... sec)


.. _scalar-octet_length:

``octet_length('string')``
--------------------------

Counts the number of bytes (octets) in a string.

Returns: ``integer``

::

    cr> select octet_length('crate.io') AS octet_length;
    +--------------+
    | octet_length |
    +--------------+
    |            8 |
    +--------------+
    SELECT 1 row in set (... sec)

::

    cr> select octet_length('©rate.io') AS octet_length;
    +--------------+
    | octet_length |
    +--------------+
    |            9 |
    +--------------+
    SELECT 1 row in set (... sec)


.. _scalar-ascii:

``ascii(string)``
-----------------

Returns the ASCII code of the first character. For UTF-8, returns the Unicode
code point of the characters.

Returns: ``int``

::

    cr> SELECT ascii('a') AS a, ascii('🎈') AS b;
    +----+--------+
    |  a |      b |
    +----+--------+
    | 97 | 127880 |
    +----+--------+
    SELECT 1 row in set (... sec)


.. _scalar-chr:

``chr(int)``
------------

Returns the character with the given code. For UTF-8 the argument is treated as
a Unicode code point.

Returns: ``string``

::

    cr> SELECT chr(65) AS a;
    +---+
    | a |
    +---+
    | A |
    +---+
    SELECT 1 row in set (... sec)


.. _scalar-lower:

``lower('string')``
-------------------

Converts all characters to lowercase. ``lower`` does not perform
locale-sensitive or context-sensitive mappings.

Returns: ``text``

::

    cr> select lower('TransformMe') AS lower;
    +-------------+
    | lower       |
    +-------------+
    | transformme |
    +-------------+
    SELECT 1 row in set (... sec)


.. _scalar-upper:

``upper('string')``
-------------------

Converts all characters to uppercase. ``upper`` does not perform
locale-sensitive or context-sensitive mappings.

Returns: ``text``

::

    cr> select upper('TransformMe') as upper;
    +-------------+
    | upper       |
    +-------------+
    | TRANSFORMME |
    +-------------+
    SELECT 1 row in set (... sec)


.. _scalar-initcap:

``initcap('string')``
---------------------

Converts the first letter of each word to upper case and the rest to lower case
(*capitalize letters*).

Returns: ``text``

::

    cr> select initcap('heLlo WORLD') AS initcap;
    +-------------+
    | initcap     |
    +-------------+
    | Hello World |
    +-------------+
     SELECT 1 row in set (... sec)


.. _scalar-sha1:

``sha1('string')``
------------------

Returns: ``text``

Computes the SHA1 checksum of the given string.

::

    cr> select sha1('foo') AS sha1;
    +------------------------------------------+
    | sha1                                     |
    +------------------------------------------+
    | 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 |
    +------------------------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-md5:

``md5('string')``
-----------------

Returns: ``text``

Computes the MD5 checksum of the given string.

See :ref:`sha1 <scalar-sha1>` for an example.


.. _scalar-replace:

``replace(text, from, to)``
---------------------------

Replaces all occurrences of ``from`` in ``text`` with ``to``.

::

    cr> select replace('Hello World', 'World', 'Stranger') AS hello;
    +----------------+
    | hello          |
    +----------------+
    | Hello Stranger |
    +----------------+
    SELECT 1 row in set (... sec)


.. _scalar-translate:

``translate(string, from, to)``
-------------------------------

Performs several single-character, one-to-one translation in one operation. It
translates ``string`` by replacing the characters in the ``from`` set,
one-to-one positionally, with their counterparts in the ``to`` set. If ``from``
is longer than ``to``, the function removes the occurrences of the extra
characters in ``from``. If there are repeated characters in ``from``, only the
first mapping is considered.

Synopsis::

    translate(string, from, to)

Examples::

   cr> select translate('Crate', 'Ct', 'Dk') as translation;
    +-------------+
    | translation |
    +-------------+
    | Drake       |
    +-------------+
    SELECT 1 row in set (... sec)

::

   cr> select translate('Crate', 'rCe', 'c') as translation;
    +-------------+
    | translation |
    +-------------+
    | cat         |
    +-------------+
    SELECT 1 row in set (... sec)


.. _scalar-trim:

``trim({LEADING | TRAILING | BOTH} 'str_arg_1' FROM 'str_arg_2')``
------------------------------------------------------------------

Removes the longest string containing characters from ``str_arg_1`` (``' '`` by
default) from the start, end, or both ends (``BOTH`` is the default) of
``str_arg_2``.

If any of the two strings is ``NULL``, the result is ``NULL``.

Synopsis::

    trim([ [ {LEADING | TRAILING | BOTH} ] [ str_arg_1 ] FROM ] str_arg_2)

Examples::

    cr> select trim(BOTH 'ab' from 'abcba') AS trim;
    +------+
    | trim |
    +------+
    | c    |
    +------+
    SELECT 1 row in set (... sec)

::

    cr> select trim('ab' from 'abcba') AS trim;
    +------+
    | trim |
    +------+
    | c    |
    +------+
    SELECT 1 row in set (... sec)

::

    cr> select trim('   abcba   ') AS trim;
    +-------+
    | trim  |
    +-------+
    | abcba |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-ltrim:

``ltrim(text, [ trimmingText ])``
---------------------------------

Removes set of characters which are matching ``trimmingText`` (``' '`` by
default) to the left of ``text``.

If any of the arguments is ``NULL``, the result is ``NULL``.

::

    cr> select ltrim('xxxzzzabcba', 'xz') AS ltrim;
    +-------+
    | ltrim |
    +-------+
    | abcba |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-rtrim:

``rtrim(text, [ trimmingText ])``
---------------------------------

Removes set of characters which are matching ``trimmingText`` (``' '`` by
default) to the right of ``text``.

If any of the arguments is ``NULL``, the result is ``NULL``.

::

    cr> select rtrim('abcbaxxxzzz', 'xz') AS rtrim;
    +-------+
    | rtrim |
    +-------+
    | abcba |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-btrim:

``btrim(text, [ trimmingText ])``
---------------------------------

A combination of :ref:`ltrim <scalar-ltrim>` and :ref:`rtrim <scalar-rtrim>`,
removing the longest string matching ``trimmingText`` from both the start and
end of ``text``.

If any of the arguments is ``NULL``, the result is ``NULL``.

::

    cr> select btrim('XXHelloXX', 'XX') AS btrim;
    +-------+
    | btrim |
    +-------+
    | Hello |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-quote_ident:

``pg_catalog.quote_ident(text)``
--------------------------------

Returns: ``text``

Quotes a provided string argument. Quotes are added only if necessary. For
example, if the string contains non-identifier characters, keywords, or would be
case-folded. Embedded quotes are properly doubled.

The quoted string can be used as an identifier in an SQL statement.

::

    cr> select pg_catalog.quote_ident('Column name') AS quoted;
    +---------------+
    | quoted        |
    +---------------+
    | "Column name" |
    +---------------+
    SELECT 1 row in set (... sec)


.. _scalar-left:

``left('string', len)``
-----------------------

Returns the first ``len`` characters of ``string`` when ``len`` > 0, otherwise
all but last ``len`` characters.

Synopsis::

    left(string, len)

Examples::

    cr> select left('crate.io', 5) AS col;
    +-------+
    | col   |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)

::

    cr> select left('crate.io', -3) AS col;
    +-------+
    | col   |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-right:

``right('string', len)``
------------------------

Returns the last ``len`` characters in ``string`` when ``len`` > 0, otherwise
all but first ``len`` characters.

Synopsis::

    right(string, len)

Examples::

    cr> select right('crate.io', 2) AS col;
    +-----+
    | col |
    +-----+
    | io  |
    +-----+
    SELECT 1 row in set (... sec)

::

    cr> select right('crate.io', -6) AS col;
    +-----+
    | col |
    +-----+
    | io  |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-starts_with:

``starts_with(text, prefix)``
-----------------------------

Returns ``true`` if the string ``text`` starts with the string ``prefix``,
otherwise ``false``. Returns ``NULL`` if any argument is ``NULL``.

Returns: ``boolean``

Synopsis::

    starts_with(text, prefix)

Examples::

    cr> select starts_with('crate.io', 'crate') AS col;
    +------+
    | col  |
    +------+
    | TRUE |
    +------+
    SELECT 1 row in set (... sec)

    cr> select starts_with('crate.io', 'io') AS col;
    +-------+
    | col   |
    +-------+
    | FALSE |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-lpad:

``lpad('string1', len[, 'string2'])``
-------------------------------------

Fill up ``string1`` to length ``len`` by prepending the characters ``string2``
(a space by default). If ``string1`` is already longer than ``len`` then it is
truncated (on the right).

Synopsis::

    lpad(string1, len[, string2])

Example::

    cr> select lpad(' I like CrateDB!!', 41, 'yes! ') AS col;
    +-------------------------------------------+
    | col                                       |
    +-------------------------------------------+
    | yes! yes! yes! yes! yes! I like CrateDB!! |
    +-------------------------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-rpad:

``rpad('string1', len[, 'string2'])``
-------------------------------------

Fill up ``string1`` to length ``len`` by appending the characters ``string2``
(a space by default). If string1 is already longer than ``len`` then it is
truncated.

Synopsis::

    rpad(string1, len[, string2])

Example::

    cr> select rpad('Do you like Crate?', 38, ' yes!') AS col;
    +----------------------------------------+
    | col                                    |
    +----------------------------------------+
    | Do you like Crate? yes! yes! yes! yes! |
    +----------------------------------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    In both cases, the scalar functions ``lpad`` and ``rpad`` do now accept a
    length greater than 50000.


.. _scalar-encode:

``encode(bytea, format)``
-------------------------

Encode takes a binary string (``hex`` format) and returns a text encoding using
the specified format. Supported formats are: ``base64``, ``hex``, and
``escape``. The ``escape`` format replaces unprintable characters with octal
byte notation like ``\nnn``. For the reverse function, see :ref:`decode()
<scalar-decode>`.

Synopsis::

    encode(string1, format)

Example::

    cr> select encode(E'123\b\t56', 'base64') AS col;
    +--------------+
    | col          |
    +--------------+
    | MTIzCAk1Ng== |
    +--------------+
    SELECT 1 row in set (... sec)


.. _scalar-decode:

``decode(text, format)``
-------------------------

Decodes a text encoded string using the specified format and returns a binary
string (``hex`` format). Supported formats are: ``base64``, ``hex``, and
``escape``. For the reverse function, see :ref:`encode() <scalar-encode>`.

Synopsis::

    decode(text1, format)

Example::

    cr> select decode('T\214', 'escape') AS col;
    +--------+
    | col    |
    +--------+
    | \x548c |
    +--------+
    SELECT 1 row in set (... sec)


.. _scalar-repeat:

``repeat(text, integer)``
-------------------------

Repeats a string the specified number of times.

If the number of repetitions is equal or less than zero then the function
returns an empty string.

Returns: ``text``

::

    cr> select repeat('ab', 3) AS repeat;
    +--------+
    | repeat |
    +--------+
    | ababab |
    +--------+
    SELECT 1 row in set (... sec)

.. _scalar-strpos:

``strpos(string, substring)``
-----------------------------

Returns the first 1-based index of the specified substring within string.
Returns zero if the substring is not found and ``NULL`` if any of the arguments
is ``NULL``.

Returns: ``integer``

::

    cr> SELECT strpos('crate' , 'ate');
    +--------+
    | strpos |
    +--------+
    |      3 |
    +--------+
    SELECT 1 row in set (... sec)


.. _scalar-position:

``position(substring in string)``
---------------------------------

The ``position()`` scalar function is an alias of the :ref:`scalar-strpos`
scalar function. Note that the order of the arguments is reversed.


.. _scalar-reverse:

``reverse(text)``
------------------

Reverses the order of the string. Returns ``NULL`` if the argument is ``NULL``.

Returns: ``text``

::

    cr> select reverse('abcde') as reverse;
    +---------+
    | reverse |
    +---------+
    |  edcba  |
    +---------+
    SELECT 1 row in set (... sec)

.. _scalar-split_part:

``split_part(text, text, integer)``
-----------------------------------

Splits a string into parts using a delimiter and returns the part at the given
index. The first part is addressed by index ``1``.

Special Cases:

* Returns the empty string if the index is greater than the number of parts.

* If any of the arguments is ``NULL``, the result is ``NULL``.

* If the delimiter is the empty string, the input string is considered as
  consisting of exactly one part.

Returns: ``text``

Synopsis::

    split_part(string, delimiter, index)

Example::

    cr> select split_part('ab--cdef--gh', '--', 2) AS part;
    +------+
    | part |
    +------+
    | cdef |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-parse_uri:

``parse_uri(text)``
-----------------------------------

Returns: ``object``

Parses the given URI string and returns an object containing the various
components of the URI. The returned object has the following properties::

    "uri" OBJECT AS (
        "scheme" TEXT,
        "userinfo" TEXT,
        "hostname" TEXT,
        "port" INT,
        "path" TEXT,
        "query" TEXT,
        "fragment" TEXT
    )

.. csv-table::
   :header: "URI Component", "Description"
   :widths: 25, 75
   :align: left

   ``scheme`` , "The scheme of the URI (e.g. ``http``, ``crate``, etc.)"
   ``userinfo`` , "The decoded user-information component of this URI."
   ``hostname`` , "The hostname or IP address specified in the URI."
   ``port`` , "The port number specified in the URI"
   ``path`` , "The decoded path specified in the URI."
   ``query`` , "The decoded query string specified in the URI"
   ``fragment`` , "The query string specified in the URI"

.. NOTE::

    For URI properties not specified in the input string, ``null`` is returned.

Synopsis::

    parse_uri(text)

Example::

    cr> SELECT parse_uri('crate://my_user@cluster.crate.io:5432/doc?sslmode=verify-full') as uri;
    +------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | uri                                                                                                                                                        |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"fragment": null, "hostname": "cluster.crate.io", "path": "/doc", "port": 5432, "query": "sslmode=verify-full", "scheme": "crate", "userinfo": "my_user"} |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------+
    SELECT 1 row in set (... sec)

If you just want to select a specific URI component, you can use the bracket
notation on the returned object::

    cr> SELECT parse_uri('crate://my_user@cluster.crate.io:5432')['hostname'] as uri_hostname;
    +------------------+
    | uri_hostname     |
    +------------------+
    | cluster.crate.io |
    +------------------+
    SELECT 1 row in set (... sec)


.. _scalar-parse_url:

``parse_url(text)``
-----------------------------------

Returns: ``object``

Parses the given URL string and returns an object containing the various
components of the URL. The returned object has the following properties::

    "url" OBJECT AS (
        "scheme" TEXT,
        "userinfo" TEXT,
        "hostname" TEXT,
        "port" INT,
        "path" TEXT,
        "query" TEXT,
        "parameters" OBJECT AS (
            "key1" ARRAY(TEXT),
            "key2" ARRAY(TEXT)
        ),
        "fragment" TEXT
    )

.. csv-table::
   :header: "URL Component", "Description"
   :widths: 25, 75
   :align: left

   ``scheme`` , "The scheme of the URL (e.g. ``https``, ``crate``, etc.)"
   ``userinfo`` , "The decoded user-information component of this URL."
   ``hostname`` , "The hostname or IP address specified in the URL."
   ``port`` , "The port number specified in the URL. If no port number is specified, the default port for the given scheme will be used."
   ``path`` , "The decoded path specified in the URL."
   ``query`` , "The decoded query string specified in the URL."
   ``parameters`` , "For each query parameter included in the URL, the ``parameter`` property holds an object property that stores an array of decoded text values for that specific query parameter."
   ``fragment`` , "The decoded fragment specified in the URL"

.. NOTE::

    For URL properties not specified in the input string, ``null`` is returned.

Synopsis::

    parse_url(text)

Example::

    cr> SELECT parse_url('https://my_user@cluster.crate.io:8000/doc?sslmode=verify-full') as url;
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | url                                                                                                                                                                                                    |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"fragment": null, "hostname": "cluster.crate.io", "parameters": {"sslmode": ["verify-full"]}, "path": "/doc", "port": 8000, "query": "sslmode=verify-full", "scheme": "https", "userinfo": "my_user"} |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    SELECT 1 row in set (... sec)

If you just want to select a specific URL component, you can use the bracket
notation on the returned object::

    cr> SELECT parse_url('https://my_user@cluster.crate.io:5432')['hostname'] as url_hostname;
    +------------------+
    | url_hostname     |
    +------------------+
    | cluster.crate.io |
    +------------------+
    SELECT 1 row in set (... sec)

Parameter values are always treated as ``text``. There is no conversion of
comma-separated parameter values into arrays::

    cr> SELECT parse_url('http://crate.io?p1=1,2,3&p1=a&p2[]=1,2,3')['parameters'] as params;
    +-------------------------------------------+
    | params                                    |
    +-------------------------------------------+
    | {"p1": ["1,2,3", "a"], "p2[]": ["1,2,3"]} |
    +-------------------------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-date-time:

Date and time functions
=======================


.. _scalar-date_trunc:

``date_trunc('interval', ['timezone',] timestamp)``
---------------------------------------------------

Returns: ``timestamp with time zone``

Limits a timestamps precision to a given interval.

Valid intervals are:

* ``second``
* ``minute``
* ``hour``
* ``day``
* ``week``
* ``month``
* ``quarter``
* ``year``

Valid values for ``timezone`` are either the name of a time zone (for example
'Europe/Vienna') or the UTC offset of a time zone (for example '+01:00'). To
get a complete overview of all possible values take a look at the `available
time zones`_ supported by `Joda-Time`_.

The following example shows how to use the ``date_trunc`` function to generate
a day based histogram in the ``Europe/Moscow`` timezone::

    cr> select
    ... date_trunc('day', 'Europe/Moscow', date) as day,
    ... count(*) as num_locations
    ... from locations
    ... group by 1
    ... order by 1;
    +---------------+---------------+
    | day           | num_locations |
    +---------------+---------------+
    | 308523600000  | 4             |
    | 1367352000000 | 1             |
    | 1373918400000 | 8             |
    +---------------+---------------+
    SELECT 3 rows in set (... sec)

If you don't specify a time zone, ``truncate`` uses UTC time::

    cr> select date_trunc('day', date) as day, count(*) as num_locations
    ... from locations
    ... group by 1
    ... order by 1;
    +---------------+---------------+
    | day           | num_locations |
    +---------------+---------------+
    | 308534400000  | 4             |
    | 1367366400000 | 1             |
    | 1373932800000 | 8             |
    +---------------+---------------+
    SELECT 3 rows in set (... sec)

.. _date-bin:

``date_bin(interval, timestamp, origin)``
-----------------------------------------

``date_bin`` "bins" the input timestamp to the specified interval, aligned with
a specified origin.

``interval`` is an expression of type ``interval``.
``Timestamp`` and ``origin`` are expressions of type
``timestamp with time zone`` or ``timestamp without time zone``.
The return type matches the timestamp and origin types and will be either
``timestamp with time zone`` or ``timestamp without time zone``.

The return value marks the beginning of the bin into which the input timestamp
is placed.

If you use an interval with a single unit like ``1 second`` or ``1 minute``,
this function returns the same result as :ref:`date_trunc <scalar-date_trunc>`.

Intervals with months and/or year units are not allowed.

If the interval is ``1 week``, ``date_bin`` only returns the same result as
``date_trunc`` if the origin is a Monday.

If at least one argument is ``NULL``, the return value is ``NULL``. The
interval cannot be zero. Negative intervals are allowed and are treated the
same as positive intervals. Intervals having month or year units are not
supported due to varying length of those units.

A timestamp can be binned to an interval of arbitrary length
aligned with a custom origin.

Examples:

::

    cr> SELECT date_bin('2 hours'::INTERVAL, ts,
    ... '2021-01-01T05:00:00Z'::TIMESTAMP) as bin,
    ... date_format('%y-%m-%d %H:%i',
    ... date_bin('2 hours'::INTERVAL, ts, '2021-01-01T05:00:00Z'::TIMESTAMP))
    ... formatted_bin
    ... FROM unnest(ARRAY[
    ... '2021-01-01T08:30:10Z',
    ... '2021-01-01T08:38:10Z',
    ... '2021-01-01T18:18:10Z',
    ... '2021-01-01T18:18:10Z'
    ... ]::TIMESTAMP[]) as tbl (ts);
    +---------------+----------------+
    |           bin | formatted_bin  |
    +---------------+----------------+
    | 1609484400000 | 21-01-01 07:00 |
    | 1609484400000 | 21-01-01 07:00 |
    | 1609520400000 | 21-01-01 17:00 |
    | 1609520400000 | 21-01-01 17:00 |
    +---------------+----------------+
    SELECT 4 rows in set (... sec)

.. TIP::

    0 can be used as a shortcut for Unix zero as the origin::

        cr> select date_bin('2 hours' :: INTERVAL,
        ... '2021-01-01T08:30:10Z' :: timestamp without time ZONE, 0) as bin;
        +---------------+
        |           bin |
        +---------------+
        | 1609488000000 |
        +---------------+
        SELECT 1 row in set (... sec)

    Please note, that implicit cast treats numbers as is, i.e. as a timestamp
    in that zone and if timestamp is in non-UTC zone you might want to set
    numeric origin to the same zone::

        cr> select date_bin('4 hours' :: INTERVAL,
        ... '2020-01-01T09:00:00+0200'::timestamp with time zone,
        ... TIMEZONE('+02:00', 0)) as bin;
        +---------------+
        |           bin |
        +---------------+
        | 1577858400000 |
        +---------------+
        SELECT 1 row in set (... sec)

.. _scalar-extract:

``extract(field from source)``
------------------------------

``extract`` is a special :ref:`expression <gloss-expression>` that translates
to a function which retrieves subcolumns such as day, hour or minute from a
timestamp or an interval.

The return type depends on the used ``field``.

Example with timestamp::

    cr> select extract(day from '2014-08-23') AS day;
    +-----+
    | day |
    +-----+
    |  23 |
    +-----+
    SELECT 1 row in set (... sec)

Example with interval::

    cr> select extract(hour from INTERVAL '5 days 12 hours 45 minutes') AS hour;
    +------+
    | hour |
    +------+
    |   12 |
    +------+
    SELECT 1 row in set (... sec)

Synopsis::

    EXTRACT( field FROM source )

``field``
  An identifier or string literal which identifies the part of the timestamp or
  interval that should be extracted.

``source``
  An expression that resolves to an interval, or a timestamp (with or without
  timezone), or is castable to a timestamp.

.. NOTE::

    When extracting from an :ref:`INTERVAL <type-interval>` there is
    normalization of units, up to days e.g.::

       cr> SELECT extract(day from INTERVAL '14 years 1250 days 49 hours') AS days;
       +------+
       | days |
       +------+
       | 1252 |
       +------+
       SELECT 1 row in set (... sec)

The following fields are supported:

``CENTURY``
  | *Return type:* ``integer``
  | century of era

  Returns the ISO representation which is a straight split of the date.

  Year 2000 century 20 and year 2001 is also century 20. This is different to
  the GregorianJulian (GJ) calendar system where 2001 would be century 21.

``YEAR``
  | *Return type:* ``integer``
  | the year field

``QUARTER``
  | *Return type:* ``integer``
  | the quarter of the year (1 - 4)

``MONTH``
  | *Return type:* ``integer``
  | the month of the year

``WEEK``
  | *Return type:* ``integer``
  | the week of the year

``DAY``
  | *Return type:* ``integer``
  | the day of the month for timestamps, days for intervals

``DAY_OF_MONTH``
  | *Return type:* ``integer``
  | same as ``day``

``DAY_OF_WEEK``
  | *Return type:* ``integer``
  | day of the week. Starting with Monday (1) to Sunday (7)

``DOW``
  | *Return type:* ``integer``
  | same as ``day_of_week``

``DAY_OF_YEAR``
  | *Return type:* ``integer``
  | the day of the year (1 - 365 / 366)

``DOY``
  | *Return type:* ``integer``
  | same as ``day_of_year``

``HOUR``
  | *Return type:* ``integer``
  | the hour field

``MINUTE``
  | *Return type:* ``integer``
  | the minute field

``SECOND``
  | *Return type:* ``integer``
  | the second field

``EPOCH``
  | *Return type:* ``double precision``
  | The number of seconds since Jan 1, 1970.
  | Can be negative if earlier than Jan 1, 1970.

.. _scalar-current_time:

``CURRENT_TIME``
----------------

The ``CURRENT_TIME`` :ref:`expression <gloss-expression>` returns the time in
microseconds since midnight UTC at the time the SQL statement was
handled. Clock time is looked up at most once within the scope of a single
query, to ensure that multiple occurrences of ``CURRENT_TIME`` :ref:`evaluate
<gloss-evaluation>` to the same value.

Synopsis::

    CURRENT_TIME [ ( precision ) ]

``precision``
  Must be a positive integer between 0 and 6. The default value is 6. It
  determines the number of fractional seconds to output. A value of 0 means the
  time will have second precision, no fractional seconds (microseconds) are
  given.

.. NOTE::

    No guarantee is provided about the accuracy of the underlying clock,
    results may be limited to millisecond precision, depending on the system.


.. _scalar-current_timestamp:

``CURRENT_TIMESTAMP``
---------------------

The ``CURRENT_TIMESTAMP`` expression returns the timestamp in milliseconds
since midnight UTC at the time the SQL statement was handled. Therefore, the
same timestamp value is returned for every invocation of a single statement.

Synopsis::

    CURRENT_TIMESTAMP [ ( precision ) ]

``precision``
  Must be a positive integer between ``0`` and ``3``. The default value is
  ``3``. This value determines the number of fractional seconds to output. A
  value of ``0`` means the timestamp will have second precision, no fractional
  seconds (milliseconds) are given.

.. TIP::

    To get an offset value of ``CURRENT_TIMESTAMP`` (e.g., this same time one
    day ago), you can add or subtract an :ref:`interval <type-interval>`,
    like so::

        CURRENT_TIMESTAMP - '1 day'::interval

.. NOTE::

    If the ``CURRENT_TIMESTAMP`` function is used in
    :ref:`ddl-generated-columns` it behaves slightly different in ``UPDATE``
    operations. In such a case the actual timestamp of each row update is
    returned.


.. _scalar-curdate:

``CURDATE()``
----------------

The ``CURDATE()`` scalar function is an alias of the :ref:`scalar-current_date`
expression.

Synopsis::

    CURDATE()


.. _scalar-current_date:

``CURRENT_DATE``
----------------

The ``CURRENT_DATE`` expression returns the date in UTC timezone at the time
the SQL statement was handled.

Clock time is looked up at most once within the scope of a single query, to
ensure that multiple occurrences of ``CURRENT_DATE`` evaluate to the same
value.

Synopsis::

    CURRENT_DATE


.. _scalar-now:

``now()``
---------

Returns the current date and time in UTC.

This is the same as ``current_timestamp``

Returns: ``timestamp with time zone``

Synopsis::

    now()


.. _scalar-date_format:

``date_format([format_string, [timezone,]] timestamp)``
-------------------------------------------------------

The ``date_format`` function formats a timestamp as string according to the
(optional) format string.

Returns: ``text``

Synopsis::

    DATE_FORMAT( [ format_string, [ timezone, ] ] timestamp )

The only mandatory argument is the ``timestamp`` value to format. It can be any
:ref:`expression <gloss-expression>` that is safely convertible to timestamp
data type with or without timezone.

The syntax for the ``format_string`` is 100% compatible to the syntax of the
`MySQL date_format`_ function. For reference, the format is listed in detail
below:

.. csv-table::
   :header: "Format Specifier", "Description"

   ``%a``, "Abbreviated weekday name (Sun..Sat)"
   ``%b``, "Abbreviated month name (Jan..Dec)"
   ``%c``, "Month in year, numeric (0..12)"
   ``%D``, "Day of month as ordinal number (1st, 2nd, ... 24th)"
   ``%d``, "Day of month, padded to 2 digits (00..31)"
   ``%e``, "Day of month (0..31)"
   ``%f``, "Microseconds, padded to 6 digits (000000..999999)"
   ``%H``, "Hour in 24-hour clock, padded to 2 digits (00..23)"
   ``%h``, "Hour in 12-hour clock, padded to 2 digits (01..12)"
   ``%I``, "Hour in 12-hour clock, padded to 2 digits (01..12)"
   ``%i``, "Minutes, numeric (00..59)"
   ``%j``, "Day of year, padded to 3 digits (001..366)"
   ``%k``, "Hour in 24-hour clock (0..23)"
   ``%l``, "Hour in 12-hour clock (1..12)"
   ``%M``, "Month name (January..December)"
   ``%m``, "Month in year, numeric, padded to 2 digits (00..12)"
   ``%p``, "AM or PM"
   ``%r``, "Time, 12-hour (``hh:mm:ss`` followed by AM or PM)"
   ``%S``, "Seconds, padded to 2 digits (00..59)"
   ``%s``, "Seconds, padded to 2 digits (00..59)"
   ``%T``, "Time, 24-hour (``hh:mm:ss``)"
   ``%U``, "Week number, Sunday as first day of the week, first week of the
   year (01) is the one starting in this year, week 00 starts in last year
   (00..53)"
   ``%u``, "Week number, Monday as first day of the week, first week of the
   year (01) is the one with at least 4 days in this year (00..53)"
   ``%V``, "Week number, Sunday as first day of the week, first week of the
   year (01) is the one starting in this year, uses the week number of the last
   year, if the week started in last year (01..53)"
   ``%v``, "Week number, Monday as first day of the week, first week of the
   year (01) is the one with at least 4 days in this year, uses the week number
   of the last year, if the week started in last year (01..53)"
   ``%W``, "Weekday name (Sunday..Saturday)"
   ``%w``, "Day of the week (0=Sunday..6=Saturday)"
   ``%X``, "Week year, Sunday as first day of the week, numeric, four digits;
   used with %V"
   ``%x``, "Week year, Monday as first day of the week, numeric, four digits;
   used with %v"
   ``%Y``, "Year, numeric, four digits"
   ``%y``, "Year, numeric, two digits"
   ``%%``, "A literal '%' character"
   ``%x``, "x, for any 'x' not listed above"

If no ``format_string`` is given the default format will be used::

    %Y-%m-%dT%H:%i:%s.%fZ

::

    cr> select date_format('1970-01-01') as epoque;
    +-----------------------------+
    | epoque                      |
    +-----------------------------+
    | 1970-01-01T00:00:00.000000Z |
    +-----------------------------+
    SELECT 1 row in set (... sec)

Valid values for ``timezone`` are either the name of a time zone (for example
'Europe/Vienna') or the UTC offset of a time zone (for example '+01:00'). To
get a complete overview of all possible values take a look at the `available
time zones`_ supported by `Joda-Time`_.

The ``timezone`` will be ``UTC`` if not provided::

    cr> select date_format('%W the %D of %M %Y %H:%i %p', 0) as epoque;
    +-------------------------------------------+
    | epoque                                    |
    +-------------------------------------------+
    | Thursday the 1st of January 1970 00:00 AM |
    +-------------------------------------------+
    SELECT 1 row in set (... sec)

::

    cr> select date_format('%Y/%m/%d %H:%i', 'EST',  0) as est_epoque;
    +------------------+
    | est_epoque       |
    +------------------+
    | 1969/12/31 19:00 |
    +------------------+
    SELECT 1 row in set (... sec)


.. _scalar-timezone:

``timezone(timezone, timestamp)``
---------------------------------

The timezone scalar function converts values of ``timestamp`` without time zone
to/from timestamp with time zone.

Synopsis::

    TIMEZONE(timezone, timestamp)

It has two variants depending on the type of ``timestamp``:

.. csv-table::
   :header: "Type of timestamp", "Return Type", "Description"

   "timestamp without time zone OR bigint", "timestamp with time zone", "Treat
   given timestamp without time zone as located in the specified timezone"
   "timestamp with time zone", "timestamp without time zone", "Convert given
   timestamp with time zone to the new timezone with no time zone designation"

::

    cr> select
    ...     257504400000 as no_tz,
    ...     date_format(
    ...         '%Y-%m-%d %h:%i', 257504400000
    ...     ) as no_tz_str,
    ...     timezone(
    ...         'Europe/Madrid', 257504400000
    ...     ) as in_madrid,
    ...     date_format(
    ...         '%Y-%m-%d %h:%i',
    ...         timezone(
    ...             'Europe/Madrid', 257504400000
    ...         )
    ...     ) as in_madrid_str;
    +--------------+------------------+--------------+------------------+
    |        no_tz | no_tz_str        |    in_madrid | in_madrid_str    |
    +--------------+------------------+--------------+------------------+
    | 257504400000 | 1978-02-28 09:00 | 257500800000 | 1978-02-28 08:00 |
    +--------------+------------------+--------------+------------------+
    SELECT 1 row in set (... sec)

::

    cr> select
    ...     timezone(
    ...         'Europe/Madrid',
    ...         '1978-02-28T10:00:00+01:00'::timestamp with time zone
    ...     ) as epoque,
    ...     date_format(
    ...          '%Y-%m-%d %h:%i',
    ...          timezone(
    ...              'Europe/Madrid',
    ...              '1978-02-28T10:00:00+01:00'::timestamp with time zone
    ...          )
    ...     ) as epoque_str;
    +--------------+------------------+
    |       epoque | epoque_str       |
    +--------------+------------------+
    | 257508000000 | 1978-02-28 10:00 |
    +--------------+------------------+
    SELECT 1 row in set (... sec)

::

    cr> select
    ...     timezone(
    ...         'Europe/Madrid',
    ...         '1978-02-28T10:00:00+01:00'::timestamp without time zone
    ...     ) as epoque,
    ...     date_format(
    ...         '%Y-%m-%d %h:%i',
    ...         timezone(
    ...             'Europe/Madrid',
    ...             '1978-02-28T10:00:00+01:00'::timestamp without time zone
    ...         )
    ...     ) as epoque_str;
    +--------------+------------------+
    |       epoque | epoque_str       |
    +--------------+------------------+
    | 257504400000 | 1978-02-28 09:00 |
    +--------------+------------------+
    SELECT 1 row in set (... sec)


.. _scalar-to_char:

``to_char(expression, format_string)``
--------------------------------------

The ``to_char`` function converts a ``timestamp`` or ``interval`` value to a
string, based on a given format string.

Returns: ``text``

Synopsis::

    TO_CHAR( expression, format_string )

Here, ``expression`` can be any value with the type of ``timestamp`` (with or
without a timezone) or ``interval``.

The syntax for the ``format_string`` differs based the type of the
:ref:`expression <gloss-expression>`. For ``timestamp`` expressions, the
``format_string`` is a template string containing any of the following symbols:

.. list-table::
    :header-rows: 1

    * - Pattern
      - Description
    * - ``HH`` / ``HH12`` / ``hh`` / ``HH12``
      - Hour of day (01-12)
    * - ``HH24`` / ``hh24``
      - Hour of day (00-23)
    * - ``MI`` / ``mi``
      - Minute (00-59)
    * - ``SS`` / ``ss``
      - Second (00-59)
    * - ``MS`` / ``ms``
      - Millisecond (000-999)
    * - ``US`` / ``us``
      - Microsecond (000000-999999)
    * - ``FF1`` / ``ff1``
      - Tenth of second (0-9)
    * - ``FF2`` / ``ff2``
      - Hundredth of second (00-99)
    * - ``FF3`` / ``ff3``
      - Millisecond (000-999)
    * - ``FF4`` / ``ff4``
      - Tenth of millisecond (0000-9999)
    * - ``FF5`` / ``ff5``
      - Hundredth of millisecond (00000-99999)
    * - ``FF6`` / ``ff6``
      - Microsecond (000000-999999)
    * - ``SSSS`` / ``SSSSS`` / ``ssss`` / ``sssss``
      - Seconds past midnight (0-86399)
    * - ``AM`` / ``am`` / ``PM`` / ``pm``
      - Meridiem indicator
    * - ``A.M.`` / ``a.m.`` / ``P.M.`` / ``p.m.``
      - Meridiem indicator (with periods)
    * - ``Y,YYY`` / ``y,yyy``
      - 4 digit year with comma
    * - ``YYYY`` / ``yyyy``
      - 4 digit year
    * - ``YYY`` / ``yyy``
      - Last 3 digits of year
    * - ``YY`` / ``yy``
      - Last 2 digits of year
    * - ``Y`` / ``y``
      - Last digit of year
    * - ``IYYY`` / ``iyyy``
      - 4 digit ISO-8601 week-numbering year
    * - ``IYY`` / ``iyy``
      - Last 3 digits of ISO-8601 week-numbering year
    * - ``IY`` / ``iy``
      - Last 2 digits of ISO-8601 week-numbering year
    * - ``I`` / ``i``
      - Last digit of ISO-8601 week-numbering year
    * - ``BC`` / ``bc`` / ``AD``/ ``ad``
      - Era indicator
    * - ``B.C.`` / ``b.c.`` / ``A.D.``/ ``a.d.``
      - Era indicator with periods
    * - ``MONTH`` / ``Month`` / ``month``
      - Full month name (uppercase, capitalized, lowercase) padded to 9
        characters
    * - ``MON`` / ``Mon`` / ``mon``
      - Short month name (uppercase, capitalized, lowercase) padded to 9
        characters
    * - ``MM`` / ``mm``
      - Month number (01-12)
    * - ``DAY`` / ``Day`` / ``day``
      - Full day name (uppercase, capitalized, lowercase) padded to 9 characters
    * - ``DY`` / ``Dy`` / ``dy``
      - Short, 3 character day name (uppercase, capitalized, lowercase)
    * - ``DDD`` / ``ddd``
      - Day of year (001-366)
    * - ``IDDD`` / ``iddd``
      - Day of ISO-8601 week-numbering year, where the first Monday of the first
        ISO week is day 1 (001-371)
    * - ``DD`` / ``dd``
      - Day of month (01-31)
    * - ``D`` / ``d``
      - Day of the week, from Sunday (1) to Saturday (7)
    * - ``ID`` / ``id``
      - ISO-8601 day of the week, from Monday (1) to Sunday (7)
    * - ``WW`` / ``WW``
      - Week number of year (1-53)
    * - ``W`` / ``w``
      - Week of month (1-5)
    * - ``IW`` / ``iw``
      - Week number of ISO-8601 week-numbering year (01-53)
    * - ``CC`` / ``cc``
      - Century
    * - ``J`` / ``j``
      - Julian Day
    * - ``Q`` / ``q``
      - Quarter
    * - ``RM`` / ``rm``
      - Month in Roman numerals (uppercase, lowercase)
    * - ``TZ`` / ``tz``
      - Time-zone abbreviation (uppercase, lowercase)
    * - ``TZH`` / ``tzh``
      - Time-zone hours
    * - ``TZM`` / ``tzm``
      - Time-zone minutes
    * - ``OF`` / ``of``
      - Time-zone offset from UTC

Example::

    cr> select
    ...     to_char(
    ...         timestamp '1970-01-01T17:31:12',
    ...         'Day, Month DD - HH12:MI AM YYYY AD'
    ...     ) as ts;
    +-----------------------------------------+
    | ts                                      |
    +-----------------------------------------+
    | Thursday, January 01 - 05:31 PM 1970 AD |
    +-----------------------------------------+
    SELECT 1 row in set (... sec)

For ``interval`` expressions, the formatting string accepts the same tokens as
``timestamp`` expressions. The function then uses the timestamp of the
specified interval added to the timestamp of ``0000/01/01 00:00:00``::

    cr> select
    ...     to_char(
    ...         interval '1 year 3 weeks 200 minutes',
    ...         'YYYY MM DD HH12:MI:SS'
    ...     ) as interval;
    +---------------------+
    | interval            |
    +---------------------+
    | 0001 01 22 03:20:00 |
    +---------------------+
    SELECT 1 row in set (... sec)

.. _scalar-pg-age:

``pg_catalog.age([timestamp,] timestamp)``
------------------------------------------

Returns: :ref:`interval <type-interval>` between 2 timestamps. Second argument
is subtracted from the first one. If at least one argument is ``NULL``, the
return value is ``NULL``. If only one timestamp is given, the return value is
interval between current_date (at midnight) and the given timestamp.

Example::

    cr> select age('2021-10-21'::timestamp, '2021-10-20'::timestamp)
    ... as age;
    +----------------+
    | age            |
    +----------------+
    | 1 day 00:00:00 |
    +----------------+
    SELECT 1 row in set (... sec)

    cr> select pg_catalog.age(date_trunc('day', CURRENT_DATE)) as age;
    +----------+
    | age      |
    +----------+
    | 00:00:00 |
    +----------+
    SELECT 1 row in set (... sec)

.. _scalar-geo:

Geo functions
=============


.. _scalar-distance:

``distance(geo_point1, geo_point2)``
------------------------------------

Returns: ``double precision``

The ``distance`` function can be used to calculate the distance between two
points on earth. It uses the `Haversine formula`_ which gives great-circle
distances between 2 points on a sphere based on their latitude and longitude.

The return value is the distance in meters.

Below is an example of the distance function where both points are specified
using WKT. See :ref:`data-types-geo` for more information on the implicit
type casting of geo points::

    cr> select distance('POINT (10 20)', 'POINT (11 21)') AS col;
    +-------------------+
    |               col |
    +-------------------+
    | 152354.3209044634 |
    +-------------------+
    SELECT 1 row in set (... sec)

This scalar function can always be used in both the ``WHERE`` and ``ORDER BY``
clauses. With the limitation that one of the arguments must be a literal and
the other argument must be a column reference.

.. NOTE::

    The algorithm of the calculation which is used when the distance function
    is used as part of the result column list has a different precision than
    what is stored inside the index which is utilized if the distance function
    is part of a WHERE clause.

    For example, if ``select distance(...)`` returns 0.0, an equality check
    with ``where distance(...) = 0`` might not yield anything at all due to the
    precision difference.


.. _scalar-within:

``within(shape1, shape2)``
--------------------------

Returns: ``boolean``

The ``within`` function returns true if ``shape1`` is within ``shape2``. If
that is not the case false is returned.

``shape1`` can either be a ``geo_shape`` or a ``geo_point``. ``shape2`` must be
a ``geo_shape``.

Below is an example of the ``within`` function which makes use of the implicit
type casting from strings in WKT representation to geo point and geo shapes::

    cr> select within(
    ...   'POINT (10 10)',
    ...   'POLYGON ((5 5, 10 5, 10 10, 5 10, 5 5))'
    ... ) AS is_within;
    +-----------+
    | is_within |
    +-----------+
    | TRUE      |
    +-----------+
    SELECT 1 row in set (... sec)

This function can always be used within the ``WHERE`` clause.


.. _scalar-intersects:

``intersects(geo_shape, geo_shape)``
------------------------------------

Returns: ``boolean``

The ``intersects`` function returns true if both argument shapes share some
points or area, they *overlap*. This also includes two shapes where one lies
:ref:`within <scalar-within>` the other.

If ``false`` is returned, both shapes are considered *disjoint*.

Example::

    cr> select
    ... intersects(
    ...   {type='Polygon', coordinates=[
    ...         [[13.4252, 52.7096],[13.9416, 52.0997],
    ...          [12.7221, 52.1334],[13.4252, 52.7096]]]},
    ...   'LINESTRING(13.9636 52.6763, 13.2275 51.9578,
    ...               12.9199 52.5830, 11.9970 52.6830)'
    ... ) as intersects,
    ... intersects(
    ...   {type='Polygon', coordinates=[
    ...         [[13.4252, 52.7096],[13.9416, 52.0997],
    ...          [12.7221, 52.1334],[13.4252, 52.7096]]]},
    ...   'LINESTRING (11.0742 49.4538, 11.5686 48.1367)'
    ... ) as disjoint;
    +------------+----------+
    | intersects | disjoint |
    +------------+----------+
    | TRUE       | FALSE    |
    +------------+----------+
    SELECT 1 row in set (... sec)

Due to a limitation on the :ref:`data-types-geo-shape` datatype this function
cannot be used in the :ref:`ORDER BY <sql-select-order-by>` clause.


.. _scalar-latitude-longitude:

``latitude(geo_point)`` and ``longitude(geo_point)``
----------------------------------------------------

Returns: ``double precision``

The ``latitude`` and ``longitude`` function return the coordinates of latitude
or longitude of a point, or ``NULL`` if not available. The input must be a
column of type ``geo_point``, a valid WKT string or a ``double precision``
array. See :ref:`data-types-geo` for more information on the implicit type
casting of geo points.

Example::

    cr> select
    ...     mountain,
    ...     height,
    ...     longitude(coordinates) as "lon",
    ...     latitude(coordinates) as "lat"
    ... from sys.summits
    ... order by height desc limit 1;
    +------------+--------+---------+---------+
    | mountain   | height |     lon |     lat |
    +------------+--------+---------+---------+
    | Mont Blanc |   4808 | 6.86444 | 45.8325 |
    +------------+--------+---------+---------+
    SELECT 1 row in set (... sec)

Below is an example of the latitude/longitude functions which make use of the
implicit type casting from strings to geo point::

    cr> select
    ...    latitude('POINT (10 20)') AS lat,
    ...    longitude([10.0, 20.0]) AS long;
    +------+------+
    |  lat | long |
    +------+------+
    | 20.0 | 10.0 |
    +------+------+
    SELECT 1 row in set (... sec)


.. _scalar-geohash:

``geohash(geo_point)``
----------------------

Returns: ``text``

Returns a `GeoHash <https://en.wikipedia.org/wiki/Geohash>`_ representation
based on full precision (12 characters) of the input point, or ``NULL`` if not
available. The input has to be a column of type ``geo_point``, a valid WKT
string or a ``double precision`` array. See :ref:`data-types-geo` for more
information of the implicit type casting of geo points.

Example::

    cr> select
    ...     mountain,
    ...     height,
    ...     geohash(coordinates) as "geohash"
    ... from sys.summits
    ... order by height desc limit 1;
    +------------+--------+--------------+
    | mountain   | height | geohash      |
    +------------+--------+--------------+
    | Mont Blanc |   4808 | u0huspw99j1r |
    +------------+--------+--------------+
    SELECT 1 row in set (... sec)


.. _scalar-area:

``area(geo_shape)``
----------------------

Returns: ``double precision``

The ``area`` function calculates the  area of the input shape in
square-degrees. The calculation will use geospatial awareness (AKA `geodetic`_)
instead of `Euclidean geometry`_. The input has to be a column of type
:ref:`data-types-geo-shape`, a valid `WKT`_ string or `GeoJSON`_.
See :ref:`data-types-geo-shape` for more information.

Below you can find an example.

Example::

    cr> select
    ...     round(area('POLYGON ((5 5, 10 5, 10 10, 5 10, 5 5))')) as "area";
    +------+
    | area |
    +------+
    |   25 |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-math:

Mathematical functions
======================

All mathematical functions can be used within ``WHERE`` and ``ORDER BY``
clauses.


.. _scalar-abs:

``abs(number)``
---------------

Returns the absolute value of the given number in the datatype of the given
number.

Example::

    cr> select abs(214748.0998) AS a, abs(0) AS b, abs(-214748) AS c;
    +-------------+---+--------+
    |           a | b |      c |
    +-------------+---+--------+
    | 214748.0998 | 0 | 214748 |
    +-------------+---+--------+
    SELECT 1 row in set (... sec)


.. _scalar-sign:

``sign(number)``
----------------

Returns the sign of a number.

This function will return one of the following:
    - If number > 0, it returns 1.0
    - If number = 0, it returns 0.0
    - If number < 0, it returns -1.0
    - If number is NULL, it returns NULL

The data type of the return value is ``numeric`` if the argument is ``numeric``
and ``double precision`` for the rest of numeric types.

For example::

    cr> select sign(12.34) as a, sign(0) as b, sign (-77) as c, sign(NULL) as d;
    +-----+-----+------+------+
    |   a |   b |    c | d    |
    +-----+-----+------+------+
    | 1.0 | 0.0 | -1.0 | NULL |
    +-----+-----+------+------+
    SELECT 1 row in set (... sec)


.. _scalar-ceil:

``ceil(number)``
----------------

Returns the smallest integral value that is not less than the argument.

Returns: ``numeric``, ``bigint`` or ``integer``

Return value will be of type ``numeric`` if the input value is of ``numeric``
type, with the same precision and scale as the input type. It will be of
``integer`` if the input value is an ``integer``` or ``float```.  If the input
value is of type ``bigint`` or ``double precision`` the return value will be of
type ``bigint``.

Example::

    cr> select ceil(29.9) AS col;
    +-----+
    | col |
    +-----+
    |  30 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-ceiling:

``ceiling(number)``
-------------------

This is an alias for :ref:`ceil <scalar-ceil>`.


.. _scalar-degrees:

``degrees(double precision)``
-----------------------------

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` if the
input is of ``double precision`` type.

::

    cr> select degrees(0.5) AS degrees;
    +-------------------+
    |           degrees |
    +-------------------+
    | 28.64788975654116 |
    +-------------------+
    SELECT 1 row in set (... sec)


.. _scalar-exp:

``exp(number)``
---------------

Returns Euler's number ``e`` raised to the power of the given numeric value.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    > select exp(1.0) AS exp;
    +-------------------+
    |               exp |
    +-------------------+
    | 2.718281828459045 |
    +-------------------+
    SELECT 1 row in set (... sec)

.. test skipped because java.lang.Math.exp() can return with different
   precision on different CPUs (e.g.: Apple M1)

.. _scalar-floor:

``floor(number)``
-----------------

Returns the largest integral value that is not less than the argument.

Returns: ``numeric``, ``bigint`` or ``integer``

Return value will be of type ``numeric`` if the input value is of ``numeric``
type, with the same precision and scale as the input type. It will be of
``integer`` if the input value is an ``integer``` or ``float```.  If the input
value is of type ``bigint`` or ``double precision`` the return value will be of
type ``bigint``.

Example::

    cr> select floor(29.9) AS floor;
    +-------+
    | floor |
    +-------+
    |    29 |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-ln:

``ln(number)``
--------------

Returns the natural logarithm of given ``number``.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT ln(1) AS ln;
    +-----+
    |  ln |
    +-----+
    | 0.0 |
    +-----+
    SELECT 1 row in set (... sec)

.. NOTE::

    An error is returned for arguments which lead to undefined or illegal
    results. E.g. ln(0) results in ``minus infinity``, and therefore, an error
    is returned.


.. _scalar-log:

``log(x : number[, b : number])``
---------------------------------

Returns the logarithm of given ``x`` to base ``b``.

Returns: ``numeric`` or ``double precision``

When the second argument (``b``) is provided it returns a value of type
``double precision``, even if ``x`` is of type ``numeric``, as it's implicitly
casted to ``double precision`` (thus, possibly loosing precision). When it's not
provided, then the return value will be of type ``numeric`` with unspecified
precision and scale, if the input value is of ``numeric`` type and of
``double precision`` for any other arithmetic type.

Examples::

    cr> SELECT log(100, 10) AS log;
    +-----+
    | log |
    +-----+
    | 2.0 |
    +-----+
    SELECT 1 row in set (... sec)

The second argument (``b``) is optional. If not present, base 10 is used::

    cr> SELECT log(100) AS log;
    +-----+
    | log |
    +-----+
    | 2.0 |
    +-----+
    SELECT 1 row in set (... sec)

.. NOTE::

    An error is returned for arguments which lead to undefined or illegal
    results. E.g. log(0) results in ``minus infinity``, and therefore, an error
    is returned.

    The same is true for arguments which lead to a ``division by zero``, as,
    e.g., log(10, 1) does.


.. _scalar-modulus:

``modulus(y, x)``
-----------------

Returns the remainder of ``y/x``.

Returns: Same as argument types.

::

    cr> select modulus(5, 4) AS mod;
    +-----+
    | mod |
    +-----+
    |   1 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-mod:

``mod(y, x)``
-----------------

This is an alias for :ref:`modulus <scalar-modulus>`.


.. _scalar-power:

``power(a: number, b: number)``
-------------------------------

Returns the given argument ``a`` raised to the power of argument ``b``.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if any of the input values is of ``numeric`` type, and ``double precision`` for
any other arithmetic type, even when both the inputs are integral types, in
order to be consistent across positive and negative exponents (which will yield
decimal types).

See below for an example::

    cr> SELECT power(2,3) AS pow;
    +-----+
    | pow |
    +-----+
    | 8.0 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-radians:

``radians(double precision)``
-----------------------------

Convert the given ``degrees`` value to ``radians``.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` if the
input is of ``double precision`` type.

::

    cr> select radians(45.0) AS radians;
    +--------------------+
    |            radians |
    +--------------------+
    | 0.7853981633974483 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-random:

``random()``
------------

The ``random`` function returns a random value in the range 0.0 <= X < 1.0.

Returns: ``double precision``

.. NOTE::

    Every call to ``random`` will yield a new random number.


.. _scalar-gen_random_text_uuid:

``gen_random_text_uuid()``
--------------------------

Returns a random time based UUID as ``text``. The returned ID is similar to
flake IDs and well suited for use as primary key value.

Note that the ID is opaque (i.e., not to be considered meaningful in any way)
and the implementation is free to change.


.. _scalar-round:

``round(number[, precision])``
------------------------------

Returns ``number`` rounded to the specified ``precision`` (decimal places).

When ``precision`` is not specified, the ``round`` function rounds the input
value to the closest integer for ``real`` and ``integer`` data types with ties
rounding up, and to the closest ``bigint`` value for ``double precision`` and
``bigint`` data types with ties rounding up. When the data type of the argument
is ``numeric``, then it returns the closest ``numeric`` value with the same
precision and scale as the input type, with all decimal digits zeroed out, and
with ties rounding up.

When it is specified, the result's type is ``numeric``. If ``number`` is of
``numeric`` datatype, then the ``numeric`` type of the result has the same
precision and scale with the input. If it's of any other arithmetic type, the
``numeric`` datatype of the result has unspecified precision and scale.

Notice that ``round(number)`` and ``round(number, 0)`` may return different
result types.


Examples::

    cr> select round(42.2) AS round;
    +-------+
    | round |
    +-------+
    |    42 |
    +-------+
    SELECT 1 row in set (... sec)

    cr> select round(42.21, 1) AS round;
    +-------+
    | round |
    +-------+
    |  42.2 |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-trunc:

``trunc(number[, precision])``
------------------------------

Returns ``number`` truncated to the specified ``precision`` (decimal places).

When ``precision`` is not specified, the result's type is an ``integer``, or
``bigint``. When it is specified, the result's type is ``double precision``.
Notice that ``trunc(number)`` and ``trunc(number, 0)`` return different result
types.

See below for examples::

    cr> select trunc(29.999999, 3) AS trunc;
    +--------+
    |  trunc |
    +--------+
    | 29.999 |
    +--------+
    SELECT 1 row in set (... sec)

    cr> select trunc(29.999999) AS trunc;
    +-------+
    | trunc |
    +-------+
    |    29 |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-sqrt:

``sqrt(number)``
----------------

Returns the square root of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> select sqrt(25.0) AS sqrt;
    +------+
    | sqrt |
    +------+
    |  5.0 |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-sin:

``sin(number)``
---------------

Returns the sine of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT sin(1) AS sin;
    +--------------------+
    |                sin |
    +--------------------+
    | 0.8414709848078965 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-asin:

``asin(number)``
----------------

Returns the arcsine of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT asin(1) AS asin;
    +--------------------+
    |               asin |
    +--------------------+
    | 1.5707963267948966 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-cos:

``cos(number)``
---------------

Returns the cosine of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT cos(1) AS cos;
    +--------------------+
    |                cos |
    +--------------------+
    | 0.5403023058681398 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-acos:

``acos(number)``
----------------

Returns the arccosine of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT acos(-1) AS acos;
    +-------------------+
    |              acos |
    +-------------------+
    | 3.141592653589793 |
    +-------------------+
    SELECT 1 row in set (... sec)


.. _scalar-tan:

``tan(number)``
---------------

Returns the tangent of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT tan(1) AS tan;
    +--------------------+
    |                tan |
    +--------------------+
    | 1.5574077246549023 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-cot:

``cot(number)``
---------------

Returns the cotangent of the argument that represents the angle expressed in
radians. The range of the argument is all real numbers. The cotangent of zero
is undefined and returns ``Infinity``.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> select cot(1) AS cot;
    +--------------------+
    |                cot |
    +--------------------+
    | 0.6420926159343306 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-atan:

``atan(number)``
----------------

Returns the arctangent of the argument.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value is of ``numeric`` type, and ``double precision`` for any
other arithmetic type.

Example::

    cr> SELECT atan(1) AS atan;
    +--------------------+
    |               atan |
    +--------------------+
    | 0.7853981633974483 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-atan2:

``atan2(y: number, x: number)``
-------------------------------

Returns the arctangent of ``y/x``.

Returns: ``numeric`` or ``double precision``

Return value will be of type ``numeric`` with unspecified precision and scale
if the input value ``y`` or ``x`` is of ``numeric`` type, and
``double precision`` for any other arithmetic type.

Example::

    cr> SELECT atan2(2, 1) AS atan2;
    +--------------------+
    |              atan2 |
    +--------------------+
    | 1.1071487177940904 |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-pi:

``pi()``
--------

Returns the π constant.

Returns: ``double precision``

::

    cr> SELECT pi() AS pi;
    +-------------------+
    |                pi |
    +-------------------+
    | 3.141592653589793 |
    +-------------------+
    SELECT 1 row in set (... sec)


.. _scalar-regexp:

Regular expression functions
============================

The :ref:`regular expression <gloss-regular-expression>` functions in CrateDB
use `Java Regular Expressions`_.

See the API documentation for more details.

.. NOTE::

    Be aware that, in contrast to the functions, the :ref:`regular expression
    operator <sql_dql_regexp>` uses `Lucene Regular Expressions`_.


.. _scalar-regexp_replace:

``regexp_replace(source, pattern, replacement [, flags])``
----------------------------------------------------------

``regexp_replace`` can be used to replace every (or only the first) occurrence
of a subsequence matching ``pattern`` in the ``source`` string with the
``replacement`` string. If no subsequence in ``source`` matches the regular
expression ``pattern``, ``source`` is returned unchanged.

Returns: ``text``

``pattern`` is a Java regular expression. For details on the regexp syntax, see
`Java Regular Expressions`_.

The ``replacement`` string may contain expressions like ``$N`` where ``N`` is a
digit between 0 and 9. It references the nth matched group of ``pattern``
and the matching subsequence of that group will be inserted in the returned
string. The expression ``$0`` will insert the whole matching ``source``.

By default, only the first occurrence of a subsequence matching ``pattern``
will be replaced. If all occurrences shall be replaced use the ``g`` flag.


.. _scalar-regexp_replace-flags:

Flags
.....

``regexp_replace`` supports a number of flags as optional parameters. These
flags are given as a string containing any of the characters listed below.
Order does not matter.

+-------+---------------------------------------------------------------------+
| Flag  | Description                                                         |
+=======+=====================================================================+
| ``i`` | enable case insensitive matching                                    |
+-------+---------------------------------------------------------------------+
| ``u`` | enable unicode case folding when used together with ``i``           |
+-------+---------------------------------------------------------------------+
| ``U`` | enable unicode support for character classes like ``\W``            |
+-------+---------------------------------------------------------------------+
| ``s`` | make ``.`` match line terminators, too                              |
+-------+---------------------------------------------------------------------+
| ``m`` | make ``^`` and ``$`` match on the beginning or end of a line        |
|       | too.                                                                |
+-------+---------------------------------------------------------------------+
| ``x`` | permit whitespace and line comments starting with ``#``             |
+-------+---------------------------------------------------------------------+
| ``d`` | only ``\n`` is considered a line-terminator when using ``^``, ``$`` |
|       | and ``.``                                                           |
+-------+---------------------------------------------------------------------+
| ``g`` | replace all occurrences of a subsequence matching ``pattern``,      |
|       | not only the first                                                  |
+-------+---------------------------------------------------------------------+


.. _scalar-regexp_replace-examples:

Examples
........

::

   cr> select
   ...     name,
   ...     regexp_replace(
   ...         name, '(\w+)\s(\w+)+', '$1 - $2'
   ...      ) as replaced
   ... from locations
   ... order by name limit 5;
    +---------------------+-----------------------+
    | name                | replaced              |
    +---------------------+-----------------------+
    |                     |                       |
    | Aldebaran           | Aldebaran             |
    | Algol               | Algol                 |
    | Allosimanius Syneca | Allosimanius - Syneca |
    | Alpha Centauri      | Alpha - Centauri      |
    +---------------------+-----------------------+
    SELECT 5 rows in set (... sec)

::

   cr> select
   ...     regexp_replace(
   ...         'alcatraz', '(foo)(bar)+', '$1baz'
   ...     ) as replaced;
    +----------+
    | replaced |
    +----------+
    | alcatraz |
    +----------+
    SELECT 1 row in set (... sec)

::

   cr> select
   ...     name,
   ...     regexp_replace(
   ...         name, '([A-Z]\w+) .+', '$1', 'ig'
   ...     ) as replaced
   ... from locations
   ... order by name limit 5;
    +---------------------+--------------+
    | name                | replaced     |
    +---------------------+--------------+
    |                     |              |
    | Aldebaran           | Aldebaran    |
    | Algol               | Algol        |
    | Allosimanius Syneca | Allosimanius |
    | Alpha Centauri      | Alpha        |
    +---------------------+--------------+
    SELECT 5 rows in set (... sec)


.. _scalar-arrays:

Array functions
===============

.. _scalar-array_append:

``array_append(anyarray, value)``
----------------------------------------

The ``array_append`` function adds the value at the end of the array

Returns: ``array``

::

    cr> select
    ...     array_append([1,2,3], 4) AS array_append;
    +--------------+
    | array_append |
    +--------------+
    | [1, 2, 3, 4] |
    +--------------+
    SELECT 1 row in set (... sec)


You can also use the concat :ref:`operator <gloss-operator>` ``||`` to append
values to an array::

    cr> select
    ...    [1,2,3] || 4 AS array_append;
    +--------------+
    | array_append |
    +--------------+
    | [1, 2, 3, 4] |
    +--------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    The ``||`` operator differs from the ``array_append`` function regarding
    the handling of ``NULL`` arguments. It will ignore a ``NULL`` value while
    the ``array_append`` function will append a ``NULL`` value to the array.


.. _scalar-array_cat:

``array_cat(first_array, second_array)``
----------------------------------------

The ``array_cat`` function concatenates two arrays into one array

Returns: ``array``

::

    cr> select
    ...     array_cat([1,2,3],[3,4,5,6]) AS array_cat;
    +-----------------------+
    | array_cat             |
    +-----------------------+
    | [1, 2, 3, 3, 4, 5, 6] |
    +-----------------------+
    SELECT 1 row in set (... sec)


You can also use the concat :ref:`operator <gloss-operator>` ``||`` with
arrays::

    cr> select
    ...     [1,2,3] || [4,5,6] || [7,8,9] AS arr;
    +-----------------------------+
    | arr                         |
    +-----------------------------+
    | [1, 2, 3, 4, 5, 6, 7, 8, 9] |
    +-----------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-array_unique:

``array_unique(first_array, [ second_array])``
----------------------------------------------

The ``array_unique`` function merges two arrays into one array with unique
elements

Returns: ``array``

::

    cr> select
    ...     array_unique(
    ...         [1, 2, 3],
    ...         [3, 4, 4]
    ...     ) AS arr;
    +--------------+
    | arr          |
    +--------------+
    | [1, 2, 3, 4] |
    +--------------+
    SELECT 1 row in set (... sec)

If the arrays have different types all elements will be cast to a common type
based on the type precedence.

::

    cr> select
    ...      array_unique(
    ...          [10, 20],
    ...          [10.0, 20.3]
    ...      ) AS arr;
    +--------------------+
    | arr                |
    +--------------------+
    | [10.0, 20.0, 20.3] |
    +--------------------+
    SELECT 1 row in set (... sec)


.. _scalar-array_difference:

``array_difference(first_array, second_array)``
-----------------------------------------------

The ``array_difference`` function removes elements from the first array that
are contained in the second array.

Returns: ``array``

::

    cr> select
    ...     array_difference(
    ...         [1,2,3,4,5,6,7,8,9,10],
    ...         [2,3,6,9,15]
    ...     ) AS arr;
    +---------------------+
    | arr                 |
    +---------------------+
    | [1, 4, 5, 7, 8, 10] |
    +---------------------+
    SELECT 1 row in set (... sec)


.. _scalar-array:

``array(subquery)``
-------------------

The ``array(subquery)`` :ref:`expression <gloss-expression>` is an array
constructor function which operates on the result of the ``subquery``.

Returns: ``array``

.. SEEALSO::

    :ref:`Array construction with subquery <sql_expressions_array_subquery>`


.. _scalar-array_upper:

``array_upper(anyarray, dimension)``
------------------------------------

The ``array_upper`` function returns the number of elements in the requested
array dimension (the upper bound of the dimension). CrateDB allows mixing
arrays with different sizes on the same dimension. Returns ``NULL`` if array
argument is ``NULL`` or if dimension <= 0 or if dimension is ``NULL``.

Returns: ``integer``

::

    cr> select array_upper([[1, 4], [3]], 1) AS size;
    +------+
    | size |
    +------+
    |    2 |
    +------+
    SELECT 1 row in set (... sec)

An empty array has no dimension and returns ``NULL`` instead of ``0``.

::

    cr> select array_upper(ARRAY[]::int[], 1) AS size;
    +------+
    | size |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-array_length:

``array_length(anyarray, dimension)``
-------------------------------------

An alias for :ref:`scalar-array_upper`.

::

    cr> select array_length([[1, 4], [3]], 1) AS len;
    +-----+
    | len |
    +-----+
    |   2 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-array_lower:

``array_lower(anyarray, dimension)``
------------------------------------

The ``array_lower`` function returns the lower bound of the requested array
dimension (which is ``1`` if the dimension is valid and has at least one
element). Returns ``NULL`` if array argument is ``NULL`` or if dimension <= 0
or if dimension is ``NULL``.

Returns: ``integer``

::

    cr> select array_lower([[1, 4], [3]], 1) AS size;
    +------+
    | size |
    +------+
    |    1 |
    +------+
    SELECT 1 row in set (... sec)

If there is at least one empty array or ``NULL`` on the requested dimension
return value is ``NULL``. Example:

::

    cr> select array_lower([[1, 4], [3], []], 2) AS size;
    +------+
    | size |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-array_overlap:

``array_overlap(anyarray, anyarray)``
-------------------------------------

The ``array_overlap`` function returns ``true`` if the two arrays have at least
one element in common, otherwise it returns ``false``.
If one of the argument is ``NULL`` the result is ``NULL``.

Returns: ``boolean``

::

    cr> select array_overlap([1, 2], [3, 2]) AS overlap;
    +---------+
    | overlap |
    +---------+
    | TRUE    |
    +---------+
    SELECT 1 row in set (... sec)

.. _scalar-array_set:

``array_set(array, index, value)``
----------------------------------

The ``array_set`` function returns the array with the element at ``index`` set
to ``value``.

Gaps are filled with ``null``.

Returns: ``array``

::

    cr> select array_set(['_', 'b'], 1, 'a') AS arr;
    +------------+
    | arr        |
    +------------+
    | ["a", "b"] |
    +------------+
    SELECT 1 row in set (... sec)


``array_set(source_array, indexes_array, values_array)``
--------------------------------------------------------

Second overload for ``array_set`` that updates many indices with many values at
once. Depending on the indexes provided, ``array_set`` updates or appends the
values and also fills any gaps with ``nulls``.

Returns: ``array``

::

    cr> select array_set(['_', 'b'], [1, 4], ['a', 'd']) AS arr;
    +-----------------------+
    | arr                   |
    +-----------------------+
    | ["a", "b", null, "d"] |
    +-----------------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    Updating indexes less than or equal to 0 is not supported.


.. _scalar-array_slice:

``array_slice(anyarray, from, to)``
-----------------------------------

The ``array_slice`` function returns a slice of the given array using the given
lower and upper bound.

Returns: ``array``

.. SEEALSO::

    :ref:`Accessing arrays<sql_dql_arrays>`

::

    cr> select array_slice(['a', 'b', 'c', 'd'], 2, 3) AS arr;
    +------------+
    | arr        |
    +------------+
    | ["b", "c"] |
    +------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    The first index value is ``1``. The maximum array index is ``2147483647``.
    Both the ``from`` and ``to`` index values are inclusive.
    Using an index greater than the array size results in an empty array.

.. _scalar-array_to_string:

``pg_catalog.array_to_string(anyarray, separator, [ null_string ])``
--------------------------------------------------------------------

The ``array_to_string`` function concatenates elements of the given array into
a single string using the ``separator``.

Returns: ``text``

::

    cr> select
    ...     array_to_string(
    ...         ['Arthur', 'Ford', 'Trillian'], ','
    ...     ) AS str;
    +----------------------+
    | str                  |
    +----------------------+
    | Arthur,Ford,Trillian |
    +----------------------+
    SELECT 1 row in set (... sec)

If the ``separator`` argument is ``NULL``, the result is ``NULL``::

    cr> select
    ...     array_to_string(
    ...         ['Arthur', 'Ford', 'Trillian'], NULL
    ...     ) AS str;
    +------+
    |  str |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)

If ``null_string`` is provided and is not ``NULL``, then ``NULL`` elements of
the array are replaced by that string, otherwise they are omitted::

    cr> select
    ...     array_to_string(
    ...         ['Arthur', NULL, 'Trillian'], ',', 'Ford'
    ...     ) AS str;
    +----------------------+
    | str                  |
    +----------------------+
    | Arthur,Ford,Trillian |
    +----------------------+
    SELECT 1 row in set (... sec)

::

    cr> select
    ...     array_to_string(
    ...         ['Arthur', NULL, 'Trillian'], ','
    ...     ) AS str;
    +-----------------+
    | str             |
    +-----------------+
    | Arthur,Trillian |
    +-----------------+
    SELECT 1 row in set (... sec)

::

    cr> select
    ...     array_to_string(
    ...         ['Arthur', NULL, 'Trillian'], ',', NULL
    ...     ) AS str;
    +-----------------+
    | str             |
    +-----------------+
    | Arthur,Trillian |
    +-----------------+
    SELECT 1 row in set (... sec)


.. _scalar-string_to_array:

``string_to_array(string, separator, [ null_string ])``
-------------------------------------------------------

The ``string_to_array`` splits a string into an array of ``text`` elements
using a supplied separator and an optional null-string to set matching
substring elements to NULL.

Returns: ``array(text)``

::

    cr> select string_to_array('Arthur,Ford,Trillian', ',') AS arr;
    +--------------------------------+
    | arr                            |
    +--------------------------------+
    | ["Arthur", "Ford", "Trillian"] |
    +--------------------------------+
    SELECT 1 row in set (... sec)

::

    cr> select string_to_array('Arthur,Ford,Trillian', ',', 'Ford') AS arr;
    +------------------------------+
    | arr                          |
    +------------------------------+
    | ["Arthur", null, "Trillian"] |
    +------------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-string_to_array-separator:

``separator``
.............

If the ``separator`` argument is NULL, each character of the input string
becomes a separate element in the resulting array.

::

    cr> select string_to_array('Ford', NULL) AS arr;
    +----------------------+
    | arr                  |
    +----------------------+
    | ["F", "o", "r", "d"] |
    +----------------------+
    SELECT 1 row in set (... sec)

If the separator is an empty string, then the entire input string is returned
as a one-element array.

::

    cr> select string_to_array('Arthur,Ford', '') AS arr;
    +-----------------+
    | arr             |
    +-----------------+
    | ["Arthur,Ford"] |
    +-----------------+
    SELECT 1 row in set (... sec)


.. _scalar-string_to_array-null_string:

``null_string``
...............

If the ``null_string`` argument is omitted or NULL, none of the substrings of
the input will be replaced by NULL.


.. _scalar-array_min:

``array_min(array)``
--------------------

The ``array_min`` function returns the smallest element in ``array``. If
``array`` is ``NULL`` or an empty array, the function returns ``NULL``. This
function supports arrays of any of the :ref:`primitive types
<data-types-primitive>`.

::

    cr> SELECT array_min([3, 2, 1]) AS min;
    +-----+
    | min |
    +-----+
    |   1 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-array_position:

``array_position(anycompatiblearray, anycompatible [, integer ] ) → integer``
-----------------------------------------------------------------------------

The ``array_position`` function returns the position of the first
occurrence of the second argument in the ``array``, or ``NULL`` if it's not
present. If the third argument is given, the search begins at that position.
The third argument is ignored if it's null. If not within the ``array`` range,
``NULL`` is returned. It is also possible to search for ``NULL`` values.

::

    cr> SELECT array_position([1,3,7,4], 7) as position;
    +----------+
    | position |
    +----------+
    |        3 |
    +----------+
    SELECT 1 row in set (... sec)

Begin the search from given position (optional).

::

    cr> SELECT array_position([1,3,7,4], 7, 2) as position;
    +----------+
    | position |
    +----------+
    |        3 |
    +----------+
    SELECT 1 row in set (... sec)

.. TIP::
    When searching for the existence of an ``array`` element, using the
    :ref:`ANY <sql_any_array_comparison>` operator inside the ``WHERE``
    clause is much more efficient as it can utilize the index whereas
    ``array_position`` won't even when used inside the ``WHERE`` clause.


.. _scalar-array_prepend:

``array_prepend(value, anyarray)``
----------------------------------

The ``array_prepend`` function prepends a value to the beginning of the array.

Returns: ``array``

::

    cr> select
    ...     array_prepend(1, [2,3,4]) AS array_prepend;
    +---------------+
    | array_prepend |
    +---------------+
    | [1, 2, 3, 4]  |
    +---------------+
    SELECT 1 row in set (... sec)


You can also use the concat :ref:`operator <gloss-operator>` ``||`` to prepend
values to an array::

    cr> select
    ...    1 || [2,3,4] AS array_prepend;
    +---------------+
    | array_prepend |
    +---------------+
    | [1, 2, 3, 4]  |
    +---------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    The ``||`` operator differs from the ``array_prepend`` function regarding the
    handling of ``NULL`` arguments. It will ignore a ``NULL`` value while the
    ``array_prepend`` function will prepend a ``NULL`` value to the array.

.. _scalar-array_max:

``array_max(array)``
--------------------

The ``array_max`` function returns the largest element in ``array``. If
``array`` is ``NULL`` or an empty array, the function returns ``NULL``. This
function supports arrays of any of the :ref:`primitive types
<data-types-primitive>`.

::

    cr> SELECT array_max([1,2,3]) AS max;
    +-----+
    | max |
    +-----+
    |   3 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-array_sum:

``array_sum(array)``
--------------------

Returns the sum of array elements that are not ``NULL``. If ``array`` is
``NULL`` or an empty array, the function returns ``NULL``. This function
supports arrays of any :ref:`numeric types <type-numeric>`.

For ``real`` and ``double precison`` arguments, the return type is equal to the
argument type. For ``char``, ``smallint``, ``integer``, and ``bigint``
arguments, the return type changes to ``bigint``.

If any ``bigint`` value exceeds range limits (-2^64 to 2^64-1), an
``ArithmeticException`` will be raised.

::

    cr> SELECT array_sum([1,2,3]) AS sum;
    +-----+
    | sum |
    +-----+
    |   6 |
    +-----+
    SELECT 1 row in set (... sec)

The sum on the bigint array will result in an overflow in the following query:

::

    cr> SELECT
    ...     array_sum(
    ...         [9223372036854775807, 9223372036854775807]
    ...     ) as sum;
    ArithmeticException[long overflow]

To address the overflow of the sum of the given array elements, we cast the
array to the numeric data type:

::

    cr>  SELECT
    ...     array_sum(
    ...         [9223372036854775807, 9223372036854775807]::numeric[]
    ...     ) as sum;
    +----------------------+
    |                  sum |
    +----------------------+
    | 18446744073709551614 |
    +----------------------+
    SELECT 1 row in set (... sec)


.. _scalar-array_avg:

``array_avg(array)``
--------------------

Returns the average of all values in ``array`` that are not ``NULL`` If
``array`` is ``NULL`` or an empty array, the function returns ``NULL``. This
function supports arrays of any :ref:`numeric types <type-numeric>`.

For ``real`` and ``double precison`` arguments, the return type is equal to the
argument type. For ``char``, ``smallint``, ``integer``, and ``bigint``
arguments, the return type is ``numeric``.

::

    cr> SELECT array_avg([1,2,3]) AS avg;
    +-----+
    | avg |
    +-----+
    |   2 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-array_unnest:

``array_unnest(nested_array)``
------------------------------

Takes a nested array and returns a flattened array. Only flattens one level at a
time.

Returns ``NULL`` if the argument is ``NULL``. ``NULL`` array elements are
skipped and ``NULL`` leaf elements within arrays are preserved.

::

    cr> SELECT array_unnest([[1, 2], [3, 4, 5]]) AS result;
    +-----------------+
    | result          |
    +-----------------+
    | [1, 2, 3, 4, 5] |
    +-----------------+
    SELECT 1 row in set (... sec)


    cr> SELECT array_unnest([[1, null, 2], null, [3, 4, 5]]) AS result;
    +-----------------------+
    | result                |
    +-----------------------+
    | [1, null, 2, 3, 4, 5] |
    +-----------------------+
    SELECT 1 row in set (... sec)

.. SEEALSO::

    :ref:`UNNEST table function <unnest>`


.. _scalar-null-or-empty-array:

``null_or_empty(array)``
-------------------------

The ``null_or_empty(array)`` function returns a Boolean indicating if an array
is ``NULL`` or empty (``[]``).

This can serve as a faster alternative to ``IS NULL`` if matching on empty
array is acceptable. It makes better use of indices.

::

    cr> SELECT null_or_empty([]) w,
    ...        null_or_empty([[]]) x,
    ...        null_or_empty(NULL) y,
    ...        null_or_empty([1]) z;
    +------+-------+------+-------+
    | w    | x     | y    | z     |
    +------+-------+------+-------+
    | TRUE | FALSE | TRUE | FALSE |
    +------+-------+------+-------+
    SELECT 1 row in set (... sec)


.. _scalar-objects:

Object functions
================

.. _scalar-object_keys:

``object_keys(object)``
-----------------------

The ``object_keys`` function returns the set of first level keys of an ``object``.

Returns: ``array(text)``

::

    cr> SELECT
    ...     object_keys({a = 1, b = {c = 2}}) AS object_keys;
    +-------------+
    | object_keys |
    +-------------+
    | ["a", "b"]  |
    +-------------+
    SELECT 1 row in set (... sec)


.. _scalar-concat-object:

``concat(object, object)``
--------------------------

The ``concat(object, object)`` function combines two objects into a new object
containing the union of their first level properties, taking the second
object's values for duplicate properties. Additionally, the
:ref:`column policy <type-object-column-policy>` of the second object is used
for the return type. If one of the objects is ``NULL``, the function returns
the non-``NULL`` object. If both objects are ``NULL``,the function returns
``NULL``.

Returns: ``object``

::

    cr> SELECT
    ...     concat({a = 1}, {a = 2, b = {c = 2}}) AS object_concat;
    +-------------------------+
    | object_concat           |
    +-------------------------+
    | {"a": 2, "b": {"c": 2}} |
    +-------------------------+
    SELECT 1 row in set (... sec)


You can also use the concat :ref:`operator <gloss-operator>` ``||`` with
objects::

    cr> SELECT
    ...     {a = 1} || {b = 2} || {c = 3} AS object_concat;
    +--------------------------+
    | object_concat            |
    +--------------------------+
    | {"a": 1, "b": 2, "c": 3} |
    +--------------------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    ``concat(object, object)`` does not operate recursively: only the
    top-level object structure is merged::

        cr> SELECT
        ...     concat({a = {b = 4}}, {a = {c = 2}}) as object_concat;
        +-----------------+
        | object_concat   |
        +-----------------+
        | {"a": {"c": 2}} |
        +-----------------+
        SELECT 1 row in set (... sec)


.. _scalar-null-or-empty-object:


``null_or_empty(object)``
-------------------------

The ``null_or_empty(object)`` function returns a Boolean indicating if an object
is ``NULL`` or empty (``{}``).

This can serve as a faster alternative to ``IS NULL`` if matching on empty
objects is acceptable. It makes better use of indices.

::

    cr> SELECT null_or_empty({}) x, null_or_empty(NULL) y, null_or_empty({x=10}) z;
    +------+------+-------+
    | x    | y    | z     |
    +------+------+-------+
    | TRUE | TRUE | FALSE |
    +------+------+-------+
    SELECT 1 row in set (... sec)


.. _scalar-conditional-fn-exp:

Conditional functions and expressions
=====================================


.. _scalar-case-when-then-end:

``CASE WHEN ... THEN ... END``
------------------------------

The ``case`` :ref:`expression <gloss-expression>` is a generic conditional
expression similar to if/else statements in other programming languages and can
be used wherever an expression is valid.

::

    CASE WHEN condition THEN result
         [WHEN ...]
         [ELSE result]
    END

Each *condition* expression must result in a boolean value. If the condition's
result is true, the value of the *result* expression that follows the condition
will be the final result of the ``case`` expression and the subsequent ``when``
branches will not be processed. If the condition's result is not true, any
subsequent ``when`` clauses are examined in the same manner. If no ``when``
condition yields true, the value of the ``case`` expression is the result of
the ``else`` clause. If the ``else`` clause is omitted and no condition is
true, the result is null.

.. Hidden: create table case_example

    cr> create table case_example (id bigint);
    CREATE OK, 1 row affected (... sec)
    cr> insert into case_example (id) values (0),(1),(2),(3);
    INSERT OK, 4 rows affected (... sec)
    cr> refresh table case_example
    REFRESH OK, 1 row affected (... sec)

Example::

    cr> select id,
    ...   case when id = 0 then 'zero'
    ...        when id % 2 = 0 then 'even'
    ...        else 'odd'
    ...   end as parity
    ... from case_example order by id;
    +----+--------+
    | id | parity |
    +----+--------+
    |  0 | zero   |
    |  1 | odd    |
    |  2 | even   |
    |  3 | odd    |
    +----+--------+
    SELECT 4 rows in set (... sec)

As a variant, a ``case`` expression can be written using the *simple* form::

    CASE expression
         WHEN value THEN result
         [WHEN ...]
         [ELSE result]
    END

Example::

    cr> select id,
    ...   case id when 0 then 'zero'
    ...           when 1 then 'one'
    ...           else 'other'
    ...   end as description
    ... from case_example order by id;
    +----+-------------+
    | id | description |
    +----+-------------+
    |  0 | zero        |
    |  1 | one         |
    |  2 | other       |
    |  3 | other       |
    +----+-------------+
    SELECT 4 rows in set (... sec)

.. NOTE::

   All *result* expressions must be convertible to a single data type.

.. Hidden: drop table case_example

    cr> drop table case_example;
    DROP OK, 1 row affected (... sec)


.. _scalar-if:

``if(condition, result [, default])``
-------------------------------------

The ``if`` function is a conditional function comparing to *if* statements of
most other programming languages. If the given *condition* :ref:`expression
<gloss-expression>` :ref:`evaluates <gloss-evaluation>` to ``true``, the
*result* expression is evaluated and its value is returned. If the *condition*
evaluates to ``false``, the *result* expression is not evaluated and the
optional given *default* expression is evaluated instead and its value will be
returned. If the *default* argument is omitted, ``NULL`` will be returned
instead.

.. Hidden: create table if_example

    cr> create table if_example (id bigint);
    CREATE OK, 1 row affected (... sec)
    cr> insert into if_example (id) values (0),(1),(2),(3);
    INSERT OK, 4 rows affected (... sec)
    cr> refresh table if_example
    REFRESH OK, 1 row affected (... sec)

::

    cr> select
    ...     id,
    ...     if(id = 0, 'zero', 'other') as description
    ... from if_example
    ... order by id;
    +----+-------------+
    | id | description |
    +----+-------------+
    |  0 | zero        |
    |  1 | other       |
    |  2 | other       |
    |  3 | other       |
    +----+-------------+
    SELECT 4 rows in set (... sec)

.. Hidden: drop table if_example

    cr> drop table if_example;
    DROP OK, 1 row affected (... sec)


.. _scalar-coalesce:

``coalesce('first_arg', second_arg [, ... ])``
----------------------------------------------

The ``coalesce`` function takes one or more arguments of the same type and
returns the first non-null value of these. The result will be NULL only if all
the arguments :ref:`evaluate <gloss-evaluation>` to NULL.

Returns: same type as arguments

::

    cr> select coalesce(clustered_by, 'nothing') AS clustered_by
    ...   from information_schema.tables
    ...   where table_name='nodes';
    +--------------+
    | clustered_by |
    +--------------+
    | nothing      |
    +--------------+
    SELECT 1 row in set (... sec)

.. NOTE::

    If the data types of the arguments are not of the same type, ``coalesce``
    will try to cast them to a common type, and if it fails to do so, an error
    is thrown.


.. _scalar-greatest:

``greatest('first_arg', second_arg[ , ... ])``
----------------------------------------------

The ``greatest`` function takes one or more arguments of the same type and will
return the largest value of these. NULL values in the arguments list are
ignored. The result will be NULL only if all the arguments :ref:`evaluate
<gloss-evaluation>` to NULL.

Returns: same type as arguments

::

    cr> select greatest(1, 2) AS greatest;
    +----------+
    | greatest |
    +----------+
    |        2 |
    +----------+
    SELECT 1 row in set (... sec)

.. NOTE::

    If the data types of the arguments are not of the same type, ``greatest``
    will try to cast them to a common type, and if it fails to do so, an error
    is thrown.


.. _scalar-least:

``least('first_arg', second_arg[ , ... ])``
-------------------------------------------

The ``least`` function takes one or more arguments of the same type and will
return the smallest value of these. NULL values in the arguments list are
ignored. The result will be NULL only if all the arguments :ref:`evaluate
<gloss-evaluation>` to NULL.

Returns: same type as arguments

::

    cr> select least(1, 2) AS least;
    +-------+
    | least |
    +-------+
    |     1 |
    +-------+
    SELECT 1 row in set (... sec)

.. NOTE::

    If the data types of the arguments are not of the same type, ``least`` will
    try to cast them to a common type, and if it fails to do so, an error is
    thrown.


.. _scalar-nullif:

``nullif('first_arg', second_arg)``
-----------------------------------

The ``nullif`` function compares two arguments of the same type and, if they
have the same value, returns NULL; otherwise returns the first argument.

Returns: same type as arguments

::

    cr> select nullif(table_schema, 'sys') AS nullif
    ...   from information_schema.tables
    ...   where table_name='nodes';
    +--------+
    | nullif |
    +--------+
    |   NULL |
    +--------+
    SELECT 1 row in set (... sec)

.. NOTE::

    If the data types of the arguments are not of the same type, ``nullif`` will
    try to cast them to a common type, and if it fails to do so, an error is
    thrown.

.. _scalar-sysinfo:

System information functions
============================


.. _scalar-current_schema:

``CURRENT_SCHEMA``
------------------

The ``CURRENT_SCHEMA`` system information function returns the name of the
current schema of the session. If no current schema is set, this function will
return the default schema, which is ``doc``.

Returns: ``text``

The default schema can be set when using the `JDBC client
<https://cratedb.com/docs/jdbc/en/latest/connect.html>`_ and :ref:`HTTP clients
<http-default-schema>` such as `CrateDB PDO`_.

.. NOTE::

    The ``CURRENT_SCHEMA`` function has a special SQL syntax, meaning that it
    must be called without trailing parenthesis (``()``). However, CrateDB also
    supports the optional parenthesis, in that case the function is registered
    under the ``pg_catalog`` schema.

Synopsis::

    CURRENT_SCHEMA [ ( ) ]

Example::

    cr> SELECT CURRENT_SCHEMA;
    +----------------+
    | current_schema |
    +----------------+
    |            doc |
    +----------------+
    SELECT 1 row in set (... sec)

    cr> SELECT pg_catalog.current_schema();
    +----------------+
    | current_schema |
    +----------------+
    |            doc |
    +----------------+
    SELECT 1 row in set (... sec)


.. _scalar-current_schemas:

``pg_catalog.CURRENT_SCHEMAS(boolean)``
---------------------------------------

The ``CURRENT_SCHEMAS()`` system information function returns the current
stored schemas inside the :ref:`search_path <conf-session-search-path>` session
state, optionally including implicit schemas (e.g. ``pg_catalog``). If no
custom :ref:`search_path <conf-session-search-path>` is set, this function will
return the default :ref:`search_path <conf-session-search-path>` schemas.

Returns: ``array(text)``

Synopsis::

    CURRENT_SCHEMAS ( boolean )

Example::

    cr> SELECT CURRENT_SCHEMAS(true) AS schemas;
    +-----------------------+
    | schemas               |
    +-----------------------+
    | ["pg_catalog", "doc"] |
    +-----------------------+
    SELECT 1 row in set (... sec)


.. _scalar-current_user:

``CURRENT_USER``
----------------

The ``CURRENT_USER`` system information function returns the name of the
current connected user or ``crate`` if the user management module is disabled.

Returns: ``text``

Synopsis::

    CURRENT_USER

Example::

    cr> select current_user AS name;
    +-------+
    | name  |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-current_role:

``CURRENT_ROLE``
----------------

Equivalent to `CURRENT_USER`_.

Returns: ``text``

Synopsis::

    CURRENT_ROLE

Example::

    cr> select current_role AS name;
    +-------+
    | name  |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)

.. _scalar-user:

``USER``
--------

Equivalent to `CURRENT_USER`_.

Returns: ``text``

Synopsis::

    USER

Example::

    cr> select user AS name;
    +-------+
    | name  |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-session_user:

``SESSION_USER``
----------------

The ``SESSION_USER`` system information function returns the name of the
current connected user or ``crate`` if the user management module is disabled.

Returns: ``text``

Synopsis::

    SESSION_USER

Example::

    cr> select session_user AS name;
    +-------+
    | name  |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)

.. NOTE::

    CrateDB doesn't currently support the switching of execution context. This
    makes `SESSION_USER`_ functionally equivalent to `CURRENT_USER`_. We
    provide it as it's part of the SQL standard.

    Additionally, the `CURRENT_USER`_, `SESSION_USER`_ and `USER`_ functions
    have a special SQL syntax, meaning that they must be called without
    trailing parenthesis (``()``).

.. _scalar-has-database-priv:

``pg_catalog.has_database_privilege([user,] database, privilege text)``
-----------------------------------------------------------------------

Returns ``boolean`` or ``NULL`` if at least one argument is ``NULL``.

First argument is ``TEXT`` user name or ``INTEGER`` user OID. If user is not
specified current user is used as an argument.

Second argument is ``TEXT`` database name or ``INTEGER`` database OID.

.. NOTE::

    Only `crate` is valid for database name and only `0` is valid for database
    OID.

Third argument is privilege(s) to check. Multiple privileges
can be provided as a comma separated list, in which case the result will be
``true`` if any of the listed privileges is held. Allowed privilege types are
``CONNECT``, ``CREATE`` and ``TEMP`` or ``TEMPORARY``. Privilege string is case
insensitive and extra whitespace is allowed between privilege names. Duplicate
entries in privilege string are allowed.

:CONNECT:
  is ``true`` for all defined users in the database

:CREATE:
  is ``true`` if the user has any ``DDL`` privilege on ``CLUSTER`` or on any
  ``SCHEMA``

:TEMP:
  is ``false`` for all users

Example::

    cr> select has_database_privilege('crate', ' Connect ,  CREATe ')
    ... as has_priv;
    +----------+
    | has_priv |
    +----------+
    | TRUE     |
    +----------+
    SELECT 1 row in set (... sec)


.. _scalar-has-schema-priv:

``pg_catalog.has_schema_privilege([user,] schema, privilege text)``
-------------------------------------------------------------------

Returns ``boolean`` or ``NULL`` if at least one argument is ``NULL``.

First argument is ``TEXT`` user name or ``INTEGER`` user OID. If user is not
specified current user is used as an argument.

Second argument is ``TEXT`` schema name or ``INTEGER`` schema OID.

Third argument is privilege(s) to check. Multiple privileges
can be provided as a comma separated list, in which case the result will be
``true`` if any of the listed privileges is held. Allowed privilege types are
``CREATE`` and ``USAGE`` which corresponds to CrateDB's ``DDL`` and ``DQL``.
Privilege string is case insensitive and extra whitespace is allowed between
privilege names. Duplicate entries in privilege string are allowed.

Example::

    cr> select has_schema_privilege('pg_catalog', ' Create , UsaGe , CREATe ')
    ... as has_priv;
    +----------+
    | has_priv |
    +----------+
    | TRUE     |
    +----------+
    SELECT 1 row in set (... sec)

.. NOTE::

    For unknown schemas:

    - Returns ``TRUE`` for superusers.

    - For a user with ``DQL`` on cluster scope, returns ``TRUE`` if the
      privilege type is ``USAGE``.

    - For a user with ``DML`` on cluster scope, returns ``TRUE`` if the
      privilege type is ``CREATE``.

    - Returns ``FALSE`` otherwise.

.. _scalar-has-table-priv:

``pg_catalog.has_table_privilege([user,] table, privilege text)``
-----------------------------------------------------------------

Returns ``boolean`` or ``NULL`` if at least one argument is ``NULL``.

First argument is ``TEXT`` user name or ``INTEGER`` user OID. If user is not
specified current user is used as an argument.

Second argument is ``TEXT`` table name or ``INTEGER`` table OID.

Third argument is privilege(s) to check. Multiple privileges can be provided as
a comma separated list, in which case the result will be ``true`` if any of the
listed privileges is held. Allowed privilege types are ``SELECT`` which
corresponds to CrateDB's ``DQL`` and ``INSERT``, ``UPDATE``, ``DELETE`` which
all correspond to CrateDB's ``DML``. Privilege string is case insensitive and
extra whitespace is allowed between privilege names. Duplicate entries in
privilege string are allowed.

Example::

    cr> select has_table_privilege('sys.summits', ' Select  ')
    ... as has_priv;
    +----------+
    | has_priv |
    +----------+
    | TRUE     |
    +----------+
    SELECT 1 row in set (... sec)

.. NOTE::

    For unknown tables:

    - Returns ``TRUE`` for superusers.

    - For a user with ``DQL`` on cluster scope, returns ``TRUE`` if the
      privilege type is ``SELECT``.

    - For a user with ``DML`` on cluster scope, returns ``TRUE`` if the
      privilege type is ``INSERT``, ``UPDATE`` or ``DELETE``.

    - For a user with ``DQL`` on the schema, returns ``TRUE`` if the privilege
      type is ``SELECT``.

    - For a user with ``DML`` on the schema, returns ``TRUE`` if the privilege
      type is ``INSERT``, ``UPDATE`` or ``DELETE``.

    - Returns ``FALSE`` otherwise.

.. _scalar-pg_backend_pid:

``pg_catalog.pg_backend_pid()``
-------------------------------

The ``pg_backend_pid()`` system information function is implemented for
enhanced compatibility with PostgreSQL. CrateDB will always return ``-1`` as
there isn't a single process attached to one query. This is different to
PostgreSQL, where this represents the process ID of the server process attached
to the current session.

Returns: ``integer``

Synopsis::

    pg_backend_pid()

Example::

    cr> select pg_backend_pid() AS pid;
    +-----+
    | pid |
    +-----+
    |  -1 |
    +-----+
    SELECT 1 row in set (... sec)


.. _scalar-pg_postmaster_start_time:

``pg_catalog.pg_postmaster_start_time()``
-----------------------------------------

Returns the server start time as ``timestamp with time zone``.


.. _scalar-current_catalog:

``CURRENT_CATALOG``
-------------------

The ``CURRENT_CATALOG`` function returns the name of the current catalog, which
in CrateDB will always be ``crate``::

    cr> select CURRENT_CATALOG AS db;
    +-------+
    | db    |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)

.. _scalar-current_database:

``pg_catalog.current_database()``
---------------------------------

The ``current_database`` function returns the name of the current database,
which in CrateDB will always be ``crate``::

    cr> select current_database() AS db;
    +-------+
    | db    |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-current_setting:

``pg_catalog.current_setting(text [,boolean])``
-----------------------------------------------

The ``current_setting`` function returns the current value of a :ref:`session
setting <conf-session>`.

Returns: ``text``

Synopsis::

    current_setting(setting_name [, missing_ok])

If no setting exists for ``setting_name``, current_setting throws an error,
unless ``missing_ok`` argument is provided and is true.

Examples::

    cr> select current_setting('search_path') AS search_path;
    +-------------+
    | search_path |
    +-------------+
    | doc         |
    +-------------+
    SELECT 1 row in set (... sec)

::

    cr> select current_setting('foo');
    SQLParseException[Unrecognised Setting: foo]

::

    cr> select current_setting('foo', true) AS foo;
    +------+
    |  foo |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_get_expr:

``pg_catalog.pg_get_expr()``
----------------------------

The function ``pg_get_expr`` is implemented to improve compatibility with
clients that use the PostgreSQL wire protocol. The function always returns
``null``.

Synopsis::

   pg_get_expr(expr text, relation_oid int [, pretty boolean])

Example::

    cr> select pg_get_expr('literal', 1) AS col;
    +------+
    |  col |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)

.. _scalar-pg_get_partkeydef:

``pg_catalog.pg_get_partkeydef()``
----------------------------------

The function ``pg_get_partkeydef`` is implemented to improve compatibility with
clients that use the PostgreSQL wire protocol. Partitioning in CrateDB is
different from PostgreSQL, therefore this function always returns ``null``.

Synopsis::

   pg_get_partkeydef(relation_oid int)

Example::

    cr> select pg_get_partkeydef(1) AS col;
    +------+
    |  col |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)

.. _scalar-pg_get_serial_sequence:

``pg_catalog.pg_get_serial_sequence()``
---------------------------------------

The function ``pg_get_serial_sequence`` is implemented to improve compatibility
with clients that use the PostgreSQL wire protocol. The function always returns
``null``. Existence of tables or columns is not validated.

Synopsis::

   pg_get_serial_sequence(table_name text, column_name text)

Example::

    cr> select pg_get_serial_sequence('t1', 'c1') AS col;
    +------+
    |  col |
    +------+
    | NULL |
    +------+
    SELECT 1 row in set (... sec)

.. _scalar-pg_encoding_to_char:

``pg_catalog.pg_encoding_to_char()``
------------------------------------

The function ``pg_encoding_to_char`` converts an PostgreSQL encoding's internal
identifier to a human-readable name.

Returns: ``text``

Synopsis::

   pg_encoding_to_char(encoding int)

Example::

    cr> select pg_encoding_to_char(6) AS encoding;
    +----------+
    | encoding |
    +----------+
    | UTF8     |
    +----------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_get_userbyid:

``pg_catalog.pg_get_userbyid()``
--------------------------------

The function ``pg_get_userbyid`` is implemented to improve compatibility with
clients that use the PostgreSQL wire protocol. The function always returns the
default CrateDB user for non-null arguments, otherwise, ``null`` is returned.

Returns: ``text``

Synopsis::

   pg_get_userbyid(id integer)

Example::

    cr> select pg_get_userbyid(-450373579) AS name;
    +-------+
    | name  |
    +-------+
    | crate |
    +-------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_typeof:

``pg_catalog.pg_typeof()``
--------------------------

The function ``pg_typeof`` returns the text representation of the value's data
type passed to it.

Returns: ``text``

Synopsis::

   pg_typeof(expression)

Example:

::

    cr> select pg_typeof([1, 2, 3]) as typeof;
    +---------------+
    | typeof        |
    +---------------+
    | integer_array |
    +---------------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_function_is_visible:

``pg_function_is_visible()``
----------------------------

The function ``pg_function_is_visible`` returns true for OIDs that refer to a
system or a user defined function.

Returns: ``boolean``

Synopsis::

   pg_function_is_visible(OID)

Example:

::

    cr> select pg_function_is_visible(-919555782) as pg_function_is_visible;
    +------------------------+
    | pg_function_is_visible |
    +------------------------+
    | TRUE                   |
    +------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_table_is_visible:

``pg_catalog.pg_table_is_visible()``
------------------------------------

The function ``pg_table_is_visible`` accepts an OID as an argument. It returns
``true`` if the current user holds at least one of ``DQL``, ``DDL`` or ``DML``
privilege on the table or view referred by the OID and there are no other
tables or views with the same name and privileges but with different schema
names appearing earlier in the search path.

Returns: ``boolean``

Example:

::

    cr> select pg_table_is_visible(912037690) as is_visible;
    +------------+
    | is_visible |
    +------------+
    | TRUE       |
    +------------+
    SELECT 1 row in set (... sec)


.. _scalar-pg_get_function_result:

``pg_catalog.pg_get_function_result()``
---------------------------------------

The function ``pg_get_function_result`` returns the text representation of the
return value's data type of the function referred by the OID.

Returns: ``text``

Synopsis::

   pg_get_function_result(OID)

Example:

::

    cr> select pg_get_function_result(-919555782) as _pg_get_function_result;
    +-------------------------+
    | _pg_get_function_result |
    +-------------------------+
    | time with time zone     |
    +-------------------------+
    SELECT 1 row in set (... sec)


.. _scalar-version:

``pg_catalog.version()``
------------------------

Returns the CrateDB version information.

Returns: ``text``

Synopsis::

  version()

Example:

::

    cr> select version() AS version;
    +---------...-+
    | version     |
    +---------...-+
    | CrateDB ... |
    +---------...-+
    SELECT 1 row in set (... sec)


.. _scalar-col_description:

``pg_catalog.col_description(integer, integer)``
------------------------------------------------

This function exists mainly for compatibility with PostgreSQL. In PostgreSQL,
the function returns the comment for a table column. CrateDB doesn't support
user defined comments for table columns, so it always returns ``null``.


Returns: ``text``

Example:

::

    cr> SELECT col_description(1, 1) AS comment;
    +---------+
    | comment |
    +---------+
    |    NULL |
    +---------+
    SELECT 1 row in set (... sec)


.. _scalar-obj_description:

``pg_catalog.obj_description(integer, text)``
---------------------------------------------

This function exists mainly for compatibility with PostgreSQL. In PostgreSQL,
the function returns the comment for a database object. CrateDB doesn't support
user defined comments for database objects, so it always returns ``null``.


Returns: ``text``

Example:

::

    cr> SELECT pg_catalog.obj_description(1, 'pg_type') AS comment;
    +---------+
    | comment |
    +---------+
    |    NULL |
    +---------+
    SELECT 1 row in set (... sec)


.. _scalar-format_type:

``pg_catalog.format_type(integer, integer)``
--------------------------------------------

Returns the type name of a type. The first argument is the ``OID`` of the type.
The second argument is the type modifier. This function exists for PostgreSQL
compatibility and the type modifier is always ignored.

Returns: ``text``

Example:

::

    cr> SELECT pg_catalog.format_type(25, null) AS name;
    +------+
    | name |
    +------+
    | text |
    +------+
    SELECT 1 row in set (... sec)


If the given ``OID`` is not know, ``???`` is returned::


    cr> SELECT pg_catalog.format_type(3, null) AS name;
    +------+
    | name |
    +------+
    |  ??? |
    +------+
    SELECT 1 row in set (... sec)


.. _scalar-special:

Special functions
=================


.. _scalar_knn_match:


``knn_match(float_vector, float_vector, int)``
----------------------------------------------

The ``knn_match`` function uses a k-nearest
neighbour (kNN) search algorithm to find vectors that are similar
to a query vector.

The first argument is the column to search.
The second argument is the query vector.
The third argument is the number of nearest neighbours to search in the index.
Searching a larger number of nearest neighbours is more expensive. There is one
index per shard, and on each shard the function will match at most `k` records.
To limit the total query result, add a :ref:`LIMIT clause <sql-select-limit>` to
the query.

``knn_match(search_vector, target, k)``

This function must be used within a ``WHERE`` clause targeting a table to use it
as a predicate that searches the whole dataset of a table.
Using it *outside* of a ``WHERE`` clause, or in a ``WHERE`` clause targeting a
virtual table instead of a physical table, results in an error.

Similar to the :ref:`MATCH predicate <predicates_match>`, this function affects
the :ref:`_score <sql_administration_system_column_score>` value.

An example::


    cr> CREATE TABLE IF NOT EXISTS doc.vectors (
    ...    xs float_vector(2)
    ...  );
    CREATE OK, 1 row affected (... sec)

    cr> INSERT INTO doc.vectors (xs)
    ...   VALUES
    ...   ([3.14, 8.17]),
    ...   ([14.3, 19.4]);
    INSERT OK, 2 rows affected (... sec)

.. HIDE:

    cr> REFRESH TABLE doc.vectors;
    REFRESH OK, 1 row affected (... sec)

::

    cr> SELECT xs, _score FROM doc.vectors
    ... WHERE knn_match(xs, [3.14, 8], 2)
    ... ORDER BY _score DESC;
    +--------------+--------------+
    | xs           |       _score |
    +--------------+--------------+
    | [3.14, 8.17] | 0.9719117    |
    | [14.3, 19.4] | 0.0039138086 |
    +--------------+--------------+
    SELECT 2 rows in set (... sec)


.. _scalar-ignore3vl:

``ignore3vl(boolean)``
----------------------

The ``ignore3vl`` function operates on a boolean argument and eliminates the
`3-valued logic`_ on the whole tree of :ref:`operators <gloss-operator>`
beneath it. More specifically, ``FALSE`` is :ref:`evaluated <gloss-evaluation>`
to ``FALSE``, ``TRUE`` to ``TRUE`` and ``NULL`` to ``FALSE``.

Returns: ``boolean``

.. HIDE:

    cr> CREATE TABLE IF NOT EXISTS doc.t(
    ...     int_array_col array(integer)
    ... );
    CREATE OK, 1 row affected (... sec)

    cr> INSERT INTO doc.t(int_array_col)
    ...   VALUES ([1,2,3, null]);
    INSERT OK, 1 row affected (... sec)

    cr> REFRESH table doc.t;
    REFRESH OK, 1 row affected (... sec)

.. NOTE::

    The main usage of the ``ignore3vl`` function is in the ``WHERE`` clause
    when a ``NOT`` operator is involved. Such filtering, with `3-valued
    logic`_, cannot be translated to an optimized query in the internal storage
    engine, and therefore can degrade performance. E.g.::

        SELECT * FROM t
        WHERE NOT 5 = ANY(t.int_array_col);

    If we can ignore the `3-valued logic`_, we can write the query as::

        SELECT * FROM t
        WHERE NOT IGNORE3VL(5 = ANY(t.int_array_col));

    which will yield better performance (in execution time) than before.

.. CAUTION::

    If there are ``NULL`` values in the ``long_array_col``, in the case that
    ``5 = ANY(t.long_array_col)`` evaluates to ``NULL``, without the
    ``ignore3vl``, it would be evaluated as ``NOT NULL`` => ``NULL``, resulting
    to zero matched rows. With the ``IGNORE3VL`` in place it will be evaluated
    as ``NOT FALSE`` => ``TRUE`` resulting to all rows matching the
    filter. E.g::

        cr> SELECT * FROM t
        ... WHERE NOT 5 = ANY(t.int_array_col);
        +---------------+
        | int_array_col |
        +---------------+
        +---------------+
        SELECT 0 rows in set (... sec)

    ::

        cr> SELECT * FROM t
        ... WHERE NOT IGNORE3VL(5 = ANY(t.int_array_col));
        +-----------------+
        | int_array_col   |
        +-----------------+
        | [1, 2, 3, null] |
        +-----------------+
        SELECT 1 row in set (... sec)

.. HIDE:

    cr> DROP TABLE IF EXISTS doc.t;
    DROP OK, 1 row affected (... sec)

Synopsis::

    ignore3vl(boolean)

Example::

    cr> SELECT
    ...     ignore3vl(true) as v1,
    ...     ignore3vl(false) as v2,
    ...     ignore3vl(null) as v3;
    +------+-------+-------+
    | v1   | v2    | v3    |
    +------+-------+-------+
    | TRUE | FALSE | FALSE |
    +------+-------+-------+
    SELECT 1 row in set (... sec)


.. _scalar-vector:

Vector functions
================

.. _scalar_vector_similarity:


``vector_similarity(float_vector, float_vector)``
--------------------------------------------------------

Returns similarity of 2 :ref:`FLOAT_VECTORS <type-float_vector>`
as a :ref:`FLOAT <type-real>` typed value.
Similarity is based on euclidean distance and belongs to range ``(0,1]``.
If 2 vectors coincide, function returns maximal possible similarity 1.
The more distance between vectors is, the closer similarity gets to 0.
If at least one argument is ``NULL``, function returns ``NULL``.

An example::


    cr> SELECT vector_similarity([1.2, 1.3], [10.2, 10.3]) AS vs;
    +-------------+
    |          vs |
    +-------------+
    | 0.006134969 |
    +-------------+
    SELECT 1 row in set (... sec)


.. _3-valued logic: https://en.wikipedia.org/wiki/Null_(SQL)#Comparisons_with_NULL_and_the_three-valued_logic_(3VL)
.. _available time zones: https://www.joda.org/joda-time/timezones.html
.. _CrateDB PDO: https://cratedb.com/docs/pdo/en/latest/connect.html
.. _Euclidean geometry: https://en.wikipedia.org/wiki/Euclidean_geometry
.. _formatter: https://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html
.. _geodetic: https://en.wikipedia.org/wiki/Geodesy
.. _GeoJSON: https://geojson.org/
.. _Haversine formula: https://en.wikipedia.org/wiki/Haversine_formula
.. _Java DateTimeFormatter: https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html
.. _Java DecimalFormat: https://docs.oracle.com/javase/8/docs/api/java/text/DecimalFormat.html
.. _Java Regular Expressions: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
.. _Joda-Time: https://www.joda.org/joda-time/
.. _Lucene Regular Expressions: https://lucene.apache.org/core/4_9_0/core/org/apache/lucene/util/automaton/RegExp.html
.. _MySQL date_format: https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_date-format
.. _WKT: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry</doc><doc title="CrateDB SQL reference: Aggregation functions" desc="When selecting data from CrateDB, you can use an aggregate function to calculate a single summary value for one or more columns. ">.. highlight:: psql
.. _aggregation:

===========
Aggregation
===========

When :ref:`selecting data <sql_dql_aggregation>` from CrateDB, you can use an
`aggregate function`_ to calculate a single summary value for one or more
columns.

For example::

   cr> SELECT count(*) FROM locations;
   +-------+
   | count |
   +-------+
   |    13 |
   +-------+
   SELECT 1 row in set (... sec)

Here, the :ref:`count(*) <aggregation-count-star>` function computes the result
across all rows.

Aggregate :ref:`functions <gloss-function>` can be used with the
:ref:`sql_dql_group_by` clause. When used like this, an aggregate function
returns a single summary value for each grouped collection of column values.

For example::

   cr> SELECT kind, count(*) FROM locations GROUP BY kind;
   +-------------+-------+
   | kind        | count |
   +-------------+-------+
   | Galaxy      |     4 |
   | Star System |     4 |
   | Planet      |     5 |
   +-------------+-------+
   SELECT 3 rows in set (... sec)


.. TIP::

    Aggregation works across all the rows that match a query or on all matching
    rows in every distinct group of a ``GROUP BY`` statement. Aggregating
    ``SELECT`` statements without ``GROUP BY`` will always return one row.


.. _aggregation-expressions:

Aggregate expressions
=====================

An *aggregate expression* represents the application of an :ref:`aggregate
function <aggregation-functions>` across rows selected by a query. Besides the
function signature, :ref:`expressions <gloss-expression>` might contain
supplementary clauses and keywords.

The synopsis of an aggregate expression is one of the following::

   aggregate_function ( * ) [ FILTER ( WHERE condition ) ]
   aggregate_function ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE condition ) ]

Here, ``aggregate_function`` is a name of an aggregate function and
``expression`` is a column reference, :ref:`scalar function <scalar-functions>`
or literal.

If ``FILTER`` is specified, then only the rows that met the
:ref:`sql_dql_where_clause` condition are supplied to the aggregate function.

The optional ``DISTINCT`` keyword is only supported by aggregate functions
that explicitly mention its support. Please refer to existing
:ref:`limitations <aggregation-limitations>` for further information.

The aggregate expression form that uses a ``wildcard`` instead of an
``expression`` as a function argument is supported only by the ``count(*)``
aggregate function.


.. _aggregation-functions:

Aggregate functions
===================


.. _aggregation-arbitrary:

``arbitrary(column)``
---------------------

The ``arbitrary`` aggregate function returns a single value of a column.
Which value it returns is not defined.

Its return type is the type of its parameter column and can be ``NULL`` if the
column contains ``NULL`` values.

Example::

    cr> select arbitrary(position) from locations;
    +-----------+
    | arbitrary |
    +-----------+
    |       ... |
    +-----------+
    SELECT 1 row in set (... sec)

::

    cr> select arbitrary(name), kind from locations
    ... where name != ''
    ... group by kind order by kind desc;
    +-...-------+-------------+
    | arbitrary | kind        |
    +-...-------+-------------+
    | ...       | Star System |
    | ...       | Planet      |
    | ...       | Galaxy      |
    +-...-------+-------------+
    SELECT 3 rows in set (... sec)

An example use case is to group a table with many rows per user by ``user_id``
and get the ``username`` for every group, that means every user. This works as
rows with same ``user_id`` have the same ``username``.  This method performs
better than grouping on ``username`` as grouping on number types is generally
faster than on strings.  The advantage is that the ``arbitrary`` function does
very little to no computation as for example ``max`` aggregate function would
do.


.. _aggregation-any-value:


``any_value(column)``
---------------------

``any_value`` is an alias for :ref:`arbitrary <aggregation-arbitrary>`.

Example::

    cr> select any_value(x) from unnest([1, 1]) t (x);
    +-----------+
    | any_value |
    +-----------+
    |         1 |
    +-----------+
    SELECT 1 row in set (... sec)


.. _aggregation-array-agg:

``array_agg(column)``
---------------------

The ``array_agg`` aggregate function concatenates all input values into an
array.

::

    cr> SELECT array_agg(x) FROM (VALUES (42), (832), (null), (17)) as t (x);
    +---------------------+
    | array_agg           |
    +---------------------+
    | [42, 832, null, 17] |
    +---------------------+
    SELECT 1 row in set (... sec)

.. SEEALSO::

    :ref:`aggregation-string-agg`


.. _aggregation-avg:

``avg(column)``
---------------

The ``avg`` and ``mean`` aggregate function returns the arithmetic mean, the
*average*, of all values in a column that are not ``NULL``. It accepts all
numeric, timestamp and interval types as single argument. For ``numeric``
argument type the return type is ``numeric``, for ``interval`` argument type the
return type is ``interval`` and for other argument types the return type is
``double``.

Example::

    cr> select avg(position), kind from locations
    ... group by kind order by kind;
    +------+-------------+
    |  avg | kind        |
    +------+-------------+
    | 3.25 | Galaxy      |
    | 3.0  | Planet      |
    | 2.5  | Star System |
    +------+-------------+
    SELECT 3 rows in set (... sec)

The ``avg`` aggregation on the ``bigint`` column might result in a precision
error if sum of elements exceeds 2^53::

    cr> select avg(t.val) from
    ... (select unnest([9223372036854775807, 9223372036854775807]) as val) t;
    +-----------------------+
    |                   avg |
    +-----------------------+
    | 9.223372036854776e+18 |
    +-----------------------+
    SELECT 1 row in set (... sec)

To address the precision error of the avg aggregation, we cast the aggregation
column to the ``numeric`` data type::

    cr> select avg(t.val :: numeric) from
    ... (select unnest([9223372036854775807, 9223372036854775807]) as val) t;
    +---------------------+
    |                 avg |
    +---------------------+
    | 9223372036854775807 |
    +---------------------+
    SELECT 1 row in set (... sec)

.. _aggregation-avg-distinct:

``avg(DISTINCT column)``
~~~~~~~~~~~~~~~~~~~~~~~~

The ``avg`` aggregate function also supports the ``distinct`` keyword. This
keyword changes the behaviour of the function so that it will only average the
number of distinct values in this column that are not ``NULL``::

    cr> select
    ...   avg(distinct position) AS avg_pos,
    ...   count(*),
    ...   date
    ... from locations group by date
    ... order by 1 desc, count(*) desc;
    +---------+-------+---------------+
    | avg_pos | count |          date |
    +---------+-------+---------------+
    |     4.0 |     1 | 1367366400000 |
    |     3.6 |     8 | 1373932800000 |
    |     2.0 |     4 |  308534400000 |
    +---------+-------+---------------+
    SELECT 3 rows in set (... sec)

::

    cr> select avg(distinct position) AS avg_pos from locations;
    +---------+
    | avg_pos |
    +---------+
    |     3.5 |
    +---------+
    SELECT 1 row in set (... sec)


.. _aggregation-count:

``count(column)``
-----------------

In contrast to the :ref:`aggregation-count-star` function the ``count``
function used with a column name as parameter will return the number of rows
with a non-``NULL`` value in that column.

Example::

    cr> select count(name), count(*), date from locations group by date
    ... order by count(name) desc, count(*) desc;
    +-------+-------+---------------+
    | count | count |          date |
    +-------+-------+---------------+
    |     7 |     8 | 1373932800000 |
    |     4 |     4 |  308534400000 |
    |     1 |     1 | 1367366400000 |
    +-------+-------+---------------+
    SELECT 3 rows in set (... sec)


.. _aggregation-count-distinct:

``count(DISTINCT column)``
~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``count`` aggregate function also supports the ``distinct`` keyword. This
keyword changes the behaviour of the function so that it will only count the
number of distinct values in this column that are not ``NULL``::

    cr> select
    ...   count(distinct kind) AS num_kind,
    ...   count(*),
    ...   date
    ... from locations group by date
    ... order by num_kind, count(*) desc;
    +----------+-------+---------------+
    | num_kind | count |          date |
    +----------+-------+---------------+
    |        1 |     1 | 1367366400000 |
    |        3 |     8 | 1373932800000 |
    |        3 |     4 |  308534400000 |
    +----------+-------+---------------+
    SELECT 3 rows in set (... sec)

::

    cr> select count(distinct kind) AS num_kind from locations;
    +----------+
    | num_kind |
    +----------+
    |        3 |
    +----------+
    SELECT 1 row in set (... sec)

.. SEEALSO::

    :ref:`aggregation-hyperloglog-distinct` for an alternative that trades some
    accuracy for improved performance.


.. _aggregation-count-star:

``count(*)``
~~~~~~~~~~~~

This aggregate function simply returns the number of rows that match the query.

``count(columName)`` is also possible, but currently only works on a primary
key column. The semantics are the same.

The return value is always of type ``bigint``.

::

    cr> select count(*) from locations;
    +-------+
    | count |
    +-------+
    |    13 |
    +-------+
    SELECT 1 row in set (... sec)

``count(*)`` can also be used on group by queries::

    cr> select count(*), kind from locations group by kind order by kind asc;
    +-------+-------------+
    | count | kind        |
    +-------+-------------+
    | 4     | Galaxy      |
    | 5     | Planet      |
    | 4     | Star System |
    +-------+-------------+
    SELECT 3 rows in set (... sec)


.. _aggregation-geometric-mean:

``geometric_mean(column)``
--------------------------

The ``geometric_mean`` aggregate function computes the geometric mean, a mean
for positive numbers. For details see: `Geometric Mean`_.

``geometric mean`` is defined on all numeric types and on timestamp.
:ref:`NUMERIC values <type-numeric>` are automatically casted to
:ref:`DOUBLE PRECISION <type-double-precision>`. It always returns values of
``double precision``. If a value is negative, all values were null or we got no
value at all ``NULL`` is returned. If any of the aggregated values is ``0`` the
result will be ``0.0`` as well.

.. CAUTION::

    Due to java double precision arithmetic it is possible that any two
    executions of the aggregate function on the same data produce slightly
    differing results.

Example::

    cr> select geometric_mean(position), kind from locations
    ... group by kind order by kind;
    +--------------------+-------------+
    |     geometric_mean | kind        |
    +--------------------+-------------+
    | 2.6321480259049848 | Galaxy      |
    | 2.6051710846973517 | Planet      |
    | 2.213363839400643  | Star System |
    +--------------------+-------------+
    SELECT 3 rows in set (... sec)


.. _aggregation-hyperloglog-distinct:

``hyperloglog_distinct(column, [precision])``
---------------------------------------------

The ``hyperloglog_distinct`` aggregate function calculates an approximate count
of distinct non-null values using the `HyperLogLog++`_ algorithm.

The return value data type is always a ``bigint``.

The first argument can be a reference to a column of all
:ref:`data-types-primitive`. :ref:`data-types-container` and
:ref:`data-types-geo` are not supported.

The optional second argument defines the used ``precision`` for the
`HyperLogLog++`_ algorithm. This allows to trade memory for accuracy, valid
values are ``4`` to ``18``. A precision of ``4`` uses approximately ``16``
bytes of memory. Each increase in precision doubles the memory requirement. So
precision ``5`` uses approximately ``32`` bytes, up to ``262144`` bytes for
precision ``18``.

The default value for the ``precision`` which is used if the second argument is
left out is ``14``.


Examples::

    cr> select hyperloglog_distinct(position) from locations;
    +----------------------+
    | hyperloglog_distinct |
    +----------------------+
    | 6                    |
    +----------------------+
    SELECT 1 row in set (... sec)

::

    cr> select hyperloglog_distinct(position, 4) from locations;
    +----------------------+
    | hyperloglog_distinct |
    +----------------------+
    | 6                    |
    +----------------------+
    SELECT 1 row in set (... sec)


.. _aggregation-mean:

``mean(column)``
----------------

An alias for :ref:`aggregation-avg`.


.. _aggregation-min:

``min(column)``
---------------

The ``min`` aggregate function returns the smallest value in a column that is
not ``NULL``. Its single argument is a column name and its return value is
always of the type of that column.

Example::

    cr> select min(position), kind
    ... from locations
    ... where name not like 'North %'
    ... group by kind order by min(position) asc, kind asc;
    +-----+-------------+
    | min | kind        |
    +-----+-------------+
    | 1   | Planet      |
    | 1   | Star System |
    | 2   | Galaxy      |
    +-----+-------------+
    SELECT 3 rows in set (... sec)

::

    cr> select min(date) from locations;
    +--------------+
    | min          |
    +--------------+
    | 308534400000 |
    +--------------+
    SELECT 1 row in set (... sec)

``min`` returns ``NULL`` if the column does not contain any value but ``NULL``.
It is allowed on columns with primitive data types. On ``text`` columns it will
return the lexicographically smallest.

::

    cr> select min(name), kind from locations
    ... group by kind order by kind asc;
    +------------------------------------+-------------+
    | min                                | kind        |
    +------------------------------------+-------------+
    | Galactic Sector QQ7 Active J Gamma | Galaxy      |
    |                                    | Planet      |
    | Aldebaran                          | Star System |
    +------------------------------------+-------------+
    SELECT 3 rows in set (... sec)


.. _aggregation-max:

``max(column)``
---------------

It behaves exactly like ``min`` but returns the biggest value in a column that
is not ``NULL``.

Some Examples::

    cr> select max(position), kind from locations
    ... group by kind order by kind desc;
    +-----+-------------+
    | max | kind        |
    +-----+-------------+
    |   4 | Star System |
    |   5 | Planet      |
    |   6 | Galaxy      |
    +-----+-------------+
    SELECT 3 rows in set (... sec)

::

    cr> select max(position) from locations;
    +-----+
    | max |
    +-----+
    |   6 |
    +-----+
    SELECT 1 row in set (... sec)

::

    cr> select max(name), kind from locations
    ... group by kind order by max(name) desc;
    +-------------------+-------------+
    | max               | kind        |
    +-------------------+-------------+
    | Outer Eastern Rim | Galaxy      |
    | Bartledan         | Planet      |
    | Altair            | Star System |
    +-------------------+-------------+
    SELECT 3 rows in set (... sec)


.. _aggregation-max_by:

``max_by(returnField, searchField)``
------------------------------------

Returns the value of ``returnField`` where ``searchField`` has the highest
value.

If there are ties for ``searchField`` the result is non-deterministic and can be
any of the ``returnField`` values of the ties.

``NULL`` values in the ``searchField`` don't count as max but are skipped.


An Example::

    cr> SELECT max_by(mountain, height) FROM sys.summits;
    +------------+
    | max_by     |
    +------------+
    | Mont Blanc |
    +------------+
    SELECT 1 row in set (... sec)


.. _aggregation-min_by:

``min_by(returnField, searchField)``
------------------------------------


Returns the value of ``returnField`` where ``searchField`` has the lowest
value.

If there are ties for ``searchField`` the result is non-deterministic and can be
any of the ``returnField`` values of the ties.

``NULL`` values in the ``searchField`` don't count as min but are skipped.

An Example::

    cr> SELECT min_by(mountain, height) FROM sys.summits;
    +-------------+
    | min_by      |
    +-------------+
    | Puy de Rent |
    +-------------+
    SELECT 1 row in set (... sec)


.. _aggregation-stddev:

``stddev(column)``
------------------

``stddev`` is an alias for :ref:`aggregation-stddev-samp`.


.. _aggregation-stddev-pop:

``stddev_pop(column)``
----------------------

The ``stddev_pop`` aggregate function computes the `Population Standard Deviation`_
of the set of non-null values in a column. It is a measure of the variation
of data values. A low standard deviation indicates that the values tend to be
near the mean.

``stddev_pop`` is defined on all :ref:`numeric types<data-types-numeric>` and on
timestamp. Return value will be of type ``numeric`` with unspecified precision
and scale if the input value is of ``numeric`` type, and ``double precision``
for any other type. If all values were null or we got no value at all ``NULL``
is returned.

Example::

    cr> select stddev_pop(position), kind from locations
    ... group by kind order by kind;
    +--------------------+-------------+
    |         stddev_pop | kind        |
    +--------------------+-------------+
    | 1.920286436967152  | Galaxy      |
    | 1.4142135623730951 | Planet      |
    | 1.118033988749895  | Star System |
    +--------------------+-------------+
    SELECT 3 rows in set (... sec)

.. CAUTION::

    Due to Java double precision arithmetic it is possible that any two
    executions of the aggregate function on the same data produce slightly
    differing results.

.. _aggregation-stddev-samp:

``stddev_samp(column)``
-----------------------

The ``stddev_samp`` aggregate function computes the `Sample Standard Deviation`_
of the set of non-null values in a column. It is a measure of the variation
of data values. A low standard deviation indicates that the values tend to be
near the mean.

``stddev_samp`` is defined on all :ref:`numeric types<data-types-numeric>` and on
timestamp. Return value will be of type ``numeric`` with unspecified precision
and scale if the input value is of ``numeric`` type, and ``double precision``
for any other type. If all values were null or we got no value at all ``NULL``
is returned.

Example::

    cr> select stddev_samp(position), kind from locations
    ... group by kind order by kind;
    +--------------------+-------------+
    |        stddev_samp | kind        |
    +--------------------+-------------+
    | 2.217355782608345  | Galaxy      |
    | 1.5811388300841898 | Planet      |
    | 1.2909944487358056 | Star System |
    +--------------------+-------------+
    SELECT 3 rows in set (... sec)

.. CAUTION::

    Due to Java double precision arithmetic it is possible that any two
    executions of the aggregate function on the same data produce slightly
    differing results.

.. _aggregation-string-agg:

``string_agg(column, delimiter)``
---------------------------------

The ``string_agg`` aggregate function concatenates the input values into a
string, where each value is separated by a delimiter.

If all input values are null, null is returned as a result.


::

   cr> select string_agg(col1, ', ') from (values('a'), ('b'), ('c')) as t;
   +------------+
   | string_agg |
   +------------+
   | a, b, c    |
   +------------+
   SELECT 1 row in set (... sec)

.. SEEALSO::

    :ref:`aggregation-array-agg`


.. _aggregation-percentile:

``percentile(column, {fraction | fractions} [, compression])``
--------------------------------------------------------------

The ``percentile`` aggregate function computes a `Percentile`_ over numeric
non-null values in a column. Values of type :ref:`NUMERIC <type-numeric>` are
not supported.

Percentiles show the point at which a certain percentage of observed values
occur. For example, the 98th percentile is the value which is greater than 98%
of the observed values. The result is defined and computed as an interpolated
weighted average. According to that it allows the median of the input data to
be defined conveniently as the 50th percentile.

The :ref:`function <gloss-function>` expects a single fraction or an array of
fractions and a column name. Independent of the input column data type the
result of ``percentile`` always returns a ``double precision``. If the value at
the specified column is ``null`` the row is ignored. Fractions must be double
precision values between 0 and 1. When supplied a single fraction, the function
will return a single value corresponding to the percentile of the specified
fraction::

    cr> select percentile(position, 0.95), kind from locations
    ... group by kind order by kind;
    +------------+-------------+
    | percentile | kind        |
    +------------+-------------+
    |        6.0 | Galaxy      |
    |        5.0 | Planet      |
    |        4.0 | Star System |
    +------------+-------------+
    SELECT 3 rows in set (... sec)

When supplied an array of fractions, the function will return an array of
values corresponding to the percentile of each fraction specified::

    cr> select percentile(position, [0.0013, 0.9987]) as perc from locations;
    +------------+
    | perc       |
    +------------+
    | [1.0, 6.0] |
    +------------+
    SELECT 1 row in set (... sec)

When a query with ``percentile`` function won't match any rows then a null
result is returned.

To be able to calculate percentiles over a huge amount of data and to scale out
CrateDB calculates approximate instead of accurate percentiles. The algorithm
used by the percentile metric is called `TDigest`_. The accuracy/size trade-off
of the algorithm is defined by a single ``compression`` parameter which has a
default value of ``200.0``, but can be defined by passing in an optional 3rd
``double`` value argument as the ``compression``. However, there are a few
guidelines to keep in mind in this implementation:

- Extreme percentiles (e.g. 99%) are more accurate.
- For small sets, percentiles are highly accurate.
- It is difficult to generalize the exact level of accuracy, as it depends
  on your data distribution and volume of data being aggregated.
- The ``compression`` parameter is a trade-off between accuracy and memory
  usage. A higher value will result in more accurate percentiles but will
  consume more memory.


.. _aggregation-sum:

``sum(column)``
---------------

Returns the sum of a set of numeric input values that are not ``NULL``.
Depending on the argument type a suitable return type is chosen. For ``real``,
``double precision``, ``numeric`` and ``interval`` argument types, the return
type is the same as the argument type. For ``byte``, ``smallint``, ``integer``
and ``bigint`` the return type is always ``bigint``. If the range of ``bigint``
values (-2^64 to 2^64-1) gets exceeded an ``ArithmeticException`` will be
raised.

::

    cr> select sum(position), kind from locations
    ... group by kind order by sum(position) asc;
    +-----+-------------+
    | sum | kind        |
    +-----+-------------+
    |  10 | Star System |
    |  13 | Galaxy      |
    |  15 | Planet      |
    +-----+-------------+
    SELECT 3 rows in set (... sec)

::

    cr> select sum(position) as position_sum from locations;
    +--------------+
    | position_sum |
    +--------------+
    | 38           |
    +--------------+
    SELECT 1 row in set (... sec)

::

    cr> select sum(name), kind from locations group by kind order by sum(name) desc;
    SQLParseException[Cannot cast value `Aldebaran` to type `byte`]

If the ``sum`` aggregation on a numeric data type with the fixed length can
potentially exceed its range it is possible to handle the overflow by casting
the :ref:`function <gloss-function>` argument to the :ref:`numeric type
<type-numeric>` with an arbitrary precision.

.. Hidden: create user visits table

    cr> CREATE TABLE uservisits (id integer, count bigint)
    ... CLUSTERED INTO 1 SHARDS
    ... WITH (number_of_replicas = 0);
    CREATE OK, 1 row affected (... sec)

.. Hidden: insert into uservisits table

    cr> INSERT INTO uservisits VALUES (1, 9223372036854775806), (2, 10);
    INSERT OK, 2 rows affected  (... sec)

.. Hidden: refresh uservisits table

    cr> REFRESH TABLE uservisits;
    REFRESH OK, 1 row affected  (... sec)

The ``sum`` aggregation on the ``bigint`` column will result in an overflow
in the following aggregation query::

    cr> SELECT sum(count)
    ... FROM uservisits;
    ArithmeticException[long overflow]

To address the overflow of the sum aggregation on the given field, we cast
the aggregation column to the ``numeric`` data type::

    cr> SELECT sum(count::numeric)
    ... FROM uservisits;
    +---------------------+
    |                 sum |
    +---------------------+
    | 9223372036854775816 |
    +---------------------+
    SELECT 1 row in set (... sec)

.. Hidden: refresh uservisits table

    cr> DROP TABLE uservisits;
    DROP OK, 1 row affected (... sec)


.. _aggregation-variance:

``variance(column)``
--------------------

The ``variance`` aggregate function computes the `Variance`_ of the set of
non-null values in a column. It is a measure about how far a set of numbers is
spread. A variance of ``0.0`` indicates that all values are the same.

``variance`` is defined on all numeric types, except for
:ref:`NUMERIC <type-numeric>`, and on timestamp. It always returns a
``double precision`` value. If all values were null or we got no value at all
``NULL`` is returned.

Example::

    cr> select variance(position), kind from locations
    ... group by kind order by kind desc;
    +----------+-------------+
    | variance | kind        |
    +----------+-------------+
    |   1.25   | Star System |
    |   2.0    | Planet      |
    |   3.6875 | Galaxy      |
    +----------+-------------+
    SELECT 3 rows in set (... sec)

.. CAUTION::

    Due to java double precision arithmetic it is possible that any two
    executions of the aggregate function on the same data produce slightly
    differing results.

.. _aggregation-topk:

``topk(column, [k], [max_capacity])``
-------------------------------------


The ``topk`` aggregate function computes ``k`` most frequent values. The result
is an ``OBJECT`` in the following format::

    {
        "frequencies": [
            {
                "estimate": <estimated_frequency>,
                "item": <value_of_column>,
                "lower_bound": <lower_bound>,
                "upper_bound": <upper_bound>"
            },
            ...
        ],
        "maximum_error": <max_error>
    }

The ``frequencies`` list is ordered by the estimated frequency, with the most
common items listed first.

``k`` defaults to 8 and can't exceed 5000. The ``max_capacity`` parameter is
optional and describes the maximum number of tracked items and must be in the
power of 2 and defaults to 8192.

Example::

    cr> select topk(country, 3) from sys.summits;
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | topk                                                                                                                                                                                                                                                             |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | {"frequencies": [{"estimate": 436, "item": "IT", "lower_bound": 436, "upper_bound": 436}, {"estimate": 401, "item": "AT", "lower_bound": 401, "upper_bound": 401}, {"estimate": 320, "item": "CH", "lower_bound": 320, "upper_bound": 320}], "maximum_error": 0} |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    SELECT 1 row in set (... sec)


Internally a `Frequency Sketch`_ is used to track the frequencies of the most
common values. Higher values in ``max_capacity`` provide better accuracy at the
cost of increased memory usage. If less different items than 75 % of the
``max_capacity`` are processed the frequencies of the result are exact, otherwise
they will be an approximation. The result contains all values with their
frequencies above the error threshold and may also include false positives.
The error threshold indicates the minimum frequency which can be detected
reliably and is defined as followed::

    M = max_capacity, always a power of 2
    N = Total counts of items
    e = Epsilon = 3.5/M (minimum detectable frequency)

    error threshold = (N < 0.75 * M)? 0 : e * N.

The following table is an
extract of the `Error Threshold Table`_ and shows the error threshold in relation
to the ``max_capacity`` and the number of processed items. A threshold of 0
indicates that the frequencies are exact.

.. list-table:: Error Threshold
   :widths: 20 20 20 20 20 20 20 20
   :header-rows: 1
   :stub-columns: 1

   * - max_capacity vs. items
     - 8192
     - 16384
     - 32768
     - 65536
     - 131072
     - 262144
     - 524288
   * - 10000
     - 4
     - 0
     - 0
     - 0
     - 0
     - 0
     - 0
   * - 100000
     - 43
     - 21
     - 11
     - 5
     - 3
     - 0
     - 0
   * - 1000000
     - 427
     - 214
     - 107
     - 53
     - 27
     - 13
     - 7
   * - 10000000
     - 4272
     - 2136
     - 1068
     - 534
     - 267
     - 134
     - 67
   * - 100000000
     - 42725
     - 21362
     - 10681
     - 5341
     - 2670
     - 1335
     - 668
   * - 1000000000
     - 427246
     - 213623
     - 106812
     - 53406
     - 26703
     - 13351
     - 6676


The error threshold shows which ranges of frequencies can be tracked depending
on the number of items and capacity. E.g. Processing 10,000 items with the
``max_capacity`` of 8192 indicates a error threshold of 4. Therefore all items
with frequencies greater 4 will be included. Some items with frequencies below
the threshold 4 may also appear in the result.

.. _aggregation-limitations:

Limitations
===========

- ``DISTINCT`` is not supported with aggregations on :ref:`sql_joins`.

- Aggregate functions can only be applied to columns with a :ref:`plain index
  <sql_ddl_index_plain>`, which is the default for all :ref:`primitive type
  <data-types-primitive>` columns.


.. _Aggregate function: https://en.wikipedia.org/wiki/Aggregate_function
.. _Geometric Mean: https://en.wikipedia.org/wiki/Geometric_mean
.. _HyperLogLog++: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
.. _Percentile: https://en.wikipedia.org/wiki/Percentile
.. _Population Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation
.. _Sample Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation
.. _TDigest: https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf
.. _Variance: https://en.wikipedia.org/wiki/Variance
.. _Frequency Sketch: https://datasketches.apache.org/docs/Frequency/FrequencySketches.html
.. _Error Threshold Table: https://datasketches.apache.org/docs/Frequency/FrequentItemsErrorTable.html</doc><doc title="CrateDB SQL reference: Table functions" desc="Table functions are functions that produce a set of rows. They can be used in place of a relation in the `FROM` clause. ">.. highlight:: psql

.. _table-functions:

===============
Table functions
===============

Table functions are :ref:`functions <gloss-function>` that produce a set of
rows. They can be used in place of a relation in the ``FROM`` clause.

If used within the select list, the table functions will be :ref:`evaluated
<gloss-evaluation>` per row of the relations in the ``FROM`` clause, generating
one or more rows which are appended to the result set.  If multiple table
functions with different amounts of rows are used, ``NULL`` values will be
returned for the functions that are exhausted.

For example::


    cr> select unnest([1, 2, 3]), unnest([1, 2]);
    +--------+--------+
    | unnest | unnest |
    +--------+--------+
    |      1 |      1 |
    |      2 |      2 |
    |      3 |   NULL |
    +--------+--------+
    SELECT 3 rows in set (... sec)


.. note::

    Table functions in the select list are executed after aggregations. So
    aggregations can be used as arguments to table functions, but the other way
    around is not allowed, unless sub queries are utilized.

    For example::

        (SELECT aggregate_func(col) FROM (SELECT table_func(...) AS col) ...)


.. _table-functions-scalar:

Scalar functions
================

A :ref:`scalar function <scalar-functions>`, when used in the ``FROM`` clause
in place of a relation, will result in a table of one row and one column,
containing the :ref:`scalar value <gloss-scalar>` returned from the function.

::

    cr> SELECT * FROM abs(-5), initcap('hello world');
    +-----+-------------+
    | abs | initcap     |
    +-----+-------------+
    |   5 | Hello World |
    +-----+-------------+
    SELECT 1 row in set (... sec)


``empty_row( )``
================

empty_row doesn't take any argument and produces a table with an empty row and
no column.

::

    cr> select * from empty_row();
    SELECT OK, 1 row affected  (... sec)


.. _unnest:

``unnest( array [ array , ] )``
===============================

unnest takes any number of array parameters and produces a table where each
provided array argument results in a column.

The columns are named ``colN`` where ``N`` is a number starting at 1.

::

    cr> select * from unnest([1, 2, 3], ['Arthur', 'Trillian', 'Marvin']);
    +------+----------+
    | col1 | col2     |
    +------+----------+
    |    1 | Arthur   |
    |    2 | Trillian |
    |    3 | Marvin   |
    +------+----------+
    SELECT 3 rows in set (... sec)


.. _table-functions-generate-series:

``pg_catalog.generate_series(start, stop, [step])``
===================================================

Generate a series of values from inclusive start to inclusive stop with
``step`` increments.

The argument can be ``integer`` or ``bigint``, in which case ``step`` is
optional and defaults to ``1``.

``start`` and ``stop`` can also be of type ``timestamp with time zone`` or
``timestamp without time zone`` in which case ``step`` is required and must be
of type ``interval``.

The return value always matches the ``start`` / ``stop`` types.


::

    cr> SELECT * FROM generate_series(1, 4);
    +-----------------+
    | generate_series |
    +-----------------+
    |               1 |
    |               2 |
    |               3 |
    |               4 |
    +-----------------+
    SELECT 4 rows in set (... sec)

::

    cr> SELECT
    ...     x,
    ...     date_format('%Y-%m-%d, %H:%i', x)
    ...     FROM generate_series('2019-01-01 00:00'::timestamp, '2019-01-04 00:00'::timestamp, '30 hours'::interval) AS t(x);
    +---------------+-------------------+
    |             x | date_format       |
    +---------------+-------------------+
    | 1546300800000 | 2019-01-01, 00:00 |
    | 1546408800000 | 2019-01-02, 06:00 |
    | 1546516800000 | 2019-01-03, 12:00 |
    +---------------+-------------------+
    SELECT 3 rows in set (... sec)


.. _table-functions-generate-subscripts:

``pg_catalog.generate_subscripts(array, dim, [reverse])``
=========================================================

Generate the subscripts for the specified dimension ``dim`` of the given
``array``. Zero rows are returned for arrays that do not have the requested
dimension, or for ``NULL`` arrays (but valid subscripts are returned for
``NULL`` array elements).

If ``reverse`` is ``true`` the subscripts will be returned in reverse order.

This example takes a one dimensional array of four elements, where elements
at positions 1 and 3 are ``NULL``:

::

    cr> SELECT generate_subscripts([NULL, 1, NULL, 2], 1) AS s;
    +---+
    | s |
    +---+
    | 1 |
    | 2 |
    | 3 |
    | 4 |
    +---+
    SELECT 4 rows in set (... sec)

This example returns the reversed list of subscripts for the same array:

::

    cr> SELECT generate_subscripts([NULL, 1, NULL, 2], 1, true) AS s;
    +---+
    | s |
    +---+
    | 4 |
    | 3 |
    | 2 |
    | 1 |
    +---+
    SELECT 4 rows in set (... sec)

This example works on an array of three dimensions. Each of the elements within
a given level must be either ``NULL``, or an array of the same size as the
other arrays within the same level.

::

    cr> select generate_subscripts([[[1],[2]], [[3],[4]], [[4],[5]]], 2) as s;
    +---+
    | s |
    +---+
    | 1 |
    | 2 |
    +---+
    SELECT 2 rows in set (... sec)


.. _table-functions-regexp-matches:

``regexp_matches(source, pattern [, flags])``
=============================================

Uses the :ref:`regular expression <gloss-regular-expression>` ``pattern`` to
match against the ``source`` string.

The result rows have one column:

.. list-table::
    :header-rows: 1

    * - Column name
      - Description
    * - groups
      - ``array(text)``

If ``pattern`` matches ``source``, an array of the matched regular expression
groups is returned.

If no regular expression group was used, the whole pattern is used as a group.

A regular expression group is formed by a subexpression that is surrounded by
parentheses. The position of a group is determined by the position of its
opening parenthesis.

For example when matching the pattern ``\b([A-Z])`` a match for the
subexpression ``([A-Z])`` would create group No. 1. If you want to group items
with parentheses, but without grouping, use ``(?...)``.

For example matching the regular expression ``([Aa](.+)z)`` against
``alcatraz``, results in these groups:

- group 1: ``alcatraz`` (from first to last parenthesis or whole pattern)
- group 2: ``lcatra`` (beginning at second parenthesis)

The ``regexp_matches`` :ref:`function <gloss-function>` will return all groups
as a ``text`` array::

    cr> select regexp_matches('alcatraz', '(a(.+)z)') as matched;
    +------------------------+
    | matched                |
    +------------------------+
    | ["alcatraz", "lcatra"] |
    +------------------------+
    SELECT 1 row in set (... sec)

::

    cr> select regexp_matches('alcatraz', 'traz') as matched;
    +----------+
    | matched  |
    +----------+
    | ["traz"] |
    +----------+
    SELECT 1 row in set (... sec)

Through array element access functionality, a group can be selected directly.
See :ref:`sql_dql_object_arrays` for details.

::

    cr> select regexp_matches('alcatraz', '(a(.+)z)')[2] as second_group;
    +--------------+
    | second_group |
    +--------------+
    | lcatra       |
    +--------------+
    SELECT 1 row in set (... sec)


.. _table-functions-regexp-matches-flags:

Flags
.....

This function takes a number of flags as optional third parameter. These flags
are given as a string containing any of the characters listed below. Order does
not matter.

+-------+---------------------------------------------------------------------+
| Flag  | Description                                                         |
+=======+=====================================================================+
| ``i`` | enable case insensitive matching                                    |
+-------+---------------------------------------------------------------------+
| ``u`` | enable unicode case folding when used together with ``i``           |
+-------+---------------------------------------------------------------------+
| ``U`` | enable unicode support for character classes like ``\W``            |
+-------+---------------------------------------------------------------------+
| ``s`` | make ``.`` match line terminators, too                              |
+-------+---------------------------------------------------------------------+
| ``m`` | make ``^`` and ``$`` match on the beginning or end of a line        |
|       | too.                                                                |
+-------+---------------------------------------------------------------------+
| ``x`` | permit whitespace and line comments starting with ``#``             |
+-------+---------------------------------------------------------------------+
| ``d`` | only ``\n`` is considered a line-terminator when using ``^``, ``$`` |
|       | and ``.``                                                           |
+-------+---------------------------------------------------------------------+
| ``g`` | keep matching until the end of ``source``, instead of stopping at   |
|       | the first match.                                                    |
+-------+---------------------------------------------------------------------+


Examples
........

In this example the ``pattern`` does not match anything in the ``source`` and
the result is an empty table:

::

    cr> select regexp_matches('foobar', '^(a(.+)z)$') as matched;
    +---------+
    | matched |
    +---------+
    +---------+
    SELECT 0 rows in set (... sec)

In this example we find the term that follows two digits:

::

    cr> select regexp_matches('99 bottles of beer on the wall', '\d{2}\s(\w+).*', 'ixU')
    ... as matched;
    +-------------+
    | matched     |
    +-------------+
    | ["bottles"] |
    +-------------+
    SELECT 1 row in set (... sec)

This example shows the use of flag ``g``, splitting ``source`` into a set of
arrays, each containing two entries:

::

    cr>  select regexp_matches('#abc #def #ghi #jkl', '(#[^\s]*) (#[^\s]*)', 'g') as matched;
    +------------------+
    | matched          |
    +------------------+
    | ["#abc", "#def"] |
    | ["#ghi", "#jkl"] |
    +------------------+
    SELECT 2 rows in set (... sec)


.. _pg_catalog.pg_get_keywords:

``pg_catalog.pg_get_keywords()``
================================

Returns a list of SQL keywords and their categories.

The result rows have three columns:

.. list-table::
    :header-rows: 1

    * - Column name
      - Description
    * - ``word``
      - The SQL keyword
    * - ``catcode``
      - Code for the category (`R` for reserved keywords, `U` for unreserved
        keywords)
    * - ``catdesc``
      - The description of the category

::

    cr> SELECT * FROM pg_catalog.pg_get_keywords() ORDER BY 1 LIMIT 4;
    +----------+---------+------------+
    | word     | catcode | catdesc    |
    +----------+---------+------------+
    | absolute | U       | unreserved |
    | add      | R       | reserved   |
    | alias    | U       | unreserved |
    | all      | R       | reserved   |
    +----------+---------+------------+
    SELECT 4 rows in set (... sec)


.. _information_schema._pg_expandarray:

``information_schema._pg_expandarray(array)``
=============================================

Takes an array and returns a set of value and an index into the array.

.. list-table::
    :header-rows: 1

    * - Column name
      - Description
    * - x
      - Value within the array
    * - n
      - Index of the value within the array

::

    cr> SELECT information_schema._pg_expandarray(ARRAY['a', 'b']) AS result;
    +----------+
    | result   |
    +----------+
    | ["a", 1] |
    | ["b", 2] |
    +----------+
    SELECT 2 rows in set (... sec)

::

    cr> SELECT * from information_schema._pg_expandarray(ARRAY['a', 'b']);
    +---+---+
    | x | n |
    +---+---+
    | a | 1 |
    | b | 2 |
    +---+---+
    SELECT 2 rows in set (... sec)</doc><doc title="CrateDB SQL reference: Window functions" desc="Window functions are functions which perform a computation across a set of rows which are related to the current row. This is comparable to aggregation functions, but window functions do not cause multiple rows to be grouped into a single row. ">.. highlight:: psql

.. _window-functions:

================
Window functions
================

Window functions are :ref:`functions <gloss-function>` which perform a
computation across a set of rows which are related to the current row. This is
comparable to :ref:`aggregation functions <aggregation-functions>`, but window
functions do not cause multiple rows to be grouped into a single row.


.. _window-function-call:

Window function call
====================


.. _window-call-synopsis:

Synopsis
--------

The synopsis of a window function call is one of the following

::

   function_name ( { * | [ expression [, expression ... ] ] } )
                 [ FILTER ( WHERE condition ) ]
                 [ { RESPECT | IGNORE } NULLS ]
                 over_clause

where ``function_name`` is a name of a :ref:`general-purpose window
<window-functions-general-purpose>` or :ref:`aggregate function
<aggregation-functions>` and ``expression`` is a column reference, :ref:`scalar
function <scalar-functions>` or literal.

If ``FILTER`` is specified, then only the rows that met the :ref:`WHERE
<sql-select-where>` condition are supplied to the window function. Only window
functions that are :ref:`aggregates <aggregation>` accept the ``FILTER``
clause.

If ``IGNORE NULLS`` option is specified, then the null values are excluded from
the window function executions. The window functions that support this option
are: :ref:`window-functions-lead`, :ref:`window-functions-lag`,
:ref:`window-functions-first-value`, :ref:`window-functions-last-value`,
and :ref:`window-functions-nth-value`. If a function supports this option and
it is not specified, then ``RESPECT NULLS`` is set by default.

The :ref:`window-definition-over` clause is what declares a function to be a
window function.

The window function call that uses a ``wildcard`` instead of an ``expression``
as a function argument is supported only by the ``count(*)`` aggregate
function.


.. _window-definition:

Window definition
=================


.. _window-definition-over:

OVER
----

.. _window-definition-over-synopsis:

Synopsis
........

::

   OVER { window_name | ( [ window_definition ] ) }

where ``window_definition`` has the syntax

::

   window_definition:
      [ window_name ]
      [ PARTITION BY expression [, ...] ]
      [ ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
      [ { RANGE | ROWS } BETWEEN frame_start AND frame_end ]

The ``window_name`` refers to ``window_definition`` defined in the
:ref:`WINDOW <sql-select-window>` clause.

The ``frame_start`` and ``frame_end`` can be one of

::

   UNBOUNDED PRECEDING
   offset PRECEDING
   CURRENT ROW
   offset FOLLOWING
   UNBOUNDED FOLLOWING

The default frame definition is ``RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROW``. If ``frame_end`` is omitted it defaults to ``CURRENT ROW``.

``frame_start`` cannot be ``FOLLOWING`` or ``UNBOUNDED FOLLOWING`` and
``frame_end`` cannot be ``PRECEDING`` or ``UNBOUNDED PRECEDING``.

In ``RANGE`` mode if the ``frame_start`` is ``CURRENT ROW`` the frame starts
with the current row's first peer (a row that the window's ``ORDER BY``
:ref:`expression <gloss-expression>` sorts as equal to the current row), while
a ``frame_end`` of ``CURRENT ROW`` means the frame will end with the current's
row last peer row.

In ``ROWS`` mode ``CURRENT_ROW`` means the current row.

The ``offset PRECEDING`` and ``offset FOLLOWING`` options vary in meaning
depending on the frame mode. In ``ROWS`` mode, the ``offset`` is an integer
indicating that the frame start or end is offsetted by that many rows before or
after the current row. In ``RANGE`` mode, the use of a custom ``offset`` option
requires that there is exactly one ``ORDER BY`` column in the window
definition. The frame contains those rows whose ordering column value is no
more than ``offset`` minus (for ``PRECEDING``) or plus (for ``FOLLOWING``) the
current row's ordering column value. Because the value of ``offset`` is
subtracted/added to the values of the ordering column, only type combinations
that support addition/subtraction operations are allowed. For instance, when
the ordering column is of type :ref:`timestamp <type-timestamp>`, the
``offset`` expression can be an :ref:`interval <type-interval>`.

The :ref:`window-definition-over` clause defines the ``window`` containing the
appropriate rows which will take part in the ``window function`` computation.

An empty :ref:`window-definition-over` clause defines a ``window`` containing
all the rows in the result set.

Example::

   cr> SELECT dept_id, COUNT(*) OVER() AS cnt FROM employees ORDER BY 1, 2;
   +---------+-----+
   | dept_id | cnt |
   +---------+-----+
   |    4001 |  18 |
   |    4001 |  18 |
   |    4001 |  18 |
   |    4002 |  18 |
   |    4002 |  18 |
   |    4002 |  18 |
   |    4002 |  18 |
   |    4003 |  18 |
   |    4003 |  18 |
   |    4003 |  18 |
   |    4003 |  18 |
   |    4003 |  18 |
   |    4004 |  18 |
   |    4004 |  18 |
   |    4004 |  18 |
   |    4006 |  18 |
   |    4006 |  18 |
   |    4006 |  18 |
   +---------+-----+
   SELECT 18 rows in set (... sec)

The ``PARTITION BY`` clause groups the rows within a window into
partitions which are processed separately by the window function, each
partition in turn becoming a window. If ``PARTITION BY`` is not specified, all
the rows are considered a single partition.

Example::

   cr> SELECT dept_id, ROW_NUMBER() OVER(PARTITION BY dept_id) AS row_num
   ... FROM employees ORDER BY 1, 2;
   +---------+---------+
   | dept_id | row_num |
   +---------+---------+
   |    4001 |       1 |
   |    4001 |       2 |
   |    4001 |       3 |
   |    4002 |       1 |
   |    4002 |       2 |
   |    4002 |       3 |
   |    4002 |       4 |
   |    4003 |       1 |
   |    4003 |       2 |
   |    4003 |       3 |
   |    4003 |       4 |
   |    4003 |       5 |
   |    4004 |       1 |
   |    4004 |       2 |
   |    4004 |       3 |
   |    4006 |       1 |
   |    4006 |       2 |
   |    4006 |       3 |
   +---------+---------+
   SELECT 18 rows in set (... sec)

If ``ORDER BY`` is supplied the ``window`` definition consists of a range of
rows starting with the first row in the ``partition`` and ending with the
current row, plus any subsequent rows that are equal to the current row, which
are the current row's ``peers``.

Example::

   cr> SELECT
   ...   dept_id,
   ...   sex,
   ...   COUNT(*) OVER(PARTITION BY dept_id ORDER BY sex) AS cnt
   ... FROM employees
   ... ORDER BY 1, 2, 3
   +---------+-----+-----+
   | dept_id | sex | cnt |
   +---------+-----+-----+
   |    4001 | M   |   3 |
   |    4001 | M   |   3 |
   |    4001 | M   |   3 |
   |    4002 | F   |   1 |
   |    4002 | M   |   4 |
   |    4002 | M   |   4 |
   |    4002 | M   |   4 |
   |    4003 | M   |   5 |
   |    4003 | M   |   5 |
   |    4003 | M   |   5 |
   |    4003 | M   |   5 |
   |    4003 | M   |   5 |
   |    4004 | F   |   1 |
   |    4004 | M   |   3 |
   |    4004 | M   |   3 |
   |    4006 | F   |   1 |
   |    4006 | M   |   3 |
   |    4006 | M   |   3 |
   +---------+-----+-----+
   SELECT 18 rows in set (... sec)

.. NOTE::

   Taking into account the ``peers`` concept mentioned above, for an empty
   :ref:`window-definition-over` clause all the rows in the result set are
   ``peers``.

.. NOTE::

   :ref:`Aggregation functions <aggregation>` will be treated as ``window
   functions`` when used in conjunction with the :ref:`window-definition-over`
   clause.

.. NOTE::

   Window definitions order or partitioned by an array column type are
   currently not supported.

In the ``UNBOUNDED FOLLOWING`` case the ``window`` for each row starts with
each row and ends with the last row in the current ``partition``. If the
``current row`` has ``peers`` the ``window`` will include (or start with) all
the ``current row`` peers and end at the upper bound of the ``partition``.

Example::

   cr> SELECT
   ...   dept_id,
   ...   sex,
   ...   COUNT(*) OVER(
   ...     PARTITION BY dept_id
   ...     ORDER BY
   ...       sex RANGE BETWEEN CURRENT ROW
   ...       AND UNBOUNDED FOLLOWING
   ...   ) partitionByDeptOrderBySex
   ... FROM employees
   ... ORDER BY 1, 2, 3
   +---------+-----+---------------------------+
   | dept_id | sex | partitionbydeptorderbysex |
   +---------+-----+---------------------------+
   |    4001 | M   |                         3 |
   |    4001 | M   |                         3 |
   |    4001 | M   |                         3 |
   |    4002 | F   |                         4 |
   |    4002 | M   |                         3 |
   |    4002 | M   |                         3 |
   |    4002 | M   |                         3 |
   |    4003 | M   |                         5 |
   |    4003 | M   |                         5 |
   |    4003 | M   |                         5 |
   |    4003 | M   |                         5 |
   |    4003 | M   |                         5 |
   |    4004 | F   |                         3 |
   |    4004 | M   |                         2 |
   |    4004 | M   |                         2 |
   |    4006 | F   |                         3 |
   |    4006 | M   |                         2 |
   |    4006 | M   |                         2 |
   +---------+-----+---------------------------+
   SELECT 18 rows in set (... sec)


.. _window-definition-named-windows:

Named windows
-------------

It is possible to define a list of named window definitions that can be
referenced in :ref:`window-definition-over` clauses. To do this, use the
:ref:`sql-select-window` clause in the :ref:`sql-select` clause.

Named windows are particularly useful when the same window definition
could be used in multiple :ref:`window-definition-over` clauses. For instance

::

   cr> SELECT
   ...   x,
   ...   FIRST_VALUE(x) OVER (w) AS "first",
   ...   LAST_VALUE(x) OVER (w) AS "last"
   ... FROM (VALUES (1), (2), (3), (4)) AS t(x)
   ... WINDOW w AS (ORDER BY x)
   +---+-------+------+
   | x | first | last |
   +---+-------+------+
   | 1 |     1 |    1 |
   | 2 |     1 |    2 |
   | 3 |     1 |    3 |
   | 4 |     1 |    4 |
   +---+-------+------+
   SELECT 4 rows in set (... sec)

If a ``window_name`` is specified in the window definition of the
:ref:`window-definition-over` clause, then there must be a named window entry
that matches the ``window_name`` in the window definition list of the
:ref:`sql-select-window` clause.

If the :ref:`window-definition-over` clause has its own non-empty window
definition and references a window definition from the :ref:`sql-select-window`
clause, then it can only add clauses from the referenced window, but not
overwrite them.

::

   cr> SELECT
   ...   x,
   ...   LAST_VALUE(x) OVER (w ORDER BY x) AS y
   ... FROM (VALUES
   ...      (1, 1),
   ...      (2, 1),
   ...      (3, 2),
   ...      (4, 2) ) AS t(x, y)
   ... WINDOW w AS (PARTITION BY y)
   +---+---+
   | x | y |
   +---+---+
   | 1 | 1 |
   | 2 | 2 |
   | 3 | 3 |
   | 4 | 4 |
   +---+---+
   SELECT 4 rows in set (... sec)

Otherwise, an attempt to override the clauses of the referenced window by the
window definition of the :ref:`window-definition-over` clause will result in
failure.

::

   cr> SELECT
   ...   FIRST_VALUE(x) OVER (w ORDER BY x)
   ... FROM (VALUES(1), (2), (3), (4)) as t(x)
   ... WINDOW w AS (ORDER BY x)
   SQLParseException[Cannot override ORDER BY clause of window w]

It is not possible to define the ``PARTITION BY`` clause in the window
definition of the :ref:`window-definition-over` clause if it references a
window definition from the :ref:`sql-select-window` clause.

The window definitions in the :ref:`sql-select-window` clause cannot define
its own window frames, if they are referenced by non-empty window definitions
of the :ref:`window-definition-over` clauses.

The definition of the named window can itself begin with a ``window_name``.  In
this case all the elements of interconnected named windows will be copied to
the window definition of the :ref:`window-definition-over` clause if it
references the named window definition that has subsequent window
references. The window definitions in the ``WINDOW`` clause permits only
backward references.

::

   cr> SELECT
   ...   x,
   ...   ROW_NUMBER() OVER (w) AS y
   ... FROM (VALUES
   ...      (1, 1),
   ...      (3, 2),
   ...      (2, 1)) AS t (x, y)
   ... WINDOW p AS (PARTITION BY y),
   ...        w AS (p ORDER BY x)
   +---+---+
   | x | y |
   +---+---+
   | 1 | 1 |
   | 2 | 2 |
   | 3 | 1 |
   +---+---+
   SELECT 3 rows in set (... sec)


.. _window-functions-general-purpose:

General-purpose window functions
================================


``row_number()``
----------------

Returns the number of the current row within its window.

Example::

   cr> SELECT
   ...  col1,
   ...  ROW_NUMBER() OVER(ORDER BY col1) as row_num
   ... FROM (VALUES('x'), ('y'), ('z')) AS t;
   +------+---------+
   | col1 | row_num |
   +------+---------+
   | x    |       1 |
   | y    |       2 |
   | z    |       3 |
   +------+---------+
   SELECT 3 rows in set (... sec)


.. _window-functions-first-value:

``first_value(arg)``
--------------------

Returns the argument value :ref:`evaluated <gloss-evaluation>` at the first row
within the window.

Its return type is the type of its argument.

Example::

   cr> SELECT
   ...  col1,
   ...  FIRST_VALUE(col1) OVER (ORDER BY col1) AS value
   ... FROM (VALUES('x'), ('y'), ('y'), ('z')) AS t;
   +------+-------+
   | col1 | value |
   +------+-------+
   | x    | x     |
   | y    | x     |
   | y    | x     |
   | z    | x     |
   +------+-------+
   SELECT 4 rows in set (... sec)


.. _window-functions-last-value:

``last_value(arg)``
-------------------

Returns the argument value :ref:`evaluated <gloss-evaluation>` at the last row
within the window.

Its return type is the type of its argument.

Example::

   cr> SELECT
   ...  col1,
   ...  LAST_VALUE(col1) OVER(ORDER BY col1) AS value
   ... FROM (VALUES('x'), ('y'), ('y'), ('z')) AS t;
   +------+-------+
   | col1 | value |
   +------+-------+
   | x    | x     |
   | y    | y     |
   | y    | y     |
   | z    | z     |
   +------+-------+
   SELECT 4 rows in set (... sec)


.. _window-functions-nth-value:

``nth_value(arg, number)``
--------------------------

Returns the argument value :ref:`evaluated <gloss-evaluation>` at row that is
the nth row within the window. ``NULL`` is returned if the nth row doesn't
exist in the window.

Its return type is the type of its first argument.

Example::

   cr> SELECT
   ...  col1,
   ...  NTH_VALUE(col1, 3) OVER(ORDER BY col1) AS val
   ... FROM (VALUES ('x'), ('y'), ('y'), ('z')) AS t;
   +------+------+
   | col1 | val  |
   +------+------+
   | x    | NULL |
   | y    | y    |
   | y    | y    |
   | z    | y    |
   +------+------+
   SELECT 4 rows in set (... sec)


.. _window-functions-lag:

``lag(arg [, offset [, default] ])``
------------------------------------


.. _window-functions-lag-synopsis:

Synopsis
........

::

   lag(argument any [, offset integer [, default any]])

Returns the argument value :ref:`evaluated <gloss-evaluation>` at the row that
precedes the current row by the offset within the partition. If there is no
such row, the return value is ``default``. If ``offset`` or ``default``
arguments are missing, they default to ``1`` and ``null``, respectively.

Both ``offset`` and ``default`` are evaluated with respect to the current row.

If ``offset`` is ``0``, then argument value is evaluated for the current row.

The ``default`` and ``argument`` data types must match.

Example::

   cr> SELECT
   ...   dept_id,
   ...   year,
   ...   budget,
   ...   LAG(budget) OVER(
   ...      PARTITION BY dept_id) prev_budget
   ... FROM (VALUES
   ...      (1, 2017, 45000),
   ...      (1, 2018, 35000),
   ...      (2, 2017, 15000),
   ...      (2, 2018, 65000),
   ...      (2, 2019, 12000))
   ... as t (dept_id, year, budget);
   +---------+------+--------+-------------+
   | dept_id | year | budget | prev_budget |
   +---------+------+--------+-------------+
   |       1 | 2017 |  45000 |        NULL |
   |       1 | 2018 |  35000 |       45000 |
   |       2 | 2017 |  15000 |        NULL |
   |       2 | 2018 |  65000 |       15000 |
   |       2 | 2019 |  12000 |       65000 |
   +---------+------+--------+-------------+
   SELECT 5 rows in set (... sec)


.. _window-functions-lead:

``lead(arg [, offset [, default] ])``
-------------------------------------


.. _window-functions-lead-synopsis:

Synopsis
........

::

   lead(argument any [, offset integer [, default any]])

The ``lead`` function is the counterpart of the :ref:`lag window function
<window-functions-lag>` as it allows the :ref:`evaluation <gloss-evaluation>`
of the argument at rows that follow the current row. ``lead`` returns the
argument value evaluated at the row that follows the current row by the offset
within the partition. If there is no such row, the return value is ``default``.
If ``offset`` or ``default`` arguments are missing, they default to ``1`` or
``null``, respectively.

Both ``offset`` and ``default`` are evaluated with respect to the current row.

If ``offset`` is ``0``, then argument value is evaluated for the current row.

The ``default`` and ``argument`` data types must match.

Example::

   cr> SELECT
   ...   dept_id,
   ...   year,
   ...   budget,
   ...   LEAD(budget) OVER(
   ...      PARTITION BY dept_id) next_budget
   ... FROM (VALUES
   ...      (1, 2017, 45000),
   ...      (1, 2018, 35000),
   ...      (2, 2017, 15000),
   ...      (2, 2018, 65000),
   ...      (2, 2019, 12000))
   ... as t (dept_id, year, budget);
   +---------+------+--------+-------------+
   | dept_id | year | budget | next_budget |
   +---------+------+--------+-------------+
   |       1 | 2017 |  45000 |       35000 |
   |       1 | 2018 |  35000 |        NULL |
   |       2 | 2017 |  15000 |       65000 |
   |       2 | 2018 |  65000 |       12000 |
   |       2 | 2019 |  12000 |        NULL |
   +---------+------+--------+-------------+
   SELECT 5 rows in set (... sec)


.. _window-functions-rank:

``rank()``
----------


.. _window-functions-rank-synopsis:

Synopsis
........

::

    rank()

Returns the rank of every row within a partition of a result set.

Within each partition, the rank of the first row is ``1``. Subsequent tied
rows are given the same rank, and the potential rank of the next row
is incremented. Because of this, ranks may not be sequential.

Example::

    cr> SELECT
    ...   name,
    ...   department_id,
    ...   salary,
    ...   RANK() OVER (ORDER BY salary desc) as salary_rank
    ... FROM (VALUES
    ...      ('Bobson Dugnutt', 1, 2000),
    ...      ('Todd Bonzalez', 2, 2500),
    ...      ('Jess Brewer', 1, 2500),
    ...      ('Safwan Buchanan', 1, 1900),
    ...      ('Hal Dodd', 1, 2500),
    ...      ('Gillian Hawes', 2, 2000))
    ... as t (name, department_id, salary);
    +-----------------+---------------+--------+-------------+
    | name            | department_id | salary | salary_rank |
    +-----------------+---------------+--------+-------------+
    | Todd Bonzalez   |             2 |   2500 |           1 |
    | Jess Brewer     |             1 |   2500 |           1 |
    | Hal Dodd        |             1 |   2500 |           1 |
    | Bobson Dugnutt  |             1 |   2000 |           4 |
    | Gillian Hawes   |             2 |   2000 |           4 |
    | Safwan Buchanan |             1 |   1900 |           6 |
    +-----------------+---------------+--------+-------------+
    SELECT 6 rows in set (... sec)


.. _window-functions-dense-rank:

``dense_rank()``
----------------


.. _window-functions-dense-rank-synopsis:

Synopsis
........

::

    dense_rank()

Returns the rank of every row within a partition of a result set, similar to
``rank``. However, unlike ``rank``, ``dense_rank`` always returns sequential
rank values.

Within each partition, the rank of the first row is ``1``. Subsequent tied
rows are given the same rank.

Example::

    cr> SELECT
    ...   name,
    ...   department_id,
    ...   salary,
    ...   DENSE_RANK() OVER (ORDER BY salary desc) as salary_rank
    ... FROM (VALUES
    ...      ('Bobson Dugnutt', 1, 2000),
    ...      ('Todd Bonzalez', 2, 2500),
    ...      ('Jess Brewer', 1, 2500),
    ...      ('Safwan Buchanan', 1, 1900),
    ...      ('Hal Dodd', 1, 2500),
    ...      ('Gillian Hawes', 2, 2000))
    ... as t (name, department_id, salary);
    +-----------------+---------------+--------+-------------+
    | name            | department_id | salary | salary_rank |
    +-----------------+---------------+--------+-------------+
    | Todd Bonzalez   |             2 |   2500 |           1 |
    | Jess Brewer     |             1 |   2500 |           1 |
    | Hal Dodd        |             1 |   2500 |           1 |
    | Bobson Dugnutt  |             1 |   2000 |           2 |
    | Gillian Hawes   |             2 |   2000 |           2 |
    | Safwan Buchanan |             1 |   1900 |           3 |
    +-----------------+---------------+--------+-------------+
    SELECT 6 rows in set (... sec)


.. _window-aggregate-functions:

Aggregate window functions
==========================

The standard :ref:`aggregation functions<aggregation-functions>` can also be
used with the ``window`` functionality.</doc><doc title="CrateDB SQL reference: User-defined functions" desc="CrateDB supports user-defined functions.">.. _user-defined-functions:

======================
User-defined functions
======================


.. _udf-create-replace:

``CREATE OR REPLACE``
=====================

CrateDB supports user-defined :ref:`functions <gloss-function>`. See
:ref:`ref-create-function` for a full syntax description.

``CREATE FUNCTION`` defines a new function::

    cr> CREATE FUNCTION my_subtract_function(integer, integer)
    ... RETURNS integer
    ... LANGUAGE JAVASCRIPT
    ... AS 'function my_subtract_function(a, b) { return a - b; }';
    CREATE OK, 1 row affected  (... sec)

.. hide:

    cr> _wait_for_function('my_subtract_function(1::integer, 1::integer)')

::

    cr> SELECT doc.my_subtract_function(3, 1) AS col;
    +-----+
    | col |
    +-----+
    |   2 |
    +-----+
    SELECT 1 row in set (... sec)

``CREATE OR REPLACE FUNCTION`` will either create a new function or replace
an existing function definition::

    cr> CREATE OR REPLACE FUNCTION log10(bigint)
    ... RETURNS double precision
    ... LANGUAGE JAVASCRIPT
    ... AS 'function log10(a) {return Math.log(a)/Math.log(10); }';
    CREATE OK, 1 row affected  (... sec)

.. hide:

    cr> _wait_for_function('log10(1::bigint)')

::

    cr> SELECT doc.log10(10) AS col;
    +-----+
    | col |
    +-----+
    | 1.0 |
    +-----+
    SELECT 1 row in set (... sec)

It is possible to use named function arguments in the function signature. For
example, the ``calculate_distance`` function signature has two ``geo_point``
arguments named ``start`` and ``end``::

    cr> CREATE OR REPLACE FUNCTION calculate_distance("start" geo_point, "end" geo_point)
    ... RETURNS real
    ... LANGUAGE JAVASCRIPT
    ... AS 'function calculate_distance(start, end) {
    ...       return Math.sqrt(
    ...            Math.pow(end[0] - start[0], 2),
    ...            Math.pow(end[1] - start[1], 2));
    ...    }';
    CREATE OK, 1 row affected  (... sec)


.. NOTE::

    Argument names are used for query documentation purposes only. You cannot
    reference arguments by name in the function body.

Optionally, a schema-qualified function name can be defined. If you omit the
schema, the current session schema is used::

    cr> CREATE OR REPLACE FUNCTION my_schema.log10(bigint)
    ... RETURNS double precision
    ... LANGUAGE JAVASCRIPT
    ... AS 'function log10(a) { return Math.log(a)/Math.log(10); }';
    CREATE OK, 1 row affected  (... sec)

.. NOTE::

   In order to improve the PostgreSQL server compatibility CrateDB allows the
   creation of user defined functions against the :ref:`postgres-pg_catalog`
   schema. However, the creation of user defined functions against the
   read-only :ref:`system-information` and :ref:`information_schema` schemas is
   prohibited.


.. _udf-supported-types:

Supported types
===============

Function arguments and return values can be any of the supported :ref:`data
types <data-types>`. The values passed into a function must strictly
correspond to the specified argument data types.

.. NOTE::

    The value returned by the function will be casted to the return type
    provided in the definition if required. An exception will be thrown if the
    cast is not successful.


.. _udf-overloading:

Overloading
===========

Within a specific schema, you can overload functions by defining functions
with the same name but a different set of arguments::

    cr> CREATE FUNCTION my_schema.my_multiply(integer, integer)
    ... RETURNS integer
    ... LANGUAGE JAVASCRIPT
    ... AS 'function my_multiply(a, b) { return a * b; }';
    CREATE OK, 1 row affected  (... sec)

This would overload the ``my_multiply`` function with different argument
types::

    cr> CREATE FUNCTION my_schema.my_multiply(bigint, bigint)
    ... RETURNS bigint
    ... LANGUAGE JAVASCRIPT
    ... AS 'function my_multiply(a, b) { return a * b; }';
    CREATE OK, 1 row affected  (... sec)

This would overload the ``my_multiply`` function with more arguments::

    cr> CREATE FUNCTION my_schema.my_multiply(bigint, bigint, bigint)
    ... RETURNS bigint
    ... LANGUAGE JAVASCRIPT
    ... AS 'function my_multiply(a, b, c) { return a * b * c; }';
    CREATE OK, 1 row affected  (... sec)

.. CAUTION::

    It is considered bad practice to create functions that have the same name
    as the CrateDB built-in functions.

.. NOTE::

    If you call a function without a schema name, CrateDB will look it up in
    the built-in functions first and only then in the user-defined functions
    available in the :ref:`search_path <conf-session-search-path>`.

    **Therefore a built-in function with the same name as a user-defined
    function will hide the latter, even if it contains a different set of
    arguments.** However, such functions can still be called if the schema name
    is explicitly provided.

.. _udf-determinism:

Determinism
===========

.. CAUTION::

    User-defined functions need to be deterministic, meaning that they must
    always return the same result value when called with the same argument
    values, because CrateDB might cache the returned values and reuse the value
    if the function is called multiple times with the same arguments.

.. _udf-privileges:

Privileges
==========

.. NOTE::

    A user-defined function can be executed by a user only if the user has the
    ``DQL`` :ref:`privileges <administration-privileges>` for the schema under
    which the function is defined.


.. _udf-drop-function:

``DROP FUNCTION``
=================

Functions can be dropped like this::

     cr> DROP FUNCTION doc.log10(bigint);
     DROP OK, 1 row affected  (... sec)

Adding ``IF EXISTS`` prevents from raising an error if the function doesn't
exist::

     cr> DROP FUNCTION IF EXISTS doc.log10(integer);
     DROP OK, 1 row affected  (... sec)

Optionally, argument names can be specified within the drop statement::

     cr> DROP FUNCTION IF EXISTS doc.calculate_distance(start_point geo_point, end_point geo_point);
     DROP OK, 1 row affected  (... sec)

Optionally, you can provide a schema::

     cr> DROP FUNCTION my_schema.log10(bigint);
     DROP OK, 1 row affected  (... sec)


.. _udf-supported-languages:

Supported languages
===================

Currently, CrateDB only supports JavaScript for user-defined functions.


.. _udf-js:

JavaScript
----------

The user defined function JavaScript is compatible with the `ECMAScript 2019`_
specification.

CrateDB uses the `GraalVM JavaScript`_ engine as a JavaScript (ECMAScript)
language execution runtime. The `GraalVM JavaScript`_ engine is a Java
application that works on the stock Java Virtual Machines (VMs). The
interoperability between Java code (host language) and JavaScript user-defined
functions (guest language) is guaranteed by the `GraalVM Polyglot API`_.

Please note: CrateDB does not use the GraalVM JIT compiler as optimizing
compiler. However, the `stock host Java VM JIT compilers`_ can JIT-compile,
optimize, and execute the GraalVM JavaScript codebase to a certain extent.

The execution context for guest JavaScript is created with restricted
privileges to allow for the safe execution of less trusted guest language
code. The guest language application context for each user-defined function
is created with default access modifiers, so any access to managed resources
is denied. The only exception is the host language interoperability
configuration which explicitly allows access to Java lists and arrays. Please
refer to `GraalVM Security Guide`_ for more detailed information.

Also, even though user-defined functions implemented with ECMA-compliant
JavaScript, objects that are normally accessible with a web browser
(e.g. ``window``, ``console``, and so on) are not available.

.. NOTE::

    GraalVM treats objects provided to JavaScript user-defined functions as
    close as possible to their respective counterparts and therefore by default
    only a subset of prototype functions are available in user-defined
    functions. For CrateDB 4.6 and earlier the object prototype was disabled.

    Please refer to the `GraalVM JavaScript Compatibility FAQ`_ to learn more
    about the compatibility.


.. _udf-js-supported-types:

JavaScript supported types
..........................

JavaScript functions can handle all CrateDB data types. However, for some
return types the function output must correspond to the certain format.

If a function requires ``geo_point`` as a return type, then the JavaScript
function must return a ``double precision`` array of size 2, ``WKT`` string or
``GeoJson`` object.

Here is an example of a JavaScript function returning a ``double array``::

    cr> CREATE FUNCTION rotate_point(point geo_point, angle real)
    ... RETURNS geo_point
    ... LANGUAGE JAVASCRIPT
    ... AS 'function rotate_point(point, angle) {
    ...       var cos = Math.cos(angle);
    ...       var sin = Math.sin(angle);
    ...       var x = cos * point[0] - sin * point[1];
    ...       var y = sin * point[0] + cos * point[1];
    ...       return [x, y];
    ...    }';
    CREATE OK, 1 row affected  (... sec)

Below is an example of a JavaScript function returning a ``WKT`` string, which
will be cast to ``geo_point``::

     cr> CREATE FUNCTION symmetric_point(point geo_point)
     ... RETURNS geo_point
     ... LANGUAGE JAVASCRIPT
     ... AS 'function symmetric_point (point, angle) {
     ...       var x = - point[0],
     ...           y = - point[1];
     ...       return "POINT (\" + x + \", \" + y +\")";
     ...    }';
     CREATE OK, 1 row affected  (... sec)

Similarly, if the function specifies the ``geo_shape`` return data type, then
the JavaScript function should return a ``GeoJson`` object or ``WKT`` string::

     cr> CREATE FUNCTION line("start" array(double precision), "end" array(double precision))
     ... RETURNS object
     ... LANGUAGE JAVASCRIPT
     ... AS 'function line(start, end) {
     ...        return { "type": "LineString", "coordinates" : [start_point, end_point] };
     ...    }';
     CREATE OK, 1 row affected  (... sec)

.. NOTE::

   If the return value of the JavaScript function is ``undefined``, it is
   converted to ``NULL``.


.. _udf-js-numbers:

Working with ``NUMBERS``
........................

The JavaScript engine interprets numbers as ``java.lang.Double``,
``java.lang.Long``, or ``java.lang.Integer``, depending on the computation
performed. In most cases, this is not an issue, since the return type of the
JavaScript function will be cast to the return type specified in the ``CREATE
FUNCTION`` statement, although cast might result in a loss of precision.

However, when you try to cast ``DOUBLE PRECISION`` to
``TIMESTAMP WITH TIME ZONE``, it will be interpreted as UTC seconds and will
result in a wrong value::

     cr> CREATE FUNCTION utc(bigint, bigint, bigint)
     ... RETURNS TIMESTAMP WITH TIME ZONE
     ... LANGUAGE JAVASCRIPT
     ... AS 'function utc(year, month, day) {
     ...       return Date.UTC(year, month, day, 0, 0, 0);
     ...    }';
     CREATE OK, 1 row affected  (... sec)

.. hide:

    cr> _wait_for_function('utc(1::bigint, 1::bigint, 1::bigint)')

::

    cr> SELECT date_format(utc(2016,04,6)) as epoque;
    +------------------------------+
    | epoque                       |
    +------------------------------+
    | 48314-07-22T00:00:00.000000Z |
    +------------------------------+
    SELECT 1 row in set (... sec)

.. hide:

    cr> DROP FUNCTION utc(bigint, bigint, bigint);
    DROP OK, 1 row affected  (... sec)

To avoid this behavior, the numeric value should be divided by 1000 before it
is returned::

     cr> CREATE FUNCTION utc(bigint, bigint, bigint)
     ... RETURNS TIMESTAMP WITH TIME ZONE
     ... LANGUAGE JAVASCRIPT
     ... AS 'function utc(year, month, day) {
     ...       return Date.UTC(year, month, day, 0, 0, 0)/1000;
     ...    }';
     CREATE OK, 1 row affected  (... sec)

.. hide:

    cr> _wait_for_function('utc(1::bigint, 1::bigint, 1::bigint)')

::

    cr> SELECT date_format(utc(2016,04,6)) as epoque;
    +-----------------------------+
    | epoque                      |
    +-----------------------------+
    | 2016-05-06T00:00:00.000000Z |
    +-----------------------------+
    SELECT 1 row in set (... sec)

.. hide:

    cr> DROP FUNCTION my_subtract_function(integer, integer);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION my_schema.my_multiply(integer, integer);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION my_schema.my_multiply(bigint, bigint, bigint);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION my_schema.my_multiply(bigint, bigint);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION rotate_point(point geo_point, angle real);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION symmetric_point(point geo_point);
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION line(start_point array(double precision), end_point array(double precision));
    DROP OK, 1 row affected  (... sec)

    cr> DROP FUNCTION utc(bigint, bigint, bigint);
    DROP OK, 1 row affected  (... sec)

.. _ECMAScript 2019: https://262.ecma-international.org/10.0/index.html
.. _GraalVM JavaScript: https://www.graalvm.org/reference-manual/js/
.. _GraalVM JavaScript Compatibility FAQ: https://www.graalvm.org/latest/reference-manual/js/JavaScriptCompatibility/
.. _GraalVM Polyglot API: https://www.graalvm.org/reference-manual/embed-languages/
.. _GraalVM Security Guide: https://www.graalvm.org/security-guide/
.. _stock host Java VM JIT compilers: https://www.graalvm.org/reference-manual/js/RunOnJDK/</doc><doc title="CrateDB SQL reference: Arithmetic operators" desc="Arithmetic operators perform mathematical operations on numeric values including timestamps. ">.. highlight:: psql

.. _arithmetic:

====================
Arithmetic operators
====================

Arithmetic :ref:`operators <gloss-operator>` perform mathematical operations on
numeric values (including timestamps):

========   =========================================================
Operator   Description
========   =========================================================
``+``      Add one number to another
``-``      Subtract the second number from the first
``*``      Multiply the first number with the second
``/``      Divide the first number by the second
``%``      Finds the remainder of division of one number by another
``^``      Finds the exponentiation of one number raised to another
========   =========================================================

.. NOTE::

    Operators are evaluated from left to right. Operation with a higher
    precedence is performed before operation with a lower precedence.
    Operators have the following precedence (from higher to lower):
    1. Parentheses
    2. Exponentiation
    3. Multiplication and Division
    4. Addition and Subtraction
    Use parentheses if you want to ensure a specific order of evaluation.

Here's an example that uses all of the available arithmetic operators::

    cr> select ((2 * 4.0 - 2 ^ 3 + 1) / 2) % 3 AS n;
    +-----+
    |   n |
    +-----+
    | 0.5 |
    +-----+
    SELECT 1 row in set (... sec)

Arithmetic operators always return the data type of the argument with the
higher precision.

In the case of division, if both arguments are integers, the result will also
be an integer with the fractional part truncated::

    cr> select 5 / 2 AS a,  5 / 2.0 AS b;
    +---+-----+
    | a |   b |
    +---+-----+
    | 2 | 2.5 |
    +---+-----+
    SELECT 1 row in set (... sec)

.. NOTE::

    The same restrictions that apply to :ref:`scalar functions
    <scalar-functions>` also apply to arithmetic operators.</doc><doc title="CrateDB SQL reference: Bit operators" desc="Bit operators perform bitwise operations on numeric integral values and bit strings. ">.. highlight:: psql

.. _bit-operators:

=============
Bit operators
=============

Bit :ref:`operators <gloss-operator>` perform bitwise operations on numeric
integral values and :ref:`bit <data-type-bit>` strings:

========  ========================
Operator  Description
========  ========================
``&``     Bitwise AND of operands.
``|``     Bitwise OR of operands.
``#``     Bitwise XOR of operands.
========  ========================

Here's an example that uses all of the available bit operators::

    cr> select 1 & 2 | 3 # 4 AS n;
    +---+
    | n |
    +---+
    | 7 |
    +---+
    SELECT 1 row in set (... sec)

And an example with bit strings::

    cr> select B'101' # B'011' AS n;
    +--------+
    | n      |
    +--------+
    | B'110' |
    +--------+
    SELECT 1 row in set (... sec)

When applied to numeric operands, bit operators always return the data type
of the argument with the higher precision.

If at least one operand is ``NULL``, bit operators return ``NULL``.

When applied to ``BIT`` strings, operands must have equal length.

.. NOTE::

    Bit operators have the same precedence and evaluated from left to right.
    Use parentheses if you want to ensure a specific order of evaluation.</doc><doc title="CrateDB SQL reference: Comparison operators" desc="A comparison operator tests the relationship between two values and returns a corresponding value of `true`, `false`, or `NULL`. ">.. highlight:: psql

.. _comparison-operators:

====================
Comparison operators
====================

A comparison :ref:`operator <gloss-operator>` tests the relationship between
two values and returns a corresponding value of ``true``, ``false``, or
``NULL``.


.. _comparison-operators-basic:

Basic operators
===============

For simple :ref:`data types <data-types>`, the following basic operators can be
used:

========  ==========================
Operator  Description
========  ==========================
``<``     Less than
--------  --------------------------
``>``     Greater than
--------  --------------------------
``<=``    Less than or equal to
--------  --------------------------
``>=``    Greater than or equal to
--------  --------------------------
``=``     Equal
--------  --------------------------
``<>``    Not equal
--------  --------------------------
``!=``    Not equal (same as ``<>``)
========  ==========================

When comparing strings, a `lexicographical comparison`_ is performed::

    cr> select name from locations where name > 'Argabuthon' order by name;
    +------------------------------------+
    | name                               |
    +------------------------------------+
    | Arkintoofle Minor                  |
    | Bartledan                          |
    | Galactic Sector QQ7 Active J Gamma |
    | North West Ripple                  |
    | Outer Eastern Rim                  |
    +------------------------------------+
    SELECT 5 rows in set (... sec)

When comparing dates, `ISO date formats`_ can be used::

    cr> select date, position from locations where date <= '1979-10-12' and
    ... position < 3 order by position;
    +--------------+----------+
    | date         | position |
    +--------------+----------+
    | 308534400000 |        1 |
    | 308534400000 |        2 |
    +--------------+----------+
    SELECT 2 rows in set (... sec)

When comparing Geo Shapes, `topological comparison`_ is used.
Topological equality means that the geometries have the same dimension, and their point-sets occupy the same space.
This means that the order of vertices may be different in topologically equal geometries::

    cr> SELECT 'POLYGON (( 0 0, 1 0, 1 1, 0 1, 0 0))'::GEO_SHAPE = 'POLYGON (( 1 0, 1 1, 0 1, 0 0, 1 0))'::GEO_SHAPE as res;
    +------+
    | res  |
    +------+
    | TRUE |
    +------+
    SELECT 1 row in set (... sec)

Geometry collections, containing only linestrings, points or polygons are
normalized to MultiLineString, MultiPoint and MultiPolygon. Hence, geometry
collection of points is equal to a MultiPoint with the same points set::

    cr> select 'MULTIPOINT ((10 40), (40 30), (20 20))'::GEO_SHAPE = 'GEOMETRYCOLLECTION (POINT (10 40), POINT(40 30), POINT(20 20))'::GEO_SHAPE as res;
    +------+
    | res  |
    +------+
    | TRUE |
    +------+
    SELECT 1 row in set (... sec)

.. TIP::

    Comparison operators are commonly used to filter rows (e.g., in the
    :ref:`WHERE <sql-select-where>` and :ref:`HAVING <sql-select-having>`
    clauses of a :ref:`SELECT <sql-select>` statement). However, basic
    comparison operators can be used as :ref:`value expressions
    <sql-operator-invocation>` in any context. For example::

        cr> SELECT 1 < 10 as my_column;
        +-----------+
        | my_column |
        +-----------+
        | TRUE      |
        +-----------+
        SELECT 1 row in set (... sec)

.. _comparison-operators-where:

``WHERE`` clause operators
==========================

Within a :ref:`sql_dql_where_clause`, the following operators can also be used:

=================================  ===================================================
Operator                           Description
=================================  ===================================================
``~`` , ``~*`` , ``!~`` , ``!~*``  See :ref:`sql_dql_regexp`
---------------------------------  ---------------------------------------------------
:ref:`sql_dql_like`                Matches a part of the given value
---------------------------------  ---------------------------------------------------
:ref:`sql_dql_not`                 Negates a condition
---------------------------------  ---------------------------------------------------
:ref:`sql_dql_is_null`             Matches a null value
---------------------------------  ---------------------------------------------------
:ref:`sql_dql_is_not_null`         Matches a non-null value
---------------------------------  ---------------------------------------------------
``ip << range``                    True if IP is within the given IP range (using
                                   `CIDR notation`_)
---------------------------------  ---------------------------------------------------
``x BETWEEN y AND z``              Shortcut for ``x >= y AND x <= z``
=================================  ===================================================

.. SEEALSO::

    - :ref:`sql_array_comparisons`

    - :ref:`sql_subquery_expressions`


.. _CIDR notation: https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_blocks
.. _ISO date formats: https://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser--
.. _lexicographical comparison: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/TermRangeQuery.html
.. _topological comparison: https://postgis.net/docs/ST_Equals.html</doc><doc title="CrateDB SQL reference: Array comparison operators" desc="An array comparison operator tests the relationship between a value and an array and returns `true`, `false`, or `NULL`. ">.. highlight:: psql

.. _sql_array_comparisons:

Array comparisons
=================

An array comparison :ref:`operator <gloss-operator>` test the relationship
between a value and an array and return ``true``, ``false``, or ``NULL``.

.. SEEALSO::

    :ref:`sql_subquery_expressions`


.. _sql_in_array_comparison:

``IN (value [, ...])``
----------------------

Syntax:

.. code-block:: sql

    expression IN (value [, ...])

The ``IN`` :ref:`operator <gloss-operator>` returns ``true`` if the left-hand
matches at least one value contained within the right-hand side.

The operator returns ``NULL`` if:

- The left-hand :ref:`expression <gloss-expression>` :ref:`evaluates
  <gloss-evaluation>` to ``NULL``

- There are no matching right-hand values and at least one right-hand value is
  ``NULL``

Here's an example::

    cr> SELECT
    ...   1 in (1, 2, 3) AS a,
    ...   4 in (1, 2, 3) AS b,
    ...   5 in (1, 2, null) as c;
    +------+-------+------+
    | a    | b     | c    |
    +------+-------+------+
    | TRUE | FALSE | NULL |
    +------+-------+------+
    SELECT 1 row in set (... sec)


.. _sql_any_array_comparison:

``ANY/SOME (array expression)``
-------------------------------

Syntax:

.. code-block:: sql

    expression <comparison> ANY | SOME (array_expression)

Here, ``<comparison>`` can be any :ref:`basic comparison operator
<comparison-operators-basic>`.

An example::

    cr> SELECT
    ...   1 = ANY ([1,2,3]) AS a,
    ...   4 = ANY ([1,2,3]) AS b;
    +------+-------+
    | a    | b     |
    +------+-------+
    | TRUE | FALSE |
    +------+-------+
    SELECT 1 row in set (... sec)

The ``ANY`` :ref:`operator <gloss-operator>` returns ``true`` if the defined
comparison is ``true`` for any of the values in the right-hand array
:ref:`expression <gloss-expression>`.

If the right side is a multi-dimension array it is automatically unnested to the
required dimension.

An example::


    cr> SELECT
    ...   4 = ANY ([[1, 2], [3, 4]]) as a,
    ...   5 = ANY ([[1, 2], [3, 4]]) as b,
    ...   [1, 2] = ANY ([[1,2], [3, 4]]) as c,
    ...   [1, 3] = ANY ([[1,2], [3, 4]]) as d;
    +------+-------+------+-------+
    | a    | b     | c    | d     |
    +------+-------+------+-------+
    | TRUE | FALSE | TRUE | FALSE |
    +------+-------+------+-------+
    SELECT 1 row in set (... sec)


The operator returns ``false`` if the comparison returns ``false`` for all
right-hand values or if there are no right-hand values.

The operator returns ``NULL`` if:

- The left-hand expression :ref:`evaluates <gloss-evaluation>` to ``NULL``

- There are no matching right-hand values and at least one right-hand value is
  ``NULL``

.. TIP::

    When doing ``NOT <value> = ANY(<array_col>)``, query performance may be
    degraded because special handling is required to implement the `3-valued
    logic`_. To achieve better performance, consider using the :ref:`ignore3vl
    function <scalar-ignore3vl>`.


.. _all_array_comparison:

``ALL (array_expression)``
--------------------------

Syntax:

.. code-block:: sql

    value comparison ALL (array_expression)

Here, ``comparison`` can be any :ref:`basic comparison operator
<comparison-operators-basic>`. Objects and arrays of objects are not supported
for either :ref:`operand <gloss-operand>`.

Here's an example::

    cr> SELECT 1 <> ALL(ARRAY[2, 3, 4]) AS x;
    +------+
    | x    |
    +------+
    | TRUE |
    +------+
    SELECT 1 row in set (... sec)


The ``ALL`` :ref:`operator <gloss-operator>` returns ``true`` if the defined
comparison is ``true`` for all values in the right-hand :ref:`array expression
<sql-array-constructor>`.

The operator returns ``false`` if the comparison returns ``false`` for all
right-hand values.

The operator returns ``NULL`` if:

- The left-hand expression :ref:`evaluates <gloss-evaluation>` to ``NULL``

- No comparison returns ``false`` and at least one right-hand value is ``NULL``


.. _array_overlap_operator:

``array expression && array expression``
----------------------------------------

Syntax:

.. code-block:: sql

    array_expression && array_expression

The ``&&`` :ref:`operator <gloss-operator>` returns ``true`` if the two arrays
have at least one element in common.
If one of the argument is ``NULL`` the result is ``NULL``.

This operator is an alias to the :ref:`scalar-array_overlap` function.


.. _3-valued logic: https://en.wikipedia.org/wiki/Null_(SQL)#Comparisons_with_NULL_and_the_three-valued_logic_(3VL)</doc><doc title="CrateDB SQL reference: Subquery expressions" desc="Some operators can be used with an uncorrelated subquery to form a *subquery expression* that returns a boolean value (i.e., `true` or `false`) or `NULL`. ">.. highlight:: psql

.. _sql_subquery_expressions:

Subquery expressions
====================

Some :ref:`operators <gloss-operator>` can be used with an :ref:`uncorrelated
subquery <gloss-uncorrelated-subquery>` to form a *subquery expression* that
returns a boolean value (i.e., ``true`` or ``false``) or ``NULL``.

.. SEEALSO::

    :ref:`SQL: Value expressions <sql-scalar-subquery>`


.. _sql_in_subquery_expression:

``IN (subquery)``
-----------------

Syntax:

.. code-block:: sql

    expression IN (subquery)

The ``subquery`` must produce result rows with a single column only.

Here's an example::

    cr> select name, surname, sex from employees
    ... where dept_id in (select id from departments where name = 'Marketing')
    ... order by name, surname;
    +--------+----------+-----+
    | name   | surname  | sex |
    +--------+----------+-----+
    | David  | Bowe     | M   |
    | David  | Limb     | M   |
    | Sarrah | Mcmillan | F   |
    | Smith  | Clark    | M   |
    +--------+----------+-----+
    SELECT 4 rows in set (... sec)

The ``IN`` :ref:`operator <gloss-operator>` returns ``true`` if any
:ref:`subquery <gloss-subquery>` row equals the left-hand :ref:`operand
<gloss-operand>`. Otherwise, it returns ``false`` (including the case where the
subquery returns no rows).

The operator returns ``NULL`` if:

- The left-hand expression :ref:`evaluates <gloss-evaluation>` to ``NULL``

- There are no matching right-hand values and at least one right-hand value is
  ``NULL``

.. NOTE::

    ``IN (subquery)`` is an alias for ``= ANY (subquery)``


.. _sql_any_subquery_expression:

``ANY/SOME (subquery)``
-----------------------

Syntax:

.. code-block:: sql

    expression comparison ANY | SOME (subquery)

Here, ``comparison`` can be any :ref:`basic comparison operator
<comparison-operators-basic>`. The ``subquery`` must produce result rows with a
single column only.

Here's an example::

    cr> select name, population from countries
    ... where population > any (select * from unnest([8000000, 22000000, NULL]))
    ... order by population, name;
    +--------------+------------+
    | name         | population |
    +--------------+------------+
    | Austria      |    8747000 |
    | South Africa |   55910000 |
    | France       |   66900000 |
    | Turkey       |   79510000 |
    | Germany      |   82670000 |
    +--------------+------------+
    SELECT 5 rows in set (... sec)

The ``ANY`` :ref:`operator <gloss-operator>` returns ``true`` if the defined
comparison is ``true`` for any of the result rows of the right-hand
:ref:`subquery <gloss-subquery>`.

The operator returns ``false`` if the comparison returns ``false`` for all
result rows of the subquery or if the subquery returns no rows.

The operator returns ``NULL`` if:

- The left-hand expression :ref:`evaluates <gloss-evaluation>` to ``NULL``

- There are no matching right-hand values and at least one right-hand value is
  ``NULL``

.. NOTE::

    The following is not supported:

    - ``IS NULL`` or ``IS NOT NULL`` as ``comparison``

    - Matching as many columns as there are expressions on the left-hand row
      e.g. ``(x,y) = ANY (select x, y from t)``


``ALL (subquery)``
------------------

Syntax:

.. code-block:: sql

    value comparison ALL (subquery)

Here, ``comparison`` can be any :ref:`basic comparison operator
<comparison-operators-basic>`. The ``subquery`` must produce result rows with a
single column only.

Here's an example::

    cr> select 100 <> ALL (select height from sys.summits) AS x;
    +------+
    | x    |
    +------+
    | TRUE |
    +------+
    SELECT 1 row in set (... sec)

The ``ALL`` :ref:`operator <gloss-operator>` returns ``true`` if the defined
comparison is ``true`` for all of the result rows of the right-hand
:ref:`subquery <gloss-subquery>`.

The operator returns ``false`` if the comparison returns ``false`` for any
result rows of the subquery.

The operator returns ``NULL`` if:

- The left-hand expression :ref:`evaluates <gloss-evaluation>` to ``NULL``

- No comparison returns ``false`` and at least one right-hand value is ``NULL``</doc><doc title="CrateDB cluster-wide settings" desc="Cluster-wide settings can be read by querying the `sys.cluster.settings` column. Most cluster settings can be changed at runtime. ">.. highlight:: sh

.. _conf-cluster-settings:

=====================
Cluster-wide settings
=====================

All current applied cluster settings can be read by querying the
:ref:`sys.cluster.settings <sys-cluster-settings>` column. Most
cluster settings can be :ref:`changed at runtime
<administration-runtime-config>`. This is documented at each setting.


.. _applying-cluster-settings:

Non-runtime cluster-wide settings
---------------------------------

Cluster wide settings which cannot be changed at runtime need to be specified
in the configuration of each node in the cluster.

.. CAUTION::

   Cluster settings specified via node configurations are required to be
   exactly the same on every node in the cluster for proper operation of the
   cluster.

.. _conf_collecting_stats:

Collecting stats
----------------

.. _stats.enabled:

**stats.enabled**
  | *Default:*    ``true``
  | *Runtime:*   ``yes``

  A boolean indicating whether or not to collect statistical information about
  the cluster.

  .. CAUTION::

     The collection of statistical information incurs a slight performance
     penalty, as details about every job and operation across the cluster will
     cause data to be inserted into the corresponding system tables.

.. _stats.jobs_log_size:

**stats.jobs_log_size**
  | *Default:*   ``10000``
  | *Runtime:*  ``yes``

  The maximum number of job records kept to be kept in the :ref:`sys.jobs_log
  <sys-logs>` table on each node.

  A job record corresponds to a single SQL statement to be executed on the
  cluster. These records are used for performance analytics. A larger job log
  produces more comprehensive stats, but uses more RAM.

  Older job records are deleted as newer records are added, once the limit is
  reached.

  Setting this value to ``0`` disables collecting job information.

.. _stats.jobs_log_expiration:

**stats.jobs_log_expiration**
  | *Default:*  ``0s`` (disabled)
  | *Runtime:*  ``yes``

  The job record expiry time in seconds.

  Job records in the :ref:`sys.jobs_log <sys-logs>` table are periodically
  cleared if they are older than the expiry time. This setting overrides
  :ref:`stats.jobs_log_size <stats.jobs_log_size>`.

  If the value is set to ``0``, time based log entry eviction is disabled.

  .. NOTE::

     If both the :ref:`stats.operations_log_size <stats.operations_log_size>`
     and
     :ref:`stats.operations_log_expiration <stats.operations_log_expiration>`
     settings are disabled, jobs will not be recorded.

.. _stats.jobs_log_filter:

**stats.jobs_log_filter**
  | *Default:* ``true`` (Include everything)
  | *Runtime:* ``yes``

  An :ref:`expression <gloss-expression>` to determine if a job should be
  recorded into ``sys.jobs_log``.  The expression must :ref:`evaluate
  <gloss-evaluation>` to a boolean. If it evaluates to ``true`` the statement
  will show up in ``sys.jobs_log`` until it's evicted due to one of the other
  rules. (expiration or size limit reached).

  The expression may reference all columns contained in ``sys.jobs_log``. A
  common use case is to include only jobs that took a certain amount of time to
  execute::

    cr> SET GLOBAL "stats.jobs_log_filter" = $$ended - started > '5 minutes'::interval$$;
    SET OK, 1 row affected (... sec)

.. _stats.jobs_log_persistent_filter:

**stats.jobs_log_persistent_filter**
  | *Default:* ``false`` (Include nothing)
  | *Runtime:* ``yes``

  An expression to determine if a job should also be recorded to the regular
  ``CrateDB`` log. Entries that match this filter will be logged under the
  ``StatementLog`` logger with the ``INFO`` level.

  This is similar to ``stats.jobs_log_filter`` except that these entries are
  persisted to the log file. This should be used with caution and shouldn't be
  set to an expression that matches many queries as the logging operation will
  block on IO and can therefore affect performance.

  A common use case is to use this for slow query logging.

.. _stats.operations_log_size:

**stats.operations_log_size**
  | *Default:*   ``10000``
  | *Runtime:*  ``yes``

  The maximum number of operations records to be kept in the
  :ref:`sys.operations_log <sys-logs>` table on each node.

  A job consists of one or more individual operations. Operations records are
  used for performance analytics. A larger operations log produces more
  comprehensive stats, but uses more RAM.

  Older operations records are deleted as newer records are added, once the
  limit is reached.

  Setting this value to ``0`` disables collecting operations information.

.. _stats.operations_log_expiration:

**stats.operations_log_expiration**
  | *Default:*  ``0s`` (disabled)
  | *Runtime:*  ``yes``

  Entries of :ref:`sys.operations_log <sys-logs>` are cleared by a periodically
  job when they are older than the specified expire time. This setting
  overrides :ref:`stats.operations_log_size <stats.operations_log_size>`. If
  the value is set to ``0`` the time based log entry eviction is disabled.

  .. NOTE::

    If both settings :ref:`stats.operations_log_size
    <stats.operations_log_size>` and :ref:`stats.operations_log_expiration
    <stats.operations_log_expiration>` are disabled, no job information will be
    collected.

.. _stats.service.interval:

**stats.service.interval**
  | *Default:*    ``24h``
  | *Runtime:*   ``yes``

  Defines the refresh interval to refresh tables statistics used to produce
  optimal query execution plans.

  This field expects a time value either as a ``bigint`` or
  ``double precision`` or alternatively as a string literal with a time suffix
  (``ms``, ``s``, ``m``, ``h``, ``d``, ``w``).

  If the value provided is ``0`` then the refresh is disabled.

  .. CAUTION::

    Using a very small value can cause a high load on the cluster.

.. _stats.service.max_bytes_per_sec:

**stats.service.max_bytes_per_sec**
  | *Default:*    ``40mb``
  | *Runtime:*   ``yes``

  Specifies the maximum number of bytes per second that can be read on data
  nodes to collect statistics. If this is set to a positive number, the
  underlying I/O operations of the :ref:`ANALYZE <analyze>` statement are
  throttled.

  If the value provided is ``0`` then the throttling is disabled.

Shard limits
------------

.. _cluster.max_shards_per_node:

**cluster.max_shards_per_node**
  | *Default:* 1000
  | *Runtime:* ``yes``

  The maximum number of open primary and replica shards per node. This setting
  is checked on a shard creation and doesn't limit shards for individual nodes.
  To limit the number of shards for each node, use
  :ref:`cluster.routing.allocation.total_shards_per_node
  <cluster.routing.allocation.total_shards_per_node>` setting.
  The actual limit being checked is ``max_shards_per_node * number of data nodes``.

  Any operations that would result in the creation of additional shard copies
  that would exceed this limit are rejected.

  For example. If you have 999 shards in the current cluster and you try to
  create a new table, the create table operation will fail.

  Similarly, if a write operation would lead to the creation of a new
  partition, the statement will fail.

  Each shard on a node requires some memory and increases the size of the
  cluster state. Having too many shards per node will impact the clusters
  stability and it is therefore discouraged to raise the limit above 1000.

.. NOTE::

   The maximum number of shards per node setting is also used for the
   :ref:`sys-node_checks_max_shards_per_node` check.

.. NOTE::

   If a table is created with :ref:`sql-create-table-number-of-replicas`
   provided as a range or default ``0-1`` value, the limit check accounts only
   for primary shards and not for possible expanded replicas and thus actual
   number of all shards can exceed the limit.


.. _conf_usage_data_collector:

Usage data collector
--------------------

The settings of the Usage Data Collector are read-only and cannot be set during
runtime. Please refer to :ref:`usage_data_collector` to get further information
about its usage.

.. _udc.enabled:

**udc.enabled**
  | *Default:*  ``true``
  | *Runtime:*  ``no``

  ``true``: Enables the Usage Data Collector.

  ``false``: Disables the Usage Data Collector.

.. _udc.initial_delay:

**udc.initial_delay**
  | *Default:*  ``10m``
  | *Runtime:*  ``no``

  The delay for first ping after start-up.

  This field expects a time value either as a ``bigint`` or
  ``double precision`` or alternatively as a string literal with a time suffix
  (``ms``, ``s``, ``m``, ``h``, ``d``, ``w``).

.. _udc.interval:

**udc.interval**
  | *Default:*  ``24h``
  | *Runtime:*  ``no``

  The interval a UDC ping is sent.

  This field expects a time value either as a ``bigint`` or
  ``double precision`` or alternatively as a string literal with a time suffix
  (``ms``, ``s``, ``m``, ``h``, ``d``, ``w``).

.. _udc.url:

**udc.url**
  | *Default:*  ``https://udc.crate.io/``
  | *Runtime:*  ``no``

  The URL the ping is sent to.

.. _conf_graceful_stop:

Graceful stop
-------------

By default, when the CrateDB process stops it simply shuts down, possibly
making some shards unavailable which leads to a *red* cluster state and lets
some queries fail that required the now unavailable shards. In order to
*safely* shutdown a CrateDB node, the graceful stop procedure can be used.

The following cluster settings can be used to change the shutdown behaviour of
nodes of the cluster:

.. _cluster.graceful_stop.min_availability:

**cluster.graceful_stop.min_availability**
  | *Default:*   ``primaries``
  | *Runtime:*  ``yes``
  | *Allowed values:*   ``none | primaries | full``

  ``none``: No minimum data availability is required. The node may shut down
  even if records are missing after shutdown.

  ``primaries``: At least all primary shards need to be available after the node
  has shut down. Replicas may be missing.

  ``full``: All records and all replicas need to be available after the node
  has shut down. Data availability is full.

  .. NOTE::

     This option is ignored if there is only 1 node in a cluster!

.. _cluster.graceful_stop.timeout:

**cluster.graceful_stop.timeout**
  | *Default:*   ``2h``
  | *Runtime:*  ``yes``

  Defines the maximum waiting time in milliseconds for the :ref:`reallocation
  <gloss-shard-allocation>` process to finish. The ``force`` setting will
  define the behaviour when the shutdown process runs into this timeout.

  The timeout expects a time value either as a ``bigint`` or
  ``double precision`` or alternatively as a string literal with a time suffix
  (``ms``, ``s``, ``m``, ``h``, ``d``, ``w``).

.. _cluster.graceful_stop.force:

**cluster.graceful_stop.force**
  | *Default:*   ``false``
  | *Runtime:*  ``yes``

  Defines whether ``graceful stop`` should force stopping of the node if it
  runs into the timeout which is specified with the
  `cluster.graceful_stop.timeout`_ setting.

.. _conf_bulk_operations:

Bulk operations
---------------

SQL DML Statements involving a huge amount of rows like :ref:`sql-copy-from`,
:ref:`sql-insert` or :ref:`ref-update` can take an enormous amount of time and
resources. The following settings change the behaviour of those queries.

.. _bulk.request_timeout:

**bulk.request_timeout**
  | *Default:* ``1m``
  | *Runtime:* ``yes``

  Defines the timeout of internal shard-based requests involved in the
  execution of SQL DML Statements over a huge amount of rows.

.. _conf_discovery:

Discovery
---------

Data sharding and work splitting are at the core of CrateDB. This is how we
manage to execute very fast queries over incredibly large datasets. In order
for multiple CrateDB nodes to work together a cluster needs to be formed. The
process of finding other nodes with which to form a cluster is called
discovery. Discovery runs when a CrateDB node starts and when a node is not
able to reach the master node and continues until a master node is found or a
new master node is elected.

.. _discovery.seed_hosts:

**discovery.seed_hosts**
   | *Default:* ``127.0.0.1``
   | *Runtime:* ``no``

   In order to form a cluster with CrateDB instances running on other nodes a
   list of seed master-eligible nodes needs to be provided. This setting should
   normally contain the addresses of all the master-eligible nodes in the
   cluster. In order to seed the discovery process the nodes listed here must
   be live and contactable. This setting contains either an array of hosts or a
   comma-delimited string.
   By default a node will bind to the available loopback and scan for local
   ports between ``4300`` and ``4400`` to try to connect to other nodes running
   on the same server. This default behaviour provides local auto clustering
   without any configuration.
   Each value should be in the form of host:port or host (where port defaults
   to the setting ``transport.tcp.port``).

.. NOTE::

   IPv6 hosts must be bracketed.

.. _cluster.initial_master_nodes:

**cluster.initial_master_nodes**
   | *Default:* ``not set``
   | *Runtime:* ``no``

   Contains a list of node names, full-qualified hostnames or IP addresses of
   the master-eligible nodes which will vote in the very first election of a
   cluster that's bootstrapping for the first time. By default this is not set,
   meaning it expects this node to join an already formed cluster.
   In development mode, with no discovery settings configured, this step is
   performed by the nodes themselves, but this auto-bootstrapping is designed
   to aim development and is not safe for production. In production you must
   explicitly list the names or IP addresses of the master-eligible nodes whose
   votes should be counted in the very first election.

.. _discovery.type:

**discovery.type**
  | *Default:* ``zen``
  | *Runtime:* ``no``
  | *Allowed values:*  ``zen | single-node``

  Specifies whether CrateDB should form a multiple-node cluster. By default,
  CrateDB discovers other nodes when forming a cluster and allows other nodes to
  join the cluster later. If ``discovery.type`` is set to ``single-node``,
  CrateDB forms a single-node cluster and the node won't join any other
  clusters. This can be useful for testing. It is not recommend to use this for
  production setups. The ``single-node`` mode also skips `bootstrap checks`_.

.. CAUTION::

    If a node is started without any :ref:`initial_master_nodes
    <cluster.initial_master_nodes>` or a :ref:`discovery_type <discovery.type>`
    set to ``single-node`` (e.g., the default configuration), it will never join
    a cluster even if the configuration is subsequently changed.


    It is possible to force the node to forget its current cluster state by
    using the :ref:`cli-crate-node` CLI tool. However, be aware that this may
    result in data loss.


.. _conf_host_discovery:

Unicast host discovery
......................

As described above, CrateDB has built-in support for statically specifying a
list of addresses that will act as the seed nodes in the discovery process
using the `discovery.seed_hosts`_ setting.

CrateDB also has support for several different mechanisms of seed nodes
discovery. Currently there are two other discovery types: via DNS and via EC2
API.

When a node starts up with one of these discovery types enabled, it performs a
lookup using the settings for the specified mechanism listed below. The hosts
and ports retrieved from the mechanism will be used to generate a list of
unicast hosts for node discovery.

The same lookup is also performed by all nodes in a cluster whenever the master
is re-elected (see `Cluster Meta Data`).

.. _discovery.seed_providers:

**discovery.seed_providers**
  | *Default:*   ``not set``
  | *Runtime:*   ``no``
  | *Allowed values:* ``srv``, ``ec2``

See also: `Discovery`_.

.. _conf_dns_discovery:

Discovery via DNS
`````````````````

Crate has built-in support for discovery via DNS. To enable DNS discovery the
``discovery.seed_providers`` setting needs to be set to ``srv``.

The order of the unicast hosts is defined by the priority, weight and name of
each host defined in the SRV record. For example::

    _crate._srv.example.com. 3600 IN SRV 2 20 4300 crate1.example.com.
    _crate._srv.example.com. 3600 IN SRV 1 10 4300 crate2.example.com.
    _crate._srv.example.com. 3600 IN SRV 2 10 4300 crate3.example.com.

would result in a list of discovery nodes ordered like::

    crate2.example.com:4300, crate3.example.com:4300, crate1.example.com:4300

.. _discovery.srv.query:

**discovery.srv.query**
  | *Runtime:*  ``no``

  The DNS query that is used to look up SRV records, usually in the format
  ``_service._protocol.fqdn`` If not set, the service discovery will not be
  able to look up any SRV records.

.. _discovery.srv.resolver:

**discovery.srv.resolver**
  | *Runtime:*  ``no``

  The hostname or IP of the DNS server used to resolve DNS records. If this is
  not set, or the specified hostname/IP is not resolvable, the default (system)
  resolver is used.

  Optionally a custom port can be specified using the format ``hostname:port``.

.. _conf_ec2_discovery:

Discovery on Amazon EC2
```````````````````````

CrateDB has built-in support for discovery via the EC2 API. To enable EC2
discovery the ``discovery.seed_providers`` settings needs to be set to
``ec2``.

.. _discovery.ec2.access_key:

**discovery.ec2.access_key**
  | *Runtime:*  ``no``

  The access key ID to identify the API calls.

.. _discovery.ec2.secret_key:

**discovery.ec2.secret_key**
  | *Runtime:*  ``no``

  The secret key to identify the API calls.

Following settings control the discovery:

.. _discovery.ec2.groups:

**discovery.ec2.groups**
  | *Runtime:*  ``no``

  A list of security groups; either by ID or name. Only instances with the
  given group will be used for unicast host discovery.

.. _discovery.ec2.any_group:

**discovery.ec2.any_group**
  | *Default:*  ``true``
  | *Runtime:*  ``no``

  Defines whether all (``false``) or just any (``true``) security group must
  be present for the instance to be used for discovery.

.. _discovery.ec2.host_type:

**discovery.ec2.host_type**
  | *Default:*  ``private_ip``
  | *Runtime:*  ``no``
  | *Allowed values:*  ``private_ip``, ``public_ip``, ``private_dns``, ``public_dns``

  Defines via which host type to communicate with other instances.

.. _discovery.ec2.availability_zones:

**discovery.ec2.availability_zones**
  | *Runtime:*  ``no``

  A list of availability zones. Only instances within the given availability
  zone will be used for unicast host discovery.

.. _discovery.ec2.tag.name:

**discovery.ec2.tag.<name>**
  | *Runtime:*  ``no``

  EC2 instances for discovery can also be filtered by tags using the
  ``discovery.ec2.tag.`` prefix plus the tag name.

  E.g. to filter instances that have the ``environment`` tags with the value
  ``dev`` your setting will look like: ``discovery.ec2.tag.environment: dev``.

.. _discovery.ec2.endpoint:

**discovery.ec2.endpoint**
  | *Runtime:*  ``no``

  If you have your own compatible implementation of the EC2 API service you can
  set the endpoint that should be used.


.. _conf_routing:

Routing allocation
------------------

.. _cluster.routing.allocation.enable:

**cluster.routing.allocation.enable**
  | *Default:*   ``all``
  | *Runtime:*  ``yes``
  | *Allowed values:* ``all | none | primaries | new_primaries``

  ``all`` allows all :ref:`shard allocations <gloss-shard-allocation>`, the
  cluster can allocate all kinds of shards.

  ``none`` allows no shard allocations at all. No shard will be moved or
  created.

  ``primaries`` only primaries can be moved or created. This includes existing
  primary shards.

  ``new_primaries`` allows allocations for new primary shards only. This means
  that for example a newly added node will not allocate any replicas. However
  it is still possible to allocate new primary shards for new indices. Whenever
  you want to perform a zero downtime upgrade of your cluster you need to set
  this value before gracefully stopping the first node and reset it to ``all``
  after starting the last updated node.

.. NOTE::

   This allocation setting has no effect on the :ref:`recovery
   <gloss-shard-recovery>` of primary shards! Even when
   ``cluster.routing.allocation.enable`` is set to ``none``, nodes will recover
   their unassigned local primary shards immediately after restart.

.. _cluster.routing.rebalance.enable:

**cluster.routing.rebalance.enable**
  | *Default:*   ``all``
  | *Runtime:*  ``yes``
  | *Allowed values:* ``all | none | primaries | replicas``

  Enables or disables rebalancing for different types of shards:

  - ``all`` allows shard rebalancing for all types of shards.
  - ``none`` disables shard rebalancing for any types.
  - ``primaries`` allows shard rebalancing only for primary shards.
  - ``replicas`` allows shard rebalancing only for replica shards.

.. _cluster.routing.allocation.allow_rebalance:

**cluster.routing.allocation.allow_rebalance**
  | *Default:*   ``indices_all_active``
  | *Runtime:*  ``yes``
  | *Allowed values:* ``always | indices_primary_active | indices_all_active``

  Defines when rebalancing will happen based on the total state of all
  the indices shards in the cluster.

  Defaults to ``indices_all_active`` to reduce chatter during initial
  :ref:`recovery <gloss-shard-recovery>`.

.. _cluster.routing.allocation.cluster_concurrent_rebalance:

**cluster.routing.allocation.cluster_concurrent_rebalance**
  | *Default:*   ``2``
  | *Runtime:*  ``yes``

  Defines how many concurrent rebalancing tasks are allowed across all nodes.

.. _cluster.routing.allocation.node_initial_primaries_recoveries:

**cluster.routing.allocation.node_initial_primaries_recoveries**
  | *Default:*   ``4``
  | *Runtime:*  ``yes``

  Defines how many concurrent primary shard recoveries are allowed on a node.

  Since primary recoveries use data that is already on disk (as opposed to
  inter-node recoveries), recovery should be fast and so this
  setting can be higher than :ref:`node_concurrent_recoveries
  <cluster.routing.allocation.node_concurrent_recoveries>`.

.. _cluster.routing.allocation.node_concurrent_recoveries:

**cluster.routing.allocation.node_concurrent_recoveries**
  | *Default:*   ``2``
  | *Runtime:*  ``yes``

  Defines how many concurrent recoveries are allowed on a node.


.. _conf-routing-allocation-balance:

Shard balancing
...............

You can configure how CrateDB attempts to balance shards across a cluster by
specifying one or more property *weights*. CrateDB will consider a cluster to
be balanced when no further allowed action can bring the weighted properties of
each node closer together.

.. NOTE::

    Balancing may be restricted by other settings (e.g., :ref:`attribute-based
    <conf-routing-allocation-awareness>` and :ref:`disk-based
    <conf-routing-allocation-disk>` shard allocation).

.. _cluster.routing.allocation.balance.shard:

**cluster.routing.allocation.balance.shard**
  | *Default:*   ``0.45f``
  | *Runtime:*  ``yes``

  Defines the weight factor for shards :ref:`allocated
  <gloss-shard-allocation>` on a node (float). Raising this raises the tendency
  to equalize the number of shards across all nodes in the cluster.

.. NOTE::

    :ref:`cluster.routing.allocation.balance.shard` and
    :ref:`cluster.routing.allocation.balance.index` cannot be both set to
    ``0.0f``.

.. _cluster.routing.allocation.balance.index:

**cluster.routing.allocation.balance.index**
  | *Default:*   ``0.55f``
  | *Runtime:*  ``yes``

  Defines a factor to the number of shards per index :ref:`allocated
  <gloss-shard-allocation>` on a specific node (float). Increasing this value
  raises the tendency to equalize the number of shards per index across all
  nodes in the cluster.

.. NOTE::

    :ref:`cluster.routing.allocation.balance.shard` and
    :ref:`cluster.routing.allocation.balance.index` cannot be both set to
    ``0.0f``.

.. _cluster.routing.allocation.balance.threshold:

**cluster.routing.allocation.balance.threshold**
  | *Default:*   ``1.0f``
  | *Runtime:*  ``yes``

  Minimal optimization value of operations that should be performed (non
  negative float). Increasing this value will cause the cluster to be less
  aggressive about optimising the shard balance.


.. _conf-routing-allocation-attributes:

Attribute-based shard allocation
................................

You can control how shards are allocated to specific nodes by setting
:ref:`custom attributes <conf-node-attributes>` on each node (e.g., server rack
ID or node availability zone). After doing this, you can define
:ref:`cluster-wide attribute awareness <conf-routing-allocation-awareness>` and
then configure :ref:`cluster-wide attribute filtering
<conf-routing-allocation-filtering>`.

.. SEEALSO::

    For an in-depth example of using custom node attributes, check out the
    :ref:`multi-zone setup how-to guide <multi-node-setup>`.


.. _conf-routing-allocation-awareness:

Cluster-wide attribute awareness
`````````````````````````````````

To make use of :ref:`custom attributes <conf-node-attributes>` for
:ref:`attribute-based <conf-routing-allocation-attributes>` :ref:`shard
allocation <gloss-shard-allocation>`, you must configure *cluster-wide
attribute awareness*.

.. _cluster.routing.allocation.awareness.attributes:

**cluster.routing.allocation.awareness.attributes**
  | *Runtime:*  ``no``

  You may define :ref:`custom node attributes <conf-node-attributes>` which can
  then be used to do awareness based on the :ref:`allocation
  <gloss-shard-allocation>` of a shard and its replicas.

  For example, let's say we want to use an attribute named ``rack_id``. We
  start two nodes with ``node.attr.rack_id`` set to ``rack_one``. Then we
  create a single table with five shards and one replica. The table will be
  fully deployed on the current nodes (five shards and one replica each, making
  a total of 10 shards).

  Now, if we start two more nodes with ``node.attr.rack_id`` set to
  ``rack_two``, CrateDB will relocate shards to even out the number of shards
  across the nodes. However, a shard and its replica will not be allocated to
  nodes sharing the same ``rack_id`` value.

  The ``awareness.attributes`` setting supports using several values.

.. _cluster.routing.allocation.awareness.force.\*.values:

**cluster.routing.allocation.awareness.force.\*.values**
  | *Runtime:*  ``no``

  Attributes on which :ref:`shard allocation <gloss-shard-allocation>` will be
  forced. Here, ``*`` is a placeholder for the awareness attribute, which can
  be configured using the :ref:`cluster.routing.allocation.awareness.attributes
  <cluster.routing.allocation.awareness.attributes>` setting.

  For example, let's say we configured forced shard allocation for an awareness
  attribute named ``zone`` with ``values`` set to ``zone1, zone2``. Start two
  nodes with ``node.attr.zone`` set to ``zone1``. Then, create a table with
  five shards and one replica. The table will be created, but only five shards
  will be allocated (with no replicas). The replicas will only be allocated
  when we start one or more nodes with ``node.attr.zone`` set to
  ``zone2``.


.. _conf-routing-allocation-filtering:

Cluster-wide attribute filtering
````````````````````````````````

To control how CrateDB uses :ref:`custom attributes <conf-node-attributes>` for
:ref:`attribute-based <conf-routing-allocation-attributes>` :ref:`shard
allocation <gloss-shard-allocation>`, you must configure *cluster-wide
attribute filtering*.

.. NOTE::

    CrateDB will retroactively enforce filter definitions. If a new filter
    would prevent newly created matching shards from being allocated to a node,
    CrateDB would also move any *existing* matching shards away from that node.

.. _cluster.routing.allocation.include.*:

**cluster.routing.allocation.include.***
  | *Runtime:*  ``yes``

  Only :ref:`allocate shards <gloss-shard-allocation>` on nodes where at least
  **one** of the specified values matches the attribute.

  For example::

      cluster.routing.allocation.include.zone: "zone1,zone2"`

  This setting can be overridden for individual tables by the related
  :ref:`table setting <sql-create-table-routing-allocation-include>`.

.. _cluster.routing.allocation.exclude.*:

**cluster.routing.allocation.exclude.***
  | *Runtime:*  ``yes``

  Only :ref:`allocate shards <gloss-shard-allocation>` on nodes where **none**
  of the specified values matches the attribute.

  For example::

      cluster.routing.allocation.exclude.zone: "zone1"

  This setting can be overridden for individual tables by the related
  :ref:`table setting <sql-create-table-routing-allocation-exclude>`.

  Therefore, if a node is excluded from shard allocation by this cluster level
  setting, the node can still allocate shards if the table setting allows it.

.. _cluster.routing.allocation.require.*:

**cluster.routing.allocation.require.***
  | *Runtime:*  ``yes``

  Used to specify a number of rules, which **all** of them must match for a node
  in order to :ref:`allocate a shard  <gloss-shard-allocation>` on it.

  This setting can be overridden for individual tables by the related
  :ref:`table setting <sql-create-table-routing-allocation-require>`.


.. _conf-routing-allocation-disk:

Disk-based shard allocation
...........................

.. _cluster.routing.allocation.disk.threshold_enabled:

**cluster.routing.allocation.disk.threshold_enabled**
  | *Default:*   ``true``
  | *Runtime:*  ``yes``

  Prevent :ref:`shard allocation <gloss-shard-allocation>` on nodes depending
  of the disk usage.

.. _cluster.routing.allocation.disk.watermark.low:

**cluster.routing.allocation.disk.watermark.low**
  | *Default:*   ``85%``
  | *Runtime:*  ``yes``

  Defines the lower disk threshold limit for :ref:`shard allocations
  <gloss-shard-allocation>`. New shards will not be allocated on nodes with
  disk usage greater than this value. It can also be set to an absolute bytes
  value (like e.g. ``500mb``) to prevent the cluster from allocating new shards
  on node with less free disk space than this value.

.. _cluster.routing.allocation.disk.watermark.high:

**cluster.routing.allocation.disk.watermark.high**
  | *Default:*   ``90%``
  | *Runtime:*  ``yes``

  Defines the higher disk threshold limit for :ref:`shard allocations
  <gloss-shard-allocation>`. The cluster will attempt to relocate existing
  shards to another node if the disk usage on a node rises above this value. It
  can also be set to an absolute bytes value (like e.g. ``500mb``) to relocate
  shards from nodes with less free disk space than this value.

.. _cluster.routing.allocation.disk.watermark.flood_stage:

**cluster.routing.allocation.disk.watermark.flood_stage**
  | *Default:*  ``95%``
  | *Runtime:*  ``yes``

  Defines the threshold on which CrateDB enforces a read-only block on every
  index that has at least one :ref:`shard allocated <gloss-shard-allocation>`
  on a node with at least one disk exceeding the flood stage.

  .. NOTE::

      :ref:`sql-create-table-blocks-read-only-allow-delete` setting is
      automatically reset to ``FALSE`` for the tables if the disk space is
      freed and the threshold is undershot.

``cluster.routing.allocation.disk.watermark`` settings may be defined as
percentages or bytes values. However, it is not possible to mix the value
types.

By default, the cluster will retrieve information about the disk usage of the
nodes every 30 seconds. This can also be changed by setting the
`cluster.info.update.interval`_ setting.

.. NOTE::

   The watermark settings are also used for the
   :ref:`sys-node_checks_watermark_low` and :ref:`sys-node_checks_watermark_high` node
   check.

   Setting ``cluster.routing.allocation.disk.threshold_enabled`` to false will
   disable the allocation decider, but the node checks will still be active and
   warn users about running low on disk space.

.. _cluster.routing.allocation.total_shards_per_node:

**cluster.routing.allocation.total_shards_per_node**
   | *Default*: ``-1``
   | *Runtime*: ``yes``

   Limits the number of primary and replica shards that can be :ref:`allocated
   <gloss-shard-allocation>` per node. A value of ``-1`` means unlimited.

   Setting this to ``1000``, for example, will prevent CrateDB from assigning
   more than 1000 shards per node. A node with 1000 shards would be excluded
   from allocation decisions and CrateDB would attempt to allocate shards to
   other nodes, or leave shards unassigned if no suitable node can be found.

.. NOTE::

   If a table is created with :ref:`sql-create-table-number-of-replicas`
   provided as a range or default ``0-1`` value, the limit check accounts only
   for primary shards and not for possible expanded replicas and thus actual
   number of all shards can exceed the limit.

.. _indices.recovery:

Recovery
--------

.. _indices.recovery.max_bytes_per_sec:

**indices.recovery.max_bytes_per_sec**
  | *Default:*   ``40mb``
  | *Runtime:*  ``yes``

  Specifies the maximum number of bytes that can be transferred during
  :ref:`shard recovery <gloss-shard-recovery>` per seconds. Limiting can be
  disabled by setting it to ``0``. This setting allows to control the network
  usage of the recovery process. Higher values may result in higher network
  utilization, but also faster recovery process.

.. _indices.recovery.retry_delay_state_sync:

**indices.recovery.retry_delay_state_sync**
  | *Default:*  ``500ms``
  | *Runtime:*  ``yes``

  Defines the time to wait after an issue caused by cluster state syncing
  before retrying to :ref:`recover <gloss-shard-recovery>`.

.. _indices.recovery.retry_delay_network:

**indices.recovery.retry_delay_network**
  | *Default:*  ``5s``
  | *Runtime:*  ``yes``

  Defines the time to wait after an issue caused by the network before retrying
  to :ref:`recover <gloss-shard-recovery>`.

.. _indices.recovery.internal_action_timeout:

**indices.recovery.internal_action_timeout**
  | *Default:*  ``15m``
  | *Runtime:*  ``yes``

  Defines the timeout for internal requests made as part of the :ref:`recovery
  <gloss-shard-recovery>`.

.. _indices.recovery.internal_action_long_timeout:

**indices.recovery.internal_action_long_timeout**
  | *Default:*  ``30m``
  | *Runtime:*  ``yes``

  Defines the timeout for internal requests made as part of the :ref:`recovery
  <gloss-shard-recovery>` that are expected to take a long time. Defaults to
  twice :ref:`internal_action_timeout
  <indices.recovery.internal_action_timeout>`.

.. _indices.recovery.recovery_activity_timeout:

**indices.recovery.recovery_activity_timeout**
  | *Default:*  ``30m``
  | *Runtime:*  ``yes``

  :ref:`Recoveries <gloss-shard-recovery>` that don't show any activity for
  more then this interval will fail. Defaults to
  :ref:`internal_action_long_timeout
  <indices.recovery.internal_action_long_timeout>`.

.. _indices.recovery.max_concurrent_file_chunks:

**indices.recovery.max_concurrent_file_chunks**
  | *Default:*  ``2``
  | *Runtime:*  ``yes``

  Controls the number of file chunk requests that can be sent in parallel per
  :ref:`recovery <gloss-shard-recovery>`. As multiple recoveries are already
  running in parallel, controlled by
  :ref:`cluster.routing.allocation.node_concurrent_recoveries
  <cluster.routing.allocation.node_concurrent_recoveries>`, increasing this
  expert-level setting might only help in situations where peer recovery of a
  single shard is not reaching the total inbound and outbound peer recovery
  traffic as configured by :ref:`indices.recovery.max_bytes_per_sec
  <indices.recovery.max_bytes_per_sec>`, but is CPU-bound instead, typically
  when using transport-level security or compression.

Memory management
-----------------

.. _memory.allocation.type:

**memory.allocation.type**
  | *Default:*  ``on-heap``
  | *Runtime:*  ``yes``

Supported values are ``on-heap`` and ``off-heap``. This influences if memory is
preferably allocated in the heap space or in the off-heap/direct memory region.

Setting this to ``off-heap`` doesn't imply that the heap won't be used anymore.
Most allocations will still happen in the heap space but some operations will
be allowed to utilize off heap buffers.

.. warning::

    Using ``off-heap`` is considered **experimental**.

.. _memory.operation_limit:

**memory.operation_limit**
   | *Default:* ``0``
   | *Runtime:* ``yes``

Default value for the :ref:`memory.operation_limit
session setting <conf-session-memory-operation-limit>`. Changing the cluster
setting will only affect new sessions, not existing sessions.

Example statement to update the default value to 1 GB, i.e. 1073741824 bytes::

    cr> SET GLOBAL "memory.operation_limit" = 1073741824;
    SET OK, 1 row affected (... sec)

Operations that hit this memory limit will trigger a ``CircuitBreakingException``
that can be handled in the application to inform the user about too much memory
consumption for the particular query.

Query circuit breaker
---------------------

The Query circuit breaker will keep track of the used memory during the
execution of a query. If a query consumes too much memory or if the cluster is
already near its memory limit it will terminate the query to ensure the cluster
keeps working.

.. _indices.breaker.query.limit:

**indices.breaker.query.limit**
  | *Default:*   ``60%``
  | *Runtime:*   ``yes``

  Specifies the limit for the query breaker. Provided values can either be
  absolute values (interpreted as a number of bytes), byte sizes (like ``1mb``)
  or percentage of the heap size (like ``12%``). A value of ``-1`` disables
  breaking the circuit while still accounting memory usage.


Request circuit breaker
-----------------------

The request circuit breaker allows an estimation of required heap memory per
request. If a single request exceeds the specified amount of memory, an
exception is raised.

.. _indices.breaker.request.limit:

**indices.breaker.request.limit**
  | *Default:*   ``60%``
  | *Runtime:*  ``yes``

  Specifies the JVM heap limit for the request circuit breaker.


Accounting circuit breaker
--------------------------

Tracks things that are held in memory independent of queries. For example the
memory used by Lucene for segments.

.. _indices.breaker.accounting.limit:

**indices.breaker.accounting.limit**
  | *Default:*  ``100%``
  | *Runtime:*  ``yes``

  Specifies the JVM heap limit for the accounting circuit breaker

  .. CAUTION::

      This setting is deprecated and will be removed in a future release.


.. _stats.breaker.log:

Stats circuit breakers
----------------------

Settings that control the behaviour of the stats circuit breaker. There are two
breakers in place, one for the jobs log and one for the operations log. For
each of them, the breaker limit can be set.

.. _stats.breaker.log.jobs.limit:

**stats.breaker.log.jobs.limit**
  | *Default:*    ``5%``
  | *Runtime:*   ``yes``

  The maximum memory that can be used from :ref:`CRATE_HEAP_SIZE
  <conf-env-heap-size>` for the :ref:`sys.jobs_log <sys-logs>` table on each
  node.

  When this memory limit is reached, the job log circuit breaker logs an error
  message and clears the :ref:`sys.jobs_log <sys-logs>` table completely.

.. _stats.breaker.log.operations.limit:

**stats.breaker.log.operations.limit**
  | *Default:*    ``5%``
  | *Runtime:*   ``yes``

  The maximum memory that can be used from :ref:`CRATE_HEAP_SIZE
  <conf-env-heap-size>` for the :ref:`sys.operations_log <sys-logs>` table on
  each node.

  When this memory limit is reached, the operations log circuit breaker logs an
  error message and clears the :ref:`sys.operations_log <sys-logs>` table
  completely.


Total circuit breaker
---------------------

The total - or parent - circuit breaker represents a sum of all other circuit
breakers and additionally takes the current heap usage into consideration.

.. _indices.breaker.total.limit:

**indices.breaker.total.limit**
  | *Default:*    ``95%``
  | *Runtime:*   ``yes``

  The maximum memory that can be used by all aforementioned circuit breakers
  together.

  Even if an individual circuit breaker doesn't hit its individual limit,
  queries might still get aborted if several circuit breakers together would
  hit the memory limit configured in ``indices.breaker.total.limit``.

Thread pools
------------

Every node uses a number of thread pools to schedule operations, each pool is
dedicated to specific operations. The most important pools are:

* ``write``: Used for write operations like index, update or delete. The ``type``
  defaults to ``fixed``.
* ``search``: Used for read operations like ``SELECT`` statements. The ``type``
  defaults to ``fixed``.
* ``refresh``: Used for :ref:`refresh operations <refresh_data>`. The ``type``
  defaults to ``scaling``.
* ``generic``: For internal tasks like cluster state management. The ``type``
  defaults to ``scaling``.
* ``logical_replication``: For logical replication operations. The ``type``
  defaults to fixed.

In addition to those pools, there are also ``netty`` worker threads which are
used to process network requests and many CPU bound actions like query analysis
and optimization.

The thread pool settings are expert settings which you generally shouldn't need
to touch. They are dynamically sized depending on the number of available CPU
cores. If you're running multiple services on the same machine you instead
should change the :ref:`processors` setting.

Increasing the number of threads for a pool can result in degraded performance
due to increased context switching and higher memory footprint.

If you observe idle CPU cores increasing the thread pool size is rarely the
right course of action, instead it can be a sign that:

- Operations are blocked on disk IO. Increasing the thread pool size could
  result in more operations getting queued and blocked on disk IO without
  increasing throughput but decreasing it due to more memory pressure and
  additional garbage collection activity.

- Individual operations running single threaded. Not all tasks required to
  process a SQL statement can be further subdivided and processed in parallel,
  but many operations default to use one thread per shard. Because of this, you
  can consider increasing the number of shards of a table to increase the
  parallelism of a single individual statement and increase CPU core
  utilization. As an alternative you can try increasing the concurrency on the
  client side, to have CrateDB process more SQL statements in parallel.

.. _thread_pool.<name>.type:

**thread_pool.<name>.type**
  | *Runtime:*  ``no``
  | *Allowed values:* ``fixed | scaling``

  ``fixed`` holds a fixed size of threads to handle the requests. It also has a
  queue for pending requests if no threads are available.

  ``scaling`` ensures that a thread pool holds a dynamic number of threads that
  are proportional to the workload.

Settings for fixed thread pools
...............................

If the type of a thread pool is set to ``fixed`` there are a few optional
settings.

.. _thread_pool.<name>.size:

**thread_pool.<name>.size**
  | *Runtime:*  ``no``

  Number of threads. The default size of the different thread pools depend on
  the number of available CPU cores.

.. _thread_pool.<name>.queue_size:

**thread_pool.<name>.queue_size**
  | *Default write:*  ``200``
  | *Default search:* ``1000``
  | *Runtime:*  ``no``

  Size of the queue for pending requests. A value of ``-1`` sets it to
  unbounded.
  If you have burst workloads followed by periods of inactivity it can make
  sense to increase the ``queue_size`` to allow a node to buffer more queries
  before rejecting new operations. But be aware, increasing the queue size if
  you have sustained workloads will only increase the system's memory
  consumption and likely degrade performance.


.. _overload_protection:

Overload Protection
-------------------

Overload protection settings control how many resources operations like
``INSERT INTO FROM QUERY``, ``UPDATE``, ``DELETE`` or ``COPY`` can use.

The values here serve as a starting point for an algorithm that dynamically
adapts the effective concurrency limit based on the round-trip time of
requests. Whenever one of these settings is updated, the previously calculated
effective concurrency is reset.

Changing settings will only effect new operations, already running operations
will continue with the previous settings.


.. _overload_protection.dml.initial_concurrency:

**overload_protection.dml.initial_concurrency**
  | *Default:* ``5``
  | *Runtime:* ``yes``

The initial number of concurrent operations allowed per target node.

.. _overload_protection.dml.min_concurrency:

**overload_protection.dml.min_concurrency**
  | *Default:* ``1``
  | *Runtime:* ``yes``

The minimum number of concurrent operations allowed per target node.

.. _overload_protection.dml.max_concurrency:

**overload_protection.dml.max_concurrency**
  | *Default:* ``100``
  | *Runtime:* ``yes``

The maximum number of concurrent operations allowed per target node.

.. _overload_protection.dml.queue_size:

**overload_protection.dml.queue_size**
  | *Default:* ``25``
  | *Runtime:* ``yes``

How many operations are allowed to queue up.


Metadata
--------

.. _cluster.info.update.interval:

**cluster.info.update.interval**
  | *Default:*  ``30s``
  | *Runtime:*  ``yes``

  Defines how often the cluster collect metadata information (e.g. disk usages
  etc.) if no concrete  event is triggered.

.. _metadata_gateway:

Metadata gateway
................

The following settings can be used to configure the behavior of the
:ref:`metadata gateway <gloss-metadata-gateway>`.

.. _gateway.expected_nodes:

**gateway.expected_nodes**
  | *Default:*   ``-1``
  | *Runtime:*  ``no``

  The setting ``gateway.expected_nodes`` defines the total number of nodes
  expected in the cluster. It is evaluated together with
  ``gateway.recover_after_nodes``
  to decide if the cluster can start with recovery.

  .. CAUTION::

      This setting is deprecated and will be removed in a future version.
      Use `gateway.expected_data_nodes`_ instead.

.. _gateway.expected_data_nodes:

**gateway.expected_data_nodes**
  | *Default:*   ``-1``
  | *Runtime:*  ``no``

  The setting ``gateway.expected_data_nodes`` defines the total number of
  data nodes expected in the cluster. It is evaluated together with
  ``gateway.recover_after_data_nodes``
  to decide if the cluster can start with recovery.

.. _gateway.recover_after_time:

**gateway.recover_after_time**
  | *Default:*   ``5m``
  | *Runtime:*  ``no``

  The ``gateway.recover_after_time`` setting defines the time to wait for
  the number of nodes set in ``gateway.expected_data_nodes`` (or
  ``gateway.expected_nodes``) to become available, before starting the
  recovery, once the number of nodes defined in
  ``gateway.recover_after_data_nodes`` (or ``gateway.recover_after_nodes``)
  has already been reached.
  This setting is ignored if ``gateway.expected_data_nodes`` or
  ``gateway.expected_nodes`` are set to 0 or 1.
  It also has no effect if ``gateway.recover_after_data_nodes`` is set equal
  to ``gateway.expected_data_nodes`` (or ``gateway.recover_after_nodes`` is
  set equal to ``gateway.expected_nodes``).
  The cluster also proceeds to immediate recovery, and the default 5 minutes
  waiting time does not apply, if neither this setting nor ``expected_nodes`` and
  ``expected_data_nodes`` are explicitly set.

.. _gateway.recover_after_nodes:

**gateway.recover_after_nodes**
  | *Default:*   ``-1``
  | *Runtime:*  ``no``

  The ``gateway.recover_after_nodes`` setting defines the number of nodes that
  need to join the cluster before the cluster state recovery can start.
  If this setting is ``-1`` and ``gateway.expected_nodes`` is set, all nodes
  will need to be started before the cluster state recovery can start.
  Please note that proceeding with recovery when not all nodes are available
  could trigger the promotion of shards and the creation of new replicas,
  generating disk and network load, which may be unnecessary. You can use a
  combination of this setting with ``gateway.recovery_after_time`` to
  mitigate this risk.

  .. CAUTION::

      This setting is deprecated and will be removed in CrateDB 5.0.
      Use `gateway.recover_after_data_nodes`_ instead.

.. _gateway.recover_after_data_nodes:

**gateway.recover_after_data_nodes**
  | *Default:*   ``-1``
  | *Runtime:*  ``no``

  The ``gateway.recover_after_data_nodes`` setting defines the number of data
  nodes that need to be started before the cluster state recovery can start.
  If this setting is ``-1`` and ``gateway.expected_data_nodes`` is set, all
  data nodes will need to be started before the cluster state recovery can
  start.
  Please note that proceeding with recovery when not all data nodes are
  available could trigger the promotion of shards and the creation of new
  replicas, generating disk and network load, which may be unnecessary. You
  can use a combination of this setting with ``gateway.recovery_after_time``
  to mitigate this risk.

Logical Replication
-------------------

Replication process can be configured by the following settings. Settings
are dynamic and can be changed in runtime.

.. _replication.logical.ops_batch_size:

**replication.logical.ops_batch_size**
  | *Default:* ``50000``
  | *Min value:* ``16``
  | *Runtime:* ``yes``

Maximum number of operations to replicate from the publisher cluster per poll.
Represents a number to advance a sequence.

.. _replication.logical.reads_poll_duration:

**replication.logical.reads_poll_duration**
  | *Default:* ``50``
  | *Runtime:* ``yes``

The maximum time (in milliseconds) to wait for changes per poll operation. When
a subscriber makes another one request to a publisher, it has
``reads_poll_duration`` milliseconds to harvest changes from the publisher.

.. _replication.logical.recovery.chunk_size:

**replication.logical.recovery.chunk_size**
  | *Default:* ``1MB``
  | *Min value:* ``1KB``
  | *Max value:* ``1GB``
  | *Runtime:* ``yes``

Chunk size to transfer files during the initial recovery of a replicating table.

.. _replication.logical.recovery.max_concurrent_file_chunks:

**replication.logical.recovery.max_concurrent_file_chunks**
  | *Default:* ``2``
  | *Min value:* ``1``
  | *Max value:* ``5``
  | *Runtime:* ``yes``

Controls the number of file chunk requests that can be sent in parallel between
clusters during the recovery.

.. hide:

   cr> RESET GLOBAL "stats.jobs_log_filter"
   RESET OK, 1 row affected (... sec)

   cr> RESET GLOBAL "memory.operation_limit"
   RESET OK, 1 row affected (... sec)

.. _bootstrap checks: https://cratedb.com/docs/crate/howtos/en/latest/admin/bootstrap-checks.html</doc><doc title="CrateDB node-specific settings" desc="Node-specific settings of CrateDB.">.. highlight:: sh
.. vale off

.. _conf-node-settings:

======================
Node-specific settings
======================


Basics
======

.. _cluster.name:

**cluster.name**
  | *Default:*    ``crate``
  | *Runtime:*   ``no``

  The name of the CrateDB cluster the node should join to.

.. _node.name:

**node.name**
  | *Runtime:* ``no``

  The name of the node. If no name is configured a random one will be
  generated.

  .. NOTE::

      Node names must be unique in a CrateDB cluster.

.. _node.store_allow_mmap:

**node.store.allow_mmap**
  | *Default:*    ``true``
  | *Runtime:*   ``no``

  The setting indicates whether or not memory-mapping is allowed.

Node types
==========

CrateDB supports different types of nodes.

The following settings can be used to differentiate nodes upon startup:

.. _node.master:

**node.master**
  | *Default:* ``true``
  | *Runtime:* ``no``

  Whether or not this node is able to get elected as *master* node in the
  cluster.

.. _node.data:

**node.data**
  | *Default:* ``true``
  | *Runtime:* ``no``

  Whether or not this node will store data.

Using different combinations of these two settings, you can create four
different types of node. Each type of node is differentiated by what types of
load it will handle.

Tabulating the truth values for ``node.master`` and ``node.data`` produces a
truth table outlining the four different types of node:

+---------------+-----------------------------+------------------------------+
|               | **Master**                  | **No master**                |
+---------------+-----------------------------+------------------------------+
| **Data**      | Handle all loads.           | Handles client requests and  |
|               |                             | query execution.             |
+---------------+-----------------------------+------------------------------+
| **No data**   | Handles cluster management. | Handles client requests.     |
+---------------+-----------------------------+------------------------------+

Nodes marked as ``node.master`` will only handle cluster management if they are
elected as the cluster master. All other loads are shared equally.


General
=======

.. _node.sql.read_only:

**node.sql.read_only**
  | *Default:* ``false``
  | *Runtime:* ``no``

  If set to ``true``, the node will only allow SQL statements which are
  resulting in read operations.


.. _statement_timeout:

**statement_timeout**
  | *Default:* ``0``
  | *Runtime:* ``yes``

  The maximum duration of any statement before it gets cancelled.

  This value is used as default value for the :ref:`statement_timeout session
  setting <conf-session-statement-timeout>`

  If ``0`` queries are allowed to run infinitely and don't get cancelled
  automatically.

.. NOTE::

   Updating this setting won't affect existing sessions, it will only take
   effect for new sessions.


.. _statement_max_length:

**statement_max_length**
  | *Default:* ``262144``
  | *Runtime:* ``no``

  The maximum length of a SQL statement.

.. WARNING::

   Increasing this can lead to high memory consumption when parsing statements
   exceeding the limit and can cause a node to crash with an ouf of memory
   error.

Networking
==========

.. _conf_hosts:

Hosts
-----

.. _network.host:

**network.host**
  | *Default:*   ``_local_``
  | *Runtime:*   ``no``

  The IP address CrateDB will bind itself to. This setting sets both the
  `network.bind_host`_ and `network.publish_host`_ values.

.. _network.bind_host:

**network.bind_host**
  | *Default:*   ``_local_``
  | *Runtime:*   ``no``

  This setting determines to which address CrateDB should bind itself to.

.. _network.publish_host:

**network.publish_host**
  | *Default:*   ``_local_``
  | *Runtime:*   ``no``

  This setting is used by a CrateDB node to publish its own address to the rest
  of the cluster.

.. TIP::

    Apart from IPv4 and IPv6 addresses there are some special values that can
    be used for all above settings:

    =========================  =================================================
    ``_local_``                Any loopback addresses on the system, for example
                               ``127.0.0.1``.
    ``_site_``                 Any site-local addresses on the system, for
                               example ``192.168.0.1``.
    ``_global_``               Any globally-scoped addresses on the system, for
                               example ``8.8.8.8``.
    ``_[INTERFACE]_``          Addresses of a network interface, for example
                               ``_en0_``.
    =========================  =================================================

.. _conf_ports:

Ports
-----

.. _http.port:

**http.port**
  | *Runtime:*   ``no``

  This defines the TCP port range to which the CrateDB HTTP service will be
  bound to. It defaults to ``4200-4300``. Always the first free port in this
  range is used. If this is set to an integer value it is considered as an
  explicit single port.

  The HTTP protocol is used for the REST endpoint which is used by all clients
  except the Java client.

.. _http.publish_port:

**http.publish_port**
  | *Runtime:*   ``no``

  The port HTTP clients should use to communicate with the node. It is
  necessary to define this setting if the bound HTTP port (``http.port``) of
  the node is not directly reachable from outside, e.g. running it behind a
  firewall or inside a Docker container.

.. _transport.tcp.port:

**transport.tcp.port**
  | *Runtime:*   ``no``

  This defines the TCP port range to which the CrateDB transport service will
  be bound to. It defaults to ``4300-4400``. Always the first free port in this
  range is used. If this is set to an integer value it is considered as an
  explicit single port.

  The transport protocol is used for internal node-to-node communication.

.. _transport.publish_port:

**transport.publish_port**
  | *Runtime:*   ``no``

  The port that the node publishes to the cluster for its own discovery. It is
  necessary to define this setting when the bound tranport port
  (``transport.tcp.port``) of the node is not directly reachable from outside,
  e.g. running it behind a firewall or inside a Docker container.

.. _psql.port:

**psql.port**
  | *Runtime:*   ``no``

  This defines the TCP port range to which the CrateDB Postgres service will be
  bound to. It defaults to ``5432-5532``. Always the first free port in this
  range is used. If this is set to an integer value it is considered as an
  explicit single port.

Advanced TCP settings
---------------------

Any interface that uses TCP (Postgres wire, HTTP & Transport protocols) shares
the following settings:

.. _network.tcp.no_delay:

**network.tcp.no_delay**
  | *Default:* ``true``
  | *Runtime:* ``no``

  Enable or disable the `Nagle's algorithm`_ for buffering TCP packets.
  Buffering is disabled by default.

.. _network.tcp.keep_alive:

**network.tcp.keep_alive**
  | *Default:* ``true``
  | *Runtime:* ``no``

  Configures the ``SO_KEEPALIVE`` option for sockets, which determines
  whether they send TCP keepalive probes.

.. _network.tcp.reuse_address:

**network.tcp.reuse_address**
  | *Default:* ``true`` on non-windows machines and ``false`` otherwise
  | *Runtime:* ``no``

   Configures the ``SO_REUSEADDRS`` option for sockets, which determines
   whether they should reuse the address.

.. _network.tcp.send_buffer_size:

**network.tcp.send_buffer_size**
  | *Default:* ``-1``
  | *Runtime:* ``no``

  The size of the TCP send buffer (`SO_SNDBUF`_ socket option).
  By default not explicitly set.

.. _network.tcp.receive_buffer_size:

**network.tcp.receive_buffer_size**
  | *Default:* ``-1``
  | *Runtime:* ``no``

  The size of the TCP receive buffer  (`SO_RCVBUF`_ socket option).
  By default not explicitly set.

.. NOTE::

    Each setting in this section has its counterpart for HTTP and transport.
    To provide a protocol specific setting, remove ``network`` prefix and use
    either ``http`` or ``transport`` instead. For example, no_delay can be
    configured as ``http.tcp.no_delay`` and ``transport.tcp.no_delay``. Please
    note, that PG interface takes its settings from transport.

Transport settings
------------------

.. _transport.connect_timeout:

**transport.connect_timeout**
  | *Default:* ``30s``
  | *Runtime:* ``no``

  The connect timeout for initiating a new connection.

.. _transport.compress:

**transport.compress**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Set to `true` to enable compression (DEFLATE) between all nodes.

.. _transport.ping_schedule:

**transport.ping_schedule**
  | *Default:* ``-1``
  | *Runtime:* ``no``

  Schedule a regular application-level ping message to ensure that transport
  connections between nodes are kept alive. Defaults to `-1` (disabled). It is
  preferable to correctly configure TCP keep-alives instead of using this
  feature, because TCP keep-alives apply to all kinds of long-lived connections
  and not just to transport connections.

.. _conf-node-settings_paths:

Paths
=====

.. NOTE::

    Relative paths are relative to :ref:`CRATE_HOME <conf-env-crate-home>`.
    Absolute paths override this behavior.

.. _path.conf:

**path.conf**
  | *Default:* ``config``
  | *Runtime:* ``no``

  Filesystem path to the directory containing the configuration files
  ``crate.yml`` and ``log4j2.properties``.

.. _path.data:

**path.data**
  | *Default:* ``data``
  | *Runtime:* ``no``

  Filesystem path to the directory where this CrateDB node stores its data
  (table data and cluster metadata).

  Multiple paths can be set by using a comma separated list and each of these
  paths will hold full shards (instead of striping data across them). For
  example:

  .. code-block:: yaml

      path.data: /path/to/data1,/path/to/data2

  When CrateDB finds striped shards at the provided locations (from CrateDB
  <0.55.0), these shards will be migrated automatically on startup.

.. _path.logs:

**path.logs**
  | *Default:* ``logs``
  | *Runtime:* ``no``

  Filesystem path to a directory where log files should be stored.

  Can be used as a variable inside ``log4j2.properties``.

  For example:

  .. code-block::
     yaml

     appender:
       file:
         file: ${path.logs}/${cluster.name}.log

.. _path.repo:

**path.repo**
  | *Runtime:* ``no``

  A list of filesystem or UNC paths where repositories of type
  :ref:`sql-create-repo-fs` may be stored.

  Without this setting a CrateDB user could write snapshot files to any
  directory that is writable by the CrateDB process. To safeguard against this
  security issue, the possible paths have to be whitelisted here.

  See also :ref:`location <sql-create-repo-fs-location>` setting of repository
  type ``fs``.

.. SEEALSO::

    :ref:`blobs.path <blobs.path>`

Plug-ins
========

.. _plugin.mandatory:

**plugin.mandatory**
  | *Runtime:* ``no``

  A list of plug-ins that are required for a node to startup.

  If any plug-in listed here is missing, the CrateDB node will fail to start.

CPU
===

.. _processors:

**processors**
  | *Runtime:* ``no``

  The number of processors is used to set the size of the thread pools CrateDB
  is using appropriately. If not set explicitly, CrateDB will infer the number
  from the available processors on the system.

  In environments where the CPU amount can be restricted (like Docker) or when
  multiple CrateDB instances are running on the same hardware, the inferred
  number might be too high. In such a case, it is recommended to set the value
  explicitly.

Memory
======

.. _bootstrap.memory_lock:

**bootstrap.memory_lock**
  | *Default:* ``false``
  | *Runtime:* ``no``

  CrateDB performs poorly when the JVM starts swapping: you should ensure that
  it *never* swaps. If set to ``true``, CrateDB will use the ``mlockall``
  system call on startup to ensure that the memory pages of the CrateDB process
  are locked into RAM.

Garbage collection
==================

CrateDB logs if JVM garbage collection on different memory pools takes too
long. The following settings can be used to adjust these timeouts:

.. _monitor.jvm.gc.collector.young.warn:

**monitor.jvm.gc.collector.young.warn**
  | *Default:* ``1000ms``
  | *Runtime:* ``no``

  CrateDB will log a warning message if it takes more than the configured
  timespan to collect the *Eden Space* (heap).

.. _monitor.jvm.gc.collector.young.info:

**monitor.jvm.gc.collector.young.info**
  | *Default:* ``700ms``
  | *Runtime:* ``no``

  CrateDB will log an info message if it takes more than the configured
  timespan to collect the *Eden Space* (heap).

.. _monitor.jvm.gc.collector.young.debug:

**monitor.jvm.gc.collector.young.debug**
  | *Default:* ``400ms``
  | *Runtime:* ``no``

  CrateDB will log a debug message if it takes more than the configured
  timespan to collect the *Eden Space* (heap).

.. _monitor.jvm.gc.collector.old.warn:

**monitor.jvm.gc.collector.old.warn**
  | *Default:* ``10000ms``
  | *Runtime:* ``no``

  CrateDB will log a warning message if it takes more than the configured
  timespan to collect the *Old Gen* / *Tenured Gen* (heap).

.. _monitor.jvm.gc.collector.old.info:

**monitor.jvm.gc.collector.old.info**
  | *Default:* ``5000ms``
  | *Runtime:* ``no``

  CrateDB will log an info message if it takes more than the configured
  timespan to collect the *Old Gen* / *Tenured Gen* (heap).

.. _monitor.jvm.gc.collector.old.debug:

**monitor.jvm.gc.collector.old.debug**
  | *Default:* ``2000ms``
  | *Runtime:* ``no``

  CrateDB will log a debug message if it takes more than the configured
  timespan to collect the *Old Gen* / *Tenured Gen* (heap).

Authentication
==============


.. _host_based_auth:

Trust authentication
--------------------

.. _auth.trust.http_default_user:

**auth.trust.http_default_user**
  | *Default:* ``crate``
  | *Runtime:* ``no``

  The default user that should be used for authentication when clients connect
  to CrateDB via HTTP protocol and they do not specify a user via the
  ``Authorization`` request header.

.. _auth.trust.http_support_x_real_ip:

**auth.trust.http_support_x_real_ip**
  | *Default:* ``false``
  | *Runtime:* ``no``

  If enabled, the HTTP transport will trust the ``X-Real-IP`` header sent by
  the client to determine the client's IP address. This is useful when CrateDB
  is running behind a reverse proxy or load-balancer. For improved security,
  any ``_local_`` IP address (``127.0.0.1`` and ``::1``) defined in this header
  will be ignored.

.. warning::

    Enabling this setting can be a security risk, as it allows clients to
    impersonate other clients by sending a fake ``X-Real-IP`` header.


Host-based authentication
-------------------------

Authentication settings (``auth.host_based.*``) are node settings, which means
that their values apply only to the node where they are applied and different
nodes may have different authentication settings.

.. _auth.host_based.enabled:

**auth.host_based.enabled**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Setting to enable or disable Host Based Authentication (HBA). It is disabled
  by default.

.. _jwt_defaults:

JWT Based Authentication
........................

Default global settings for the :ref:`JWT authentication <auth_jwt>`.

.. _auth.host_based.jwt.iss:

**auth.host_based.jwt.iss**
  | *Runtime:* ``no``

  Default value for the ``iss`` :ref:`JWT property <create-user-jwt>`.
  If ``iss`` is set, user specific JWT properties are ignored.

.. _auth.host_based.jwt.aud:

**auth.host_based.jwt.aud**
  | *Runtime:* ``no``

  Default value for the ``aud`` :ref:`JWT property <create-user-jwt>`.
  If ``aud`` is set but ``iss`` is not, then global config is not complete and
  user specific JWT properties are used.

HBA entries
...........

The ``auth.host_based.config.`` setting is a group setting that can have zero,
one or multiple groups that are defined by their group key (``${order}``) and
their fields (``user``, ``address``, ``method``, ``protocol``, ``ssl``).

.. _$(order):

**${order}:**
  | An identifier that is used as a natural order key when looking up the host
  | based configuration entries. For example, an order key of ``a`` will be
  | looked up before an order key of ``b``. This key guarantees that the entry
  | lookup order will remain independent from the insertion order of the
  | entries.

The :ref:`admin_hba` setting is a list of predicates that users can specify to
restrict or allow access to CrateDB.

The meaning of the fields of the are as follows:

.. _auth.host_based.config.${order}.user:

**auth.host_based.config.${order}.user**
  | *Runtime:*  ``no``

  | Specifies an existing CrateDB username, only ``crate`` user (superuser) is
  | available. If no user is specified in the entry, then all existing users
  | can have access.

.. _auth.host_based.config.${order}.address:

**auth.host_based.config.${order}.address**
  | *Runtime:* ``no``

  | The client machine addresses that the client matches, and which are allowed
  | to authenticate. This field may contain an IPv4 address, an IPv6 address or
  | an IPv4 CIDR mask. For example: ``127.0.0.1`` or ``127.0.0.1/32``. It also
  | may contain a hostname or the special ``_local_`` notation which will match
  | both IPv4 and IPv6 connections from localhost. A hostname specification
  | that starts with a dot (.) matches a suffix of the actual hostname.
  | So .crate.io would match foo.crate.io but not just crate.io. If no address
  | is specified in the entry, then access to CrateDB is open for all hosts.

.. _auth.host_based.config.${order}.method:

**auth.host_based.config.${order}.method**
  | *Runtime:* ``no``

  | The authentication method to use when a connection matches this entry.
  | Valid values are ``trust``, ``cert``, ``password`` and ``jwt``. If no
  | method is specified, the ``trust`` method is used by default.
  | See :ref:`auth_trust`, :ref:`auth_cert`, :ref:`auth_password` and
  | :ref:`auth_jwt` for more information about these methods.

.. _auth.host_based.config.${order}.protocol:

**auth.host_based.config.${order}.protocol**
  | *Runtime:* ``no``

  | Specifies the protocol for which the authentication entry should be used.
  | If no protocol is specified, then this entry will be valid for all
  | protocols that rely on host based authentication see :ref:`auth_trust`).

.. _auth.host_based.config.${order}.ssl:

**auth.host_based.config.${order}.ssl**
  | *Default:* ``optional``
  | *Runtime:* ``no``

  | Specifies whether the client must use SSL/TLS to connect to the cluster.
  | If set to ``on`` then the client must be connected through SSL/TLS
  | otherwise is not authenticated. If set to ``off`` then the client must
  | *not* be connected via SSL/TLS otherwise is not authenticated. Finally
  | ``optional``, which is the value when the option is completely skipped,
  | means that the client can be authenticated regardless of SSL/TLS is used
  | or not.

Example of config groups:

.. code-block:: yaml

    auth.host_based.config:
      entry_a:
        user: crate
        address: 127.16.0.0/16
      entry_b:
        method: trust
      entry_3:
        user: crate
        address: 172.16.0.0/16
        method: trust
        protocol: pg
        ssl: on


.. _ssl_config:

Secured communications (SSL/TLS)
================================

Secured communications via SSL allows you to encrypt traffic between CrateDB
nodes and clients connecting to them. Connections are secured using Transport
Layer Security (TLS).

.. _ssl.http.enabled:

**ssl.http.enabled**
  | *Default:* ``false``
  | *Runtime:*  ``no``

  Set this to true to enable secure communication between the CrateDB node
  and the client through SSL via the HTTPS protocol.

.. _ssl.psql.enabled:

**ssl.psql.enabled**
  | *Default:* ``false``
  | *Runtime:*  ``no``

  Set this to true to enable secure communication between the CrateDB node
  and the client through SSL via the PostgreSQL wire protocol.

.. _ssl.transport.mode:

**ssl.transport.mode**
  | *Default:* ``legacy``
  | *Runtime:* ``no``

  For communication between nodes, choose:

  ``off``
    SSL cannot be used
  ``legacy``
    SSL is not used. If HBA is enabled, transport connections won't be verified
    Any reachable host can establish a connection.
  ``on``
    SSL must be used

.. _ssl.keystore_filepath:

**ssl.keystore_filepath**
  | *Runtime:* ``no``

  The full path to the node keystore file.

.. _ssl.keystore_password:

**ssl.keystore_password**
  | *Runtime:* ``no``

  The password used to decrypt the keystore file defined with
  ``ssl.keystore_filepath``.

.. _ssl.keystore_key_password:

**ssl.keystore_key_password**
  | *Runtime:* ``no``

  The password entered at the end of the ``keytool -genkey command``.

.. NOTE::

    Optionally trusted CA certificates can be stored separately from the
    node's keystore into a truststore for CA certificates.

.. _ssl.truststore_filepath:

**ssl.truststore_filepath**
  | *Runtime:* ``no``

  The full path to the node truststore file. If not defined, then only a
  keystore will be used.

.. _ssl.truststore_password:

**ssl.truststore_password**
  | *Runtime:* ``no``

  The password used to decrypt the truststore file defined with
  ``ssl.truststore_filepath``.

.. _ssl.resource_poll_interval:

**ssl.resource_poll_interval**
  | *Default:* ``5m``
  | *Runtime:* ``no``

  The frequency at which SSL files such as keystore and truststore are polled
  for changes.

Cross-origin resource sharing (CORS)
====================================

Many browsers support the `same-origin policy`_ which requires web applications
to explicitly allow requests across origins. The `cross-origin resource
sharing`_ settings in CrateDB allow for configuring these.

.. _http.cors.enabled:

**http.cors.enabled**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Enable or disable `cross-origin resource sharing`_.

.. _http.cors.allow-origin:

**http.cors.allow-origin**
  | *Default:* ``<empty>``
  | *Runtime:* ``no``

  Define allowed origins of a request. ``*`` allows *any* origin (which can be
  a substantial security risk) and by prepending a ``/`` the string will be
  treated as a :ref:`regular expression <gloss-regular-expression>`. For
  example ``/https?:\/\/crate.io/`` will allow requests from
  ``https://cratedb.com`` and ``https://crate.io``. This setting disallows any
  origin by default.

.. _http.cors.max-age:

**http.cors.max-age**
  | *Default:* ``1728000`` (20 days)
  | *Runtime:* ``no``

  Max cache age of a preflight request in seconds.

.. _http.cors.allow-methods:

**http.cors.allow-methods**
  | *Default:* ``OPTIONS, HEAD, GET, POST, PUT, DELETE``
  | *Runtime:* ``no``

  Allowed HTTP methods.

.. _http.cors.allow-headers:

**http.cors.allow-headers**
  | *Default:* ``X-Requested-With, Content-Type, Content-Length``
  | *Runtime:* ``no``

  Allowed HTTP headers.

.. _http.cors.allow-credentials:

**http.cors.allow-credentials**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Add the ``Access-Control-Allow-Credentials`` header to responses.

.. _`same-origin policy`: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy
.. _`cross-origin resource sharing`: https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/CORS

Blobs
=====

.. _blobs.path:

**blobs.path**
  | *Runtime:* ``no``

  Path to a filesystem directory where to store blob data allocated for this
  node.

  By default blobs will be stored under the same path as normal data. A
  relative path value is interpreted as relative to ``CRATE_HOME``.

.. _ref-configuration-repositories:

Repositories
============

Repositories are used to :ref:`backup <snapshot-restore>` a CrateDB cluster.

.. _repositories.url.allowed_urls:

**repositories.url.allowed_urls**
  | *Runtime:* ``no``

  This setting only applies to repositories of type :ref:`sql-create-repo-url`.

  With this setting a list of urls can be specified which are allowed to be
  used if a repository of type ``url`` is created.

  Wildcards are supported in the host, path, query and fragment parts.

  This setting is a security measure to prevent access to arbitrary resources.

  In addition, the supported protocols can be restricted using the
  :ref:`repositories.url.supported_protocols
  <repositories.url.supported_protocols>` setting.

.. _repositories.url.supported_protocols:

**repositories.url.supported_protocols**
  | *Default:* ``http``, ``https``, ``ftp``, ``file`` and ``jar``
  | *Runtime:* ``no``

  A list of protocols that are supported by repositories of type
  :ref:`sql-create-repo-url`.

  The ``jar`` protocol is used to access the contents of jar files. For more
  info, see the java `JarURLConnection documentation`_.

See also the :ref:`path.repo <path.repo>` Setting.

.. _`JarURLConnection documentation`: https://docs.oracle.com/javase/8/docs/api/java/net/JarURLConnection.html

Queries
=======

.. _indices.query.bool.max_clause_count:

**indices.query.bool.max_clause_count**
  | *Default:* ``8192``
  | *Runtime:* ``no``

  This setting limits the number of boolean clauses that can be generated by
  ``!= ANY()``, ``LIKE ANY()``, ``ILIKE ANY()``, ``NOT LIKE ANY()`` and
  ``NOT ILIKE ANY()`` :ref:`operators <gloss-operator>` on arrays in order to
  prevent users from executing queries that may result in heavy memory
  consumption causing nodes to crash with ``OutOfMemory`` exceptions. Throws
  ``TooManyClauses`` errors when the limit is exceeded.

  .. NOTE::

    You can avoid ``TooManyClauses`` errors by increasing this setting. The
    number of boolean clauses used can be larger than the elements of the array
    .

Legacy
=======

.. _legacy.table_function_column_naming:

**legacy.table_function_column_naming**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Since CrateDB 5.0.0, if the table function is not aliased and is returning a
  single base data typed column, the table function name is used as the column
  name. This setting can be set in order to use the naming convention prior to
  5.0.0.

  The following table functions are affected by this setting:

  - :ref:`unnest <unnest>`
  - :ref:`regexp_matches <table-functions-regexp-matches>`
  - :ref:`generate_series <table-functions-generate-series>`

  When the setting is set and a single column is expected to be returned,
  the returned column will be named ``col1``, ``groups``, or ``col1``
  respectively.

  .. NOTE::

    Beware that if not all nodes in the cluster are consistently set or unset,
    the behaviour will depend on the node handling the query.

.. _conf-node-lang-js:

JavaScript language
===================

.. _lang.js.enabled:

**lang.js.enabled**
  | *Default:*  ``true``
  | *Runtime:*  ``no``

  Setting to enable or disable :ref:`JavaScript UDF <udf-js>` support.


.. _conf-fdw:

Foreign Data Wrappers
=====================


.. _fdw.allow_local:

**fdw.allow_local**
  | *Default:* ``false``
  | *Runtime:* ``no``

  Allow access to local addresses via :ref:`Foreign data wrappers
  <administration-fdw>` for all users.

  By default, only the ``crate`` superuser is allowed to access foreign servers
  that point to ``localhost``.

.. warning::

  Changing this to ``true`` can pose a security risk if you do not trust the
  users with ``AL`` permissions on the system. They can create foreign servers,
  foreign tables and user mappings that allow them to access services running on
  the same machine as CrateDB as if connected locally - effectively bypassing
  any restrictions set up via :ref:`admin_hba`.

  Do **not** change this if you don't understand the implications.


.. _conf-node-attributes:

Custom attributes
=================

The ``node.attr`` namespace is a bag of custom attributes. Custom attributes
can be :ref:`used to control shard allocation
<conf-routing-allocation-awareness>`.

You can create any attribute you want under this namespace, like
``node.attr.key: value``. These attributes use the ``node.attr`` namespace to
distinguish them from core node attribute like ``node.name``.

Custom attributes are not validated by CrateDB, unlike core node attributes.

.. vale on


.. _plugins: https://github.com/crate/crate/blob/master/devs/docs/plugins.rst
.. _Nagle's algorithm: https://en.wikipedia.org/wiki/Nagle%27s_algorithm
.. _SO_RCVBUF: https://docs.oracle.com/javase/7/docs/api/java/net/StandardSocketOptions.html#SO_RCVBUF
.. _SO_SNDBUF: https://docs.oracle.com/javase/7/docs/api/java/net/StandardSocketOptions.html#SO_SNDBUF</doc></api><examples><doc title="CrateDB SQL gallery" desc="A collection of SQL queries and utilities suitable for diagnostics on CrateDB."># Copyright (c) 2021-2024, Crate.io Inc.
# Distributed under the terms of the AGPLv3 license, see LICENSE.
from cratedb_toolkit.info.model import InfoElement, LogElement
from cratedb_toolkit.info.util import get_single_value


class Library:
    """
    A collection of SQL queries and utilities suitable for diagnostics on CrateDB.

    Credits to the many authors and contributors of CrateDB diagnostics utilities,
    dashboards, and cheat sheets.

    Acknowledgements: Baurzhan Sakhariev, Eduardo Legatti, Georg Traar, Hernan
    Cianfagna, Ivan Sanchez Valencia, Karyn Silva de Azevedo, Niklas Schmidtmer,
    Walter Behmann.

    References:
    - https://community.cratedb.com/t/similar-elasticsearch-commands/1455/4
    - CrateDB Admin UI.
    - CrateDB Grafana General Diagnostics Dashboard.
    - Debugging CrateDB - Queries Cheat Sheet.
    """

    class Health:
        """
        CrateDB health check queries.
        """

        backups_recent = InfoElement(
            name="backups_recent",
            label="Recent Backups",
            sql="""
                SELECT repository, name, finished, state
                FROM sys.snapshots
                ORDER BY finished DESC
                LIMIT 10;
            """,
            description="Most recent 10 backups",
        )

        cluster_name = InfoElement(
            name="cluster_name",
            label="Cluster name",
            sql=r"SELECT name FROM sys.cluster;",
            transform=get_single_value("name"),
        )

        nodes_count = InfoElement(
            name="cluster_nodes_count",
            label="Total number of cluster nodes",
            sql=r"SELECT COUNT(*) AS count FROM sys.nodes;",
            transform=get_single_value("count"),
        )
        nodes_list = InfoElement(
            name="cluster_nodes_list",
            label="Cluster Nodes",
            sql="SELECT * FROM sys.nodes ORDER BY hostname;",
            description="Telemetry information for all cluster nodes.",
        )
        table_health = InfoElement(
            name="table_health",
            label="Table Health",
            sql="SELECT health, COUNT(*) AS table_count FROM sys.health GROUP BY health;",
            description="Table health short summary",
        )

    class JobInfo:
        """
        Information distilled from `sys.jobs_log` and `sys.jobs`.
        """

        age_range = InfoElement(
            name="age_range",
            label="Query age range",
            description="Timestamps of first and last job",
            sql="""
                SELECT
                    MIN(started) AS "first_job",
                    MAX(started) AS "last_job"
                FROM sys.jobs_log;
            """,
        )
        by_user = InfoElement(
            name="by_user",
            label="Queries by user",
            sql=r"""
                SELECT
                  username,
                  COUNT(username) AS count
                FROM sys.jobs_log
                GROUP BY username
                ORDER BY count DESC;
            """,
            description="Total number of queries per user.",
        )

        duration_buckets = InfoElement(
            name="duration_buckets",
            label="Query Duration Distribution (Buckets)",
            sql="""
                WITH dur AS (
                        SELECT
                            ended-started::LONG AS duration
                        FROM sys.jobs_log
                    ),
                    pct AS (
                        SELECT
                            [0.25,0.5,0.75,0.99,0.999,1] pct_in,
                            percentile(duration,[0.25,0.5,0.75,0.99,0.999,1]) as pct,
                            count(*) cnt
                        FROM dur
                    )
                    SELECT
                        UNNEST(pct_in) * 100 AS bucket,
                        cnt - CEIL(UNNEST(pct_in) * cnt) AS count,
                        CEIL(UNNEST(pct)) duration
                        ---cnt
                    FROM pct;
                """,
            description="Distribution of query durations, bucketed.",
        )
        duration_percentiles = InfoElement(
            name="duration_percentiles",
            label="Query Duration Distribution (Percentiles)",
            sql="""
                SELECT
                    min(ended-started::LONG) AS min,
                    percentile(ended-started::LONG, 0.50) AS p50,
                    percentile(ended-started::LONG, 0.90) AS p90,
                    percentile(ended-started::LONG, 0.99) AS p99,
                    MAX(ended-started::LONG) AS max
                FROM
                    sys.jobs_log
                LIMIT 50;
                """,
            description="Distribution of query durations, percentiles.",
        )
        history100 = InfoElement(
            name="history",
            label="Query History",
            sql="""
                SELECT
                  started AS "time",
                  stmt,
                  (ended::LONG - started::LONG) AS duration,
                  username
                FROM sys.jobs_log
                WHERE stmt NOT ILIKE '%snapshot%'
                ORDER BY time DESC
                LIMIT 100;
            """,
            transform=lambda x: list(reversed(x)),
            description="Statements and durations of the 100 recent queries / jobs.",
        )
        history_count = InfoElement(
            name="history_count",
            label="Query History Count",
            sql="""
                SELECT
                  COUNT(*) AS job_count
                FROM
                    sys.jobs_log;
            """,
            transform=get_single_value("job_count"),
            description="Total number of queries on this node.",
        )
        performance15min = InfoElement(
            name="performance15min",
            label="Query performance 15min",
            sql=r"""
                SELECT
                    CURRENT_TIMESTAMP AS last_timestamp,
                    (ended / 10000) * 10000 + 5000 AS ended_time,
                    COUNT(*) / 10.0 AS qps,
                    TRUNC(AVG(ended::BIGINT - started::BIGINT), 2) AS duration,
                    UPPER(regexp_matches(stmt,'^\s*(\w+).*')[1]) AS query_type
                FROM
                    sys.jobs_log
                WHERE
                    ended > now() - ('15 minutes')::INTERVAL
                GROUP BY 1, 2, 5
                ORDER BY ended_time ASC;
            """,
            description="The query performance within the last 15 minutes, including two metrics: "
            "queries per second, and query speed (ms).",
        )
        running = InfoElement(
            name="running",
            label="Currently Running Queries",
            sql="""
                SELECT
                  started AS "time",
                  stmt,
                  (CURRENT_TIMESTAMP::LONG - started::LONG) AS duration,
                  username
                FROM sys.jobs
                WHERE stmt NOT ILIKE '%snapshot%'
                ORDER BY time;
            """,
            description="Statements and durations of currently running queries / jobs.",
        )
        running_count = InfoElement(
            name="running_count",
            label="Number of running queries",
            sql="""
                SELECT
                    COUNT(*) AS job_count
                FROM
                    sys.jobs;
            """,
            transform=get_single_value("job_count"),
            description="Total number of currently running queries.",
        )
        top100_count = InfoElement(
            name="top100_count",
            label="Query frequency",
            description="The 100 most frequent queries.",
            sql="""
                SELECT
                  stmt,
                  COUNT(stmt) AS stmt_count,
                  MAX((ended::LONG - started::LONG) ) AS max_duration,
                  MIN((ended::LONG - started::LONG) ) AS min_duration,
                  AVG((ended::LONG - started::LONG) ) AS avg_duration,
                  PERCENTILE((ended::LONG - started::LONG), 0.99) AS p90
                FROM sys.jobs_log
                GROUP BY stmt
                ORDER BY stmt_count DESC
                LIMIT 100;
            """,
        )
        top100_duration_individual = InfoElement(
            name="top100_duration_individual",
            label="Individual Query Duration",
            description="The 100 queries by individual duration.",
            sql="""
                SELECT
                  (ended::LONG - started::LONG) AS duration,
                  stmt
                FROM sys.jobs_log
                ORDER BY duration DESC
                LIMIT 100;
            """,
            unit="ms",
        )
        top100_duration_total = InfoElement(
            name="top100_duration_total",
            label="Total Query Duration",
            description="The 100 queries by total duration.",
            sql="""
                SELECT
                  SUM(ended::LONG - started::LONG) AS total_duration,
                  stmt,
                  COUNT(stmt) AS stmt_count
                FROM sys.jobs_log
                GROUP BY stmt
                ORDER BY total_duration DESC
                LIMIT 100;
            """,
            unit="ms",
        )

    class Logs:
        """
        Access `sys.jobs_log` for logging purposes.
        """

        """
        TODO: Implement `tail` in one way or another.
              -- https://stackoverflow.com/q/4714975

        @seut says:
        why? whats the issue with sorting it desc by ended? As the table will be computed by results of
        all nodes inside the cluster, the natural ordering might not be deterministic.

        Ideas::

            SELECT * FROM sys.jobs_log OFFSET -10;
            SELECT * FROM sys.jobs_log OFFSET (SELECT count(*) FROM sys.jobs_log)-10;

        - https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#to-char-expression-format-string
        - https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#date-format-format-string-timezone-timestamp
        """

        user_queries_latest = LogElement(
            name="user_queries_latest",
            label="Latest User Queries",
            sql=r"""
                SELECT
                    DATE_FORMAT('%Y-%m-%dT%H:%i:%s.%f', started) AS started,
                    DATE_FORMAT('%Y-%m-%dT%H:%i:%s.%f', ended) AS ended,
                    classification, stmt, username, node
                FROM
                    sys.jobs_log
                WHERE
                    stmt NOT LIKE '%sys.%' AND
                    stmt NOT LIKE '%information_schema.%'
                ORDER BY ended DESC
                LIMIT {limit};
            """,
        )

    class Replication:
        """
        Information about logical replication.
        """

        # https://github.com/crate/crate/blob/master/docs/admin/logical-replication.rst#monitoring
        subscriptions = """
        SELECT s.subname, s.subpublications, sr.srrelid::text, sr.srsubstate, sr.srsubstate_reason
        FROM pg_subscription s
        JOIN pg_subscription_rel sr ON s.oid = sr.srsubid
        ORDER BY s.subname;
        """

    class Resources:
        """
        About system resources.
        """

        # TODO: Needs templating.
        column_cardinality = """
        SELECT tablename, attname, n_distinct
        FROM pg_stats
        WHERE schemaname = '...'
        AND tablename IN (...)
        AND attname IN (...);
        """

        file_descriptors = """
        SELECT
            name AS node_name,
            process['open_file_descriptors'] AS "open_file_descriptors",
            process['max_open_file_descriptors'] AS max_open_file_descriptors
        FROM sys.nodes
        ORDER BY node_name;
        """

        heap_usage = """
        SELECT
            name AS node_name,
            heap['used'] / heap['max']::DOUBLE AS heap_used
        FROM sys.nodes
        ORDER BY node_name;
        """

        tcp_connections = """
        SELECT
            name AS node_name,
            connections
        FROM sys.nodes
        ORDER BY node_name;
        """

        # TODO: Q: Why "14"? Is it about only getting information about the `write` thread pool?
        #       A: Yes, the `write` thread pool will be exposed as the last entry inside this array.
        #          But this may change in future.
        thread_pools = """
        SELECT
            name AS node_name,
            thread_pools[14]['queue'],
            thread_pools[14]['active'],
            thread_pools[14]['threads']
        FROM sys.nodes
        ORDER BY node_name;
        """

    class Settings:
        """
        Reflect cluster settings.
        """

        info = """
        SELECT
            name,
            master_node,
            settings['cluster']['routing']['allocation']['cluster_concurrent_rebalance']
                AS cluster_concurrent_rebalance,
            settings['indices']['recovery']['max_bytes_per_sec'] AS max_bytes_per_sec
        FROM sys.cluster
        LIMIT 1;
        """

    class Shards:
        """
        Information about shard / node / table / partition allocation and rebalancing.
        """

        # https://cratedb.com/docs/crate/reference/en/latest/admin/system-information.html#example
        # TODO: Needs templating.
        for_table = """
        SELECT
            schema_name,
            table_name,
            id,
            partition_ident,
            num_docs,
            primary,
            relocating_node,
            routing_state,
            state,
            orphan_partition
        FROM sys.shards
        WHERE schema_name = '{schema_name}' AND table_name = '{table_name}';
        """

        # Identify the location of the shards for each partition.
        # TODO: Needs templating.
        location_for_partition = """
        SELECT   table_partitions.table_schema,
                 table_partitions.table_name,
                 table_partitions.values[{partition_column}]::TIMESTAMP,
                 shards.primary,
                 shards.node['name']
        FROM sys.shards
        JOIN information_schema.table_partitions ON shards.partition_ident=table_partitions.partition_ident
        WHERE table_partitions.table_name = {table_name}
        ORDER BY 1,2,3,4,5;
        """

        allocation = InfoElement(
            name="shard_allocation",
            sql="""
                SELECT
                    IF(primary = TRUE, 'primary', 'replica') AS shard_type,
                    COUNT(*) AS shards
                  FROM sys.allocations
                  WHERE current_state != 'STARTED'
                  GROUP BY 1
            """,
            label="Shard Allocation",
            description="Support identifying issues with shard allocation.",
        )

        max_checkpoint_delta = InfoElement(
            name="max_checkpoint_delta",
            sql="""
                SELECT
                    COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0)
                    AS max_checkpoint_delta
                FROM sys.shards;
            """,
            transform=get_single_value("max_checkpoint_delta"),
            label="Delta between local and global checkpoint",
            description="If the delta between the local and global checkpoint is significantly large, "
            "shard replication might have stalled or slowed down.",
        )

        # data-hot-2 	 262
        # data-hot-1 	 146
        node_shard_distribution = InfoElement(
            name="node_shard_distribution",
            label="Shard Distribution",
            sql="""
                SELECT
                    node['name'] AS node_name,
                    COUNT(*) AS num_shards
                FROM sys.shards
                WHERE primary = true
                GROUP BY node_name;
            """,
            description="Shard distribution across nodes.",
        )

        not_started = InfoElement(
            name="shard_not_started",
            label="Shards not started",
            sql="""
                SELECT *
                FROM sys.allocations
                WHERE current_state != 'STARTED';
            """,
            description="Information about shards which have not been started.",
        )
        not_started_count = InfoElement(
            name="shard_not_started_count",
            label="Number of shards not started",
            description="Total number of shards which have not been started.",
            sql="""
                SELECT COUNT(*) AS not_started_count
                FROM sys.allocations
                WHERE current_state != 'STARTED';
            """,
            transform=get_single_value("not_started_count"),
        )

        rebalancing_progress = InfoElement(
            name="shard_rebalancing_progress",
            label="Shard Rebalancing Progress",
            sql="""
                SELECT
                    table_name,
                    schema_name,
                    recovery['stage'] AS recovery_stage,
                    AVG(recovery['size']['percent']) AS progress,
                    COUNT(*) AS count
                FROM
                    sys.shards
                GROUP BY table_name, schema_name, recovery_stage;
            """,
            description="Information about rebalancing progress.",
        )

        rebalancing_status = InfoElement(
            name="shard_rebalancing_status",
            label="Shard Rebalancing Status",
            sql="""
                SELECT node['name'], id, recovery['stage'], recovery['size']['percent'], routing_state, state
                FROM sys.shards
                WHERE routing_state IN ('INITIALIZING', 'RELOCATING')
                ORDER BY id;
            """,
            description="Information about rebalancing activities.",
        )

        table_allocation = InfoElement(
            name="table_allocation",
            label="Table Allocations",
            sql="""
                SELECT
                    table_schema, table_name, node_id, shard_id, partition_ident, current_state, decisions, explanation
                FROM
                    sys.allocations;
            """,
            description="Table allocation across nodes, shards, and partitions.",
        )

        table_allocation_special = InfoElement(
            name="table_allocation_special",
            label="Table Allocations Special",
            sql="""
                SELECT decisions[2]['node_name'] AS node_name, COUNT(*) AS table_count
                FROM sys.allocations
                GROUP BY decisions[2]['node_name'];
            """,
            description="Table allocation. Special.",
        )

        table_shard_count = InfoElement(
            name="table_shard_count",
            label="Table Shard Count",
            sql="""
                SELECT
                    table_schema,
                    table_name,
                    SUM(number_of_shards) AS num_shards
                FROM
                    information_schema.table_partitions
                WHERE
                    closed = false
                GROUP BY table_schema, table_name;
            """,
            description="Total number of shards per table.",
        )

        total_count = InfoElement(
            name="shard_total_count",
            label="Number of shards",
            description="Total number of shards.",
            sql="""
                SELECT COUNT(*) AS shard_count
                FROM sys.shards
            """,
            transform=get_single_value("shard_count"),
        )

        # TODO: Are both `translog_uncommitted` items sensible?
        translog_uncommitted = InfoElement(
            name="translog_uncommitted",
            label="Uncommitted Translog",
            description="Check if translogs are committed properly by comparing the "
            "`flush_threshold_size` with the `uncommitted_size` of a shard.",
            sql="""
                SELECT
                  sh.table_name,
                  sh.partition_ident,
                  SUM(sh.translog_stats['uncommitted_size']) / POWER(1024, 3) as "translog_uncomitted_in_gib"
                FROM information_schema.table_partitions tp
                JOIN sys.shards sh USING (table_name, partition_ident)
                WHERE sh.translog_stats['uncommitted_size'] > settings['translog']['flush_threshold_size']
                GROUP BY 1, 2
                ORDER BY 3 DESC;
            """,
        )
        translog_uncommitted_size = InfoElement(
            name="translog_uncommitted_size",
            label="Total uncommitted translog size",
            description="A large number of uncommitted translog operations can indicate issues with shard replication.",
            sql="""
                SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size
                FROM sys.shards;
            """,
            transform=get_single_value("translog_uncommitted_size"),
            unit="bytes",
        )</doc><doc title="CrateDB Toolkit: Import example datasets" desc="CrateDB Toolkit's `cratedb_toolkit.datasets.load_dataset` API primitive can be used to load curated datasets from &lt;https://github.com/crate/cratedb-datasets&gt; into your database programmatically, using a few lines of Python. The API is suitable to be used for exploring CrateDB in Python programs and scientific notebooks."># Datasets API

Provide access to datasets, to be easily consumed by tutorials
and/or production applications.

## Install
```shell
pip install --upgrade 'cratedb-toolkit[datasets]'
```

## Synopsis

```python
from cratedb_toolkit.datasets import load_dataset

dataset = load_dataset("tutorial/weather-basic")
print(dataset.ddl)
```

## Usage

### Built-in datasets
Load example datasets into CrateDB database tables.
```python
from cratedb_toolkit.datasets import load_dataset

# Weather data example.
dataset = load_dataset("tutorial/weather-basic")
dataset.dbtable(dburi="crate://crate@localhost/", table="weather_data").load()
```
```python
from cratedb_toolkit.datasets import load_dataset

# UK wind farm data example.
dataset = load_dataset("tutorial/windfarm-uk-info")
dataset.dbtable(dburi="crate://crate@localhost/", table="windfarms").load()

dataset = load_dataset("tutorial/windfarm-uk-data")
dataset.dbtable(dburi="crate://crate@localhost/", table="windfarm_output").load()
```

### Kaggle
For accessing datasets on Kaggle, you will need an account on their platform.

#### Authentication
Either create a configuration file `~/.kaggle/kaggle.json` in JSON format,
```json
{"username":"acme","key":"134af98bdb0bd0fa92078d9c37ac8f78"}
```
or, alternatively, use those environment variables.
```shell
export KAGGLE_USERNAME=acme
export KAGGLE_KEY=134af98bdb0bd0fa92078d9c37ac8f78
```

#### Acquisition
Load a dataset on Kaggle into a CrateDB database table.
```python
from cratedb_toolkit.datasets import load_dataset

dataset = load_dataset("kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet")
dataset.dbtable(dburi="crate://crate@localhost/", table="kaggle_daily_weather").load()
```


## In Practice

Please refer to those notebooks to learn how `load_dataset` works in practice.

- [How to Build Time Series Applications in CrateDB]
- [Exploratory data analysis with CrateDB]
- [Time series decomposition with CrateDB]


[Exploratory data analysis with CrateDB]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/exploratory_data_analysis.ipynb
[How to Build Time Series Applications in CrateDB]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/dask-weather-data-import.ipynb
[Time series decomposition with CrateDB]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/time-series-decomposition.ipynb</doc></examples></project>