.. _sql_ddl:

========================
Data Definition in Crate
========================


Table Basics
============

To create a table use the ``CREATE TABLE`` command. You must at least specify a name for the
table and names and types of the columns.
See `Data Types` for information about the supported data types.

Let's create a simple table with two columns of type ``integer`` and ``string``::

    cr> create table my_table (
    ...   first_column integer,
    ...   second_column string
    ... );
    CREATE OK (... sec)

A table can be removed by using the ``DROP TABLE`` command::

    cr> drop table my_table;
    DROP OK (... sec)

Constraints
===========

Primary Key
-----------

The primary key constraint combines a unique constraint and a not-null constraint. It also defines
the default routing value used for sharding. Example::

    cr> create table my_table1 (
    ...   first_column integer primary key,
    ...   second_column string
    ... );
    CREATE OK (... sec)

Currently primary key's cannot be auto generated and have to be specified if
data is inserted, otherwise an error is returned.

Defining multiple columns with a primary key constraint is also supported::

    cr> create table my_table1pk (
    ...   first_column integer primary key,
    ...   second_column string primary key,
    ...   third_column string
    ... );
    CREATE OK (... sec)

Or using a alternate syntax::

    cr> create table my_table1pk1 (
    ...   first_column integer,
    ...   second_column string,
    ...   third_column string,
    ...   primary key (first_column, second_column)
    ... );
    CREATE OK (... sec)

Data Types
==========

boolean
-------

A basic boolean type. Accepting ``true`` and ``false`` as values. Example::

    cr> create table my_bool_table (
    ...   first_column boolean
    ... );
    CREATE OK (... sec)

    cr> drop table my_bool_table;
    DROP OK (... sec)

string
------

A text-based basic type containing one or more character. Example::

    cr> create table my_table2 (
    ...   first_column string
    ... );
    CREATE OK (... sec)


number
------

Crate supports a set of number types: ``integer``, ``long``, ``short``, ``double``,
``float`` and ``byte``. All types have the same ranges as corresponding `Java types`_.
You can insert any number for any type, be it a ``float``, ``integer``, or ``byte``
as long as its within the corresponding range.
Example::

    cr> create table my_table3 (
    ...   first_column integer,
    ...   second_column long,
    ...   third_column short,
    ...   fourth_column double,
    ...   fifth_column float,
    ...   sixth_column byte
    ... );
    CREATE OK (... sec)

timestamp
---------

The timestamp type is a special type which maps to a formatted string. Internally it maps to
the UTC milliseconds since 1970-01-01T00:00:00Z stored as ``long``. They are always returned as ``long``.
The default format is dateOptionalTime_ and cannot be changed currently.
Formatted date strings containing timezone offset information will be converted to UTC.
Formated string without timezone offset information will be treated as UTC.
Timestamps will also accept a ``long`` representing UTC milliseconds since the epoch or
a ``float`` or ``double`` representing UTC seconds since the epoch with milliseconds as
fractions. Example::

    cr> create table my_table4 (
    ...   id integer,
    ...   first_column timestamp
    ... );
    CREATE OK (... sec)

    cr> insert into my_table4 (id, first_column) values (0, '1970-01-01T00:00:00');
    INSERT OK, 1 row affected (... sec)

    cr> insert into my_table4 (id, first_column) values (1, '1970-01-01T00:00:00+0100');
    INSERT OK, 1 row affected (... sec)

    cr> insert into my_table4 (id, first_column) values (2, 0);
    INSERT OK, 1 row affected (... sec)

    cr> insert into my_table4 (id, first_column) values (3, 1.0);
    INSERT OK, 1 row affected (... sec)

    cr> insert into my_table4 (id, first_column) values (3, 'wrong');
    ValidationException[Validation failed for first_column: wrong type 'string'. expected: 'timestamp']

object
------

The object type allows to define nested documents instead of old-n-busted flat tables.
An object can contain other fields of any type, even further object columns.
An Object column can be either schemaless or enforce its defined schema.
It can even be used as a kind of json-blob.

Syntax::

    <columnName> OBJECT [ ({DYNAMIC|STRICT|IGNORED}) ] [ AS ( <columnDefinition>* ) ]

The only required part of this column definition is ``OBJECT``.
The object type defining this objects behaviour is optional, if left out ``DYNAMIC`` will be used.
The list of subcolumns is optional as well, if left out, this object will have no schema
(with a schema created on the fly on first inserts in case of ``DYNAMIC``).

Example::

    cr> create table my_table11 (
    ...   title string,
    ...   col1 object,
    ...   col3 object(strict) as (
    ...     age integer,
    ...     name string,
    ...     col31 object as (
    ...       birthday timestamp
    ...     )
    ...   )
    ... );
    CREATE OK (... sec)

strict
......

It can be configured to be ``strict``, rejecting any subcolumn that is not defined upfront
in the schema. As you might have guessed, defining ``strict`` objects without subcolumns results
in an unusable column that will always be null, which is the most useless column one could create.

Example::

    cr> create table my_table12 (
    ...   title string,
    ...   author object(strict) as (
    ...     name string,
    ...     birthday timestamp
    ...   )
    ... );
    CREATE OK (... sec)


dynamic
.......

Another option is ``dynamic``, which means that new subcolumns can be added in this object.

Note that adding new columns to a ``dynamic`` object will affect the schema of the
table. Once a column is added, it shows up in the ``information_schema.columns``
table and its type and attributes are fixed. They will have the type that was guessed by their
inserted/updated value and they will always be ``not_indexed`` which means
they are analyzed with the ``plain`` analyzer, which means as-is.
If a new column ``a`` was added with type ``integer``,
adding strings to this column will result in an error.

Examples::

    cr> create table my_table13 (
    ...   title string,
    ...   author object as (
    ...     name string,
    ...     birthday timestamp
    ...   )
    ... );
    CREATE OK (... sec)

which is exactly the same as::

    cr> create table my_table14 (
    ...   title string,
    ...   author object(dynamic) as (
    ...     name string,
    ...     birthday timestamp
    ...   )
    ... );
    CREATE OK (... sec)

New columns added to ``dynamic`` objects are, once added,
usable as usual subcolumns. One can retrieve them, sort by them
and use them in where clauses.

ignored
.......

The third option is ``ignored`` which results in an object that allows inserting new subcolumns
but this adding will not affect the schema, they are not mapped according to their type,
which is therefor not guessed as well. You can in fact add any value to an added column of the
same name. The first value added does not determine what you can add further,
like with ``dynamic`` objects.
An object configured like this will simply accept and return the columns inserted into it,
but otherwise ignore them.

    cr> create table my_table15 (
    ...   title string,
    ...   details object(ignored) as (
    ...     num_pages integer,
    ...     font_size float
    ...   )
    ... );
    CREATE OK (... sec)


New columns added to ``ignored`` objects can be retrieved as result column in a ``SELECT`` statement,
but one cannot order by them or use them in a where clause. They are simply there for fetching,
nothing else.

array
-----

Crate supports arrays. An array is a collection of another `simple` data type.
The simple data types are:

* boolean
* string
* ip
* all numeric types (integer, long, short, double, float, byte)
* timestamp
* object

Array types are defined as follows::

    cr> create table my_table_arrays (
    ...     tags array(string),
    ...     objects array(object as (age integer, name string))
    ... );
    CREATE OK (... sec)


.. note::

    Currently arrays cannot be nested. Something like array(array(string))
    won't work.

.. note::

    Crash currently has no support to insert arrays. But it is possible to
    insert arrays using the `_sql` REST Endpoint or any of the available crate
    clients.

    E.g. in the python client a python list can be used as an argument to the
    cursors execute method.


.. _sql_ddl_sharding:

Sharding
========

Number of shards
----------------

Crate supports sharding natively, it even uses *5* shards by default if not further defined.
The number of shards can be defined by using the ``CLUSTERED INTO <number> SHARDS`` statement on
table creation. Example::

    cr> create table my_table5 (
    ...   first_column int
    ... ) clustered into 10 shards;
    CREATE OK (... sec)

.. note::

  The number of shards can only be set on table creation, it cannot be changed later on.

Routing
-------

The column used for routing can be freely defined using the ``CLUSTERED BY (<column>)``
statement and is used to route a row to a particular shard. Example::

    cr> create table my_table6 (
    ...   first_column int,
    ...   second_column string
    ... ) clustered by (first_column);
    CREATE OK (... sec)


If primary key constraints are defined, the routing column definition
can be omitted as primary key columns are always used for routing by default.
If the routing column is defined explicitly, it must match a primary key column::

    cr> create table my_table8 (
    ...   first_column int primary key,
    ...   second_column string primary key,
    ...   third_column string
    ... ) clustered by (first_column);
    CREATE OK (... sec)


Example for combining custom routing and shard definition::

    cr> create table my_table9 (
    ...   first_column int primary key,
    ...   second_column string primary key,
    ...   third_column string
    ... ) clustered by (first_column) into 10 shards;
    CREATE OK (... sec)


.. _replication:

Replication
===========

Replication of a table in Crate means that each primary shard of a
table is stored additionally on so called secondary shards. This might
be useful for better read performance and high availability. If not
specified, crate creates one replica, which means that a tables
content is stored twice across the nodes of a cluster.

Defining the number of replicas is done using the
``number_of_replicas`` property.

Example::

    cr> create table my_table10 (
    ...   first_column int,
    ...   second_column string
    ... ) with (number_of_replicas=1);
    CREATE OK (... sec)


The `number_of_replicas` property also accepts an string as parameter that
contains a `range`.

A range is a definition of `minimum number of replicas` to `maximum
number of replicas` depending on the number of nodes in the
cluster. The table below shows some examples.

===== ======================================================================
Range Explanation
===== ======================================================================
0-1   Will create 0 or 1 replicas depending on the number of available nodes
----- ----------------------------------------------------------------------
2-4   Table requires at least 2 replicas to be fully replicated. Will
      create up to 4 if nodes are added.
----- ----------------------------------------------------------------------
0-all Will expand the number of replicas to the available number of nodes.
===== ======================================================================

For details of the range syntax refer to :ref:`number_of_replicas`.

.. note::

  The number of replicas can be changed at any time.


.. _sql_ddl_partitioned_by:

Partitioned Tables
==================

A partitioned table is a virtual table that is internally split up into several
internal *tables*, called `partitions`, by the value of one or more columns. For every distinct combination
of values in those configured columns one separate partition is created.

Creating a partitioned table is done using the :ref:`partitioned_clause`::

    cr> create table partitioned_table (
    ... id long,
    ... title string,
    ... date timestamp
    ... ) partitioned by (date);
    CREATE OK (... sec)

For further details, please refer to :ref:`partitioned_clause`.

.. _fulltext-indices:

.. _indices_and_fulltext:

Indices and fulltext search
===========================

Fulltext indices take the contents of one or more fields and split it
up into tokens that are used for fulltext-search. The transformation
from a text to separate tokens is done by an analyzer. In order to
create fulltext search queries a :ref:`fulltext index with an analyzer
<sql_ddl_index_fulltext>` must be defined for the related columns.

.. _sql_ddl_index_definition:

Index Definition
----------------

In Crate, every column's data is indexed using the ``plain`` index
method by default.  Currently there are 3 choices related to index
definition:

  - `Disable indexing`_

  - `Plain index (Default)`_

  - `Fulltext index with analyzer`_

.. warning::

    Creating an index after a table was already created is currently not supported,
    so think carefully while designing your table definition.


.. _sql_ddl_index_off:

Disable indexing
................

Indexing can be turned off by using the ``INDEX OFF`` column
definition. Without an index the column can never be hit by a query,
and is only available as a result column::

    cr> create table my_table1b (
    ...   first_column string INDEX OFF
    ... );
    CREATE OK (... sec)


.. _sql_ddl_index_plain:

Plain index (Default)
.....................

An index of type ``plain`` is indexing the input data as-is without
analyzing.  Using the ``plain`` index method is the default behaviour
but can also be declared explicitly::

    cr> create table my_table1b1 (
    ...   first_column string INDEX using plain
    ... );
    CREATE OK (... sec)

This results in the same behaviour than without any index declaration::

    cr> create table my_table1b2 (
    ...   first_column string
    ... );
    CREATE OK (... sec)


.. _sql_ddl_index_fulltext:

Fulltext index with analyzer
............................

By defining an index on a column, it's analyzed data is indexed
instead of the raw data.  Thus, depending on the used analyzer,
querying for the exact data may not work anymore.  See
:ref:`builtin-analyzer` for details about available builtin analyzer
or :ref:`sql-ddl-custom-analyzer`.

If no analyzer is specified when using a fulltext index, the
:ref:`standard <standard-analyzer>` analyzer is used::

    cr> create table my_table1c (
    ...   first_column string INDEX using fulltext
    ... );
    CREATE OK (... sec)

Defining the usage of a concrete analyzer is straight forward by
defining the analyzer as a parameter using the ``WITH`` statement::

    cr> create table my_table1d (
    ...   first_column string INDEX using fulltext with(analyzer='english')
    ... );
    CREATE OK (... sec)


Defining a named index column definition
........................................

It's also possible to define an index column which treat the data of a
given column as input.  This is especially useful if you want to
search for both, the exact and analyzed data::

    cr> create table my_table1e (
    ...   first_column string,
    ...   INDEX first_column_ft using fulltext(first_column)
    ... );
    CREATE OK (... sec)

Of course defining a custom analyzer is possible here too::

    cr> create table my_table1f (
    ...   first_column string,
    ...   INDEX first_column_ft using fulltext(first_column) with(analyzer='english')
    ... );
    CREATE OK (... sec)


Defining a composite index
..........................

Defining a composite (or combined) index is done using the same syntax
as above despite multiple columns are given to the ``fulltext`` index
method::

    cr> create table documents (
    ...   title string,
    ...   body string,
    ...   INDEX title_body_ft using fulltext(title, body) with(analyzer='english')
    ... );
    CREATE OK (... sec)

Composite indices can include nested columns within object columns as well::

    cr> create table my_table1g (
    ...   title string,
    ...   author object(dynamic) as (
    ...     name string,
    ...     birthday timestamp
    ...   ),
    ...   INDEX author_title_ft using fulltext(title, author['name'])
    ... );
    CREATE OK (... sec)

.. _sql-ddl-custom-analyzer:

.. _create_custom_analyzer:

Create custom analyzer
----------------------

An analyzer consists of one tokenizer, zero or more token-filters, and
zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the
char-filter is applied at first.  It is used to filter some special
chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them to
finally produce the desired stream of tokens.

With these elements in place, analyzers provide finegrained control
over building a token stream used for fulltext search.  For example
you can use language specific analyzers, tokenizers and token-filters
to get proper search results for data provided in a certain language.

Here is a simple Example::

    cr> create ANALYZER myanalyzer (
    ...   TOKENIZER whitespace,
    ...   TOKEN_FILTERS (
    ...     lowercase,
    ...     kstem
    ...   ),
    ...  CHAR_FILTERS (
    ...     html_strip,
    ...     mymapping WITH (
    ...       type='mapping',
    ...       mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
    ...     )
    ...   )
    ... );
    CREATE OK (... sec)

This example creates an analyzer called ``myanalyzer`` to be used in
index-definitions and index-constraints.  It will use a
:ref:`whitespace-tokenizer` tokenizer, a :ref:`lowercase-tokenfilter`
token-filter and a :ref:`kstem-tokenfilter` token-filter, a
:ref:`htmlstrip-charfilter` char-filter and a custom char-filter that
extends the :ref:`mapping-charfilter` char-filter.

You can use :ref:`builtin-tokenizer`, :ref:`builtin-token-filter` and
:ref:`builtin-char-filter` by just writing their names and you can
extend and parameterize them, see for example the ``mymapping``
char-filter above. You have to give these extended ones a unique name.

Extending Bultin Analyzer
-------------------------

Existing Analyzers can be used to create custom Analyzers by means of extending them.

You can extend and parameterize :ref:`builtin-analyzer` like this::

    cr> create ANALYZER "german_snowball" extends snowball WITH (
    ...   language='german'
    ... );
    CREATE OK (... sec)

If you extend :ref:`builtin-analyzer`, tokenizer, char-filter or
token-filter cannot be defined.  In this case use the parameters
available for the extended :ref:`builtin-analyzer`.

If you extend custom-analyzers, every part of the analyzer that is
ommitted will be taken from the extended one.  Example::

    cr> create ANALYZER e2 EXTENDS myanalyzer (
    ...     TOKENIZER mypattern WITH (
    ...       type='pattern',
    ...       pattern='.*'
    ...     )
    ... );
    CREATE OK (... sec)

This analyzer will use the char-filters and token-filters from
``myanalyzer`` and will override the tokenizer with ``mypattern``.

Analyzer Reference
------------------

.. toctree::
  :maxdepth: 2

  analyzer


.. _sql_ddl_system_columns:

System Columns
==============

On every table Crate implements several implicitly defined system columns. Their names are
reserved and cannot be used as user-defined column names. All system columns are prefixed with
an underscore.

.. _sql_ddl_system_column_version:

_version
  Crate uses an internal versioning for every row, the version number is increased on every write.
  This column can be used for `Optimistic Concurrency Control`_, see :ref:`sql_occ` for usage
  details.

.. _sql_ddl_system_column_score:

_score
  This internal system column is available on all documents retrieved by a ``SELECT`` query.
  It is representing the scoring ratio of the document related to the used query filter and
  makes most sense on fulltext searches.
  The scoring ratio is always related to the highest score determined by a search,
  thus scores are not directly comparable across searches.
  If the query does not include a fulltext search the value is 1.0f in most cases.


.. _Java types: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
.. _dateOptionalTime: http://joda-time.sourceforge.net/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser()
.. _Optimistic Concurrency Control: http://en.wikipedia.org/wiki/Optimistic_concurrency_control


Alter table
===========

The properties of a table can be modified using the `ALTER TABLE` clause::

    cr> alter table my_table1 set (number_of_replicas='0-all');
    ALTER OK (... sec)

In order to set a property to its default value use `reset`::

    cr> alter table my_table1 reset (number_of_replicas);
    ALTER OK (... sec)

.. note::

    Any other table modifications are currently not possible.
