.. _sql_ddl:

========================
Data Definition in Crate
========================


Table Basics
============

To create a table use the ``CREATE TABLE`` command. You must at least specify a name for the
table and names and types of the columns.
See `Data Types` for information about the supported data types.

Let's create a simple table with two columns of type ``integer`` and ``string``::

    cr> create table my_table (
    ...   first_column integer,
    ...   second_column string
    ... )
    CREATE OK (... sec)

A table can be removed by using the ``DROP TABLE`` command::

    cr> drop table my_table
    DROP OK (... sec)

Constraints
===========

Primary Key
-----------

The primary key constraint combines a unique constraint and a not-null constraint. It also defines
the default routing value used for sharding. Example::

    cr> create table my_table1 (
    ...   first_column integer primary key,
    ...   second_column string
    ... )
    CREATE OK (... sec)

Currently primary key's cannot be auto generated and have to be specified if
data is inserted, otherwise an error is returned.

.. note::

  Multiple primary keys are not supported yet.

.. _sql_ddl_index_definition:

Index
-----

At Crate, every column's data is indexed by default using a :ref:`keyword-analyzer`,
indexing the data as a single token.

Disable indexing
................

Indexing can be turned off by using the ``INDEX OFF`` constraint.
Without an index the column can never be hit by a query, and is only available as a result
column::

    cr> create table my_table1b (
    ...   first_column string INDEX OFF
    ... )
    CREATE OK (... sec)

Defining an index analyzer
..........................

By defining an index on a column, it's analyzed data is indexed instead of the raw data.
Thus, depending on the used analyzer, querying for the exact data may not work anymore.
See :ref:`builtin-analyzer` for details about available builtin analyzer or `Create custom
analyzer`_.

If no analyzer is specified, the :ref:`standard <standard-analyzer>` analyzer is used::

    cr> create table my_table1c (
    ...   first_column string INDEX using fulltext
    ... )
    CREATE OK (... sec)

.. note:

    ``fulltext`` is the only indexing method supported right now.

Defining the usage of a concrete analyzer is straight forward by defining the analyzer as a
parameter using the ``WITH`` statement::

    cr> create table my_table1d (
    ...   first_column string INDEX using fulltext with(analyzer='english')
    ... )
    CREATE OK (... sec)

Defining a named index column definition
........................................

It's also possible to define an index column which treat the data of a given column as input.
This is especially useful if you want to search for both, the exact and analyzed data::

    cr> create table my_table1e (
    ...   first_column string,
    ...   INDEX first_column_ft using fulltext(first_column)
    ... )
    CREATE OK (... sec)

Of course defining a custom analyzer is possible here too::

    cr> create table my_table1f (
    ...   first_column string,
    ...   INDEX first_column_ft using fulltext(first_column) with(analyzer='english')
    ... )
    CREATE OK (... sec)

Defining a composite index
..........................

Defining a composite (or combined) index is done using the same syntax as above despite multiple
columns are given to the ``fulltext`` index method::

    cr> create table documents (
    ...   title string,
    ...   body string,
    ...   INDEX title_body_ft using fulltext(title, body) with(analyzer='english')
    ... )
    CREATE OK (... sec)


.. warning::

    Creating a column index/analysis after a table was already created is currently not
    supported, so think carefully while designing your table definition.

Data Types
==========

string
------

A text-based basic type containing one or more character. Example::

    cr> create table my_table2 (
    ...   first_column string
    ... )
    CREATE OK (... sec)


number
------

Crate supports a set of number types: ``integer``,``long``, ``short``, ``double``,
``float`` and ``byte``. All types have the same ranges as corresponding `Java types`_.
Example::

    cr> create table my_table3 (
    ...   first_column integer,
    ...   second_column long,
    ...   third_column short,
    ...   fourth_column double,
    ...   fifth_column float,
    ...   sixth_column byte
    ... )
    CREATE OK (... sec)

timestamp
---------

The timestamp type is a special type which maps to a formatted string. Internally it maps to
a ``long``, adding the parsing from long to string and vice versa. All timestamps a treat as UTC.
The default format is dateOptionalTime_ and cannot be changed currently. It will also accept a
``long`` representing UTC milliseconds since the epoch. Example::

    cr> create table my_table4 (
    ...   first_column timestamp
    ... )
    CREATE OK (... sec)


Sharding
========

Number of shards
----------------

Crate supports sharding natively, it even uses *5* shards by default if not further defined.
The number of shards can be defined by using the ``CLUSTERED INTO <number> SHARDS`` statement on
table creation. Example::

    cr> create table my_table5 (
    ...   first_column int
    ... ) clustered into 10 shards
    CREATE OK (... sec)

.. note::

  The number of shards can only be set on table creation, it cannot be changed later on.

Routing
-------

The column used for routing can be freely defined using the ``CLUSTERED BY (<column>)``
statement and is used to route a row to a particular shard. Example::

    cr> create table my_table6 (
    ...   first_column int primary key,
    ...   second_column string
    ... ) clustered by (first_column)
    CREATE OK (... sec)


By default Crate is using the primary keys for routing the request to the involved shards. So
following two examples resulting in the same behaviour::

    cr> create table my_table7 (
    ...   first_column int primary key,
    ...   second_column string
    ... )
    CREATE OK (... sec)

    cr> create table my_table8 (
    ...   first_column int primary key,
    ...   second_column string
    ... ) clustered by (first_column)
    CREATE OK (... sec)

If no primary is defined an internal generated unique id is used for routing.

.. note::

  It is currently not supported to define a column for routing which is not a primary key or
  member of a composite primary key.

Example for combining custom routing and shard definition::

    cr> create table my_table9 (
    ...   first_column int primary key,
    ...   second_column string
    ... ) clustered by (first_column) into 10 shards
    CREATE OK (... sec)


Replication
===========

By default Crate uses an replication factor of 1. If e.g. a cluster with 2 nodes is set up and
an index is created using 5 shards, each node will have 5 shards.
Defining the number of replicas is done using the ``REPLICAS <number_of_replicas>`` statement.
Example::

    cr> create table my_table10 (
    ...   first_column int,
    ...   second_column string
    ... ) replicas 1
    CREATE OK (... sec)

.. note::

  The number of replicas can be changed at any time.

.. _fulltext-indices:

Indices and fulltext search
============================

Fulltext indices take the contents of one or more fields and split it up into tokens that are used for fulltext-search.
The transformation from a text to separate tokens is done by an analyzer.
In order to create fulltext search queries an index with an analyzer must be defined for the
related columns.
See :ref:`Index definition <sql_ddl_index_definition>` for details.


Create custom analyzer
----------------------

An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first.
It is used to filter some special chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them.

With these elements in place, analyzer provide finegrained control over building a token stream used for fulltext search.
e.G. you can use language specific analyzers, tokenizers and token-filters to get property search results for data
 provided in a certain language.



Create Analyzer Syntax::

    CREATE ANALYZER <analyzer_name> [EXTENDS <analyzer_name>] (
        [
          TOKENIZER <tokenizer_name> [WITH] (
            <tokenizer_property>=<value>,
            ...
          ),
        ]
        [
          TOKEN_FILTERS [WITH] (
            <token_filter_name>
            [ [WITH] (
                <token_filter_property>=<value>,
                ...
              )
            ],
            ...
          ),
        ]
        [
          CHAR_FILTERS [WITH] (
            <char_filter_name>
            [ [WITH] (
                <char_filter_property>=<value>,
                ...
              )
            ],
            ...
          )
        ]
    )

Multiple char filters and token filters are allowed but at maximum one tokenizer. Order does not matter.
A simple Example::

    cr> create ANALYZER myanalyzer (
    ...   TOKENIZER whitespace,
    ...   TOKEN_FILTERS WITH (
    ...     lowercase,
    ...     kstem
    ...   ),
    ...  CHAR_FILTERS (
    ...     html_strip,
    ...     mymapping WITH (
    ...       type='mapping',
    ...       mapping = ['ph=>f', 'qu=>q', 'foo=>bar']
    ...     )
    ...   )
    ... )
    CREATE OK (... sec)

This example creates an analyzer called ``myanalyzer`` to be used in index-definitions and index-constraints.
It will use a :ref:`whitespace-tokenizer` tokenizer, a :ref:`lowercase-tokenfilter` token-filter
and a :ref:`kstem-tokenfilter` token-filter, a :ref:`htmlstrip-charfilter` char-filter
and a custom char-filter that extends the :ref:`mapping-charfilter` char-filter.

You can use :ref:`builtin-tokenizer`, :ref:`builtin-token-filter` and :ref:`builtin-char-filter`
by just writing their names and you can extend and parameterize them,
see for example the ``mymapping`` char-filter above. You have to give these extended ones a unique name.

Custom tokenizers, token-filters and char-filters can be reused in other analyzer-definitions,
but cannot be further extended, this is only possible for builtin ones.

Extending Bultin Analyzer
-------------------------

Existing Analyzers can be used to create custom Analyzers by means of extending them.

You can extend and parameterize :ref:`builtin-analyzer` like this::

    cr> create ANALYZER "german_snowball" extends snowball WITH (
    ...   language='german'
    ... )
    CREATE OK (... sec)

If you extend :ref:`builtin-analyzer`, tokenizer, char-filter or token-filter cannot be defined.
In this case use the parameters available for the extended :ref:`builtin-analyzer`.

If you extend custom-analyzers, every part of the analyzer that is ommitted will be taken from the extended one.
Example::

    cr> create ANALYZER e2 EXTENDS myanalyzer (
    ...     TOKENIZER mypattern (
    ...       type='pattern',
    ...       pattern='.*'
    ...     )
    ... )
    CREATE OK (... sec)

This analyzer will use the char-filters and token-filters from ``myanalyzer``
and will override the tokenizer with ``mypattern``.

Analyzer Reference
------------------

.. toctree::
  :maxdepth: 2

  analyzer


.. _sql_ddl_system_columns:

System Columns
==============

On every table Crate implements several implicitly defined system columns. Their names are
reserved and cannot be used as user-defined column names. All system columns are prefixed with
an underscore and therefore must be quoted on usage.

.. _sql_ddl_system_column_version:

_version
  Crate uses an internal versioning for every row, the version number is increased on every write.
  This column can be used for `Optimistic Concurrency Control`_, see :ref:`sql_occ` for usage
  details.

.. _sql_ddl_system_column_score:

_score
  This internal system column is available on all documents retrieved by a ``SELECT`` query.
  It is representing the scoring ratio of the document related to the used query filter and
  makes most sense on fulltext searches.
  The scoring ratio is always related to the highest score determined by a search,
  thus scores are not directly comparable across searches.
  If the query does not include a fulltext search the value is 1.0f in most cases.


.. _Java types: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
.. _dateOptionalTime: http://joda-time.sourceforge.net/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser()
.. _Optimistic Concurrency Control: http://en.wikipedia.org/wiki/Optimistic_concurrency_control
