.. _sql-analyzer:

==========================================
Builtin Tools for fulltext search in Crate
==========================================

.. _analyzer-overview:

Overview
========

Analyzers are used for creating fulltext-indexes. They take the content of a field and split it
into tokens, which are then searched. Analyzers filter, reorder and/or transform the content of
a field before it becomes the final stream of tokens.

An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first.
It is used to filter some special chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them.

With these elements in place, analyzer provide fine-grained control over building a token stream
used for fulltext search. For example you can use language specific analyzers,
tokenizers and token-filters to get proper search results for data provided in a certain language.

Below the builtin analyzers, tokenizers, token-filters and char-filters are listed.
They can be used as is or can be extended. See :ref:`fulltext-indices`.

.. _builtin-analyzer:

Builtin Analyzer
================

.. _standard-analyzer:

standard
--------

``type='standard'``

An analyzer of type standard is built using the :ref:`standard-tokenizer` Tokenizer
with the :ref:`standard-tokenfilter` Token Filter,
:ref:`lowercase-tokenfilter` Token Filter, and :ref:`stop-tokenfilter` Token Filter.

Lowercase all Tokens, uses *NO* stopwords and excludes tokens longer than 255 characters.
This analyzer uses unicode text segmentation, which is defined by `UAX#29`_.

Example::

    The quick brown fox jumps Over the lAzY DOG. --> quick, brown, fox, jumps, lazy, dog

Parameters
..........

stopwords
    A list of stopwords to initialize the :ref:`stop-tokenfilter` filter with. Defaults to the
    english stop words.

max_token_length
    The maximum token length. If a token is seen that exceeds this length then it is discarded.
    Defaults to ``255``.


.. _default-analyzer:

default
-------

``type='default'``

This is the same as the `standard-analyzer`_ analyzer.


.. _simple-analyzer:

simple
------

``type='simple'``

Uses the :ref:`lowercase-tokenizer` tokenizer.


.. _whitespace-analyzer:

whitespace
----------

``type='whitespace'``

Uses a :ref:`whitespace-tokenizer` tokenizer


.. _stop-analyzer:

stop
----

``type='stop'``

Uses a :ref:`lowercase-tokenizer` Tokenizer, with :ref:`stop-tokenfilter` Token Filter.

Parameters
..........

stopwords
    A list of stopwords to initialize the :ref:'stop-tokenfilter` filter with. Defaults to the
    english stop words.

stopwords_path
    A path (either relative to config location, or absolute) to a stopwords file configuration.


.. _keyword-analyzer:

keyword
-------

``type=keyword``

Creates one single token from the field-contents.


.. _pattern-analyzer:

pattern
-------

``type='pattern'``

An analyzer of type pattern that can flexibly separate text into terms via a regular expression.

Parameters
..........

lowercase
    Should terms be lowercased or not. Defaults to true.

pattern
    The regular expression pattern, defaults to \W+.

flags
    The regular expression flags.

.. note::

    The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg ``CASE_INSENSITIVE|COMMENTS``. Check `Java Pattern API`_ for
more details about flags options.


.. _language-analyzer:

language
--------

``type='<language-name>'``

The following types are supported:

``arabic``, ``armenian``, ``basque``, ``brazilian``, ``bulgarian``, ``catalan``,
``chinese``, ``cjk``, ``czech``, ``danish``, ``dutch``, ``english``, ``finnish``,
``french``, ``galician``, ``german``, ``greek``, ``hindi``, ``hungarian``, ``indonesian``,
``italian``, ``norwegian``, ``persian``, ``portuguese``, ``romanian``, ``russian``,
``spanish``, ``swedish``, ``turkish``, ``thai``.

Parameters
..........

stopwords
    A list of stopwords to initialize the stop filter with. Defaults to the english stop words.

stopwords_path
    A path (either relative to config location, or absolute) to a stopwords file configuration.

The following analyzers support setting custom ``stem_exclusion`` list:

``arabic``, ``armenian``, ``basque``, ``brazilian``, ``bulgarian``, ``catalan``, ``czech``,
``danish``, ``dutch``, ``english``, ``finnish``, ``french``, ``galician``, ``german``,
``hindi``, ``hungarian``, ``indonesian``, ``italian``, ``norwegian``, ``portuguese``,
``romanian``, ``russian``, ``spanish``, ``swedish``, ``turkish``.


.. _snowball-analyzer:

snowball
--------

``type='snowball'``

Uses the :ref:`standard-tokenizer` Tokenizer, with :ref:`standard-tokenfilter` filter,
:ref:`lowercase-tokenfilter` filter, :ref:`stop-tokenfilter` filter,
and :ref:`snowball-tokenfilter` filter.

Parameters
..........

stopwords
    A list of stopwords to initialize the stop filter with. Defaults to the english stop words.

language
    See the language-parameter of :ref:`snowball-tokenfilter`.


.. _builtin-tokenizer:

Builtin Tokenizer
=================


.. _standard-tokenizer:

standard
--------

``type='standard'``

A tokenizer of type standard providing a grammar based tokenizer, which is a good tokenizer for
most European language documents. The tokenizer implements the Unicode Text Segmentation
algorithm, as specified in Unicode Standard Annex #29.

Parameters
..........

max_token_length
    The maximum token length. If a token is seen that exceeds this length then it is discarded.
    Defaults to ``255``.

.. _edgengram-tokenizer:

edgeNGram
---------

``type='edge_ngram'`` or ``type='edgeNGram'``

This tokenizer is very similar to :ref:`ngram-tokenizer` but only keeps n-grams which start at
the beginning of a token.

Parameters
..........

min_gram
    Minimum size in codepoints of a single n-gram. default: 1

max_gram
    Maximum size in codepoints of a single n-gram. default: 2

token_chars
    Characters classes to keep in the tokens, will split on characters that don’t belong to any
    of these classes. default: [] (Keep all characters).

    **Classes:** letter, digit, whitespace, punctuation, symbol

.. _keyword-tokenizer:

keyword
-------

``type='keyword'``

Emits the entire input as a single token.

Parameters
..........

buffer_size
    The term buffer size. Defaults to ``256``.

.. _letter-tokenizer:

letter
------

``type='letter'``

Divides text at non-letters.


.. _lowercase-tokenizer:

lowercase
---------

``type='lowercase'``

Performs the function of :ref:`letter-tokenizer` and :ref:`lowercase-tokenfilter` together.
It divides text at non-letters and converts them to lower case.


.. _ngram-tokenizer:

ngram
-----

``type='ngram'`` or ``type='nGram'``

Parameters
..........

min_gram
    Minimum size in codepoints of a single n-gram. default 1.

max_gram
    Maximum size in codepoints of a single n-gram. default 2.

token_chars
    Characters classes to keep in the tokens, will split on characters that don’t belong to any
    of these classes. default: [] (Keep all characters).

    **Classes:** letter, digit, whitespace, punctuation, symbol


.. _whitespace-tokenizer:

whitespace
----------

``type='whitespace'``

Divides text at whitespace.


.. _pattern-tokenizer:

pattern
-------

``type='pattern'``

Separates text into terms via a regular expression.

Parameters
..........

pattern
    The regular expression pattern, defaults to \\W+.

flags
    The regular expression flags.

group
    Which group to extract into tokens. Defaults to -1 (split).

.. note::

    The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg ``CASE_INSENSITIVE|COMMENTS``. Check `Java Pattern API`_ for
more details about flags options.


.. _uaxemailurl-tokenizer:

uax email url
-------------

``type='uax_url_email'``

Exactly like the :ref:`standard-tokenizer`, but tokenizes emails and urls as single tokens.

Parameters
..........

max_token_length
    The maximum token length. If a token is seen that exceeds this length then it is discarded.
    Defaults to ``255``.


.. _pathhierarchy-tokenizer:

path hierarchy
--------------

``type='path_hierarchy'``

Takes something like this::

    /something/something/else

And produces tokens::

    /something
    /something/something
    /something/something/else

Parameters
..........

delimiter
    The character delimiter to use, defaults to /.

replacement
    An optional replacement character to use. Defaults to the delimiter.

buffer_size
    The buffer size to use, defaults to 1024.

reverse
    Generates tokens in reverse order, defaults to false.

skip
    Controls initial tokens to skip, defaults to 0.


.. _builtin-token-filter:

Builtin Token Filter
====================

.. _standard-tokenfilter:

standard
--------

``tyoe='standard'``

Normalizes tokens extracted with the :ref:`standard-tokenizer` Tokenizer.


.. _asciifolding-tokenfilter:

ascii folding
-------------

``type='asciifolding'``

Converts alphabetic, numeric, and symbolic Unicode characters
which are not in the first 127 ASCII characters
(the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

.. _length-tokenfilter:

length
------

``type='length'``

Removes words that are too long or too short for the stream.

Parameters
..........

min
    The minimum number. Defaults to 0.

max
    The maximum number. Defaults to Integer.MAX_VALUE.


.. _lowercase-tokenfilter:

lowercase
---------

``type='lowercase'``

Normalizes token text to lower case.

Parameters
..........

language
    For options, see :ref:`language-analyzer` Analyzer.


.. _ngram-tokenfilter:

ngram
-----

``type='ngram'`` or ``type='nGram'``

Parameters
..........

min_gram
    Defaults to 1.

max_gram
    Defaults to 2.


.. _edgengram-tokenfilter:

edge ngram
----------

``type='edgeNGram'`` or ``type='edge_ngram'``

Parameters
..........

min_gram
    Defaults to 1.

max_gram
    Defaults to 2.

side
    Either front or back. Defaults to front.


.. _porterstem-tokenfilter:

porter stem
-----------

``type='porter_stem'``

Transforms the token stream as per the Porter stemming algorithm.

.. note::

    The input to the stemming filter must already be in lower case,
    so you will need to use Lower Case Token Filter or Lower Case Tokenizer farther down the
    Tokenizer chain in order for this to work properly! For example, when using custom analyzer,
    make sure the lowercase filter comes before the porterStem filter in the list of filters.


.. _shingle-tokenfilter:

shingle
-------

``type='shingle'``

Constructs shingles (token n-grams), combinations of tokens as a single token, from a token stream.

Parameters
..........

max_shingle_size
    The maximum shingle size. Defaults to 2.

min_shingle_sizes
    The minimum shingle size. Defaults to 2.

output_unigrams
    If true the output will contain the input tokens (unigrams) as well as the shingles.
    Defaults to true.

output_unigrams_if_no_shingles
    If output_unigrams is false the output will contain the input tokens (unigrams) if no
    shingles are available. Note if output_unigrams is set to true this setting has no effect.
    Defaults to false.

token_separator
    The string to use when joining adjacent tokens to form a shingle. Defaults to " ".


.. _stop-tokenfilter:

stop
----

``type='stop'``

Removes stop words from token streams.

Parameters
..........

stopwords
    A list of stop words to use. Defaults to english stop words.

stopwords_path
    A path (either relative to config location, or absolute) to a stopwords file configuration.
    Each stop word should be in its own "line" (separated by a line break). The file must be
    UTF-8 encoded.

ignore_case
    Set to true to lower case all words first. Defaults to false.

remove_trailing
    Set to false in order to not ignore the last term of a search if it is a stop word.
    Defaults to true


.. _worddelimiter-tokenfilter:

word delimiter
--------------

``type='word_delimiter'``

Splits words into subwords and performs optional transformations on subword groups.

Parameters
..........

generate_word_parts
    If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.

generate_number_parts
    If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.

catenate_words
    If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to
    false.

catenate_numbers
    If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to
    false.

catenate_all
    If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to
    false.

split_on_case_change
    If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards).
    Defaults to true.

preserve_original
    If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to
    false.

split_on_numerics
    If true causes "j2se" to be three tokens; "j" "2" "se". Defaults to true.

stem_english_possessive
    If true causes trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil".
    Defaults to true.

protected_words
    A list of words protected from being delimiter.

protected_words_path
    A relative or absolute path to a file configured with protected words (one on each line).
    If relative, automatically resolves to ``config/`` based location if exists.

type_table
    A custom type mapping table


.. _stemmer-tokenfilter:

stemmer
-------

``type='stemmer'``

A filter that stems words (similar to :ref:`snowball-tokenfilter`, but with more options).

Parameters
..........

language/name
    arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english,
    finnish, french, german, german2, greek, hungarian, italian, kp, kstem, lovins, latvian,
    norwegian, minimal_norwegian, porter, portuguese, romanian, russian, spanish, swedish,
    turkish, minimal_english, possessive_english, light_finnish, light_french, minimal_french,
    light_german, minimal_german, hindi, light_hungarian, indonesian, light_italian,
    light_portuguese, minimal_portuguese, portuguese, light_russian, light_spanish, light_swedish.


.. _keywordmarker-tokenfilter:

keyword marker
--------------

``type='keyword_marker'``

Protects words from being modified by stemmers. Must be placed before any stemming filters.

Parameters
..........

keywords
    A list of words to use.

keywords_path
    A path (either relative to config location, or absolute) to a list of words.

ignore_case
    Set to true to lower case all words first. Defaults to false.


.. _kstem-tokenfilter:

kstem
-----

``type='kstem'``

High performance filter for english.
All terms must already be lowercased (use :ref:`lowercase-tokenfilter` filter) for this filter
to work correctly.


.. _snowball-tokenfilter:

snowball
--------

``type='snowball'``

A filter that stems words using a Snowball-generated stemmer.

Parameters
..........

language
    Possible values: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German,
    German2, Hungarian, Italian, Kp, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian,
    Spanish, Swedish, Turkish.


.. _synonym-tokenfilter:

synonym
-------

``type='synonym'``

Allows to easily handle synonyms during the analysis process. Synonyms are configured using a
configuration file.

Parameters
..........

synonyms_path
    Path to synonyms configuration file

ignore_case
    Defaults to ``false``

expand
    Defaults to ``true``


.. _compoundword-tokenfilter:

compound word
-------------

``type='dictionary_decompounder'`` or ``type='hyphenation_decompounder'``

Decomposes compound words.

Parameters
..........

word_list
    A list of words to use.

word_list_path
    A path (either relative to config location, or absolute) to a list of words.

min_word_size
    Minimum word size(Integer). Defaults to 5.

min_subword_size
    Minimum subword size(Integer). Defaults to 2.

max_subword_size
    Maximum subword size(Integer). Defaults to 15.

only_longest_match
    Only matching the longest(Boolean). Defaults to false


.. _reverse-tokenfilter:

reverse
-------

``type='reverse'``

Reverses each token.


.. _elision-tokenfilter:

elision
-------

``type='elision'``

Removes elisions.

Parameters
..........

articles
    A set of stop words articles, for example ``['j', 'l']`` for content like ``J'aime l'odeur.``


.. _truncate-tokenfilter:

truncate
--------

``type='truncate'``

Truncates tokens to a specific length.

Parameters
..........

length
    Number of characters to truncate to. default 10


.. _unique-tokenfilter:

unique
------

``type='unique'``

Used to only index unique tokens during analysis. By default it is applied on all the token stream.

Parameters
..........


only_on_same_position
    If set to true, it will only remove duplicate tokens on the same position.


.. _patterncapture-tokenfilter:

pattern capture
---------------

``type='pattern_capture'``

Emits a token for every capture group in the regular expression

Parameters
..........

preserve_original
    If set to true (the default) then it would also emit the original token

.. _patternreplace-tokenfilter:

pattern replace
---------------

``type='pattern_replace'``

Handle string replacements based on a regular expression.

Parameters
..........

pattern
    Regular expression whose matches will be replaced.

replacement
    The replacement, can reference the original text with ``$1``-like (the first matched group)
    references.


.. _trim-tokenfilter:

trim
----

``type='trim'``

Trims the whitespace surrounding a token.


.. _limittokencount-tokenfilter:

limit token count
-----------------

``type='limit'``

Limits the number of tokens that are indexed per document and field.

Parameters
..........

max_token_count
    The maximum number of tokens that should be indexed per document and field. The default is 1

consume_all_tokens
    If set to true the filter exhaust the stream even if max_token_count tokens have been
    consumed already. The default is false.

.. _hunspell-tokenfilter:

hunspell
--------

``type='hunspell'``

Basic support for hunspell stemming.
Hunspell dictionaries will be picked up from a dedicated hunspell directory on the filesystem
(defaults to <path.conf>/hunspell).
Each dictionary is expected to have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the \*.aff and \*.dic files (all of which will
automatically be picked up).

Parameters
..........

ignore_case
    If true, dictionary matching will be case insensitive (defaults to false)

strict_affix_parsing
    Determines whether errors while reading a affix rules file will cause exception or simply be
    ignored (defaults to true)

locale
    A locale for this filter. If this is unset, the lang or language are used instead - so one
    of these has to be set.

dictionary
    The name of a dictionary. The path to your hunspell dictionaries should be configured
    via indices.analysis.hunspell.dictionary.location in the crate.yml config file.

dedup
    If only unique terms should be returned, this needs to be set to true. Defaults to true.

recursion_level
    Configures the recursion level a stemmer can go into. Defaults to 2.
    Some languages (for example czech) give better results when set to 1 or 0,
    so you should test it out.

.. _commongrams-tokenfilter:

common grams
------------

``type='common_grams'``

Generates bigrams for frequently occuring terms. Single terms are still indexed.
It can be used as an alternative to the :ref:`stop-tokenfilter` Token filter when we don’t want
to completely ignore common terms.

Parameters
..........

common_words
    A list of common words to use.

common_words_path
    A path (either relative to config location, or absolute) to a list of common words.
    Each word should be in its own "line" (separated by a line break). The file must be UTF-8
    encoded.

ignore_case
    If true, common words matching will be case insensitive (defaults to false).

query_mode
    Generates bigrams then removes common words and single terms followed by a common word
    (defaults to false).

.. note::

    Either ``common_words`` or ``common_words_path`` must be given.

.. _normalization-tokenfilter:

normalization
-------------

``type='arabic_normalization'`` or ``type='persian_normalization'``


.. _delimited_payload-tokenfilter:

delimited payload
-----------------

``type='delimited_payload_filter'``

Split tokens up by delimiter (default ``|``) into the real token being indexed and the payload
stored additionally into the index. For example ``Trillian|65535`` will be indexed as ``Trillian``
with ``65535`` as payload.

Parameter
.........

encoding
    How the payload should be interpreted, possible values are ``float`` for float values,
    ``int`` for integer values and ``identity`` for keeping the payload as byte array (string).

delimiter
    The string used to separate the token and its payload.


.. _keep-tokenfilter:

keep
----

``type='keep'``

Only keep tokens defined within the settings of this filter ``keep_words`` and variations.
All other tokens will be filtered. This filter works like an inverse `stop-tokenfilter`_ filter.

Parameter
.........

keep_words
    A list of words to keep and index as tokens.

keep_words_path
    A path (either relative to config location, or absolute) to a list of words to keep and index.
    Each word should be in its own "line" (separated by a line break). The file must be UTF-8
    encoded.

.. _stemmer_override-tokenfilter:

stemmer override
----------------

``type='stemmer_override'``

Override any previous stemmer that recognizes keywords with a custom mapping,
defined by ``rules`` or ``rules_path``. One of these settings has to be set.

Parameter
.........

rules
    A list of rules for overriding, in the form of ``[<source>=><replacement>] e.g. "foo=>bar"``

rules_path
    A path to a file with one rule per line, like above.


.. _cjk_bigram-tokenfilter:

cjk bigram
----------

``type='cjk_bigram'``

Handle Chinese, Japanese and Korean (CJK) bigrams.

Parameters
..........

output_bigrams
    Boolean flag to enable a combined unigram+bigram approach.
    Default is ``false``, so single CJK characters that do not form a bigram are passed as unigrams.
    All non CJK characters are output unmodified.

ignored_scripts
    Scripts to ignore. possible values: ``han``, ``hiragana``, ``katakana``, ``hangul``


.. cjk_width-tokenfilter:

cjk width
---------

``type='cjk_width'``

A filter that normalizes CJK.


.. language_stem-tokenfilter:

language stem
-------------

| ``type='arabic_stem'`` or
| ``type='brazilian_stem'`` or
| ``type='czech_stem'`` or
| ``type='dutch_stem'`` or
| ``type='french_stem'`` or
| ``type='german_stem'`` or
| ``type='russian_stem'``

A group of filters that applies language specific stemmers to the token stream.
To prevent terms from being stemmed put a `keywordmarker-tokenfilter`_ before this
filter into the ``token_filter`` chain.

.. _builtin-char-filter:

Builtin Char Filter
===================

.. _mapping-charfilter:

mapping
-------

``type='mapping'``

Parameters
..........

mappings
    A list of mappings as strings of the form ``[<source>=><replacement>] e.g. "ph=>f"``

mappings_path
    A path to a file with one mapping per line, like above.


.. _htmlstrip-charfilter:

html strip
----------

``type='html_strip'``

Strips out HTML elements from an analyzed text.


.. _patternreplace-charfilter:

pattern replace
---------------

``type='pattern_replace'``

Manipulates the characters in a string before analysis with a regex.

Parameters
..........

pattern
    Regex whose matches will be replaced

replacement
    Replacement string, can reference replaced text by ``$1`` like references (first matched
    element)


.. _Java Pattern Api: http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary
.. _`UAX#29`: http://www.unicode.org/reports/tr29/