xapian-core  1.4.25
Public Types | Public Member Functions | List of all members
Xapian::TermGenerator Class Reference

Parses a piece of text and generate terms. More...

#include <termgenerator.h>

Public Types

enum  { FLAG_SPELLING = 128 , FLAG_NGRAMS = 2048 , FLAG_CJK_NGRAM = FLAG_NGRAMS }
 Flags to OR together and pass to TermGenerator::set_flags(). More...
 
enum  stem_strategy {
  STEM_NONE , STEM_SOME , STEM_ALL , STEM_ALL_Z ,
  STEM_SOME_FULL_POS
}
 Stemming strategies, for use with set_stemming_strategy().
 
enum  stop_strategy { STOP_NONE , STOP_ALL , STOP_STEMMED }
 Stopper strategies, for use with set_stopper_strategy().
 
typedef int flags
 For backward compatibility with Xapian 1.2.
 

Public Member Functions

 TermGenerator (const TermGenerator &o)
 Copy constructor.
 
TermGeneratoroperator= (const TermGenerator &o)
 Assignment.
 
 TermGenerator ()
 Default constructor.
 
 ~TermGenerator ()
 Destructor.
 
void set_stemmer (const Xapian::Stem &stemmer)
 Set the Xapian::Stem object to be used for generating stemmed terms.
 
void set_stopper (const Xapian::Stopper *stop=NULL)
 Set the Xapian::Stopper object to be used for identifying stopwords.
 
void set_document (const Xapian::Document &doc)
 Set the current document.
 
const Xapian::Documentget_document () const
 Get the current document.
 
void set_database (const Xapian::WritableDatabase &db)
 Set the database to index spelling data to.
 
flags set_flags (flags toggle, flags mask=flags(0))
 Set flags.
 
void set_stemming_strategy (stem_strategy strategy)
 Set the stemming strategy.
 
void set_stopper_strategy (stop_strategy strategy)
 Set the stopper strategy.
 
void set_max_word_length (unsigned max_word_length)
 Set the maximum length word to index.
 
void index_text (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text.
 
void index_text (const std::string &text, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text in a std::string.
 
void index_text_without_positions (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text without positional information.
 
void index_text_without_positions (const std::string &text, Xapian::termcount wdf_inc=1, const std::string &prefix=std::string())
 Index some text in a std::string without positional information.
 
void increase_termpos (Xapian::termpos delta=100)
 Increase the term position used by index_text.
 
Xapian::termpos get_termpos () const
 Get the current term position.
 
void set_termpos (Xapian::termpos termpos)
 Set the current term position.
 
std::string get_description () const
 Return a string describing this object.
 

Detailed Description

Parses a piece of text and generate terms.

This module takes a piece of text and parses it to produce words which are then used to generate suitable terms for indexing. The terms generated are suitable for use with Query objects produced by the QueryParser class.

Member Enumeration Documentation

◆ anonymous enum

anonymous enum

Flags to OR together and pass to TermGenerator::set_flags().

Enumerator
FLAG_SPELLING 

Index data required for spelling correction.

FLAG_NGRAMS 

Generate n-grams for scripts without explicit word breaks.

    Spans of characters in such scripts are split into unigrams
    and bigrams, with the unigrams carrying positional information.
    Text in other scripts is split into words as normal.

    The QueryParser::FLAG_NGRAMS flag needs to be passed to
    QueryParser.

    This mode can also be enabled in 1.2.8 and later by setting
    environment variable XAPIAN_CJK_NGRAM to a non-empty value (but
    doing so was deprecated in 1.4.11).

    In 1.4.x this feature was specific to CJK (Chinese, Japanese and
    Korean), but in 1.5.0 it's been extended to other languages.  To
    reflect this change the new and preferred name is FLAG_NGRAMS,
    which was added as an alias for forward compatibility in Xapian
    1.4.23.  Use FLAG_CJK_NGRAM instead if you aim to support Xapian
    &lt; 1.4.23.

    @since Added in Xapian 1.4.23.
FLAG_CJK_NGRAM 

Generate n-grams for scripts without explicit word breaks.

    Old name - use FLAG_NGRAMS instead unless you aim to support Xapian
    &lt; 1.4.23.

    @since Added in Xapian 1.3.4 and 1.2.22.

Member Function Documentation

◆ increase_termpos()

void Xapian::TermGenerator::increase_termpos ( Xapian::termpos  delta = 100)

Increase the term position used by index_text.

This can be used between indexing text from different fields or other places to prevent phrase searches from spanning between them (e.g. between the title and body text, or between two chapters in a book).

Parameters
deltaAmount to increase the term position by (default: 100).

◆ index_text() [1/2]

void Xapian::TermGenerator::index_text ( const std::string &  text,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)
inline

Index some text in a std::string.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

◆ index_text() [2/2]

void Xapian::TermGenerator::index_text ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)

Index some text.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

◆ index_text_without_positions() [1/2]

void Xapian::TermGenerator::index_text_without_positions ( const std::string &  text,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)
inline

Index some text in a std::string without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

◆ index_text_without_positions() [2/2]

void Xapian::TermGenerator::index_text_without_positions ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
const std::string &  prefix = std::string() 
)

Index some text without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

◆ set_flags()

flags Xapian::TermGenerator::set_flags ( flags  toggle,
flags  mask = flags(0) 
)

Set flags.

The new value of flags is: (flags & mask) ^ toggle

To just set the flags, pass the new flags in toggle and the default value for mask.

Parameters
toggleFlags to XOR.
maskFlags to AND with first.
Returns
The old flags setting.

◆ set_max_word_length()

void Xapian::TermGenerator::set_max_word_length ( unsigned  max_word_length)

Set the maximum length word to index.

The limit is on the length of a word prior to stemming and prior to adding any term prefix.

The backends mostly impose a limit on the length of terms (often of about 240 bytes), but it's generally useful to have a lower limit to help prevent the index being bloated by useless junk terms from trying to indexing things like binary data, uuencoded data, ASCII art, etc.

This method was new in Xapian 1.3.1.

Parameters
max_word_lengthThe maximum length word to index, in bytes in UTF-8 representation. Default is 64.

◆ set_stemming_strategy()

void Xapian::TermGenerator::set_stemming_strategy ( stem_strategy  strategy)

Set the stemming strategy.

This method controls how the stemming algorithm is applied. It was new in Xapian 1.3.1.

Parameters
strategyThe strategy to use - possible values are:
  • STEM_NONE: Don't perform any stemming - only unstemmed terms are generated.
  • STEM_SOME: Generate both stemmed (with a "Z" prefix) and unstemmed terms. No positional information is stored for unstemmed terms. This is the default strategy.
  • STEM_SOME_FULL_POS: Like STEM_SOME but positional information is stored for both stemmed and unstemmed terms. Added in Xapian 1.4.8.
  • STEM_ALL: Generate only stemmed terms (but without a "Z" prefix).
  • STEM_ALL_Z: Generate only stemmed terms (with a "Z" prefix).

◆ set_stopper()

void Xapian::TermGenerator::set_stopper ( const Xapian::Stopper stop = NULL)

Set the Xapian::Stopper object to be used for identifying stopwords.

Stemmed forms of stopwords aren't indexed, but unstemmed forms still are so that searches for phrases including stop words still work.

Parameters
stopThe Stopper object to set (default NULL, which means no stopwords).

◆ set_stopper_strategy()

void Xapian::TermGenerator::set_stopper_strategy ( stop_strategy  strategy)

Set the stopper strategy.

The method controls how the stopper is used. It was added in Xapian 1.4.1.

You need to also call set_stopper() for this to have any effect.

Parameters
strategyThe strategy to use - possible values are:
  • STOP_NONE: Don't use the stopper.
  • STOP_ALL: If a word is identified as a stop word, skip it completely.
  • STOP_STEMMED: If a word is identified as a stop word, index its unstemmed form but skip the stem. Unstemmed forms are indexed with positional information by default, so this allows searches for phrases containing stopwords to be supported. (This is the default mode).

◆ set_termpos()

void Xapian::TermGenerator::set_termpos ( Xapian::termpos  termpos)

Set the current term position.

Parameters
termposThe new term position to set.

The documentation for this class was generated from the following file: