xapian-core  2.0.0
Classes | Public Types | Public Member Functions | Private Attributes | List of all members
Xapian::TermGenerator Class Reference

Parses a piece of text and generate terms. More...

#include <termgenerator.h>

+ Collaboration diagram for Xapian::TermGenerator:

Classes

class  Internal
 

Public Types

enum  { FLAG_SPELLING = 128 , FLAG_NGRAMS = 2048 , FLAG_CJK_NGRAM = FLAG_NGRAMS , FLAG_WORD_BREAKS = 4096 }
 Flags to OR together and pass to TermGenerator::set_flags(). More...
 
enum  stem_strategy {
  STEM_NONE , STEM_SOME , STEM_ALL , STEM_ALL_Z ,
  STEM_SOME_FULL_POS
}
 Stemming strategies, for use with set_stemming_strategy(). More...
 
enum  stop_strategy { STOP_NONE , STOP_ALL , STOP_STEMMED }
 Stopper strategies, for use with set_stopper_strategy(). More...
 
typedef int flags
 For backward compatibility with Xapian 1.2. More...
 

Public Member Functions

 TermGenerator (const TermGenerator &o)
 Copy constructor. More...
 
TermGeneratoroperator= (const TermGenerator &o)
 Assignment. More...
 
 TermGenerator (TermGenerator &&o)
 Move constructor. More...
 
TermGeneratoroperator= (TermGenerator &&o)
 Move assignment operator. More...
 
 TermGenerator ()
 Default constructor. More...
 
 ~TermGenerator ()
 Destructor. More...
 
void set_stemmer (const Xapian::Stem &stemmer)
 Set the Xapian::Stem object to be used for generating stemmed terms. More...
 
void set_stopper (const Xapian::Stopper *stop=NULL)
 Set the Xapian::Stopper object to be used for identifying stopwords. More...
 
void set_document (const Xapian::Document &doc)
 Set the current document. More...
 
const Xapian::Documentget_document () const
 Get the current document. More...
 
void set_database (const Xapian::WritableDatabase &db)
 Set the database to index spelling data to. More...
 
flags set_flags (flags toggle, flags mask=flags(0))
 Set flags. More...
 
void set_stemming_strategy (stem_strategy strategy)
 Set the stemming strategy. More...
 
void set_stopper_strategy (stop_strategy strategy)
 Set the stopper strategy. More...
 
void set_max_word_length (unsigned max_word_length)
 Set the maximum length word to index. More...
 
void index_text (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, std::string_view prefix={})
 Index some text. More...
 
void index_text (std::string_view text, Xapian::termcount wdf_inc=1, std::string_view prefix={})
 Index some text. More...
 
void index_text_without_positions (const Xapian::Utf8Iterator &itor, Xapian::termcount wdf_inc=1, std::string_view prefix={})
 Index some text without positional information. More...
 
void index_text_without_positions (std::string_view text, Xapian::termcount wdf_inc=1, std::string_view prefix={})
 Index some text without positional information. More...
 
void increase_termpos (Xapian::termpos delta=100)
 Increase the term position used by index_text. More...
 
Xapian::termpos get_termpos () const
 Get the current term position. More...
 
void set_termpos (Xapian::termpos termpos)
 Set the current term position. More...
 
void set_termpos_limit (Xapian::termpos termpos_limit)
 Set the term position limit. More...
 
std::string get_description () const
 Return a string describing this object. More...
 

Private Attributes

Xapian::Internal::intrusive_ptr_nonnull< Internalinternal
 

Detailed Description

Parses a piece of text and generate terms.

This module takes a piece of text and parses it to produce words which are then used to generate suitable terms for indexing. The terms generated are suitable for use with Query objects produced by the QueryParser class.

Definition at line 49 of file termgenerator.h.

Member Typedef Documentation

◆ flags

For backward compatibility with Xapian 1.2.

Definition at line 97 of file termgenerator.h.

Member Enumeration Documentation

◆ anonymous enum

anonymous enum

Flags to OR together and pass to TermGenerator::set_flags().

Enumerator
FLAG_SPELLING 

Index data required for spelling correction.

FLAG_NGRAMS 

Generate n-grams for scripts without explicit word breaks.

    Spans of characters in such scripts are split into unigrams
    and bigrams, with the unigrams carrying positional information.
    Text in other scripts is split into words as normal.

    The QueryParser::FLAG_NGRAMS flag needs to be passed to
    QueryParser.

    This mode can also be enabled in 1.2.8 and later by setting
    environment variable XAPIAN_CJK_NGRAM to a non-empty value (but
    doing so was deprecated in 1.4.11).

    In 1.4.x this feature was specific to CJK (Chinese, Japanese and
    Korean), but in 2.0.0 it's been extended to other languages.  To
    reflect this change the new and preferred name is FLAG_NGRAMS,
    which was added as an alias for forward compatibility in Xapian
    1.4.23.  Use FLAG_CJK_NGRAM instead if you aim to support Xapian
    &lt; 1.4.23.

    @since Added in Xapian 1.4.23.
FLAG_CJK_NGRAM 

Generate n-grams for scripts without explicit word breaks.

    Old name - use FLAG_NGRAMS instead unless you aim to support Xapian
    &lt; 1.4.23.

    @since Added in Xapian 1.3.4 and 1.2.22.
FLAG_WORD_BREAKS 

Find word breaks for text in scripts without explicit word breaks.

    With this option enabled, spans of text written in such scripts are
    split into words using ICU (which uses heuristics and/or
    dictionaries to do so).  Text in other scripts is split into words
    as normal.

    The QueryParser::FLAG_WORD_BREAKS flag needs to be passed to
    QueryParser.

    @since Added in Xapian 2.0.0.

Definition at line 100 of file termgenerator.h.

◆ stem_strategy

Stemming strategies, for use with set_stemming_strategy().

Enumerator
STEM_NONE 
STEM_SOME 
STEM_ALL 
STEM_ALL_Z 
STEM_SOME_FULL_POS 

Definition at line 153 of file termgenerator.h.

◆ stop_strategy

Stopper strategies, for use with set_stopper_strategy().

Enumerator
STOP_NONE 
STOP_ALL 
STOP_STEMMED 

Definition at line 158 of file termgenerator.h.

Constructor & Destructor Documentation

◆ TermGenerator() [1/3]

TermGenerator::TermGenerator ( const TermGenerator o)
default

Copy constructor.

◆ TermGenerator() [2/3]

TermGenerator::TermGenerator ( TermGenerator &&  o)
default

Move constructor.

◆ TermGenerator() [3/3]

TermGenerator::TermGenerator ( )

Default constructor.

Definition at line 46 of file termgenerator.cc.

◆ ~TermGenerator()

TermGenerator::~TermGenerator ( )

Destructor.

Definition at line 48 of file termgenerator.cc.

Member Function Documentation

◆ get_description()

string TermGenerator::get_description ( ) const

Return a string describing this object.

Definition at line 148 of file termgenerator.cc.

References Xapian::TermGenerator::Internal::cur_pos, internal, Xapian::TermGenerator::Internal::stopper, and Xapian::Internal::str().

Referenced by DEFINE_TESTCASE().

◆ get_document()

const Xapian::Document & TermGenerator::get_document ( ) const

Get the current document.

Definition at line 70 of file termgenerator.cc.

◆ get_termpos()

Xapian::termpos TermGenerator::get_termpos ( ) const

Get the current term position.

Definition at line 130 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ increase_termpos()

void TermGenerator::increase_termpos ( Xapian::termpos  delta = 100)

Increase the term position used by index_text.

This can be used between indexing text from different fields or other places to prevent phrase searches from spanning between them (e.g. between the title and body text, or between two chapters in a book).

Parameters
deltaAmount to increase the term position by (default: 100).

Definition at line 124 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ index_text() [1/2]

void TermGenerator::index_text ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
std::string_view  prefix = {} 
)

Index some text.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 108 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ index_text() [2/2]

void Xapian::TermGenerator::index_text ( std::string_view  text,
Xapian::termcount  wdf_inc = 1,
std::string_view  prefix = {} 
)
inline

Index some text.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 249 of file termgenerator.h.

◆ index_text_without_positions() [1/2]

void TermGenerator::index_text_without_positions ( const Xapian::Utf8Iterator itor,
Xapian::termcount  wdf_inc = 1,
std::string_view  prefix = {} 
)

Index some text without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
itorUtf8Iterator pointing to the text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 116 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ index_text_without_positions() [2/2]

void Xapian::TermGenerator::index_text_without_positions ( std::string_view  text,
Xapian::termcount  wdf_inc = 1,
std::string_view  prefix = {} 
)
inline

Index some text without positional information.

Just like index_text, but no positional information is generated. This means that the database will be significantly smaller, but that phrase searching and NEAR won't be supported.

Parameters
textThe text to index.
wdf_incThe wdf increment (default 1).
prefixThe term prefix to use (default is no prefix).

Definition at line 279 of file termgenerator.h.

◆ operator=() [1/2]

TermGenerator & TermGenerator::operator= ( const TermGenerator o)
default

Assignment.

◆ operator=() [2/2]

TermGenerator & TermGenerator::operator= ( TermGenerator &&  o)
default

Move assignment operator.

◆ set_database()

void TermGenerator::set_database ( const Xapian::WritableDatabase db)

Set the database to index spelling data to.

Definition at line 76 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_document()

void TermGenerator::set_document ( const Xapian::Document doc)

Set the current document.

Definition at line 63 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ set_flags()

TermGenerator::flags TermGenerator::set_flags ( flags  toggle,
flags  mask = flags(0) 
)

Set flags.

The new value of flags is: (flags & mask) ^ toggle

To just set the flags, pass the new flags in toggle and the default value for mask.

Parameters
toggleFlags to XOR.
maskFlags to AND with first.
Returns
The old flags setting.

Definition at line 82 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_max_word_length()

void TermGenerator::set_max_word_length ( unsigned  max_word_length)

Set the maximum length word to index.

The limit is on the length of a word prior to stemming and prior to adding any term prefix.

The backends mostly impose a limit on the length of terms (often of about 240 bytes), but it's generally useful to have a lower limit to help prevent the index being bloated by useless junk terms from trying to indexing things like binary data, uuencoded data, ASCII art, etc.

Parameters
max_word_lengthThe maximum length word to index, in bytes in UTF-8 representation. Default is 64.
Since
Added in Xapian 1.3.1.

Definition at line 102 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_stemmer()

void TermGenerator::set_stemmer ( const Xapian::Stem stemmer)

Set the Xapian::Stem object to be used for generating stemmed terms.

Definition at line 51 of file termgenerator.cc.

References stemmer.

Referenced by DEFINE_TESTCASE(), main(), make_netstats1_db(), and make_tg_db().

◆ set_stemming_strategy()

void TermGenerator::set_stemming_strategy ( stem_strategy  strategy)

Set the stemming strategy.

This method controls how the stemming algorithm is applied.

Parameters
strategyThe strategy to use - possible values are:
  • STEM_NONE: Don't perform any stemming - only unstemmed terms are generated.
  • STEM_SOME: Generate both stemmed (with a "Z" prefix) and unstemmed terms. No positional information is stored for unstemmed terms. This is the default strategy.
  • STEM_SOME_FULL_POS: Like STEM_SOME but positional information is stored for both stemmed and unstemmed terms. Added in Xapian 1.4.8.
  • STEM_ALL: Generate only stemmed terms (but without a "Z" prefix).
  • STEM_ALL_Z: Generate only stemmed terms (with a "Z" prefix).
Since
Added in Xapian 1.3.1.

Definition at line 90 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE(), main(), and make_netstats1_db().

◆ set_stopper()

void TermGenerator::set_stopper ( const Xapian::Stopper stop = NULL)

Set the Xapian::Stopper object to be used for identifying stopwords.

Stemmed forms of stopwords aren't indexed, but unstemmed forms still are so that searches for phrases including stop words still work.

Parameters
stopThe Stopper object to set (default NULL, which means no stopwords).

Definition at line 57 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_stopper_strategy()

void TermGenerator::set_stopper_strategy ( stop_strategy  strategy)

Set the stopper strategy.

The method controls how the stopper is used.

You need to also call set_stopper() for this to have any effect.

Parameters
strategyThe strategy to use - possible values are:
  • STOP_NONE: Don't use the stopper.
  • STOP_ALL: If a word is identified as a stop word, skip it completely.
  • STOP_STEMMED: If a word is identified as a stop word, index its unstemmed form but skip the stem. Unstemmed forms are indexed with positional information by default, so this allows searches for phrases containing stopwords to be supported. (This is the default mode).
Since
Added in Xapian 1.4.1.

Definition at line 96 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_termpos()

void TermGenerator::set_termpos ( Xapian::termpos  termpos)

Set the current term position.

Parameters
termposThe new term position to set.

Definition at line 136 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

◆ set_termpos_limit()

void TermGenerator::set_termpos_limit ( Xapian::termpos  termpos_limit)

Set the term position limit.

Parameters
termpos_limitUpper bound on term positions that can be added.

By default the only limit is the maximum value of the Xapian::termpos type.

Since
Added in Xapian 2.0.0.

Definition at line 142 of file termgenerator.cc.

Referenced by DEFINE_TESTCASE().

Member Data Documentation

◆ internal

Xapian::Internal::intrusive_ptr_nonnull<Internal> Xapian::TermGenerator::internal
private

Reference counted internals.

Definition at line 54 of file termgenerator.h.

Referenced by get_description().


The documentation for this class was generated from the following files: