xapian-core
1.4.22
|
Iterator returning unigrams and bigrams. More...
#include <cjk-tokenizer.h>
Public Member Functions | |
CJKTokenIterator (const std::string &s) | |
CJKTokenIterator (const Xapian::Utf8Iterator &it_) | |
CJKTokenIterator () | |
const std::string & | operator* () const |
CJKTokenIterator & | operator++ () |
bool | unigram () const |
Is this a unigram? More... | |
const Xapian::Utf8Iterator & | get_utf8iterator () const |
bool | operator== (const CJKTokenIterator &other) const |
bool | operator!= (const CJKTokenIterator &other) const |
Private Member Functions | |
void | init () |
Call to set current_token at the start. More... | |
Private Attributes | |
Xapian::Utf8Iterator | it |
unsigned | offset = 0 |
Offset to penultimate Unicode character in current_token. More... | |
std::string | current_token |
Iterator returning unigrams and bigrams.
Definition at line 56 of file cjk-tokenizer.h.
|
inlineexplicit |
Definition at line 71 of file cjk-tokenizer.h.
|
inlineexplicit |
Definition at line 75 of file cjk-tokenizer.h.
|
inline |
Definition at line 79 of file cjk-tokenizer.h.
|
inline |
Definition at line 90 of file cjk-tokenizer.h.
Referenced by Xapian::parse_terms().
|
private |
Call to set current_token at the start.
Definition at line 96 of file cjk-tokenizer.cc.
References Xapian::Unicode::append_utf8(), CJK::codepoint_is_cjk(), and Xapian::Unicode::is_wordchar().
|
inline |
Definition at line 98 of file cjk-tokenizer.h.
|
inline |
Definition at line 81 of file cjk-tokenizer.h.
CJKTokenIterator & CJKTokenIterator::operator++ | ( | ) |
Definition at line 109 of file cjk-tokenizer.cc.
References Xapian::Unicode::append_utf8(), CJK::codepoint_is_cjk(), and Xapian::Unicode::is_wordchar().
|
inline |
Definition at line 92 of file cjk-tokenizer.h.
References current_token.
|
inline |
Is this a unigram?
Definition at line 88 of file cjk-tokenizer.h.
Referenced by Xapian::parse_terms().
|
private |
Definition at line 65 of file cjk-tokenizer.h.
Referenced by operator==().
|
private |
Definition at line 57 of file cjk-tokenizer.h.
|
private |
Offset to penultimate Unicode character in current_token.
If current_token has one Unicode character, this is 0.
Definition at line 63 of file cjk-tokenizer.h.