44 static bool result = ((p = getenv(
"XAPIAN_CJK_NGRAM")) != NULL && *p);
73 if (p < 0x2E80)
return false;
74 return ((p >= 0x2E80 && p <= 0x2EFF) ||
75 (p >= 0x3000 && p <= 0x9FFF) ||
76 (p >= 0xA700 && p <= 0xA71F) ||
77 (p >= 0xAC00 && p <= 0xD7AF) ||
78 (p >= 0xF900 && p <= 0xFAFF) ||
79 (p >= 0xFE30 && p <= 0xFE4F) ||
80 (p >= 0xFF00 && p <= 0xFFEF) ||
81 (p >= 0x20000 && p <= 0x2A6DF) ||
82 (p >= 0x2F800 && p <= 0x2FA1F));
104 current_token.resize(0);
116 offset = current_token.size();
120 current_token.resize(0);
123 current_token.resize(0);
126 current_token.erase(0, offset);
Unicode and UTF-8 related classes and functions.
void append_utf8(std::string &s, unsigned ch)
Append the UTF-8 representation of a single Unicode character to a std::string.
bool is_unbroken_script(unsigned p)
Iterator returning unigrams and bigrams.
NgramIterator & operator++()
Handle text without explicit word breaks.
An iterator which returns Unicode character values from a UTF-8 encoded string.
bool is_wordchar(unsigned ch)
Test if a given Unicode character is "word character".
void get_unbroken(Xapian::Utf8Iterator &it)
void init()
Call to set current_token at the start.
Various assertion macros.
bool is_ngram_enabled()
Should we use the n-gram code?