xapian-core
1.4.26
|
Functions associated with handling Unicode characters. More...
Namespaces | |
Internal | |
Enumerations | |
enum | category { UNASSIGNED, UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, NON_SPACING_MARK, ENCLOSING_MARK, COMBINING_SPACING_MARK, DECIMAL_DIGIT_NUMBER, LETTER_NUMBER, OTHER_NUMBER, SPACE_SEPARATOR, LINE_SEPARATOR, PARAGRAPH_SEPARATOR, CONTROL, FORMAT, PRIVATE_USE, SURROGATE, CONNECTOR_PUNCTUATION, DASH_PUNCTUATION, OPEN_PUNCTUATION, CLOSE_PUNCTUATION, INITIAL_QUOTE_PUNCTUATION, FINAL_QUOTE_PUNCTUATION, OTHER_PUNCTUATION, MATH_SYMBOL, CURRENCY_SYMBOL, MODIFIER_SYMBOL, OTHER_SYMBOL } |
Each Unicode character is in exactly one of these categories. More... | |
Functions | |
unsigned | nonascii_to_utf8 (unsigned ch, char *buf) |
Convert a single non-ASCII Unicode character to UTF-8. More... | |
unsigned | to_utf8 (unsigned ch, char *buf) |
Convert a single Unicode character to UTF-8. More... | |
void | append_utf8 (std::string &s, unsigned ch) |
Append the UTF-8 representation of a single Unicode character to a std::string. More... | |
category | get_category (unsigned ch) |
Return the category which a given Unicode character falls into. More... | |
bool | is_wordchar (unsigned ch) |
Test if a given Unicode character is "word character". More... | |
bool | is_whitespace (unsigned ch) |
Test if a given Unicode character is a whitespace character. More... | |
bool | is_currency (unsigned ch) |
Test if a given Unicode character is a currency symbol. More... | |
unsigned | tolower (unsigned ch) |
Convert a Unicode character to lowercase. More... | |
unsigned | toupper (unsigned ch) |
Convert a Unicode character to uppercase. More... | |
std::string | tolower (const std::string &term) |
Convert a UTF-8 std::string to lowercase. More... | |
std::string | toupper (const std::string &term) |
Convert a UTF-8 std::string to uppercase. More... | |
Functions associated with handling Unicode characters.
Each Unicode character is in exactly one of these categories.
The Unicode standard calls this the "General Category", and uses a "Major, minor" convention to derive a two letter code.
|
inline |
Append the UTF-8 representation of a single Unicode character to a std::string.
Definition at line 332 of file unicode.h.
References to_utf8().
Referenced by Term::as_positional_unbroken(), DEFINE_TESTCASE(), description_append(), NgramIterator::init(), NgramIterator::operator++(), Xapian::QueryParser::Internal::parse_query(), Xapian::QueryParser::Internal::parse_term(), Xapian::parse_terms(), tolower(), and toupper().
|
inline |
Return the category which a given Unicode character falls into.
Definition at line 338 of file unicode.h.
References Xapian::Unicode::Internal::get_character_info().
Referenced by is_currency(), is_whitespace(), and is_wordchar().
|
inline |
Test if a given Unicode character is a currency symbol.
Definition at line 371 of file unicode.h.
References CURRENCY_SYMBOL, and get_category().
Referenced by DEFINE_TESTCASE(), prefix_needs_colon(), Xapian::snippet_check_leading_nonwordchar(), and Xapian::snippet_check_trailing_nonwordchar().
|
inline |
Test if a given Unicode character is a whitespace character.
Definition at line 361 of file unicode.h.
References CONTROL, get_category(), LINE_SEPARATOR, PARAGRAPH_SEPARATOR, and SPACE_SEPARATOR.
Referenced by DEFINE_TESTCASE(), Xapian::SnipPipe::drain(), is_not_whitespace(), Xapian::QueryParser::Internal::parse_query(), Xapian::MSet::Internal::snippet(), and U_isalpha().
|
inline |
Test if a given Unicode character is "word character".
Definition at line 343 of file unicode.h.
References COMBINING_SPACING_MARK, CONNECTOR_PUNCTUATION, DECIMAL_DIGIT_NUMBER, ENCLOSING_MARK, get_category(), LETTER_NUMBER, LOWERCASE_LETTER, MODIFIER_LETTER, NON_SPACING_MARK, OTHER_LETTER, OTHER_NUMBER, TITLECASE_LETTER, and UPPERCASE_LETTER.
Referenced by Xapian::check_wordchar(), DEFINE_TESTCASE(), Xapian::SnipPipe::drain(), get_unbroken(), NgramIterator::init(), is_not_whitespace(), is_not_wordchar(), NgramIterator::operator++(), Xapian::QueryParser::Internal::parse_query(), Xapian::QueryParser::Internal::parse_term(), and Xapian::parse_terms().
unsigned Xapian::Unicode::nonascii_to_utf8 | ( | unsigned | ch, |
char * | buf | ||
) |
Convert a single non-ASCII Unicode character to UTF-8.
This is intended mainly as a helper method for to_utf8().
ch | The character (which must be > 128) to write to buf. |
buf | The buffer to write the character to - it must have space for (at least) 4 bytes. |
Definition at line 39 of file utf8itor.cc.
Referenced by Xapian::Unicode::Internal::get_delta(), and to_utf8().
|
inline |
Convert a single Unicode character to UTF-8.
ch | The character to write to buf. |
buf | The buffer to write the character to - it must have space for (at least) 4 bytes. |
Definition at line 321 of file unicode.h.
References nonascii_to_utf8().
Referenced by append_utf8().
|
inline |
Convert a Unicode character to lowercase.
Definition at line 376 of file unicode.h.
References Xapian::Unicode::Internal::get_case_type(), Xapian::Unicode::Internal::get_character_info(), and Xapian::Unicode::Internal::get_delta().
Referenced by Xapian::check_wordchar(), DEFINE_TESTCASE(), AuthorValueRangeProcessor::operator()(), AuthorRangeProcessor::operator()(), Xapian::QueryParser::Internal::parse_query(), Xapian::QueryParser::Internal::parse_term(), and Xapian::parse_terms().
|
inline |
Convert a UTF-8 std::string to lowercase.
Definition at line 393 of file unicode.h.
References append_utf8().
|
inline |
Convert a Unicode character to uppercase.
Definition at line 384 of file unicode.h.
References Xapian::Unicode::Internal::get_case_type(), Xapian::Unicode::Internal::get_character_info(), and Xapian::Unicode::Internal::get_delta().
Referenced by DEFINE_TESTCASE().
|
inline |
Convert a UTF-8 std::string to uppercase.
Definition at line 405 of file unicode.h.
References append_utf8().