xapian-core  1.4.25
Namespaces | Enumerations | Functions
Xapian::Unicode Namespace Reference

Functions associated with handling Unicode characters. More...

Namespaces

 Internal
 

Enumerations

enum  category {
  UNASSIGNED, UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER,
  MODIFIER_LETTER, OTHER_LETTER, NON_SPACING_MARK, ENCLOSING_MARK,
  COMBINING_SPACING_MARK, DECIMAL_DIGIT_NUMBER, LETTER_NUMBER, OTHER_NUMBER,
  SPACE_SEPARATOR, LINE_SEPARATOR, PARAGRAPH_SEPARATOR, CONTROL,
  FORMAT, PRIVATE_USE, SURROGATE, CONNECTOR_PUNCTUATION,
  DASH_PUNCTUATION, OPEN_PUNCTUATION, CLOSE_PUNCTUATION, INITIAL_QUOTE_PUNCTUATION,
  FINAL_QUOTE_PUNCTUATION, OTHER_PUNCTUATION, MATH_SYMBOL, CURRENCY_SYMBOL,
  MODIFIER_SYMBOL, OTHER_SYMBOL
}
 Each Unicode character is in exactly one of these categories. More...
 

Functions

unsigned nonascii_to_utf8 (unsigned ch, char *buf)
 Convert a single non-ASCII Unicode character to UTF-8. More...
 
unsigned to_utf8 (unsigned ch, char *buf)
 Convert a single Unicode character to UTF-8. More...
 
void append_utf8 (std::string &s, unsigned ch)
 Append the UTF-8 representation of a single Unicode character to a std::string. More...
 
category get_category (unsigned ch)
 Return the category which a given Unicode character falls into. More...
 
bool is_wordchar (unsigned ch)
 Test if a given Unicode character is "word character". More...
 
bool is_whitespace (unsigned ch)
 Test if a given Unicode character is a whitespace character. More...
 
bool is_currency (unsigned ch)
 Test if a given Unicode character is a currency symbol. More...
 
unsigned tolower (unsigned ch)
 Convert a Unicode character to lowercase. More...
 
unsigned toupper (unsigned ch)
 Convert a Unicode character to uppercase. More...
 
std::string tolower (const std::string &term)
 Convert a UTF-8 std::string to lowercase. More...
 
std::string toupper (const std::string &term)
 Convert a UTF-8 std::string to uppercase. More...
 

Detailed Description

Functions associated with handling Unicode characters.

Enumeration Type Documentation

◆ category

Each Unicode character is in exactly one of these categories.

The Unicode standard calls this the "General Category", and uses a "Major, minor" convention to derive a two letter code.

Enumerator
UNASSIGNED 

Other, not assigned (Cn)

UPPERCASE_LETTER 

Letter, uppercase (Lu)

LOWERCASE_LETTER 

Letter, lowercase (Ll)

TITLECASE_LETTER 

Letter, titlecase (Lt)

MODIFIER_LETTER 

Letter, modifier (Lm)

OTHER_LETTER 

Letter, other (Lo)

NON_SPACING_MARK 

Mark, nonspacing (Mn)

ENCLOSING_MARK 

Mark, enclosing (Me)

COMBINING_SPACING_MARK 

Mark, spacing combining (Mc)

DECIMAL_DIGIT_NUMBER 

Number, decimal digit (Nd)

LETTER_NUMBER 

Number, letter (Nl)

OTHER_NUMBER 

Number, other (No)

SPACE_SEPARATOR 

Separator, space (Zs)

LINE_SEPARATOR 

Separator, line (Zl)

PARAGRAPH_SEPARATOR 

Separator, paragraph (Zp)

CONTROL 

Other, control (Cc)

FORMAT 

Other, format (Cf)

PRIVATE_USE 

Other, private use (Co)

SURROGATE 

Other, surrogate (Cs)

CONNECTOR_PUNCTUATION 

Punctuation, connector (Pc)

DASH_PUNCTUATION 

Punctuation, dash (Pd)

OPEN_PUNCTUATION 

Punctuation, open (Ps)

CLOSE_PUNCTUATION 

Punctuation, close (Pe)

INITIAL_QUOTE_PUNCTUATION 

Punctuation, initial quote (Pi)

FINAL_QUOTE_PUNCTUATION 

Punctuation, final quote (Pf)

OTHER_PUNCTUATION 

Punctuation, other (Po)

MATH_SYMBOL 

Symbol, math (Sm)

CURRENCY_SYMBOL 

Symbol, currency (Sc)

MODIFIER_SYMBOL 

Symbol, modified (Sk)

OTHER_SYMBOL 

Symbol, other (So)

Definition at line 220 of file unicode.h.

Function Documentation

◆ append_utf8()

void Xapian::Unicode::append_utf8 ( std::string &  s,
unsigned  ch 
)
inline

◆ get_category()

category Xapian::Unicode::get_category ( unsigned  ch)
inline

Return the category which a given Unicode character falls into.

Definition at line 338 of file unicode.h.

References Xapian::Unicode::Internal::get_character_info().

Referenced by is_currency(), is_whitespace(), and is_wordchar().

◆ is_currency()

bool Xapian::Unicode::is_currency ( unsigned  ch)
inline

Test if a given Unicode character is a currency symbol.

Definition at line 371 of file unicode.h.

References CURRENCY_SYMBOL, and get_category().

Referenced by DEFINE_TESTCASE(), prefix_needs_colon(), Xapian::snippet_check_leading_nonwordchar(), and Xapian::snippet_check_trailing_nonwordchar().

◆ is_whitespace()

bool Xapian::Unicode::is_whitespace ( unsigned  ch)
inline

◆ is_wordchar()

bool Xapian::Unicode::is_wordchar ( unsigned  ch)
inline

◆ nonascii_to_utf8()

unsigned Xapian::Unicode::nonascii_to_utf8 ( unsigned  ch,
char *  buf 
)

Convert a single non-ASCII Unicode character to UTF-8.

This is intended mainly as a helper method for to_utf8().

Parameters
chThe character (which must be > 128) to write to buf.
bufThe buffer to write the character to - it must have space for (at least) 4 bytes.
Returns
The length of the resultant UTF-8 character in bytes.

Definition at line 39 of file utf8itor.cc.

Referenced by Xapian::Unicode::Internal::get_delta(), and to_utf8().

◆ to_utf8()

unsigned Xapian::Unicode::to_utf8 ( unsigned  ch,
char *  buf 
)
inline

Convert a single Unicode character to UTF-8.

Parameters
chThe character to write to buf.
bufThe buffer to write the character to - it must have space for (at least) 4 bytes.
Returns
The length of the resultant UTF-8 character in bytes.

Definition at line 321 of file unicode.h.

References nonascii_to_utf8().

Referenced by append_utf8().

◆ tolower() [1/2]

unsigned Xapian::Unicode::tolower ( unsigned  ch)
inline

◆ tolower() [2/2]

std::string Xapian::Unicode::tolower ( const std::string &  term)
inline

Convert a UTF-8 std::string to lowercase.

Definition at line 393 of file unicode.h.

References append_utf8().

◆ toupper() [1/2]

unsigned Xapian::Unicode::toupper ( unsigned  ch)
inline

◆ toupper() [2/2]

std::string Xapian::Unicode::toupper ( const std::string &  term)
inline

Convert a UTF-8 std::string to uppercase.

Definition at line 405 of file unicode.h.

References append_utf8().