xapian-core  1.4.31
Xapian::Unicode Namespace Reference

Functions associated with handling Unicode characters. More...

Enumerations

enum  category {
  UNASSIGNED , UPPERCASE_LETTER , LOWERCASE_LETTER , TITLECASE_LETTER ,
  MODIFIER_LETTER , OTHER_LETTER , NON_SPACING_MARK , ENCLOSING_MARK ,
  COMBINING_SPACING_MARK , DECIMAL_DIGIT_NUMBER , LETTER_NUMBER , OTHER_NUMBER ,
  SPACE_SEPARATOR , LINE_SEPARATOR , PARAGRAPH_SEPARATOR , CONTROL ,
  FORMAT , PRIVATE_USE , SURROGATE , CONNECTOR_PUNCTUATION ,
  DASH_PUNCTUATION , OPEN_PUNCTUATION , CLOSE_PUNCTUATION , INITIAL_QUOTE_PUNCTUATION ,
  FINAL_QUOTE_PUNCTUATION , OTHER_PUNCTUATION , MATH_SYMBOL , CURRENCY_SYMBOL ,
  MODIFIER_SYMBOL , OTHER_SYMBOL
}
 Each Unicode character is in exactly one of these categories. More...

Functions

unsigned nonascii_to_utf8 (unsigned ch, char *buf)
 Convert a single non-ASCII Unicode character to UTF-8.
unsigned to_utf8 (unsigned ch, char *buf)
 Convert a single Unicode character to UTF-8.
void append_utf8 (std::string &s, unsigned ch)
 Append the UTF-8 representation of a single Unicode character to a std::string.
category get_category (unsigned ch)
 Return the category which a given Unicode character falls into.
bool is_wordchar (unsigned ch)
 Test if a given Unicode character is "word character".
bool is_whitespace (unsigned ch)
 Test if a given Unicode character is a whitespace character.
bool is_currency (unsigned ch)
 Test if a given Unicode character is a currency symbol.
unsigned tolower (unsigned ch)
 Convert a Unicode character to lowercase.
unsigned toupper (unsigned ch)
 Convert a Unicode character to uppercase.
std::string tolower (const std::string &term)
 Convert a UTF-8 std::string to lowercase.
std::string toupper (const std::string &term)
 Convert a UTF-8 std::string to uppercase.

Detailed Description

Functions associated with handling Unicode characters.

Enumeration Type Documentation

◆ category

Each Unicode character is in exactly one of these categories.

The Unicode standard calls this the "General Category", and uses a "Major, minor" convention to derive a two letter code.

Enumerator
UNASSIGNED 

Other, not assigned (Cn).

UPPERCASE_LETTER 

Letter, uppercase (Lu).

LOWERCASE_LETTER 

Letter, lowercase (Ll).

TITLECASE_LETTER 

Letter, titlecase (Lt).

MODIFIER_LETTER 

Letter, modifier (Lm).

OTHER_LETTER 

Letter, other (Lo).

NON_SPACING_MARK 

Mark, nonspacing (Mn).

ENCLOSING_MARK 

Mark, enclosing (Me).

COMBINING_SPACING_MARK 

Mark, spacing combining (Mc).

DECIMAL_DIGIT_NUMBER 

Number, decimal digit (Nd).

LETTER_NUMBER 

Number, letter (Nl).

OTHER_NUMBER 

Number, other (No).

SPACE_SEPARATOR 

Separator, space (Zs).

LINE_SEPARATOR 

Separator, line (Zl).

PARAGRAPH_SEPARATOR 

Separator, paragraph (Zp).

CONTROL 

Other, control (Cc).

FORMAT 

Other, format (Cf).

PRIVATE_USE 

Other, private use (Co).

SURROGATE 

Other, surrogate (Cs).

CONNECTOR_PUNCTUATION 

Punctuation, connector (Pc).

DASH_PUNCTUATION 

Punctuation, dash (Pd).

OPEN_PUNCTUATION 

Punctuation, open (Ps).

CLOSE_PUNCTUATION 

Punctuation, close (Pe).

INITIAL_QUOTE_PUNCTUATION 

Punctuation, initial quote (Pi).

FINAL_QUOTE_PUNCTUATION 

Punctuation, final quote (Pf).

OTHER_PUNCTUATION 

Punctuation, other (Po).

MATH_SYMBOL 

Symbol, math (Sm).

CURRENCY_SYMBOL 

Symbol, currency (Sc).

MODIFIER_SYMBOL 

Symbol, modified (Sk).

OTHER_SYMBOL 

Symbol, other (So).

Function Documentation

◆ nonascii_to_utf8()

unsigned Xapian::Unicode::nonascii_to_utf8 ( unsigned ch,
char * buf )

Convert a single non-ASCII Unicode character to UTF-8.

This is intended mainly as a helper method for to_utf8().

Parameters
chThe character (which must be > 128) to write to buf.
bufThe buffer to write the character to - it must have space for (at least) 4 bytes.
Returns
The length of the resultant UTF-8 character in bytes.

Referenced by to_utf8().

◆ to_utf8()

unsigned Xapian::Unicode::to_utf8 ( unsigned ch,
char * buf )
inline

Convert a single Unicode character to UTF-8.

Parameters
chThe character to write to buf.
bufThe buffer to write the character to - it must have space for (at least) 4 bytes.
Returns
The length of the resultant UTF-8 character in bytes.

References nonascii_to_utf8().

Referenced by append_utf8().