Functions associated with handling Unicode characters. More...

Enumerations
enum	category { UNASSIGNED , UPPERCASE_LETTER , LOWERCASE_LETTER , TITLECASE_LETTER , MODIFIER_LETTER , OTHER_LETTER , NON_SPACING_MARK , ENCLOSING_MARK , COMBINING_SPACING_MARK , DECIMAL_DIGIT_NUMBER , LETTER_NUMBER , OTHER_NUMBER , SPACE_SEPARATOR , LINE_SEPARATOR , PARAGRAPH_SEPARATOR , CONTROL , FORMAT , PRIVATE_USE , SURROGATE , CONNECTOR_PUNCTUATION , DASH_PUNCTUATION , OPEN_PUNCTUATION , CLOSE_PUNCTUATION , INITIAL_QUOTE_PUNCTUATION , FINAL_QUOTE_PUNCTUATION , OTHER_PUNCTUATION , MATH_SYMBOL , CURRENCY_SYMBOL , MODIFIER_SYMBOL , OTHER_SYMBOL }
	Each Unicode character is in exactly one of these categories. More...

Functions
unsigned	nonascii_to_utf8 (unsigned ch, char *buf)
	Convert a single non-ASCII Unicode character to UTF-8.

unsigned	to_utf8 (unsigned ch, char *buf)
	Convert a single Unicode character to UTF-8.

void	append_utf8 (std::string &s, unsigned ch)
	Append the UTF-8 representation of a single Unicode character to a std::string.

category	get_category (unsigned ch)
	Return the category which a given Unicode character falls into.

bool	is_wordchar (unsigned ch)
	Test if a given Unicode character is "word character".

bool	is_whitespace (unsigned ch)
	Test if a given Unicode character is a whitespace character.

bool	is_currency (unsigned ch)
	Test if a given Unicode character is a currency symbol.

unsigned	tolower (unsigned ch)
	Convert a Unicode character to lowercase.

unsigned	toupper (unsigned ch)
	Convert a Unicode character to uppercase.

std::string	tolower (const std::string &term)
	Convert a UTF-8 std::string to lowercase.

std::string	toupper (const std::string &term)
	Convert a UTF-8 std::string to uppercase.

Detailed Description

Functions associated with handling Unicode characters.

Enumeration Type Documentation

Each Unicode character is in exactly one of these categories.

The Unicode standard calls this the "General Category", and uses a "Major, minor" convention to derive a two letter code.

Enumerator
UNASSIGNED	Other, not assigned (Cn)
UPPERCASE_LETTER	Letter, uppercase (Lu)
LOWERCASE_LETTER	Letter, lowercase (Ll)
TITLECASE_LETTER	Letter, titlecase (Lt)
MODIFIER_LETTER	Letter, modifier (Lm)
OTHER_LETTER	Letter, other (Lo)
NON_SPACING_MARK	Mark, nonspacing (Mn)
ENCLOSING_MARK	Mark, enclosing (Me)
COMBINING_SPACING_MARK	Mark, spacing combining (Mc)
DECIMAL_DIGIT_NUMBER	Number, decimal digit (Nd)
LETTER_NUMBER	Number, letter (Nl)
OTHER_NUMBER	Number, other (No)
SPACE_SEPARATOR	Separator, space (Zs)
LINE_SEPARATOR	Separator, line (Zl)
PARAGRAPH_SEPARATOR	Separator, paragraph (Zp)
CONTROL	Other, control (Cc)
FORMAT	Other, format (Cf)
PRIVATE_USE	Other, private use (Co)
SURROGATE	Other, surrogate (Cs)
CONNECTOR_PUNCTUATION	Punctuation, connector (Pc)
DASH_PUNCTUATION	Punctuation, dash (Pd)
OPEN_PUNCTUATION	Punctuation, open (Ps)
CLOSE_PUNCTUATION	Punctuation, close (Pe)
INITIAL_QUOTE_PUNCTUATION	Punctuation, initial quote (Pi)
FINAL_QUOTE_PUNCTUATION	Punctuation, final quote (Pf)
OTHER_PUNCTUATION	Punctuation, other (Po)
MATH_SYMBOL	Symbol, math (Sm)
CURRENCY_SYMBOL	Symbol, currency (Sc)
MODIFIER_SYMBOL	Symbol, modified (Sk)
OTHER_SYMBOL	Symbol, other (So)

unsigned Xapian::Unicode::nonascii_to_utf8	(	unsigned	ch,
		char *	buf
	)

Convert a single non-ASCII Unicode character to UTF-8.

This is intended mainly as a helper method for to_utf8().

Parameters

ch	The character (which must be > 128) to write to buf.
buf	The buffer to write the character to - it must have space for (at least) 4 bytes.

Referenced by to_utf8().

unsigned Xapian::Unicode::to_utf8	(	unsigned	ch,
		char *	buf
	)

inline

Convert a single Unicode character to UTF-8.

Parameters

ch	The character to write to buf.
buf	The buffer to write the character to - it must have space for (at least) 4 bytes.

Referenced by append_utf8().