|
xapian-core
2.0.0
|
An iterator which returns Unicode character values from a UTF-8 encoded string. More...
#include <unicode.h>
Public Types | |
| typedef std::input_iterator_tag | iterator_category |
| We implement the semantics of an STL input_iterator. More... | |
| typedef unsigned | value_type |
| typedef size_t | difference_type |
| typedef value_type * | pointer |
| typedef value_type | reference |
Public Member Functions | |
| const char * | raw () const |
| Return the raw const char* pointer for the current position. More... | |
| size_t | left () const |
| Return the number of bytes left in the iterator's buffer. More... | |
| void | assign (const char *p_, size_t len) |
| Assign a new string to the iterator. More... | |
| void | assign (std::string_view s) |
| Assign a new string to the iterator. More... | |
| Utf8Iterator (const char *p_, size_t len) | |
| Create an iterator given a pointer and a length. More... | |
| Utf8Iterator (std::string_view s) | |
| Create an iterator given a string. More... | |
| Utf8Iterator () noexcept | |
| Create an iterator which is at the end of its iteration. More... | |
| unsigned | operator* () const noexcept |
| Get the current Unicode character value pointed to by the iterator. More... | |
| Utf8Iterator | operator++ (int) |
| Move forward to the next Unicode character. More... | |
| Utf8Iterator & | operator++ () |
| Move forward to the next Unicode character. More... | |
| bool | operator== (const Utf8Iterator &other) const noexcept |
| Test two Utf8Iterators for equality. More... | |
| bool | operator!= (const Utf8Iterator &other) const noexcept |
| Test two Utf8Iterators for inequality. More... | |
Private Member Functions | |
| bool | calculate_sequence_length () const noexcept |
| unsigned | get_char () const |
| Utf8Iterator (const unsigned char *p_, const unsigned char *end_, unsigned seqlen_) | |
| unsigned | strict_deref () const noexcept |
Private Attributes | |
| const unsigned char * | p |
| const unsigned char * | end |
| unsigned | seqlen |
An iterator which returns Unicode character values from a UTF-8 encoded string.
| typedef size_t Xapian::Utf8Iterator::difference_type |
| typedef std::input_iterator_tag Xapian::Utf8Iterator::iterator_category |
| typedef value_type* Xapian::Utf8Iterator::pointer |
| typedef unsigned Xapian::Utf8Iterator::value_type |
|
inlineprivate |
|
inline |
Create an iterator given a pointer and a length.
The iterator will return characters from the start of the string when next called. The string is not copied into the iterator, so it must remain valid while the iteration is in progress.
| p_ | A pointer to the start of the string to read. |
| len | The length of the string to read. |
|
inlineexplicit |
Create an iterator given a string.
The iterator will return characters from the start of the string when next called. The string is not copied into the iterator, so it must remain valid while the iteration is in progress.
| s | The string to read. Must not be modified while the iteration is in progress. |
This parameter is of type std::string_view, so you can pass in types which automatically convert to that such as std::string, or a const char* pointing to a nul-terminated string.
|
inlinenoexcept |
|
inline |
Assign a new string to the iterator.
The iterator will forget the string it was iterating through, and return characters from the start of the new string when next called. The string is not copied into the iterator, so it must remain valid while the iteration is in progress.
| p_ | A pointer to the start of the string to read. |
| len | The length of the string to read. |
Definition at line 73 of file unicode.h.
References p.
Referenced by Xapian::SnipPipe::drain().
|
inline |
Assign a new string to the iterator.
The iterator will forget the string it was iterating through, and return characters from the start of the new string when next called. The string is not copied into the iterator, so it must remain valid while the iteration is in progress.
| s | The string to read. Must not be modified while the iteration is in progress. |
Definition at line 93 of file unicode.h.
References assign().
Referenced by assign().
|
privatenoexcept |
Definition at line 66 of file utf8itor.cc.
References bad_cont(), and p.
|
private |
|
inline |
Return the number of bytes left in the iterator's buffer.
Definition at line 60 of file unicode.h.
References p.
Referenced by Xapian::break_words(), and Xapian::parse_terms().
|
inlinenoexcept |
Test two Utf8Iterators for inequality.
| other | The Utf8Iterator to compare this one with. |
Definition at line 206 of file unicode.h.
References p.
|
noexcept |
Get the current Unicode character value pointed to by the iterator.
If an invalid UTF-8 sequence is encountered, then the byte values comprising it are returned until valid UTF-8 or the end of the input is reached.
This handling applies to invalid byte sequences, truncated UTF-8 sequences, overlong sequences and (since Xapian 2.0.0) surrogate pair codepoints encoded as UTF-8.
If you want to reject or otherwise discriminate invalid UTF-8 sequences then see the strict_deref() method.
Returns unsigned(-1) if the iterator has reached the end of its buffer.
Definition at line 109 of file utf8itor.cc.
References p.
|
inline |
|
inline |
|
inlinenoexcept |
Test two Utf8Iterators for equality.
| other | The Utf8Iterator to compare this one with. |
Definition at line 197 of file unicode.h.
References p.
|
inline |
Return the raw const char* pointer for the current position.
Definition at line 55 of file unicode.h.
References p.
Referenced by Xapian::break_words(), Xapian::SnipPipe::drain(), and Xapian::QueryParser::Internal::parse_term().
|
privatenoexcept |
Get the current Unicode character value pointed to by the iterator.
If an invalid UTF-8 sequence is encountered, then the byte values comprising it are returned with the top bit set (so the caller can differentiate these from the same values arising from valid UTF-8) until valid UTF-8 or the end of the input is reached.
This handling applies to invalid byte sequences, truncated UTF-8 sequences, overlong sequences and (since Xapian 2.0.0) surrogate pair codepoints encoded as UTF-8.
Returns unsigned(-1) if the iterator has reached the end of its buffer.
Definition at line 122 of file utf8itor.cc.
References p.