|
xapian-core
2.0.0
|
Calculate edit distances to a target string. More...
#include <editdistance.h>
Collaboration diagram for EditDistanceCalculator:Public Member Functions | |
| EditDistanceCalculator (std::string_view target_) | |
| Constructor. More... | |
| ~EditDistanceCalculator () | |
| int | operator() (const std::string &candidate, int max_distance) const |
| Calculate edit distance for a sequence. More... | |
Private Types | |
| typedef unsigned long long | freqs_bitmap |
| The type to use for the occurrence bitmaps. More... | |
Private Member Functions | |
| EditDistanceCalculator & | operator= (const EditDistanceCalculator &)=delete |
| Don't allow assignment. More... | |
| EditDistanceCalculator (const EditDistanceCalculator &)=delete | |
| Don't allow copying. More... | |
| int | calc (const unsigned *ptr, int len, int max_distance) const |
| Calculate edit distance. More... | |
Private Attributes | |
| std::vector< unsigned > | target |
| Target in UTF-32. More... | |
| size_t | target_bytes |
| std::vector< unsigned > | utf32 |
| Current candidate in UTF-32. More... | |
| int * | array = nullptr |
| freqs_bitmap | target_freqs = 0 |
| Occurrence bitmap for target sequence. More... | |
| freqs_bitmap | target_freqs2 = 0 |
| Second occurrence bitmap for target sequence. More... | |
Static Private Attributes | |
| static constexpr unsigned | FREQS_MASK = sizeof(freqs_bitmap) * 8 - 1 |
Calculate edit distances to a target string.
Edit distance is defined as the minimum number of edit operations required to move from one sequence to another. The edit operations considered are:
Definition at line 43 of file editdistance.h.
|
private |
The type to use for the occurrence bitmaps.
There will be a trade-off between how good the bound is and how many bits we use. We currently use an unsigned long long, which is typically 64 bits and seems to work pretty well (it makes sense it does for English, as each letter maps to a different bit).
At least on x86-64 and English, a 32-bit type seems to give an identical cycle count to a 64-bit type when profiled with cachegrind, but a 64-bit type is likely to work better for languages which have more than 32 commonly-used word characters.
FIXME: Profile other architectures and languages.
Definition at line 74 of file editdistance.h.
|
privatedelete |
Don't allow copying.
|
inlineexplicit |
Constructor.
| target_ | Target string to calculate edit distances to. |
Definition at line 115 of file editdistance.h.
References FREQS_MASK, target, target_freqs, and target_freqs2.
|
inline |
Definition at line 127 of file editdistance.h.
References array.
|
private |
Calculate edit distance.
Internal helper - the cheap case is inlined from the header.
Definition at line 196 of file editdistance.cc.
References add_popcount().
Referenced by operator()().
|
inline |
Calculate edit distance for a sequence.
| candidate | String to calculate edit distance for. |
| max_distance | The greatest edit distance that's interesting to us. If the true edit distance is > max_distance, any value > max_distance may be returned instead (which allows the edit distance algorithm to avoid work for poor matches). The value passed for subsequent calls to this method on the same object must be the same or less. |
Definition at line 145 of file editdistance.h.
References calc(), target, target_bytes, and utf32.
|
privatedelete |
Don't allow assignment.
|
mutableprivate |
Definition at line 58 of file editdistance.h.
Referenced by ~EditDistanceCalculator().
|
staticconstexprprivate |
Definition at line 101 of file editdistance.h.
Referenced by EditDistanceCalculator().
|
private |
Target in UTF-32.
Definition at line 51 of file editdistance.h.
Referenced by EditDistanceCalculator(), and operator()().
|
private |
Definition at line 53 of file editdistance.h.
Referenced by operator()().
|
private |
Occurrence bitmap for target sequence.
We set the bit corresponding to each codepoint in the word and then xor the bitmaps for the target and candidate and count the bits to compute a lower bound on the edit distance.
Rather than flagging each Unicode code point uniquely, we reduce the code points modulo the number of available bits which can only reduce the bound we calculate.
Definition at line 86 of file editdistance.h.
Referenced by EditDistanceCalculator().
|
private |
Second occurrence bitmap for target sequence.
We set the bit corresponding to each codepoint in the word which we see a second time, which enables us to calculate a tighter edit distance lower bound when there's more than one occurence of a character in the target and/or candidate word.
Rather than flagging each Unicode code point uniquely, we reduce the code points modulo the number of available bits which can only reduce the bound we calculate.
Definition at line 99 of file editdistance.h.
Referenced by EditDistanceCalculator().
|
mutableprivate |
Current candidate in UTF-32.
Definition at line 56 of file editdistance.h.
Referenced by operator()().