xapian-core  2.0.0
Public Member Functions | Private Types | Private Member Functions | Private Attributes | Static Private Attributes | List of all members
EditDistanceCalculator Class Reference

Calculate edit distances to a target string. More...

#include <editdistance.h>

+ Collaboration diagram for EditDistanceCalculator:

Public Member Functions

 EditDistanceCalculator (std::string_view target_)
 Constructor. More...
 
 ~EditDistanceCalculator ()
 
int operator() (const std::string &candidate, int max_distance) const
 Calculate edit distance for a sequence. More...
 

Private Types

typedef unsigned long long freqs_bitmap
 The type to use for the occurrence bitmaps. More...
 

Private Member Functions

EditDistanceCalculatoroperator= (const EditDistanceCalculator &)=delete
 Don't allow assignment. More...
 
 EditDistanceCalculator (const EditDistanceCalculator &)=delete
 Don't allow copying. More...
 
int calc (const unsigned *ptr, int len, int max_distance) const
 Calculate edit distance. More...
 

Private Attributes

std::vector< unsigned > target
 Target in UTF-32. More...
 
size_t target_bytes
 
std::vector< unsigned > utf32
 Current candidate in UTF-32. More...
 
int * array = nullptr
 
freqs_bitmap target_freqs = 0
 Occurrence bitmap for target sequence. More...
 
freqs_bitmap target_freqs2 = 0
 Second occurrence bitmap for target sequence. More...
 

Static Private Attributes

static constexpr unsigned FREQS_MASK = sizeof(freqs_bitmap) * 8 - 1
 

Detailed Description

Calculate edit distances to a target string.

Edit distance is defined as the minimum number of edit operations required to move from one sequence to another. The edit operations considered are:

Definition at line 43 of file editdistance.h.

Member Typedef Documentation

◆ freqs_bitmap

typedef unsigned long long EditDistanceCalculator::freqs_bitmap
private

The type to use for the occurrence bitmaps.

There will be a trade-off between how good the bound is and how many bits we use. We currently use an unsigned long long, which is typically 64 bits and seems to work pretty well (it makes sense it does for English, as each letter maps to a different bit).

At least on x86-64 and English, a 32-bit type seems to give an identical cycle count to a 64-bit type when profiled with cachegrind, but a 64-bit type is likely to work better for languages which have more than 32 commonly-used word characters.

FIXME: Profile other architectures and languages.

Definition at line 74 of file editdistance.h.

Constructor & Destructor Documentation

◆ EditDistanceCalculator() [1/2]

EditDistanceCalculator::EditDistanceCalculator ( const EditDistanceCalculator )
privatedelete

Don't allow copying.

◆ EditDistanceCalculator() [2/2]

EditDistanceCalculator::EditDistanceCalculator ( std::string_view  target_)
inlineexplicit

Constructor.

Parameters
target_Target string to calculate edit distances to.

Definition at line 115 of file editdistance.h.

References FREQS_MASK, target, target_freqs, and target_freqs2.

◆ ~EditDistanceCalculator()

EditDistanceCalculator::~EditDistanceCalculator ( )
inline

Definition at line 127 of file editdistance.h.

References array.

Member Function Documentation

◆ calc()

int EditDistanceCalculator::calc ( const unsigned *  ptr,
int  len,
int  max_distance 
) const
private

Calculate edit distance.

Internal helper - the cheap case is inlined from the header.

Definition at line 196 of file editdistance.cc.

References add_popcount().

Referenced by operator()().

◆ operator()()

int EditDistanceCalculator::operator() ( const std::string &  candidate,
int  max_distance 
) const
inline

Calculate edit distance for a sequence.

Parameters
candidateString to calculate edit distance for.
max_distanceThe greatest edit distance that's interesting to us. If the true edit distance is > max_distance, any value > max_distance may be returned instead (which allows the edit distance algorithm to avoid work for poor matches). The value passed for subsequent calls to this method on the same object must be the same or less.
Returns
The edit distance between candidate and the target.

Definition at line 145 of file editdistance.h.

References calc(), target, target_bytes, and utf32.

◆ operator=()

EditDistanceCalculator& EditDistanceCalculator::operator= ( const EditDistanceCalculator )
privatedelete

Don't allow assignment.

Member Data Documentation

◆ array

int* EditDistanceCalculator::array = nullptr
mutableprivate

Definition at line 58 of file editdistance.h.

Referenced by ~EditDistanceCalculator().

◆ FREQS_MASK

constexpr unsigned EditDistanceCalculator::FREQS_MASK = sizeof(freqs_bitmap) * 8 - 1
staticconstexprprivate

Definition at line 101 of file editdistance.h.

Referenced by EditDistanceCalculator().

◆ target

std::vector<unsigned> EditDistanceCalculator::target
private

Target in UTF-32.

Definition at line 51 of file editdistance.h.

Referenced by EditDistanceCalculator(), and operator()().

◆ target_bytes

size_t EditDistanceCalculator::target_bytes
private

Definition at line 53 of file editdistance.h.

Referenced by operator()().

◆ target_freqs

freqs_bitmap EditDistanceCalculator::target_freqs = 0
private

Occurrence bitmap for target sequence.

We set the bit corresponding to each codepoint in the word and then xor the bitmaps for the target and candidate and count the bits to compute a lower bound on the edit distance.

Rather than flagging each Unicode code point uniquely, we reduce the code points modulo the number of available bits which can only reduce the bound we calculate.

Definition at line 86 of file editdistance.h.

Referenced by EditDistanceCalculator().

◆ target_freqs2

freqs_bitmap EditDistanceCalculator::target_freqs2 = 0
private

Second occurrence bitmap for target sequence.

We set the bit corresponding to each codepoint in the word which we see a second time, which enables us to calculate a tighter edit distance lower bound when there's more than one occurence of a character in the target and/or candidate word.

Rather than flagging each Unicode code point uniquely, we reduce the code points modulo the number of available bits which can only reduce the bound we calculate.

Definition at line 99 of file editdistance.h.

Referenced by EditDistanceCalculator().

◆ utf32

std::vector<unsigned> EditDistanceCalculator::utf32
mutableprivate

Current candidate in UTF-32.

Definition at line 56 of file editdistance.h.

Referenced by operator()().


The documentation for this class was generated from the following files: