|
xapian-core
2.0.0
|
Xapian::Weight subclass implementing the tf-idf weighting scheme. More...
#include <weight.h>
Inheritance diagram for Xapian::TfIdfWeight:
Collaboration diagram for Xapian::TfIdfWeight:Public Types | |
| enum class | wdf_norm : unsigned char { NONE = 1 , BOOLEAN = 2 , SQUARE = 3 , LOG = 4 , PIVOTED = 5 , LOG_AVERAGE = 6 , AUG_LOG = 7 , SQRT = 8 , AUG_AVERAGE = 9 , MAX = 10 , AUG = 11 } |
| Wdf normalizations. More... | |
| enum class | idf_norm : unsigned char { NONE = 1 , TFIDF = 2 , SQUARE = 3 , FREQ = 4 , PROB = 5 , PIVOTED = 6 , GLOBAL_FREQ = 7 , LOG_GLOBAL_FREQ = 8 , INCREMENTED_GLOBAL_FREQ = 9 , SQRT_GLOBAL_FREQ = 10 } |
| Idf normalizations. More... | |
| enum class | wt_norm : unsigned char { NONE = 1 } |
| Weight normalizations. More... | |
Public Member Functions | |
| TfIdfWeight (const std::string &normalizations) | |
| Construct a TfIdfWeight. More... | |
| TfIdfWeight (const std::string &normalizations, double slope, double delta) | |
| Construct a TfIdfWeight. More... | |
| TfIdfWeight (wdf_norm wdf_normalization, idf_norm idf_normalization, wt_norm wt_normalization) | |
| Construct a TfIdfWeight. More... | |
| TfIdfWeight (wdf_norm wdf_norm_, idf_norm idf_norm_, wt_norm wt_norm_, double slope, double delta) | |
| Construct a TfIdfWeight. More... | |
| TfIdfWeight () | |
| Construct a TfIdfWeight using the default normalizations ("ntn"). More... | |
| std::string | name () const |
| Return the name of this weighting scheme, e.g. More... | |
| std::string | serialise () const |
| Return this object's parameters serialised as a single string. More... | |
| TfIdfWeight * | unserialise (const std::string &serialised) const |
| Unserialise parameters. More... | |
| double | get_sumpart (Xapian::termcount wdf, Xapian::termcount doclen, Xapian::termcount uniqterm, Xapian::termcount wdfdocmax) const |
| Calculate the weight contribution for this object's term to a document. More... | |
| double | get_maxpart () const |
| Return an upper bound on what get_sumpart() can return for any document. More... | |
| TfIdfWeight * | create_from_parameters (const char *params) const |
| Create from a human-readable parameter string. More... | |
Public Member Functions inherited from Xapian::Weight | |
| Weight () | |
| Default constructor, needed by subclass constructors. More... | |
| virtual | ~Weight () |
| Virtual destructor, because we have virtual methods. More... | |
| virtual double | get_sumextra (Xapian::termcount doclen, Xapian::termcount uniqterms, Xapian::termcount wdfdocmax) const |
| Calculate the term-independent weight component for a document. More... | |
| virtual double | get_maxextra () const |
| Return an upper bound on what get_sumextra() can return for any document. More... | |
Private Member Functions | |
| TfIdfWeight * | clone () const |
| Clone this object. More... | |
| void | init (double factor) |
| Allow the subclass to perform any initialisation it needs to. More... | |
| double | get_wdfn (Xapian::termcount wdf, Xapian::termcount len, Xapian::termcount uniqterms, Xapian::termcount wdfdocmax, wdf_norm wdf_normalization) const |
| double | get_idfn (idf_norm idf_normalization) const |
| double | get_wtn (double wt, wt_norm wt_normalization) const |
Private Attributes | |
| wdf_norm | wdf_norm_ |
| The parameter for normalization for the wdf. More... | |
| idf_norm | idf_norm_ |
| The parameter for normalization for the idf. More... | |
| wt_norm | wt_norm_ |
| The parameter for normalization for the document weight. More... | |
| double | wqf_factor |
| The factor to multiply with the weight. More... | |
| double | idfn |
| Normalised IDF value (document-independent). More... | |
| double | param_slope |
| Parameters slope and delta in the Piv+ normalization weighting formula. More... | |
| double | param_delta |
Additional Inherited Members | |
Static Public Member Functions inherited from Xapian::Weight | |
| static const Weight * | create (const std::string &scheme, const Registry ®=Registry()) |
| Return the appropriate weighting scheme object. More... | |
Protected Types inherited from Xapian::Weight | |
| enum | stat_flags { COLLECTION_SIZE = 0 , RSET_SIZE = 0 , AVERAGE_LENGTH = 4 , TERMFREQ = 1 , RELTERMFREQ = 1 , QUERY_LENGTH = 0 , WQF = 0 , WDF = 2 , DOC_LENGTH = 8 , DOC_LENGTH_MIN = 16 , DOC_LENGTH_MAX = 32 , WDF_MAX = 64 , COLLECTION_FREQ = 1 , UNIQUE_TERMS = 128 , TOTAL_LENGTH = 256 , WDF_DOC_MAX = 512 , UNIQUE_TERMS_MIN = 1024 , UNIQUE_TERMS_MAX = 2048 , DB_DOC_LENGTH_MIN = 4096 , DB_DOC_LENGTH_MAX = 8192 , DB_UNIQUE_TERMS_MIN = 16384 , DB_UNIQUE_TERMS_MAX = 32768 , DB_WDF_MAX = 65536 , IS_BOOLWEIGHT_ = static_cast<int>(0x80000000) } |
| Stats which the weighting scheme can use (see need_stat()). More... | |
Protected Member Functions inherited from Xapian::Weight | |
| void | need_stat (stat_flags flag) |
| Tell Xapian that your subclass will want a particular statistic. More... | |
| Weight (const Weight &) | |
| Don't allow copying. More... | |
| Xapian::doccount | get_collection_size () const |
| The number of documents in the collection. More... | |
| Xapian::doccount | get_rset_size () const |
| The number of documents marked as relevant. More... | |
| Xapian::doclength | get_average_length () const |
| The average length of a document in the collection. More... | |
| Xapian::doccount | get_termfreq () const |
| The number of documents which this term indexes. More... | |
| Xapian::doccount | get_reltermfreq () const |
| The number of relevant documents which this term indexes. More... | |
| Xapian::termcount | get_collection_freq () const |
| The collection frequency of the term. More... | |
| Xapian::termcount | get_query_length () const |
| The length of the query. More... | |
| Xapian::termcount | get_wqf () const |
| The within-query-frequency of this term. More... | |
| Xapian::termcount | get_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the shard. More... | |
| Xapian::termcount | get_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the shard. More... | |
| Xapian::termcount | get_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the shard. More... | |
| Xapian::totallength | get_total_length () const |
| Total length of all documents in the collection. More... | |
| Xapian::termcount | get_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the shard. More... | |
| Xapian::termcount | get_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the shard. More... | |
| Xapian::termcount | get_db_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the database. More... | |
| Xapian::termcount | get_db_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the database. More... | |
| Xapian::termcount | get_db_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the database. More... | |
| Xapian::termcount | get_db_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the database. More... | |
| Xapian::termcount | get_db_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the database. More... | |
Xapian::Weight subclass implementing the tf-idf weighting scheme.
|
strong |
Idf normalizations.
|
strong |
Wdf normalizations.
|
strong |
|
inlineexplicit |
Construct a TfIdfWeight.
| normalizations | A three character string indicating the normalizations to be used for the tf(wdf), idf and document weight. (default: "ntn") |
The normalizations string works like so:
Implementing support for more normalizations of each type would require extending the backend to track more statistics.
| Xapian::TfIdfWeight::TfIdfWeight | ( | const std::string & | normalizations, |
| double | slope, | ||
| double | delta | ||
| ) |
Construct a TfIdfWeight.
| normalizations | A three character string indicating the normalizations to be used for the tf(wdf), idf and document weight. (default: "ntn") |
| slope | Extra parameter for "Pivoted" tf normalization. (default: 0.2) |
| delta | Extra parameter for "Pivoted" tf normalization. (default: 1.0) |
The normalizations string works like so:
Implementing support for more normalizations of each type would require extending the backend to track more statistics.
Definition at line 103 of file tfidfweight.cc.
|
inline |
Construct a TfIdfWeight.
| wdf_norm_ | The normalization for the wdf. |
| idf_norm_ | The normalization for the idf. |
| wt_norm_ | The normalization for the document weight. |
Implementing support for more normalizations of each type would require extending the backend to track more statistics.
| Xapian::TfIdfWeight::TfIdfWeight | ( | wdf_norm | wdf_norm_, |
| idf_norm | idf_norm_, | ||
| wt_norm | wt_norm_, | ||
| double | slope, | ||
| double | delta | ||
| ) |
Construct a TfIdfWeight.
| wdf_norm_ | The normalization for the wdf. |
| idf_norm_ | The normalization for the idf. |
| wt_norm_ | The normalization for the document weight. |
| slope | Extra parameter for "Pivoted" tf normalization. (default: 0.2) |
| delta | Extra parameter for "Pivoted" tf normalization. (default: 1.0) |
Implementing support for more normalizations of each type would require extending the backend to track more statistics.
Definition at line 110 of file tfidfweight.cc.
References AUG, AUG_AVERAGE, Xapian::Weight::AVERAGE_LENGTH, Xapian::Weight::COLLECTION_FREQ, Xapian::Weight::COLLECTION_SIZE, Xapian::Weight::DOC_LENGTH, Xapian::Weight::DOC_LENGTH_MAX, Xapian::Weight::DOC_LENGTH_MIN, GLOBAL_FREQ, idf_norm_, INCREMENTED_GLOBAL_FREQ, LOG_AVERAGE, LOG_GLOBAL_FREQ, MAX, Xapian::Weight::need_stat(), NONE, param_delta, param_slope, PIVOTED, SQRT_GLOBAL_FREQ, Xapian::Weight::TERMFREQ, Xapian::Weight::UNIQUE_TERMS, Xapian::Weight::WDF, Xapian::Weight::WDF_DOC_MAX, Xapian::Weight::WDF_MAX, wdf_norm_, and Xapian::Weight::WQF.
|
inline |
Construct a TfIdfWeight using the default normalizations ("ntn").
Definition at line 1023 of file weight.h.
Referenced by clone(), and unserialise().
|
privatevirtual |
Clone this object.
This method allocates and returns a copy of the object it is called on.
If your subclass is called FooWeight and has parameters a and b, then you would implement FooWeight::clone() like so:
FooWeight * FooWeight::clone() const { return new FooWeight(a, b); }
Note that the returned object will be deallocated by Xapian after use with "delete". If you want to handle the deletion in a special way (for example when wrapping the Xapian API for use from another language) then you can define a static operator delete method in your subclass as shown here: https://trac.xapian.org/ticket/554#comment:1
Implements Xapian::Weight.
Definition at line 152 of file tfidfweight.cc.
References idf_norm_, param_delta, param_slope, TfIdfWeight(), wdf_norm_, and wt_norm_.
|
virtual |
Create from a human-readable parameter string.
| params | string containing weighting scheme parameter values. |
Reimplemented from Xapian::Weight.
Definition at line 350 of file tfidfweight.cc.
References idf_norm_tab, keyword(), NONE, p, Xapian::Weight::Internal::param_name(), Xapian::parameter_error(), and wdf_norm_tab.
|
private |
Definition at line 287 of file tfidfweight.cc.
References FREQ, Xapian::Weight::get_collection_freq(), Xapian::Weight::get_collection_size(), Xapian::Weight::get_termfreq(), GLOBAL_FREQ, INCREMENTED_GLOBAL_FREQ, LOG_GLOBAL_FREQ, NONE, PIVOTED, PROB, SQRT_GLOBAL_FREQ, SQUARE, and TFIDF.
Referenced by init().
|
virtual |
Return an upper bound on what get_sumpart() can return for any document.
This information is used by the matcher to perform various optimisations, so strive to make the bound as tight as possible.
Implements Xapian::Weight.
Definition at line 218 of file tfidfweight.cc.
References Xapian::Weight::get_doclength_lower_bound(), Xapian::Weight::get_wdf_upper_bound(), get_wdfn(), get_wtn(), idfn, wdf_norm_, wqf_factor, and wt_norm_.
|
virtual |
Calculate the weight contribution for this object's term to a document.
The parameters give information about the document which may be used in the calculations:
| wdf | The within document frequency of the term in the document. You need to call need_stat(WDF) if you use this value. |
| doclen | The document's length (unnormalised). You need to call need_stat(DOC_LENGTH) if you use this value. |
| uniqterms | Number of unique terms in the document. You need to call need_stat(UNIQUE_TERMS) if you use this value. |
| wdfdocmax | Maximum wdf value in the document. You need to call need_stat(WDF_DOC_MAX) if you use this value. |
You can rely of wdf <= doclen if you call both need_stat(WDF) and need_stat(DOC_LENGTH) - this is trivially true for terms, but Xapian also ensure it's true for OP_SYNONYM, where the wdf is approximated.
Implements Xapian::Weight.
Definition at line 206 of file tfidfweight.cc.
References get_wdfn(), get_wtn(), idfn, wdf_norm_, wqf_factor, and wt_norm_.
|
private |
Definition at line 228 of file tfidfweight.cc.
References AUG, AUG_AVERAGE, AUG_LOG, BOOLEAN, Xapian::Weight::get_average_length(), LOG, LOG_AVERAGE, MAX, NONE, param_delta, param_slope, PIVOTED, rare, SQRT, and SQUARE.
Referenced by get_maxpart(), and get_sumpart().
|
private |
Definition at line 336 of file tfidfweight.cc.
Referenced by get_maxpart(), and get_sumpart().
|
privatevirtual |
Allow the subclass to perform any initialisation it needs to.
| factor | Any scaling factor (e.g. from OP_SCALE_WEIGHT). If the Weight object is for the term-independent weight supplied by get_sumextra()/get_maxextra(), then init(0.0) is called (starting from Xapian 1.2.11 and 1.3.1 - earlier versions failed to call init() for such Weight objects). |
Implements Xapian::Weight.
Definition at line 159 of file tfidfweight.cc.
References get_idfn(), Xapian::Weight::get_wqf(), idf_norm_, idfn, and wqf_factor.
|
virtual |
Return the name of this weighting scheme, e.g.
"bm25+".
This is the name that the weighting scheme gets registered under when passed to Xapian:Registry::register_weighting_scheme().
As a result:
For 1.4.x and earlier we recommended returning the full namespace-qualified name of your class here, but now we recommend returning a just the name in lower case, e.g. "foo" instead of "FooWeight", "bm25+" instead of "Xapian::BM25PlusWeight".
If you don't want to support creation via Weight::create() or the remote backend, you can use the default implementation which simply returns an empty string.
Reimplemented from Xapian::Weight.
Definition at line 172 of file tfidfweight.cc.
|
virtual |
Return this object's parameters serialised as a single string.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Reimplemented from Xapian::Weight.
Definition at line 178 of file tfidfweight.cc.
References idf_norm_, param_delta, param_slope, serialise_double(), wdf_norm_, and wt_norm_.
|
virtual |
Unserialise parameters.
This method unserialises parameters serialised by the serialise() method and allocates and returns a new object initialised with them.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Note that the returned object will be deallocated by Xapian after use with "delete". If you want to handle the deletion in a special way (for example when wrapping the Xapian API for use from another language) then you can define a static operator delete method in your subclass as shown here: https://trac.xapian.org/ticket/554#comment:1
| serialised | A string containing the serialised parameters. |
Reimplemented from Xapian::Weight.
Definition at line 189 of file tfidfweight.cc.
References rare, TfIdfWeight(), and unserialise_double().
|
private |
The parameter for normalization for the idf.
Definition at line 863 of file weight.h.
Referenced by clone(), init(), serialise(), and TfIdfWeight().
|
private |
Normalised IDF value (document-independent).
Definition at line 871 of file weight.h.
Referenced by get_maxpart(), get_sumpart(), and init().
|
private |
Definition at line 874 of file weight.h.
Referenced by clone(), get_wdfn(), serialise(), and TfIdfWeight().
|
private |
Parameters slope and delta in the Piv+ normalization weighting formula.
Definition at line 874 of file weight.h.
Referenced by clone(), get_wdfn(), serialise(), and TfIdfWeight().
|
private |
The parameter for normalization for the wdf.
Definition at line 861 of file weight.h.
Referenced by clone(), get_maxpart(), get_sumpart(), serialise(), and TfIdfWeight().
|
private |
The factor to multiply with the weight.
Definition at line 868 of file weight.h.
Referenced by get_maxpart(), get_sumpart(), and init().
|
private |
The parameter for normalization for the document weight.
Definition at line 865 of file weight.h.
Referenced by clone(), get_maxpart(), get_sumpart(), and serialise().