Omega 1.4.18 (2021-01-14): indexers: * omindex: + Add default MIME mapping for application/rtf. IANA have registrations for text/rtf and (more recently) application/rtf (it seems because newer versions of the RTF format can contain 8-bit data) so we now recognise application/rtf by default and handle it the same way as text/rtf. Current libmagic seems to always return text/rtf (no matches for application/rtf in magic.mgc) and we continue to map extension rtf to text/rtf, so this change is mainly future-proofing against libmagic future changes. + Add support for indexing OpenXPS, which is effectively the same as XPS internally in ways we care about, but it uses a different mimetype and a different filename extension. omega: * Explicitly use OR for MORELIKE queries. Since 1.3.0 the default value of DEFAULTOP has been AND, which typically makes MORELIKE queries much less useful since they'll only match documents containing all the terms from the query expansion. We now explicitly insert " OR " between the terms if DEFAULTOP hasn't been set to OR, which makes them work much more like they did in 1.2.x. * Make $stoplist and $unstem consider all query strings by always passing the new Xapian::QueryParser::FLAG_ACCUMULATE flag. * Add $foreach command which works like $map, but just concatenates the evaluated results rather than adding tabs to turn them into an OmegaScript list. * Extend $include{} to allow handling failure to open the specified file via an optional second argument which if specified will be evaluated and returned instead. Patch from Gaurav Arora. * Support multiple MORELIKE parameters - we now form an RSet from all the specified documents and use that to generate the query to run (previously only one of multiple MORELIKE parameters was used). Omega 1.4.17 (2020-08-21): documentation: * Document comment format supported by scriptindex index scripts. We've supported comments on a line by themselves and introduced with a # since scriptindex was first added back in 2002, but it seems have never actually been documented before now. omega: * Check for SERVER_PROTOCOL=INCLUDED before anything which might throw an exception so that if it is set we suppress the Content-Type: when reporting such exceptions. Spotted by Gaurav Arora. * Report get_description() for Xapian::Error exceptions instead of get_msg(). This means we now report the exception's type, context (useful for network errors), and errno information. * Avoid leaking MyStopper object. The object essentially has the lifespan of omega itself, but becomes unreachable when the QueryParser object is destroyed. To make it easier to use leak-checking tools, hand ownership of this object to the QueryParser object. testsuite: * omegatest: Tell leak sanitizer not to report leaks for allocations which aren't explicitly released on exit - the OS will reclaim all memory from the process at this point and explicitly releasing everything just takes time for no real benefit. We will still see leaks of objects which become unreachable during a run. Omega 1.4.16 (2020-06-08): indexers: * Fix handling of XML empty tag syntax when there's a quoted parameter right before the closing `/>`. This caused `` to treat the body text as the document title. Spotted by Gaurav Arora. * omindex: Fix killing of filter child process if the parent process receives a signal. Spotted by Gaurav Arora. omega: * Reject $setrelevant without an argument list. This has never been documented as allowed, and previously crashed with a segfault. Fixes #802, reported by Gaurav Arora. * If there's an error opening the databases we now close any we managed to open successfully before the error so that things like $dbsize can't end up reporting values for a subset of the specified databases. portability: * Use our own autoconf cache variable namespace (xo_cv_ prefix instead of ac_cv_) to avoid colliding with standard autoconf macro use if config.site or a shared config.cache is used. The former case caused a build failure for the OpenBSD port with 1.4.15, reported by Lucas R. Omega 1.4.15 (2020-02-24): documentation: * Update documentation about how to add a new format to omindex. Patch from Bruno Baruffaldi. indexers: * Check for a BOM on HTML files, which for HTML5 should determine the encoding. omega: * Allow $if{COND} without any actions which is useful as a way to evaluate something but ignore the result if you just want the side effects. Indeed we were already recommending to use it if you want to ignore the return value of $log. Fixes bug introduced in 1.4.14, reported by tuftedocelot. * Add OmegaScript support for $jsonbool{COND} for encoding a boolean value for use in JSON. This is equivalent to $if{COND,true,false} but more readable. * Add OmegaScript support for $jsonobject{} which allows producing a JSON object from an OmegaScript map. * Allow specifying a format to $jsonarray{} so it is no longer restricted to producing an array of strings. * Add $keys{MAP} OmegaScript command which gives a sorted list of the keys from an OmegaScript map. portability: * Simplify probes for snprintf. The broken snprintf in libbsd in Linux libc4 is from ~25 years ago so way too ancient to matter now, and all callers already handle the pre-ISO semantics of returning -1 for an undersize buffer so we don't need to run a test program to probe for this at configure time, which is more cross-compile friendly. * Avoid deprecation warning on recent Linux. We were including sys/sysctl.h if it existed, which it does on Linux but we don't actually use it there. Including it now warns that it is deprecated, so skip including it under Linux. Reported on IRC by kumaran. Omega 1.4.14 (2019-11-23): documentation: * Improve omindex --help docs for --duplicates. indexers: * Add built-in support for iso-8859-15 so we can handle it without iconv. This charset is a variant of iso-8859-1 with 8 characters changed, most notably including the euro currency symbol. It's the most commonly seen charset we didn't have built-in support for. omega: * Fix error handling in $lookup. We now check for errors from cdb_init() and cdb_get(). We've never checked for errors from cdb_init(), while for cdb_get() this bug was introduced by a warning fix in 1.2.20. Omega 1.4.13 (2019-10-14): documentation: * Document that $log will start to return an error message in 1.5.0, and that one can wrap it using a $if with no action now to be future-proof. indexers: * Optimise converting us-ascii to UTF-8 to do nothing, like we already do when converting UTF-8 to UTF-8. * scriptindex: + Add new 'gap' action which provides a way to leave a gap in the term positions between fields to prevent phrases and positional operators from matching across fields. templates: * Future-proof use of $log against changes in 1.5.0. Omega 1.4.12 (2019-07-23): documentation: * Improve docs for OmegaScript $hitlist{}. * Fix RST formatting errors in omega docs. * Clarify use of Q prefix for unique ID terms - it was described as "reserved", but the use of "Q" is really just a convention (and in fact omindex uses "U" not "Q"). * Clarify scriptindex's weight action takes parameter >= 0. * Correct typo in OmegaScript $add parameter documentation. indexers: * omindex: + Fix typo in mimetypes used for Apple iWork documents ("apply" instead of "apple") which meant that these documents weren't actually being indexed. Patch from Bruno Baruffaldi. + Pipe input to ps2pdf as this accepts input on stdin. Possibility pointed out by Gaurav Arora. * scriptindex: + If parsedate action's format includes %z adjust for the timezone if possible (this requires the non-POSIX tm_gmtoff member of struct tm) and flag an error for other platforms. + If parsedate action's format include %Z flag an error as that doesn't seem to be usefully supported by strptime() anywhere. + Fix parsedate action to treat formats without a timezone as being UTC instead of localtime. + Add date=unixutc. The existing date=unix works in localtime which is unhelpful if you want to use it on the output of parsedate since that's in UTC; date=unixutc is just like date=unix except it always works in UTC. + The date action now emits a warning for invalid values. The documentation used to say "invalid values are ignored at present", but it's more helpful to flag bad data than quietly ignore it. + We now check the date action's parameter at script parse time and unknown values result in an error and nothing being indexed. Previously an unknown format uselessly resulted in the terms D, M and Y literally being added to every document. + The split action now supports a new "prefixes" split style. This gives all the prefixes from the split, so split=/,prefixes on a file path gives all parent directories. omega: * Remove documented limitation of $subdb and $subid - the implementation assumed that each omega database name corresponded to a single Xapian database, and if a database name referred to a stub database file expanding to multiple Xapian databases then they would misbehave. Such cases are now handled properly as well. * Extend $addfilter to support adding negated filters via a new optional second argument which specifies the type of filter to add. * Stop $sort from needlessly ensuring the match has run. * Handle corner case of nested $hitlist gracefully instead of potentially entering an infinite loop. testsuite: * omegatest: Avoid setting TZ globally during tests as that hides bugs where behaviour depends on the local timezone when it shouldn't. * omegatest: Support testing when built using LeakSanitizer by suppressing leak reports for cached compiled pcre regular expressions. These aren't released when the program exits but aren't memory leaks. build system: * Remove outdated deprecation warning suppression which was there to support building from git in the run up to 1.3.2 - a development version which is nearly 5 years ago now. portability: * Fix problems with fallback strptime() implementation which was being included in the wrong binary, and was lacking a required const_cast on the return value. * Rework setenv() compatibility handling. Now that Solaris 9 is dead we can assume setenv() is provided by Unix-like platforms (POSIX requires it). For other platforms, provide a compatibility implementation of setenv() so the compatibility code is encapsulated in one place rather than replicated at every use. Omega 1.4.11 (2019-03-02): indexers: * omindex: + outlookmsg2html: Handle Subject, Date, and From headers. omega: * In $div and $mod we were converting a non-zero denominator from string to int twice for no good reason. testsuite: * omegatest: Fix testcase which was failing if the local timezone was behind UTC. This testcase was added in 1.4.10. * omegatest: Tweak to not fail when $time not supported - it seems that the OS time functions we use report an error on GNU Hurd for unknown reasons. build system: * Sync up probes for OS time functions in omega's configure with those in xapian-core which may solve $time not being supported on GNU Hurd. portability: * Add missing includes of <cerrno>. Fixes #776, reported by Matthieu Gautier. * Stop using htonl()/ntohl() in a non-network context which should improve portability to platforms without a POSIX-like socket API. Omega 1.4.10 (2019-02-12): documentation: * Use https for URLs where supported. indexers: * omindex: + Index .apxl and .kth files as Apple Keynote. The .apxl extension is used for the XML files inside .key bundles/directories which hold the text content of the presentation, and by handling them we can index .key directories more usefully. It seems they are also sometimes found by themselves. Keynote themes have a .kth extension, and key2text can also handle these. + Pipe input to pdftotext, pdfinto and dpkg. These tools all support piping an input file on stdin, which can be a little more efficient when we already have the file open (e.g. to determine its type using libmagic, or to calculate its checksum). + An empty string for the start directory is now flagged as an error. Previously `/` was used instead, which is unlikely to be what is wanted (and `/` can be explicitly specified if that really is what is wanted). + Fix emulation of stderr redirection when the indexer's stderr has been closed. We try to avoid using the shell when running external filters, and emulate 2>/dev/null in commands, but if the indexer's stderr was closed this emulation was buggy and would make give the filter a closed stderr instead of one redirected to /dev/null. + When emulating redirection to /dev/null, we now open /dev/null once and dup that fd each time which is a little more efficient and simplifies the code. * scriptindex: + date=unix is now a no-op for empty input - previously it would unhelpfully add boolean date terms for 1970-01-01. + Warn for empty filename in LOAD action. Previously this gave a slightly confusing error: "Couldn't load file '': No such file or directory" + Unknown command-line options now cause scriptindex to give a non-zero exit status. testsuite: * omegatest: Add testcase for SPAN.n on different slots. * omegatest: Update expected QueryParser output for the xapian-core change to produce flatter Query trees. build system: * Use AM_ICONV to detect iconv() which should handle non-system install of GNU libiconv properly. Fixes #775, reported by Ryan Schmidt. portability: * Provide fall-back strptime() implementation for platforms which don't provide it, using the C++11 std::get_time() function. We use strptime() directly where it's available as some older C++11 compilers seem to lack std::get_time() (GCC 4.8 for example). This is used by the parsedate action, which was added in 1.4.6. Omega 1.4.9 (2018-11-02): indexers: * omindex: + Try harder to avoid opening a file being indexed more than once by reusing the file descriptor in more cases. + Hint to the OS not to cache output from external filters which require using a temporary file. * scriptindex: + If the LOAD action successfully opens a file but hits a read error the error message now reports the file name correctly. Previously it would report the partial file contents read so far instead of the file name. portability: * We no longer call posix_fadvise() with POSIX_FADV_NOREUSE under Linux, since it's still not implemented there. We also now only call posix_fadvise() with POSIX_FADV_DONTNEED right before we close the file descriptor under Linux. Omega 1.4.8 (2018-10-25): documentation: * Assorted minor documentation improvements. indexers: * omindex: + Improve date handling in .eml files. We now handle a "Date:" header without the day of the week, which is allowed by RFC822 and RFC2822 (though seems rare in practice). If the date can't be parsed, we now just omit the date information rather than failing to process the file. + Add support for indexing Apple iWork documents (Keynote (.key), Numbers (.numbers) and Pages (.pages)) using libetonyek. Currently only the file variants are handled since omindex doesn't currently support indexing a directory as a document. + Index Visio files using vsd2xhtml. + Extend --filter to support filters which produce SVG as output. + Handle SVG embedded in XML with svg: namespace prefix. + Add --read-filters option to read a list of filters from a file, each line of which is a rule as passed to --filter. Based on a patch from Gaurav Arora. + Add new --mime-type-match option which allows specifying a MIME Content-Type for a given shell filename pattern pattern (with the special Content-Type values "ignore" and "skip" supported, as for --mime-type). + Adjust --mime-type to allow ':' in the extension. A valid MIME Content-Type can't contain a colon, so if the argument to --mime-type contains more than one colon it makes more sense to split at the *last* colon (we used to split at the first), as an extension could conceivably contain a colon. Mostly this change is for consistency with the new --mime-type-match option, where the leafname pattern could reasonably contain a colon. + Remove failed entries for ignored files. If a file is mapped to pseudo-mimetype "ignore" then remove any existing failure record for it so that ignored files so we don't potentially end up with a lot of cruft failure records for files we are no longer trying to index. + If a file fails to index due to failing to allocate enough memory we now try to flag it as failed to index so it will be skipped by default on future runs. This should help to avoid indexing getting stuck on problematic files. + Add a "pages" field with the number of pages in the document where we know how to determine this (currently only for PDF files for which pdfinfo reports this information). + Handle initially empty database exactly the same was as when --overwrite is specified. This probably has no user-visible consequences, but it's cleaner for the handling to be exactly the same. * scriptindex: + Improve scriptindex diagnostic messages. All diagnostics are now labelled as "error", "warning" or "note" as appropriate, and we now consistently report "FILE:LINE:" (and also "COLUMN:" in most cases) to make it clearer where the problem lies. + Add new "split" action which splits the text on a specified delimiter and executes the following actions for each piece. Based on a patch by Gaurav Arora. + Missing whitespace after the closing " on an action argument is now flagged as an error. Previously scriptindex would attempt to parse the following characters as the next action. + Support C-like escapes for quoted parameter values. Notably this means it is now possible to include `"` in quoted parameter values. omega: + Value-based date range filters can now be specified via CGI parameters START.N, END.N and/or SPAN.N where N is a value slot number, allowing multiple concurrent filters on different slots to be specified. + Support YYYY and YYYYMM limits in term-based date ranges. Previously value-based date ranges supported these as limits, but term-based date ranges gave an error. + Add stem_strategy option and deprecate existing stem_all option in favour of this new more versatile option. + Support "natural" $sort option via new flag "#" which sorts embedded natural numbers in numerical order. + Support numeric $sort option via new flag "n", similar to GNU sort -n. + Rewrite field parsing to be more efficient, and store fields in an unordered_map for faster lookup. testsuite: * htmlparsetest: Test whitespace collapsing. portability: * omegatest: Avoid "set -". The autoconf manual notes that POSIX no longer requires this, and that with traditional shells it resets -v and -x which makes debugging harder. * omegatest: Fix shell printf quoting issues which were a latent bug on macOS. * Drop special handling for Compaq C++. We never actually achieved a working build using it, and I can find no evidence that this compiler still exists, let alone that it was updated for C++11 which we now require. Omega 1.4.7 (2018-07-19): omega: * New OmegaScript $unique command. The existing $uniq only removes adjacent entries (like the Unix uniq command) so to fully remove duplicates you need a sorted input. Sometimes it is desirable to remove duplicates from an unsorted list without changing the order of the entries which are left, so add $unique to do that. If the list is sorted already, then $uniq is more efficient. * Fix $map to cleanly reject a single argument. templates: * templates/query: Merge multiple entries in the term frequency information, which came from searching several prefixes by default. Reported by Alistair Buxton on #xapian-discuss. * When multiple words with the same stem are in the query string we now fully eliminate duplicates when showing term frequency information. Omega 1.4.6 (2018-07-02): general: * Fix generate_sample() (used by OmegaScript $truncate and omindex) to return an empty sample instead of throwing an exception when the requested sample size is less than the size of the truncation indicator string. Patch from Addy. Fixes https://trac.xapian.org/ticket/754 reported by Gaurav Arora. documentation: * Use terminology "value slot number" instead of "value number". * Stop talking about "probabilistic terms" and "probabilistic queries" - we've supported other families of weighting schemes since 1.3.2. indexers: * Check for the HTML5 doctype or legacy doctype declaration and use default charset UTF-8 if either is present. Previously we always used ISO-8859-1, which is correct for older HTML versions, but not for HTML5. * omindex: + When running commands without going through the shell, emulate shell exit codes 127 (for command not found) and 126 (for other cases where we fail to run the command). This means the "missing filter" handling should now work properly for such commands. Noted by Gaurav Arora. + Index POD files despite minor formatting errors. We now pass --errors=stderr to pod2text so that minor formatting errors don't prevent us from indexing a file. (It may seem that --errors=none is a better option, but for podlators < 4.11 that results in an ERRATA section in the generated text version which we then end up indexing; 4.11 fixed that but we can't assume that's in use). Reported by Gaurav Arora. * scriptindex: + Avoid some unnecessary copying of Action objects by making use of C++11 features. + Consistently send errors to stderr - some were sent to stdout. Patch from Gaurav Arora. + Add new "hextobin" action. Based on a patch from Gaurav Arora. + Warn about non-integer arg to hash. + Fix hash action without an argument, which was failing with an assertion. Based on a patch by Gaurav Arora: https://github.com/xapian/xapian/pull/189 + Reject 'hash' with argument < 6. The hashing truncates and then adds a 6 character hash of the removed part, so can't produce a result shorter than 6 characters. Patch from Gaurav Arora. + Look for alphanumerics when parsing index actions. None of the current index actions contain digits, but we give more helpful error messages this way. + Deprecate allowing spaces around = in scripts. This was never documented as supported, and leads to a missing argument quietly swallowing the next action rather than using an empty value or giving an error. Reported by Gaurav Arora in https://github.com/xapian/xapian/pull/182 + In boolean and unique actions, add a colon between prefix and term when the term starts with a colon. This means the mapping is reversible, and matches what omega actually does in this case when it tries to reverse the mapping. Thanks to Andy Chilton for pointing out this corner case. + Add parsedate and valuepacked actions. Together these assist adding date values for sorting and date range filtering. Based on a patch from Gaurav Arora. + Use DB_RETRY_LOCK to wait if the database is already in use rather than sleeping for a second and retrying. On most platforms this means we make a blocking request for the lock, and even on platforms where that's not supported, we now sleep and retry inside libxapian, and without having to throw and catch an exception each time. omega: * $freq: Speed up some cases by avoiding throwing and catching an exception when we know the MSet has no term frequency information. * $sort: New OmegaScript command which does a string sort on an OmegaScript list, with u (unique) and r (reverse) options. * $cond: New OmegaScript conditional multi-way conditional. Inspired by LISP's COND, this provides a neater way to write a cascade of $if checks. * $switch: New OmegaScript multi-way conditional which provides an even neater way to write a cascade of $if{$eq{X,VALUE1},$if{$eq{X,VALUE2},...}}. * $subdb and $subid: New commands which report the subdatabase name and the docid in that subdatabase. + $termprefix and $unprefix: New OmegaScript commands which expose the existing code inside omega for splitting up a term. * Use str() to convert time_t to string, which is simpler code and faster than using snprintf(). testsuite: * omegatest: Fix message when faketime is not installed - we were misreporting this case as "faketime not working". * omegatest: Add feature tests of $map. * Add testcases for XML charset. We already handle both default and specified charsets for XML, but we didn't have any testcases for it. build system: * configure: Fix potentially confusing messages suggesting snprintf was added in C90 - it was actually standardised in C99. * Improve handling of multitarget rule stamp files. Clean them on "make maintainer-clean" and ship them so that --enable-maintainer-mode when building from a tarball doesn't needlessly rerun the multitarget rules. portability: * Check for EAGAIN as well as EINTR from select(). The Linux select(2) man page says: "Portable programs may wish to check for EAGAIN and loop, just as with EINTR" and that seems to be necessary for Cygwin at least. packaging: * Use https for tarball URLs in .spec files. This provides protection against MITM attacks on people building packages using these spec files, and is also slightly more efficient as the http: URLs redirect to the https: versions anyway. Omega 1.4.5 (2017-10-16): documentation: * Direct users towards $set{flag_spelling_correction,true} rather than the deprecated $set{spelling,true} (which is slated for removal in 1.5.0). * Fix typo in docs. indexers: * omindex: + Check file size before calling libmagic to get the mime type, since reading the file size is a much cheaper check and we can skip the libmagic test if the file is empty or larger than the specified maximum size. Patch from caiyulun. * scriptindex: + Reject index scripts with multiple "unique" actions. We don't handle this case sensibly, and it doesn't seem like it really has a use, so better to give an error for people who do this inadvertently. omega: * New $seterror command to set the error message. Implemented by Gaurav Arora. * Make $highlight more efficient. Patch from Vivek Pal. templates: * query: Use $prettyurl for the URL shown at the end of each match (previously we only used it on the URL shown as a fallback when the document has no title). Split off from changes by Vivek Pal in https://github.com/xapian/xapian/pull/161 testsuite: * omegatest: Tell faketime to freeze the clock - previously the clock ran on from the specified fake time, and on a slow and/or heavily loaded machine a test taking more than a second might fail due to this. * Start adding feature tests for scriptindex (so far, checking that specifying multiple 'unique' actions results in an error). Omega 1.4.4 (2017-04-19): indexers: * omindex: + 1.4.3 added a new --sample option, but contrary to the documentation the default behaviour was to take the sample from the meta description (which was the hard-wired behaviour in 1.4.2 and earlier). The default has now been changed to take the sample from the body. + Index .shtm, .xhtml and .xhtm as HTML by default - .shtm is another extension used for server-parsed HTML (in addition to the more common .shtml), and .xhtm and .xhtml are XHTML. + Fix fallback lookup for extension containing upper case. User mappings worked, but built-in extension to MIME type mappings were effectively being ignored (because the result of the function call was not being checked). Bug introduced in 1.3.4. + Fix term-based date ranges, broken by changes in 1.4.2. Found and diagnosed by Gaurav Arora. + Handle date range with start after end better - with term-based ranges, this used to generate a bogus filter, but now just generates Dlatest. + Use Y-term when range starts/ends at year start/end. Previously we used 12 M-terms for these cases. + Use full leap-year check when constructing term-based date ranges - previous code was good until 2100, but even then it would only result in an extra term being included for a non-existent February 29th in rare cases. omega: * New OmegaScript command $cgiparams which returns a list of the parameter names. * Handle tab in a CGI parameter name in the same way as space. Mostly this is a way to avoid having tabs in CGI parameter names - they aren't useful, but if they could have tabs in we can't put CGI parameter names in a list. templates: * query: Fix highlighting of matching terms. We were using both $snippet and $highlight, which results in double highlighting and HTML escaping, most noticeable by literal <strong> and </strong> appearing around matching terms in the rendered HTML snippet. Reported by Mark Thomas on xapian-discuss. build system: * If gen-mimemap failed after creating mimemap.h, the rule wouldn't get rerun. Omega 1.4.3 (2017-01-25): indexers: * omindex: + Add support for indexing vCard files if Perl and its Text::vCard module are available. + Recognise application/x-rpm as alternative type since libmagic reports this rather than application/x-redhat-package-manager. + Use official MIME type application/vnd.debian.binary-package for debian packages. We used to map .deb and .udeb to application/x-debian-package, but in 2014 (after we added that support for .deb) an official type was registered with IANA. We now map extensions .deb and .udeb to the official type, but the unofficial type is still recognised (older versions of libmagic probably report it, and users may be mapping to it). + Handle PHP as MIME type text/x-php. The main difference this makes is that PHP files which don't have extension '.php' (e.g. .phtml, .phps, .php5, .ph4, etc) get identified by libmagic as text/x-php and will now be indexed. It also means that the user can now more easily configure different filters for HTML and PHP. + Don't use meta description as sample by default. Now we have dynamic snippets (via $snippet), the body text is a better default. Also generated HTML sometimes has unhelpful content in the meta description. To get the previous behaviour, use the new omindex command line option: --sample=description Omega 1.4.2 (2016-12-26): documentation: * Replace auto-generated list of the supported MIME types with an auto-generated table showing the extensions that are mapped to each MIME type by default. Partly addresses #569, reported by catkin. indexers: * omindex: Add support for indexing markdown files (extension .md or .markdown, mime-type text/markdown, using "markdown" to convert to HTML). testsuite: * Add support for "make installcheck" to run tests against installed version. build system: * configure: Fail with clear error with xapian-core < 1.4.0. portability: * Fix GCC -Wimplicit-fallthrough warning. * Add missing <ctime> for time_t. * Avoid snprintf_for formatting fixed-width integers - it results in warnings about possible output truncation with GCC7 (which aren't actually possible due to limited input range) and it's a bit heavyweight for this job anyway. Omega 1.4.1 (2016-10-21): documentation: * Document bug in how $filters encodes DOCIDORDER=A. * Suggest DOCIDORDER=X for DONT_CARE. * Correct mentions of C++ API method MSet::get_snippet() to MSet::snippet(). * Fix typo in Omega 1.4.0 NEWS entry. Patch from James Aylett. indexers: * omindex: Also index leafname with _ and & replaced by spaces. Literal spaces are often avoided in filenames, and "hello_world.txt" ought to be searchable for via "hello" and "world". Partly addresses #618, reported by Julien Pfefferkorn. omega: * Add support for sorting by more than one value - e.g. SORT=+1,-2 * Add $msizelower and $msizeupper which provide access to the lower and upper bounds on the number of matches. * Add support for $set{weighting,coord}. * Add weightingpurefilter option. Normally a query consisting only of filter terms won't have relevance weights calculated. This new option allows you to specify a weighting scheme to use for such queries, with the same values supported as for the existing weighting option. For example, $set{weightingpurefilter,coord} will weight such queries by how many filter terms match each document. * $filters now includes DATEVALUE, which means we'll force the first page when reloading or changing page starting from existing URLs upon upgrade to 1.4.1, but the exact same existing URL could be for a search without the date filter where we want to force the first page, so there's an inherent ambiguity there. Forcing first page in this case seems the least problematic side-effect. Omission noted by Gaurav Arora. testsuite: * Add feature test for boolprefix and prefix maps. * Add more feature tests for $filters. build system: * GCC 4.7 is now enforced as the minimum version. * Drop unused configure check for symbol visibility * Drop compiler options that are no longer useful: + -fshow-column is the default in all GCC versions we now support (checked as GCC 4.6). + -Wno-long-long is no longer necessary now that we require C++11 where "long long" is a standard type. portability: * Fix build on platforms which don't provide timegm(), such as Cygwin. Reported on xapian-discuss by John Bankert. Omega 1.4.0 (2016-06-24): documentation: * Clarify $allterms and $terms documentation. Make it clearer how they differ, and document that $allterms without a parameter list gives all terms indexing the current hit. Noted by Andy Chilton. Omega 1.3.7 (2016-06-01): indexers: * Make named entity look-up (e.g. é -> 233) use the same keyword-lookup table approach we already use for HTML tags and built-in MIME content-types, rather than a std::map, which makes it faster while using less memory. Omega 1.3.6 (2016-05-09): documentation: * Fix overview.rst processing in VPATH build. Our workaround for lack of an include path in docutils was only handling the first include in the file. omega: * Implement $match command for omegascript. Patch from Richhiey Thomas. templates: * Lower case all HTML tags, attributes and values; explicitly close <option> tags. Patches from Vivek Pal and Nirmal Singhania. * Migrate Omega Templates to HTML5. Patch from Nirmal Sighania. * templates/query: Remove stray double quote from generated URL for spelling suggestion when THRESHOLD is set. Patch from Nirmal Singhania. * templates/opensearch: Change response feeds to support OpenSearch 1.1. Patch from Nirmal Singhania. testsuite: * Update omegatest - the order of subqueries has changed in some cases, due to the "grouping" changes in the C++ API. build system: * Drop workaround for old git master before 1.3.2 Omega 1.3.5 (2016-04-01): This release includes all changes from 1.2.23 which are relevant. omega: * Add optional prefix argument to $terms. * $snippet now uses MSet::snippet() instead of the Snipper class. * Add $contains{STRING1,STRING2}. Contributed by Ayush Gupta. * Add support for negated boolean filter terms, specified by CGI parameter "N". * Support a direction prefix on SORT: '+' for ascending, '-' for descending. SORTREVERSE set to non-0 now flips the direction. Fixes #697, reported by Andy Chilton. build system: * Need to AC_SUBST probed value of ZLIB_LIBS. Noted by Paul Wise. portability: * omegatest; Test faketime actually works, and if it doesn't work skip testcases which use it. On OS X 10.11, faketime from homebrew doesn't seem to work, probably due to the new "System Integrity Protection". Fixes part of #707, reported by James Aylett. Omega 1.3.4 (2016-01-01): This release includes all changes from 1.2.22 which are relevant. documentation: * The lists of recognised MIME types and of ignored extensions are now generated along with the corresponding source code from a single master list. Partly addresses #569, reported by Charles Atkinson. * Note when $json and $jsonarray were added. indexers: * omindex: + Avoid using the shell to run most external commands as it's unnecessary overhead. For the built-in filters, the only cases which now use a shell are where we run two unzip commands. For user-specified commands, a simple and slightly conservative test is used, which should avoid a shell in most common cases where it isn't needed. Notably, environment variables set before the command are handled. + Track files which couldn't be indexed in the user metadata and skip them by default on subsequent runs to avoid the costs of repeatedly running a filter on a file it can't handle. Run omindex with --retry-failed to retry such files. + Overhaul the "per-site" terms: - 'H' prefix is hostname as before, except that if the term would be > 240 bytes (unlikely but possible) the end is hashed is the same way 'U' prefix terms are. - 'P' terms are now added for every directory level, not just the start URL's path. - A new 'J' prefix term is added with the start URL (less any trailing '/'), which means all files indexed from a particular "site" are now indexed by one term. See #376. + Add 'skip' pseudo-mimetype which extensions can be mapped to, and they will then be reported and skipped (to complement the existing 'ignore' pseudo-mimetype which causes files with the specified extension to be quietly ignored). + Treat a command of 'true' specially as meaning make the text extraction a no-op (as actually running /bin/true effectively would). This provides a way to index some file types by only meta-data. Fixes #519, reported by Brian Burton. + Add support for wildcard mimetypes */* and *. Combined with filter command ``true`` for indexing by meta-data only, you can specify a fall back case of indexing by meta-data only using ``--filter '*:true'``. From a suggestion by Brian Burton on xapian-discuss. + Index message/rfc822 and message/news. These are individually saved email messages and news articles. + Index archived web page formats MAFF and MHTML. + Handle .xla, yet another XL extension. + Handle metadata in LibreOffice HTML export (dcterms.subject, dcterms.description, dcterms.creator and dcterms.contributor). + Use zlib's gzopen() instead of invoking "gzip -dc" for compressed Abiword documents. omega: * Add options argument to $transform. * Cache compiled regexps used in $transform. * Add $ord OmegaScript command which returns the Unicode codepoint for the first character of a UTF-8 string. * Add $chr OmegaScript command which returns the UTF-8 string for given Unicode codepoint. * Add $csv OmegaScript command which escapes a string for use as a field in a CSV file ("always quote" mode inspired by patch from Gaurav Arora.) * New $filters encoding which avoids collisions. We also compare CGI parameter xFILTERS to what $filters would have returned in previous releases, so that on upgrades old format serialised filters are handled correctly. * Fix $jsonarray not to prepend ']' to the first array element. * Skip weighting scheme setup for a pure date range query - it won't be weighted anyway, so we can avoid having to parse weighting scheme parameters, etc. * Use value ranges when date range filtering by value. Should be more efficient than a MatchDecider, and will automatically take advantage of any future value range optimisations in xapian-core. * Add default_db and default_template config options. These allow the default template and default database name to be set via the config file, rather than being stuck with the respective defaults of "default" and "query". Fixes #310, reported by Marco Hennigs. * Add support for non-exclusive filters. Fixes #234, reported by Thomas Viehmann. testsuite: * Add start of testsuite for omega CGI. build system: * configure script now defaults to looking for xapian-config-1.3. This is now automatically done for development series (odd middle component of the version number), but not for stable series (even middle component). Fixes #695, reported by Jorge C. Leitão * Don't pointlessly link omega binary with libmagic (as we have since 1.3.1). portability: * Fix "make check" compilation failure on platforms without timegm(). Omega 1.3.3 (2015-06-01): This release includes all changes from 1.2.20-1.2.21 which are relevant. documentation: * INSTALL: IRIX is past EOL so drop information about IRIX make. indexers: * omindex: + Add support for %f in command passed to --filter to allow specifying commands where the input file is not the final argument. Fixed #570, reported by Charles Atkinson. + Allow --filter to handle commands which produce output in a temporary file rather than on stdout. + Allow --filter to specify the character set of the output the filter produces. + Handle application/vnd.ms-excel, text/x-perl and application/x-dvi via default --filter settings instead of hardcoded cases (now possible thanks to the new abilities that --filter has). + Add support for specifying a MIME subtype of '*' in --filter arguments. + Add -track-ctime option to allow omindex to pick up changes to file ownership and permissions. + Index terms from the leafname with an 'F' prefix, rather than treating them as more body text. (Fixes #633, reported by Emmanuel Garette) omega: * Fix handling of multiple P.<prefix> fields - previously only the first seen was used. These fields are also now taken into account when deciding if the query has changed. $query now returns an OmegaScript list with one entry for each CGI parameter passed. templates: * templates/query: Fix setting setting of prefix map for P - in 1.3.2, this would failed to also search in the subject. Now it also searches in the subject and topic. build system: * configure: Fix typo in message: 'libmagic-devl' -> 'libmagic-devel' portability: * Require a compiler with good C++11 support, like xapian-core now does. * Now we require C++11, just include <cstdint> for uint32_t. * Link omindex-list with our (GNU) getopt for platforms which don't use GNU libc. Thanks to James Aylett * Add timegm.cc to scriptindex_SOURCES to fix build on platforms which don't provide timegm(). * Suppress bogus uninitialised variable warning with -Os under GCC 4.7.2. packaging: Omega 1.3.2 (2014-11-24): This release includes all changes from 1.2.16-1.2.19 which are relevant. documentation: * docs/overview.rst: Document built-in list of stopwords. * docs/termprefixes.rst: Update for renaming of 'brass' backend to 'glass'. indexers: * omindex: + The starting URL wasn't previously URL encoded. In 1.2.18, a minimally intrusive fix was implemented. In 1.3.2, we now encode the starting URL as we do for the rest of the filename. + Don't assume .doc is application/msword but let libmagic decide, since .doc files may actually be RTF, and sometimes people use .doc for plain-text documentation. + Add support for indexing 'topic' and 'created date' meta-data for OpenDocument format and HTML. + Index "topic" for PDF documents. + Commit changes and exit, rather than skipping the current file on most unexpected errors reading directories or initialising libmagic - otherwise we can end up deleting a lot of database entries on errors like EHOSTDOWN when indexing network mounts. + Add --opendir-sleep=SECS option to allow working around problems with indexing files on Microsoft DFS shares. + If we get ENOTDIR trying to index a file, skip it quietly (unless in verbose mode) as we already do if we get ENOENT, since ENOTDIR is what we get if the file and the directory it was in got removed between us getting the filename and trying to open it. + Handle ENOENT, ENOTDIR and EACCES from readdir(). + If we've already opened the file (as we often will have if using a modern libmagic with magic_descriptor() available), then use fstat() on that fd rather than stat()/lstat() on the pathname. + Pass error message string and errno value in ReadError exceptions. + Report strerror(errno) if we can't read a file. + Filtering via text/html now handles HTML documents which specify a charset. + Add support for indexing Microsoft Publisher files using pub2xhtml. + Restrict the length of what we consider to be an extension, currently to 7 characters or whatever the longest extension in the mime_map is if it is longer. + Avoid '//' in temporary filenames (cosmetic only). * omindex-list: New tool to list URLs of all the documents in a database (or list of databases) indexed by omindex. omega: * Allow setting query expansion scheme to "bo1". * Make the $json and $jsonarray force the text to be valid UTF-8, since otherwise the output isn't valid JSON. * Check parameters to $set{weighting,bm25 ...} and $set{weighting,trad ...} converted OK. Based on patch from Aarsh Shah. * Add support to $set{weighting,...} for bb2, dlh, dph, ifb2, ineb2, inl2, lm, pl2 when we're built against a xapian-core which is new enough to have these schemes. * Add $snippet to generate a snippet of text tailored to the search. build system: * configure: Enable GCC's -Woverloaded-virtual warning. portability: * Ship common/safewinsock2.h, needed under mingw. Omega 1.3.1 (2013-05-03): This release includes all changes from 1.2.10-1.2.15 which are relevant. documentation: * INSTALL,configure: Provide hints as to what package to install for magic.h. indexers: * The HTML parser now explicitly handles <APPLET>, <OBJECT> and <TR>. * Use a generated compact and efficient table to convert HTML tag names to enum codes - this is both faster and smaller than the approach we were using, with the benefit that the table is auto-generated. * Always use our built-in conversion code for the character sets it can handle (previously we'd use iconv if available; now we only use iconv for other character sets). This gives us more consistent results, and in particular means we now handle BOMs better (at least when using GNU iconv). * A lot of data labelled as "iso-8859-1" is actually "windows-1252". The two only differ in characters which are control characters in iso-8859-1, so assume the latter when we see the former. * omindex: + Extend --filter to handle commands which produce HTML on stdout. + Don't report an error if a file is deleted (or renamed) between us reading the directory entry for it and trying to read the file itself by default. In --verbose mode, the situation is still reported, but now with a specific message. + If omindex receives any of the signals SIGHUP, SIGINT, SIGQUIT or SIGTERM, then kill any active external filter child process, then handle the signal as we did before. If setpgid() is available, put each external filter in its own process group and kill the whole process group when we get a signal. + Use magic_descriptor() if the version of libmagic we're building against is new enough to have it. This eliminates an extra opening of a file being indexed in certain cases. + Use rst2html to handle .rst and .rest files. omega: * Add new $json and $jsonarray OmegaScript commands to support producing JSON output. * Add $truncate command which truncates a string after a word. * Add support for $set{weighting,tfidf} to allow the new TfIdfWeight weighting scheme to be used. build system: * configure: Now looks for libmagic in MAGIC_PREFIX, to allow building with libmagic installed in a non-standard location. * Remove support for 'configure --enable-quiet', 'make QUIET=' and 'make QUIET=y' - automake now supports 'configure --enable-silent-rules', 'make V=1' and 'make V=0' which are broadly equivalent and more standard. portability: * tmpdir.cc: Add safeunistd.h for rmdir, required by GCC 4.7 (reported by Gaurav Arora). Omega 1.3.0 (2012-03-14): general: * Make libmagic a required dependency. documentation: * docs/termprefixes.html: Document how to map a user prefix to multiple term prefixes. * docs/overview.html: Improve documentation of htdig_noindex. indexers: * omindex: + Index title with an 'S' prefix rather than no prefix. + If the document with the highest existing docid before the run was updated, we were reporting it as "added", but now we correctly report it as "updated". + Catch and report std::exception explicitly, so failing to allocate memory is no longer reported as "Unknown exception". * scriptindex: + Remove special error handling case noting that index=nopos was replaced with indexnopos - this was removed in 1.1.0 so there's been enough time to upgrade. omega: + DEFAULTOP now defaults to AND rather than OR, since that matches what pretty much every search engine does these days. Closes ticket#512. * Allow mapping a query string prefix to more than one term prefix (which xapian-core has supported since 1.0.4). * Add support for search inputs for multiple probabilistic prefixes, with support for per-prefix stemmers. * Drop legacy support for handling '.' separated terms in xP - that changed in Omega 0.9.7, more than 5 years ago now. * Remove support for OLDP CGI parameter which was superseded by xP approximately a decade ago, and isn't even documented! * Drop special handling for R-prefixed terms in $prettyterm - we stopped generating these in Xapian 1.0. templates: * templates/query: + We now map unprefixed queries to include S-prefixed terms to match the change in omindex to prefixing terms from the title with S. You may want to make the same update to your own templates. + Set up prefixes for 'author:' and 'title:'. packaging: * xapian-omega.spec: We're ABI compatible within a release series so make dependency on xapian-core-libs >= rather than =. Omega 1.2.23 (2016-03-28): documentation: * Update links to Xapian website and trac to use https, which is now supported, thanks to James Aylett. indexers: * Fix HTML/XML entity decoding to be O(n) not O(n²) - processing HTML/XML with a lot of entities is now much faster. templates: * Remove unused country code to name maps. These were intended as examples, but they aren't very useful as such, and really just bloat the templates needlessly. Omega 1.2.22 (2015-12-29): documentation: * Stop maintaining ChangeLog files. They make merging patches harder, and stop 'git cherry-pick' from working as it should. The git repo history should be sufficient for complying with GPLv2 2(a). * Clarify help text for omindex --mime-type option. * docs/omegascript.rst: + Fix documentation of $last to say it's the MSet index *one beyond* the end of the current page. Reported by Andrew Chilton. + Clarify that $split and $substr work in bytes. Previously we said "characters" which could be taken as meaning they work with UTF-8 characters. + Update documentation for $filters - it was missing these CGI parameters from the list of those serialised: COLLAPSE, DOCIDORDER, SORT, SORTREVERSE, SORTAFTER + Explicitly note user can use $setmap to create their own maps. * docs/overview.rst: + SVG extraction is built-in too. + Expand paragraph about command `false`. Note the versions where explicit support was added, and that this will also work with any version on Unix, where `false` is a command. + Document `cdb_dir`. * docs/cgiparams.rst: Document behaviour if xDB is not set. * Change "characters" to "bytes" in a few places to clarify that we don't mean Unicode code points. indexers: * omindex: + Add '--title-size' option. + Handle .oft the same way as .msg - it's some sort of template email, and has essentially the same format. omega: * Make $querydescription ensure the match has been run, so that it includes filters. * Avoid $allterms, $cgilist, $filterterms and $terms being O(n²) in the number of items in the returned list. * If xFILTERS is not set, don't force the first page as that's unhelpful if someone fails to set it in their template. * When environment variable SERVER_PROTOCOL is set to INCLUDED (as it is when we're being included in a page), we already suppress the HTTP headers, but now we suppress the blank line after the header too. * Support option flag_cjk_ngram if built against xapian-core >= 1.2.22. testsuite: * Add test coverage for parsing of HTML entities. build system: * Fix error reporting if PCRE isn't installed. Fixes #693, reported by lhz7370. portability: * Avoid warning when building with glibc >= 2.21. * Don't provide our own implementation of sleep() under __WIN32__ if there already is one - mingw provides one, and in some situations it seems to clash with ours. Reported to xapian-discuss by John Alveris. * Stop trying to use O_STREAMING - the patch to implement it was never merged into the Linux kernel, and I can't find any evidence that other platforms implement it. The constant value O_STREAMING used now seems to be used for the part of O_SYNC which isn't covered by O_DSYNC, which seems likely to hurt performance if anything. Omega 1.2.21 (2015-05-20): documentation: * docs/overview.rst: Document 'E' prefixed boolean terms for filtering by extension (see #668, reported by bramvdh). * docs/encodings.rst: Add a document about character encoding, as suggested by James Aylett in #550. indexers: * omindex: + outlookmsg2html: Fix handling of message/rfc822 subparts. omega: * $prettyurl now decodes valid UTF-8 sequences, and some additional ASCII characters in the path part: []@!$&'()*+.;= (Fixes #550 and #644, reported by catkin and terencz.) * $prettyurl now leaves the query and fragment parts of the URL alone and won't decode an escaped "/" (omindex doesn't create URLs with any of these, so we only risk breaking other URLs which have them). * Drop compilation date and time from output when run from the command line - they prevent reproducible builds and the version number is sufficient information. templates: * templates/query: When listing matching terms, don't make the commas italic. * templates/query: Eliminate blank line before <html>. * templates/xml: Add XML declaration. * templates/godmode: Specify charset utf-8 in the content-type. build system: * Link test programs with libtool's '-no-install' or '-no-fast-install', like we already do in xapian-core, which means that libtool doesn't need to generate shell script wrappers for them on most platforms. portability: * Add spaces between literal strings and macros which expand to literal strings for C++11 compatibility. * Remove 'register' as it's deprecated and clang spits out warnings because of that. Any modern compiler likely just ignores it as an optimisation hint anyway. Omega 1.2.20 (2015-03-04): documentation: * docs/cgiparams.rst: Improve wording of docs for SORT parameter. * docs/omegascript.rst: Update documentation references to DATE1, DATE2, and DAYSMINUS which were renamed in 0.6.x and the compatibility aliases removed in 1.0.0. indexers: * omindex: + Ignore extensions .msi and .msp, which are Microsoft installer files, but which libmagic sometimes incorrectly identifies as application/msword. + Interpret a command of "false" in "--filter" as meaning to ignore files with that MIME type. omega: * Handle CGI parameter [=0 as [=1. templates: * templates/xml: Update handling of DATE1, DATE2 and DAYSMINUS which were renamed in 0.6.x and the compatibility aliases removed in 1.0.0. build system: * configure: Use pkg-config in preference to determine flags needed to compile and link with PCRE, as this will just work when cross-compiling (at least under MXE). * configure: Define MINGW_HAS_SECURE_API under mingw to get _putenv_s() declared in stdlib.h. * Enable automake option 'subdir-objects' to avoid warning from newer automake. portability: * Avoid doing link tests with libmagic in configure as they fail on mingw due to not automatically picking up libraries which libmagic itself depends on. Omega 1.2.19 (2014-10-21): documentation: * docs/overview.rst: Note that pdftotext is part of poppler as well as xpdf. (Noted by Paul Wise) Omega 1.2.18 (2014-06-22): indexers: * omindex: + Work around libmagic returning a MIME content-type of "Composite Document File V2 Document[...]" or "application/CDFV2-corrupt" by returning a more suitable filetype based on looking at the file's extension. + The starting URL wasn't previously URL encoded. In 1.3.2, this will be fixed by URL encoding it as we do for the rest of the path, for the 1.2 branch we only URL encode it if it contains a character <= 31 or at least one of '#', '%', ':' or '?'. This avoids a one-off reindex of every document in the database in cases which work OK in practice. + When we skip a file because it exceeds the configured size limit, include that size limit in the message. omega: * Add support for setting the query expansion scheme to use. portability: * Don't compile in unixperm.cc - it isn't currently used, and it fails to build with mingw. (fixes #635, reported by Alexis Denis) * Fix warning when built with GCC 4.7.2 using -Os. * Removed unused inline function, fixing compiler warning. Omega 1.2.17 (2014-01-29): documentation: * docs/overview.html: Add Abiword as an example use of --filter, based on patch from Frank J Bruzzaniti (fixes#383). portability: * Fix "no previous declaration" warning on platforms which don't have mkdtemp(). Omega 1.2.16 (2013-12-04): indexers: * omindex: + Fix off-by-one when finding documents to delete which would sometimes cause omindex to fail to delete documents from the database when they weren't refound during an index update. + Decode dates in xlsx files. + Ignore extensions 'adm', 'cur', and 'ico' by default. + Group-readable files which are owner-readable but not world-readable should still get a "readable by owner" term added. Reported by Emmanuel Garette. build system: * Compress source tarballs with xz instead of gzip. * configure: Sync compiler warning flag machinery against xapian-core. The changes are special handling for clang, passing -fshow-column where supported, and handling for new warning flags in GCC 4.6 and 4.7. Omega 1.2.15 (2013-04-16): omega: * Don't pointlessly link utf8convert.o into the omega CGI. Omega 1.2.14 (2013-03-14): indexers: * omindex: + Correct "max" -> "min" when reserving space for shared strings in .xlsx files. This just means we now reserve a more appropriate amount of space to start with. + Ignore .com files by default. Omega 1.2.13 (2013-01-09): indexers: * omindex: + Extracting text using external filters now works for filenames containing a newline character - previously the newline got lost during escaping for the shell. + Fix segfault when -F option without a ':' is passed. + Skip a file if we get a read error while calculating the MD5 checksum (used for duplicate detection) - previously we used a checksum of the file up to that point. + Avoid rereading SVG and Atom files when we calculate their MD5 checksums. + Improvement --help output and man page, most notably: - Say explicitly that --sample-size accepts the same formats as --max-size. - Note default size limit on files to index is unlimited. + When generating a sample for a CSV file, limit the size we pre-allocate to the CSV file size if that's smaller than the requested sample size, in case the user sets that limit very high. omega: * Fix to decode %-encoded character at the end of the query string. build system: * INCLUDES is now deprecated in automake, so use AM_CPPFLAGS instead. Omega 1.2.12 (2012-06-27): No changes since 1.2.11 except to bump the version - this release was made to fix an incorrect library version information update in xapian-core 1.2.11. Omega 1.2.11 (2012-06-26): indexers: * Change HTML parser's handling of multiple <body> tags and of text outside of <body> to match the behaviour of modern web browsers. (ticket#599) * omindex: + Add command line option to control the size of the document sample stored. Patch from Mihai Bivol. + Rework .xlsx parsing to substitute the shared strings into the positions they are used in, so that the sample actually matches what appears in the spreadsheet, and to index calculated cell contents. + Improve handling of headers and footers in OpenDocument documents. + pdftotext outputs a formfeed between each page, which messes up our "empty body" check, so trim any trailing formfeeds before this check. build system: * Don't explicitly link indirect shared library dependencies on FreeBSD, OpenBSD, and Solaris. Omega 1.2.10 (2012-05-09): indexers: * Add support for CDATA to HTML/XML parser. * omindex: + Add --max-size option, based on patch from ndaley in ticket#587. + Add support for atom feed files, patch from Mihai Bivol in ticket#595. + If the document with the highest existing docid before the run was updated, we were reporting it as "added", but now we correctly report it as "updated". (Backported from 1.3.0). + Catch and report std::exception explicitly, so failing to allocate memory is no longer reported as "Unknown exception". (Backported from 1.3.0). * scriptindex: portability: * Fix to build with GCC 4.7 by adding cast to rlim_t to fix error about C++11 compatibility (reported by Gaurav Arora). Omega 1.2.9 (2012-03-08): documentation: * docs/overview.html: + Document that libmagic is used to determine the MIME type if the extension isn't known. Partly addresses ticket#569. + We now limit time as well as CPU and memory for external filters. indexers: * Our HTML parser now ignores sections bracketed by <!--UdmComment--> and <!--/UdmComment-->, like we already do for <!--htdig_noindex-->. * omindex: Add more extensions to the default ignore list: bin dat db fon jar lnk pyc pyd pyo sqlite sqlite3 sqlite-journal tmp ttf Omega 1.2.8 (2011-12-13): documentation: * scriptindex.cc: Add link to http://xapian.org/docs/omega/scriptindex.html to --help output (and so also to the man page which is generated from this). * omegascript.html: Add note to discourage use of percentage scores. indexers: * omindex: + If we don't get any data from an external filter for 5 minutes, give up - it has probably ended up blocked indefinitely. + Improve --help output (and man page which is generated from it). Closes bug#572. * scriptindex: + If no rules are found in the index script, report an error and give up - this is inevitably the result of a mistake, and adding empty documents to the database isn't helpful. omega: + Add new $prettyurl{} command which undoes RFC3986 URL escaping which doesn't affect semantics in practice. Partly addresses ticket#550. + Replace URL decoder with new implementation which handles various corner cases better. Fixes bug#578. + If CGI parameter P has trailing spaces, we now remove them all rather than leaving one. templates: * templates/query: HTML escape topterms. * templates/godmode: HTML escape the contents of document values. * templates/query: Don't show the percentage score in the default template. testsuite: * Add new urlenctest unit test of URL encoding and decoding. portability: * configure: Sync changes from xapian-core: Don't pass -Wshadow for GCC < 4.1; don't pass -Wstrict-null-sentinel for GCC 4.0.x; only enable symbol visibility on platforms where it is supported. packaging: * xapian-omega.spec: Package outlookmsg2html helper. Omega 1.2.7 (2011-08-10): documentation: * docs/termprefixes.html: Document how to map a user prefix to multiple term prefixes. * docs/overview.html: Improve documentation of htdig_noindex. omega: * Improve $version output from "Xapian - xapian-omega 1.2.7" to "xapian-omega 1.2.7". packaging: * xapian-omega.spec: We're ABI compatible within a release series so make dependency on xapian-core-libs >= rather than =. Omega 1.2.6 (2011-06-12): documentation: * docs/omegascript.html: Correct the documentation of the colours used by $highlight{}. * docs/overview.html: Add using unoconv as more complex example of using --filter (ticket#324). templates: * templates/query: + Make search query input type=search. + Autofocus the search query input (using HTML autofocus attribute with Javascript fallback for older browsers). (ticket#544) portability: * Fix a compiler warning. Omega 1.2.5 (2011-04-04): documentation: * Add index page which links to all the other documentation pages. * INSTALL: Copy new Multi-Arch section from xapian-core/INSTALL. Replace VPATH section with better equivalent from Xapian-core/INSTALL. * docs/omegascript.html: Minor improvements. indexers: * The HTML parser no longer uses an exception to signify it has finished in the normal case as exceptions are typically costly to handle. In tests, this made omindex ~0.23% faster when indexing a lot of HTML files. * omindex: + Add --ignore-exclusions option, which will index HTML files despite meta robots tags, etc - omindex is often used in environments where such exclusions aren't relevant. + Fix to compile with older versions of libmagic which don't have MAGIC_MIME_TYPE (e.g. on Ubuntu hardy). + Tell xls2csv to separate fields with spaces rather than commas, and not to quote them. Fixes indexing of numeric fields, and means we don't need to use our CSV parser to get a sample. + Add whitespace between chunks of text extracted from Microsoft Office 2007 formats to prevent words in adjacent chunks from being run together. + Encode reserved characters in URLs - links to files with names containing '#' and '?' now work. + Handle .xlr extension the same way as .xls (later Microsoft Works versions apparently produce such files which are really the same format). + Index filename extension with new standard prefix E. + Just report the mimetype as unknown instead of saying "unknown Office 2007 MIME subtype". + Ignore *.css and *.js by default too. + Messages reporting skipping files are now more consistent and always report the filename. + New --empty-docs option to allow documents we extract no body text from to be indexed (existing behaviour), skipped, or reported and then indexed. omega: * Fix double Content-Type header in some error reporting situations (regression introduced in 1.2.4). * Update $url's URL encoding to follow RFC3986. * Allow QueryParser flags to be set from OmegaScript (ticket#418). The FLAG_SPELLING_CORRECTION flag can now be set using $opt{flag_spelling_correction,1} - the old $opt{spelling,true} way to enable this flag still works, but it now deprecated. templates: * templates/emptydocs,templates/godmode,templates/opensearch,templates/query, templates/xml: Add missing escaping. Some of these instances may allow cross-site scripting, so upgrading your templates is recommended, especially if you have any sensitive cookies set on the domain Omega is running on. * templates/xml: + Try $field{caption} (which is what omindex sets) before $field{title} when getting a value for the hit tag's title attribute - this is consistent with how the query template gets the title. + Add new 'type' attribute which gives $field{type}. + Add 'DBSize' attribute to <result> element. + Fix double escaping of matching terms. This is only likely to affect cases where a matching term contains '&'. + Remove support for undocumented HILITECLASS CGI variable. There's no evidence I can find using Google code search or web search that this has been used anywhere, and it's difficult to handle escaping it properly in the face of all the ways it could reasonably be used. portability: * Fix to compile on Microsoft Windows (ticket#350). Omega 1.2.4 (2010-12-19): documentation: * Minor documentation improvements. indexers: * Some iconv implementations (such as that on Mac OS X) don't handle many of the commonly seen mis-punctuated charset names (e.g. UTF16, UTF_16). We now check for this if iconv fails, fix up the charset name, and retry. * The built-in character encoding converter now handles spaces in charset names. * Use O_NOATIME if available and either the file is owned by the current euid, or the current euid is 0 (i.e. we're running as root). This avoids updating the access time of files we index which saves time. Fixes ticket#222. * Report get_description() for Xapian exceptions, which provides additional information above get_msg(). * Add boolean terms with add_boolean_term() so they get wdf of 0 and don't contribute to document length. * omindex: + Escape wildcard patterns being passed to unzip - in the unlikely event that one of these matched files in or under the current directory, we might fail to extract all the files we wanted to. + Add explicit support for indexing CSV files (better samples than from using '-Mcsv:text/plain'). + Add support for indexing .msg files from Microsoft Outlook (using the Perl module Email::Outlook::Message. (ticket#334) + Improve --help for --mime-type option. + Optionally use libmagic to detect MIME types for files for which we have no extension mapping, which allows us to handle files with a misleading extension, or no extension at all. (ticket#114) + Add new --filter option which allows the user to specify new filters provided they return UTF-8 text on stdout. + If a filter command isn't installed, previously we wouldn't try it again for the same file extension - now we won't try it again for the same mime-type. + Index the leafname of the file (without any extension) as extra keywords. + Extract author from HTML, OpenDocument, and PDF files. Index it with an A prefix, and add it as a field. + Add support for indexing text and metadata from SVG files. + Extract metadata from Microsoft Office 2007 file formats. + Index text in headers and footers for .odt and .docx files. + Use the CSV parser to generate a nicer sample for files of type application/vnd.ms-excel. + Add support for indexing Debian and RPM package files (ticket#493). + Make the memory limit for filter processes the size of physical memory, which is a little less arbitrary than 7/8 of this value (ticket#424). + Under --duplicate=ignore, fix so that old documents which aren't seen get deleted, which wasn't implemented before (to suppress this deletion, pass -p as well). + Rename the short option for --version from -v to -V for consistency with scriptindex and many other packages, and to free up -v as the short option for --verbose. For backward compatibility, "omindex -v" is handled specially and still reports the version. + Add --verbose option, and disable the less interesting output unless it is specified. + Deprecate "--preserve-nonduplicates" in favour of new long option "--no-delete" which does the same thing, but has a clearer name. + The deletion of documents pass at the end of indexing is now more efficient. We track how many documents in the database we haven't seen so we can stop once we've found them all (a particularly big improvement if there are no documents to delete), and we now use a PostingIterator over all documents which avoids needing to catch an exception for every gap in the used document ids. + Quietly ignore files with mimetype set to "ignore". The initial list of extensions set to ignore is: .a .dll .dylib .exe .lib .o .obj .so + Index file owner and read permissions, to allow finding documents with a particular owner, and so searches can be restricted to documents a user is able to read. + Add file size as a document value, so you can sort on it and filter by it. * scriptindex: + Fix file descriptor leak if the LOADFILE action is used on something which isn't a file. omega: * Make sure we write out HTTP headers when reporting an error early on. * Extend $field to take an optional DOCID argument, rather than always using the context from $hitlist. * Add new $emptydocs command which returns a list of documents with doclength zero. * Add support for size: range filtering. Currently the end points of the range have to be specified in bytes (e.g. size:102400..204800 for 100-200KB). templates: * templates/emptydocs: New template which lists documents with doclength zero. build system: * configure: Probe for any options needed to enable large file support. Handling files >= 2GB isn't especially useful, but more importantly this is needed to allow omindex to index files on filing systems with 64 bit inodes on some platforms (e.g. 32-bit Linux). * Use -no-undefined on platforms which need it to dynamically link such as cygwin (need to do this taken from ticket#282). portability: * Fix to compile with Sun C++. Omega 1.2.3 (2010-08-24): documentation: * docs/termprefixes.html: Update "flint and quartz" to "flint and chert" as quartz is no longer supported. Give exact term length limit for flint and chert. packaging: * xapian-omega.spec: Don't run autoreconf - it's no longer required. Omega 1.2.2 (2010-06-27): portability: * Apply getopt portability fixes from xapian-core 1.2.0, fixing build failures on Mac OS X (and probably some other platforms with non-GNU getopt implementations). (ticket#469) Omega 1.2.1 (2010-06-22): This release includes all changes from 1.0.21 which are relevant. Omega 1.2.0 (2010-04-28): This release includes all changes from 1.0.20 which are relevant. build system: * configure: Tell libtool not to link in deplibs on platforms where we know they aren't needed. * configure: On Linux, extract the library search path from ldconfig which gives us the default entries reliably. Omega 1.1.5 (2010-04-15): This release includes all changes from 1.0.19 which are relevant. Omega 1.1.4 (2010-02-15): This release includes all changes from 1.0.18 which are relevant. omega: * Use the optimised integer to string conversion routines from xapian-core. Omega 1.1.3 (2009-11-18): This release includes all changes from 1.0.15-1.0.17 which are relevant. templates: * templates/query: If JavaScript is available, convert $field{modtime} to a string on the client-side so that the timezone is correct. If JavaScript isn't available, fall back to the existing behaviour of using UTC. (ticket#314) build system: * configure: Default to looking for xapian-config-1.1 unless XAPIAN_CONFIG is specified. Omega 1.1.2 (2009-07-23): This release includes all changes from 1.0.14 which are relevant. indexers: * omindex: + Handle the "macroenabled" versions of MS Office 2007 files too (ticket#290). + Extract pptx notesSlides and comments, if present. (ticket#290). Omega 1.1.1 (2009-06-09): This release includes all changes from 1.0.13 which are relevant. indexers: * omindex: + Check the last modification time of files before reindexing (ticket#342). + Add "--spelling" option to index spelling correction data. * scriptindex: + Add new "spell" action for indexing spelling correction data (ticket#296). omega: * Add $suggestion and $opt{spelling} to provide access to spelling correction (ticket#296). * Add $opt{weighting} to allow the weighting scheme and parameters to be specified (ticket#298). * If SERVER_PROTOCOL in the environment is set to INCLUDED, then our output is being included in another page (e.g. using SSI) so suppress the output of any HTTP headers. templates: * templates/query: Offer any spelling correction QueryParser gives. build system: * configure: Sync warning flags used with GCC with xapian-core apart from -Woverloaded-virtual which fires for MyHtmlParser::parse_html(). That probably should be tidied up at some point, but not right now. Omega 1.1.0 (2009-04-23): indexers: * scriptindex: + Make deprecated "index=nopos" an error. omega: * New OmegaScript command $transform{} which performs regular expression substitutions using the PCRE library (which is now required to build Omega). (ticket#231) build system: * The build system is now bootstrapped with newer versions of autoconf and libtool which should produce smaller files and speed up configure and make. Omega 1.0.23 (2011-01-14): indexers: * omindex: + Escape wildcard patterns being passed to unzip - in the unlikely event that one of these matched files in or under the current directory, we might fail to extract all the files we wanted to when indexing document formats like OpenDocument which use a zip file container. + The parser for OpenDocument metadata wasn't initialising its "state" field. Often you'd be lucky and it would be initialised to zero, but this could have caused misparsing of metadata in some cases. * scriptindex: Fix file descriptor leak if the LOADFILE action is used on something that isn't a file. * If fstat() fails when trying to load a file, preserve the errno value from the fstat call to report to the user. portability: * configure: Probe for any options needed to enable large file support. Handling files >= 2GB isn't especially useful, but more importantly this is needed to allow omindex to index files on filing systems with 64 bit inodes on some platforms (e.g. 32-bit Linux). * Add -no-undefined to AM_LDFLAGS on platforms which need it to dynamically link such as cygwin (need to do this taken from ticket#282). Omega 1.0.22 (2010-10-03): portability: * Fix to compile with Sun C++. Omega 1.0.21 (2010-05-18): portability: * Fix build failure in freemem.cc on Microsoft Windows. Omega 1.0.20 (2010-04-27): portability: * Fix build failure on Mac OS X and possibly some other platforms (regression caused by fix for getopt-related warnings on Cygwin in 1.0.19). Omega 1.0.19 (2010-04-15): portability: * Fix getopt-related warning on Cygwin. Omega 1.0.18 (2010-02-14): indexers: * Make the default charset "utf-8" not "UTF-8" as we lower case explicitly specified character sets to compare to see if we need to reparse. Previously XML documents which explicitly specified their character set as UTF-8 would cause needless restart or the parser. * omindex: + Increase the wdf boost for the document title from 2 to 5, since 2 isn't really enough. * scriptindex: + Don't abort with "Unknown Exception" if indexing is disallowed or we hit </body> for a document which had an overridden character set. Fixes ticket#410. Omega 1.0.17 (2009-11-18): indexers: * omindex: + On Linux, change the memory limit on external filters to use _SC_PHYS_PAGES since _SC_AVPHYS_PAGES excludes pages used by the OS cache and so will often report a really low value. Fixes Debian bug#548987 and ticket#358. + Fix likely crash when reading output from external filter program if read() is interrupted by a signal. + Fix potential crash when indexing PostScript files (fixed by using delete[] (not delete) for array allocated by new[]). testsuite: * utf8converttest: Charset "8859_1" isn't understood by Solaris libiconv, and isn't a standard charset name, so just test it when using our built-in converter and GNU libc. portability: * Fix build failure on Mac OS X 10.6. * Also check for socketpair() in -lxnet if it isn't found without, which enables resource limits on external filter programs called by omindex on Solaris, and possibly some other platforms. Fixes ticket#412. Omega 1.0.16 (2009-09-10): * omega: Fix cross-site scripting vulnerability in reporting of exceptions (CVE-2009-2947). Omega 1.0.15 (2009-08-26): general: * omegascript.vim: The list of OmegaScript commands in the vim mode was rather out of date, and a few commands were misclassified. Fix both problems and avoid future recurrences by automatically generating those lists from the command list in query.cc. documentation: * omegascript.html: Document that $date uses UTC. (ticket#314) templates: * query: Link to "xapian.org" rather than "www.xapian.org". * inc/toptermsjs: Use double-quotes rather than single quotes for parameter values on the <script> tag. portability: * omindex: Implement correct handling of paths when calling external filter programs on Microsoft Windows. Omega 1.0.14 (2009-07-21): indexers: * omindex: Make sure that output is flushed after every message, not just after some of them. portability: * Avoid infinite loop in omindex and scriptindex when reading files under Cygwin with automatic end of line translation enabled. This same bug can also manifest on Unix platforms if the file is truncated by another process while being read. Omega 1.0.13 (2009-05-23): indexers: * omindex: + If the filter program needed for a file format isn't installed, report this explicitly when skipping subsequent files with the extension instead of misleadingly reporting "Unknown extension". + Make -s actually work as a short-form for --stemmer (as documented by "omindex --help" and "man omindex"). + Drop the copyright info from the output of --version as it's perennially out of date and we don't report it for any other Xapian programs. * scriptindex: + Add new "valuenumeric" action to add a document value using Xapian::sortable_serialise() to allow numeric sorting (ticket#260). build system: * configure: Enable more GCC warnings - "-Wstrict-null-sentinel" for 4.0+, "-Wlogical-op -Wmissing-declarations" for 4.3+. Omega 1.0.12 (2009-04-19): omega: * $log now retries a partial write, or one interrupted by a system call. build system: * configure: Fix iconv parameter type probe not to implicitly cast a string literal to char* - this a warning under GCC currently, but the user could pass -Werror explicitly in CXXFLAGS, and this could be promoted to an error in future GCC versions, and may already be so for some other compilers. * Overriding CXXFLAGS at make-time (e.g. "make CXXFLAGS=-Os") no longer overrides any flags required for building with Xapian. * We now actually use the compiler warning flags which configure detects. Omega 1.0.11 (2009-03-15): documentation: * cgiparams.html: Note the technique of using a stub database file to allow a default of searching over multiple databases. indexers: * omindex: + Add support for indexing Microsoft Office 2007 formats and XPS files (bug#290). + Fix the extraction of metadata from OpenDocument formats. + Fix "-l" which would previously always cause a segmentation fault if used ("--depth-limit" wasn't affected). build system: * configure: The output of g++ --version changed format (again) with GCC 4.3 which meant configure got "g++" for the version. Instead use the (hopefully) more robust technique of using g++ -E to pull out __GNUC__ and __GNUC_MINOR__. * configure: Turn on _FORTIFY_SOURCE where available (as we do in xapian-core). portability: * Fix to compile when RLIMIT_AS isn't available (as on NetBSD and OpenBSD). Instead use RLIMIT_VMEM or RLIMIT_DATA if either is available, else don't try to limit the memory the filter process can use. Omega 1.0.10 (2008-12-23): build system: * This release now uses newer versions of the autotools (autoconf 2.62 -> 2.63; automake 1.10.1 -> 1.10.2). The newer autoconf fixes a regression in autoconf 2.62 (and so Omega 1.0.7) with detecting the endian-ness of some platforms. Omega 1.0.9 (2008-10-31): documentation: * docs/overview.html: Document HTML parsing a bit, including robots meta and htdig_noindex. omega: * omega: Catch std::exception and report what its what() method returns. * omega: Remove undocumented and non-functional support for numeric sorting via CGI parameter SORT=#<slot> (SORT=<slot> works as before). build system: * configure: Sync warning flag handling changes from xapian-core to eliminate many warnings from GCC 4.3. Omega 1.0.8 (2008-09-04): documentation: * Fix a few typos and improve wording in a few places. indexers: * omindex: + If the character encoding is specified using <meta http-equiv=...> in an HTML document then reparse the document if it isn't the encoding we're already using so that any preceding <title> is converted correctly (bug#292). + Convert text from meta tag parameters to UTF-8 (bug#293). + Handle <meta charset="..."> (new in HTML 5). + Fix bug in HTML tag parameter parsing which was probably just a small performance penalty in real world cases, but could perhaps result in parsing bogus extra parameters in carefully contrived situations. portability: * Add missing <signal.h>, noted on FreeBSD by Henrik Brix Andersen. Omega 1.0.7 (2008-07-14): documentation: * omegascript.html,scriptindex.html: Fix empty titles. indexers: * omindex: + When indexing text files, handle UCS-2 and UTF-16 text files with a byte-order mark (BOM), and ignore any UTF-8 "byte-order" mark. + The built-in conversion code (used when iconv isn't available) now handles UCS-2/UTF-16 with and without a BOM, and also the explicit BE and LE forms. omega: * Overhaul the $highlight colour combinations since some were rather unreadable (Debian bug 484456). build system: * configure: Synchronise code for working out warning flags used for builds with that used for xapian-core, which in particular handles different output formats from "gcc --version". portability: * configure: Fix header checks to pre-include <sys/types.h> which Mac OS X needs for some other headers to work. * configure: Fix probing for iconv to work better when iconv isn't found (previously this only worked on Mac OS X with fink). * Fix compilation error on FreeBSD, introduced in 1.0.5. * In omega, cast size to unsigned before division to avoid a warning about signed overflow. packaging: * xapian-omega.spec: Remove "www." from xapian.org and oligarchy.co.uk URLs. Omega 1.0.6 (2008-03-17): documentation: * docs/omegascript.html: Improve formatting. indexers: * omindex: + Add support for DjVu files. + If we get an error trying to read a directory entry, report it to the user rather than ignoring it. omega: * New OmegaScript commands $addfilter, $lower, $upper. portability: * Check "defined HAVE_SYSMP" rather than just "HAVE_SYSMP". This doesn't change behaviour, but fixes a compile warning on platforms other than Linux and IRIX. Omega 1.0.5 (2007-12-21): documentation: * Convert .txt docs to reStructedText which we process to produce HTML. * Add a note inviting suggestions for additional reliable filter programs. * overview.html: omindex hasn't generated "W"-prefix terms since 0.9.7, so remove the documentation saying it does. indexers: * omindex: + If a file's extension isn't found in the mime_map and contains uppercase ASCII characters, check for the lower cased extension (so .PDF and .Pdf behave the same way as .pdf, unless you deliberately add different mappings for them). + '-f' is documented by --help as a short option for '--follow', but wasn't previously actually recognised. + Limit filter programs to 7/8 of free physical memory on platforms where we know how to determine this statistic (currently at least Linux, FreeBSD, IRIX, HP-UX; probably Solaris and a few others too). This helps to prevent runaway filters from causing a denial of service (bug#111). + Avoid rereading uncompressed AbiWord documents in order to calculate their MD5 checksums. * scriptindex: + Now inserts a ':' between prefix and term, using the same criteria which Xapian::QueryParser does. + The 'BOOLEAN' action now ignores an empty input rather than adding just the prefix as a term. + The 'UNIQUE' action now issues a warning for empty input but otherwise ignores it. portability: * Add explicit includes of C headers needed to build with the latest snapshots of GCC 4.3. Omega 1.0.4 (2007-10-30): omega: * If an OmegaScript template specifies the same field name as both a boolean and a probabilistic term prefix then previous the boolean setting would be ignored (e.g. $setmap{prefix,foo,A}$setmap{boolprefix,foo,H}). Now this generates an error. If you set prefixes in your templates, you may wish to check them over before upgrading. Omega 1.0.3 (2007-09-28): general: * Distribution tarballs are now in the POSIX "ustar" format since it saves a few KB and we need to use it for xapian-core anyway. documentation: * Expand the output of 'mbox2omega --help' and refer the reader to it from docs/scriptindex.txt. indexers: * omindex: + Add support for indexing AbiWord documents and TeX DVI files. + Impose a 5 minute CPU time limit on filter programs to prevent problems if a filter program goes into an infinite loop on a malformed input. Partly addresses bug#111. * scriptindex: + Fix line number tracking in dump files. omega: * Add $muldiv{A,B,C} which calculates int(A*B/C). * Fix bug in decimal fraction in $size for files >= 1M in size. templates: * query: + Set HTML charset to utf-8 since that's what databases now are by default. + Restyle to use CSS to draw a "score bar" instead of using images. + Rework the layout of each hit. + Add popup hints on mouse-over for various items. + Tidy up some HTML gremlins. Omega 1.0.2 (2007-07-05): documentation: * scriptindex.txt: Fix typo. indexers: * omindex: + If --url isn't passed, default to "/", but print a warning noting that this default has been used (at least for now). + Report files that aren't indexed because their extensions aren't recognised. build system: * Value of XAPIAN_CONFIG supplied to configure is now passed to distcheck, to ensure that it works with uninstalled copies of Xapian. portability: * Fix test programs to build with a development snapshot of GCC 4.3. Omega 1.0.1 (2007-06-11): documentation: * overview.txt: As of 1.0.0, we no longer use pstotext for PostScript, but instead use ps2pdf followed by pdftotext (since this works for Unicode). * scriptindex.txt: Document that you can delete a document by supplying a new document which only contains the unique term. indexers: * Fix bug in HTML parser - if the text between two tags consisted entirely of whitespace it would just be ignored which could run words together if the tags didn't produce implicit whitespace. This bug dates back to at least Omega 0.8.2. * omindex: Under Linux (and probably some other platforms) struct dirent can tell us the type of a directory entry for some filing systems, so make use of this to avoid calling stat() (or lstat()) unnecessarily - when indexing /usr/share/doc on my Linux box, this saves about 14000 explicit calls to stat() (leaving about 7000). omega: * Fix handling of query parsing errors (broken by changes in 1.0.0). packaging: * The required automake version has been lowered to 1.8.3, so RPMs can now be built on RHEL 4 and SLES 9. Omega 1.0.0 (2007-05-17): general: * Omega and the indexers now work in UTF-8. If iconv() is available, omindex will use it to convert documents from other formats, otherwise it has built-in support for UTF-8 and ISO-8859-1; omindex knows how to run the various external filter programs to generate UTF-8 output; scriptindex assumes input is already in UTF-8. * Change the project name (used to name tarballs, and default installation paths) to "xapian-omega" since that's what the RPMs and Debian packages already use (there's a Rogue-like game called Omega). documentation: * docs/overview.txt: Document what each of the OmegaScript templates does. * docs/quickstart.txt: Assorted minor improvements. * docs/termprefixes.txt: Document new 'Z' prefix, and that the 'R' and 'W' prefixes are no longer used by Xapian. * docs/cgiparams.txt: FMT isn't limited to just `a-z' - the actual restriction is that it may not contain `..'. * docs/scriptindex.txt: Explicitly note that index=nopos is deprecated (scriptindex already emits a warning). * NEWS: Add note that Omega < 0.8.0 NEWS entries are in the xapian-core NEWS file. * TODO: Updated. indexers: * Updated to use the new Xapian::TermGenerator class. This means that the indexing strategy has changed. * "--help" now reports the default stemming language (i.e. "english"). * Implement new sample generating function which normalises all runs of whitespace to a single space, and fixes invalid UTF-8 in the sample. * omindex: + We now index PostScript by converting to PDF with ps2pdf and then indexing that. This allows us to index PostScript files containing Unicode characters outside of ISO-8859-1, and also means we now get metadata from PostScript files. The downside is it is quite a bit slower. + Add support for indexing MS Works documents using wps2text (part of libwps). + Don't index empty files. * scriptindex: + Fix optimisation of "load truncate=N" to actually work! + The "truncate" action knows not to chop off a multibyte UTF-8 character. + Update short option list for scriptindex to match documented usage (-h, -V and -s were not working). + Remove -q and -u options - they no longer do anything and are only accepted for compatibility with really old versions (0.6.1 and earlier for -q; 0.7.5 and earlier for -u). omega: * Add an alternative implementation of date range filtering which uses a MatchDecider. This allows everything that the existing implementation does, plus you can support sorting on a choice of dates (e.g. first published or last updated), and filtering works to a resolution of a minute rather than a day. Set CGI parameter DATEVALUE to enable this, and to specify the value to use. Since omindex now adds the last modified date as value 0, this will work with omindex. * Enhance $substr{} to accept a negative length (meaning to count back from the end of the string). * New CGI parameters to allow finer control of sorting and ranking - SORTAFTER and DOCIDORDER. * The sorting options are now encoded in $filters so Omega can automatically reset to page 1 if they are changed. * Add new OmegaScript $weight command which returns the raw document weight - mostly useful for debugging purposes. * $topterms{} now generates unstemmed terms. * $prettyterm{TERM} has been updated to fit with changes to the term generation strategy. * Add 'you' and 'your' as stopwords. * $filesize{SIZE} enhanced to return a decimal point for K, M, and G (e.g. "2.1K" and "4.0M" rather than "2K" and "4M"); $filesize{0} is now "0 bytes"; $filesize{1} is now "1 byte"; $filesize{SIZE} where SIZE is negative is now "". * Remove $freqs as it has been deprecated for ages. * Remove support for xB, xDATE1, xDATE2, xDAYSMINUS, and xDEFAULTOP which were deprecated in favour of xFILTER in 0.7.5 (over 3 years ago). * Remove deprecated aliases for CGI parameters (deprecated in 0.6.3 or 0.6.5, more than 3.5 years ago): RAW_SEARCH (now RAWSEARCH), DATE1 (now START), DATE2 (now END), DAYSMINUS (now SPAN but with slightly different semantics), and MIN_HITS (now MINHITS). * Remove "bias_weight" and "bias_halflife" CGI parameters since they rely on Enquire::set_bias() which has been removed. templates: * The 'query' template no longer uses $topterms by default. * New 'topterms' template provides a query template with $topterms support. * Template fragments which aren't intended for direct use have been moved to an "inc" subdirectory. testsuite: * md5test: Add tests for MD5 code. build system: * `./configure --enable-quiet' already allows you to specify at configure time to pass `--quiet' to libtool. Now you can override this at make-time by using `make QUIET=' (to turn off `--quiet') or `make QUIET=y' (to turn on `--quiet'). * configure: Disable probes for f77, gcj, and rc completely by preventing the probe code from even appearing in configure - this reduces the size of configure by 29% and should speed it up significantly. portability: * Fixed to build with GCC 4.3 snapshot. * We now make use of the safe*.h portability headers from xapian-core. * Ensure that the result of snprintf is zero terminated since MSVC's snprintf is broken (by design it seems). * configure: xapian-config --cxxflags now includes -ptused for SGI's C++ compiler, so we don't need to probe for it here. * configure: Perform a link test for posix_fadvise to fix misdetection on HP-UX. Omega 0.9.10 (2007-03-04): documentation: * docs/omegascript.txt: Rewrite introductory paragraph. Note that whitespace is significant, and add explicit warning to $setmap. * docs/termprefixes.txt: Expand section on boolean prefixes, showing how to generate them using scriptindex, and how to allow them to be selected in an HTML form. indexers: * omindex: Generate correct MD5 checksums on big-endian platforms. omega: * Fix $substr{} with negative start to actually work. * Fix $substr{} to never cause a C++ exception. packaging: * omega.spec.in: Remove "." from the end of the Summary. Omega 0.9.9 (2006-11-09): documentation: * Ship our custom INSTALL file rather than the generic one from autoconf which we've accidentally been shipping instead since 0.9.5. indexers: * scriptindex: The "date" action no longer modifies the value it operates on (it was never meant to!) omega: * Report an error if $setmap is called with an even number of parameters. An incorrect example in the documentation used to suggest this, so it's particularly useful to catch this case. packaging: * RPMs: Prevent binaries getting an rpath for /usr/lib64 on FC6. Omega 0.9.8 (2006-11-02): omega: * $substr where the start is negative and longer than the string (e.g. $substr{abcd,-5,1}) wasn't working as intended. build system: * configure: Tell AC_CHECK_HEADERS to suppress its backward compatibility mode, so it only checks headers with the compiler. This speeds up configure a little, and is what we do elsewhere. * configure: Warning flags for GCC weren't actually getting used. Fix this to work and use the same warning flags for GCC and Intel C++ as xapian-core does. Fix all the warnings this uncovered! * omega,omindex,scriptindex: Remove some old unused code. portability: * Ensure that we always pass an unsigned char value to isupper(), toupper(), etc as they are undefined on other values (glibc makes them work for signed char values too, but this is an extension). * configure: Pass magic options to SGI's C++ compiler to allow linking of templates to work. * configure: IRIX doesn't allow stdint.h to be included from C++ so we need a smarter configure test than AC_CHECK_HEADERS. * Fix warnings from SGI's C++ compiler. Omega 0.9.7 (2006-10-10): documentation: * omegascript.txt: Note that (by design) an omegascript template can't contain an infinite loop. * termprefixes.txt: "$setmap{title,S}" should be "$setmap{prefix,title,S}". * Use the default paths to the database directories and the omega CGI binary in examples. * README: Update reference to "CVS" to say "SVN". indexers: * Don't get confused by "a<b" in Javascript in a <script> tag. Fixes bug#91. * Support htdig's "ignore this bit" comments. * Don't generate terms with more than 3 trailing symbols ('-', '+', or '#'). * omindex: + Add the file last modified time as value #0. + Generate an MD5 checksum of each file indexed and store it in value #1 to allow duplicates to be collapsed. + Store the file's last modified time in the document data as "modtime" so it shows up in search results (and tweak the query template so the display of this information looks nicer). Don't add "modtime" field if the timestamp is (time_t)-1. + Run pdfinfo once and pull out the fields we want using string operations, instead of running it twice filtered through sed. + Parse the XML from OpenDocument and OpenOffice using new subclasses of HtmlParser. Only extract meta.xml once. + Add "size" field to document data. + Run xls2csv on MS Excel files, run catppt on MS Powerpoint files, and also index MS Word templates (.dot) the same way as .doc files. + Don't generate 'W' terms since omega doesn't use them. + If a filter program isn't installed, then don't try it again for the same extension (not perfect but an improvement - previously we indexed an empty document!) + If popen() fails, treat it as a read error. * scriptindex: + Add new "load" action to allow the contents of an external file to be loaded and parsed. + Fix check for whether a record has content in the case where the same field is processed more than once. omega: * Add $pack and $unpack OmegaScript commands to allow big endian binary values to be encoded and decoded (for use with omindex's lastmod in value #1). * omega.conf: Fix code which reads omega.conf to be line based as documented rather than the wacky whitespace based scheme that was actually implemented. Also we now allow "#" comments and blank lines in omega.conf. * Fix $highlight{} to work with capitalised words (it used to work but regressed in 0.8.2). * Use '\t' to separate terms in xP since filter terms might contain '.'. Fixes bug#87. testsuite: * Add htmlparsetest which tests the MyHtmlParser class. build system: * Makefile.am: Make use of the dist_ prefix to avoid having to list files in EXTRA_DIST as well as in *_SCRIPTS, *_DATA, and man_MANS. * Makefile.am: Prefer $(sysconfdir) to @sysconfdir@ since the former can be overridden on the "make" command line. portability: * xapian-config will now switch Sun's C++ compiler into ANSI C++ compliant mode, so remove all the special case bits of code added for just this one compiler. * omindex: Fix escaping of filenames to cast characters to "unsigned char" so that isalnum() works correctly everywhere. Not a security hole as dangerous characters were still being escaped. * Call pclose() not fclose() on a FILE* obtained from popen(). This bug could cause us to run out of file descriptors on some platforms. * configure: Check for strftime. packaging: * omega.spec.in: Include documentation in the RPM package. Omega 0.9.6 (2006-05-15): documentation: * docs/omegascript.txt: Clarified description of $now. indexers: * scriptindex: Fix "index" and "indexnopos" without a prefix to set the weight correctly (bug introduced in 0.9.5). omega: * Added new OmegaScript commands $filterterms and $substr. portability: * configure: Update snprintf detection to match xapian-core. * Fix MSVC warnings. packaging: * omega.spec.in: Create and package /var/lib/omega/cdb and /var/log/omega. Omega 0.9.5 (2006-04-08): documentation: * README: Add pointer to documentation. * Added man pages for omindex and scriptindex, generated using help2man. indexers: * scriptindex: + If we fail to open the index script, die with an error (previously we acted as if an empty file was specified). + Warn about a useless "weight" action, even if it's followed by another non-useless action (e.g. "field") - previously we only warned if it was last or followed only by other useless actions. + Warn if "unique=<prefix>" is used without a corresponding "boolean=<prefix>" on the same line. + Warn that "index=nopos" is deprecated and should be replaced by "indexnopos". + Add explanatory text "(note that actions are executed from left to right)" when reporting useless actions. + Added new "hash" command to allow hashed terms to be generated from long URLs like omindex does. * htdig2omega.script,mbox2omega.script: Make use of the new scriptindex "hash" command. * dbi2omega: Check DBIDRIVER environmental variable to allow a driver other than mysql to be specified without modifying the script. omega: * Fix $opt[fieldnames] handling. Previously it would try to kick in if you didn't set fieldnames but set any alphabetically later option! The symptom was that $field{} would stop working (bug#72). portability: * omindex,omega: Tweaks for MSVC compilation. Omega 0.9.4 (2006-02-21): documentation: * COPYING: Updated FSF address. Omega 0.9.3 (2006-02-16): documentation: * overview.txt: The U prefix (URL term) was grouped with the date searching prefixes, but it makes more sense to group it with the prefixes relating to parts of the URL (H for hostname, P for path, etc). * overview.txt: Add pointer to documentation of the supported query syntax. * omegascript.txt: Improve descriptions of $cgi, $collapsed, $value, $version. * termprefixes.txt: Fix typo. indexers: * omindex: add --preserve-nonduplicates / -p option to not delete any documents that aren't updated, in replace duplicates mode (so that multiple runs of omindex on different subsites don't stomp on each other). * omindex,scriptindex: Add "--stemmer" option to omindex and scriptindex to allow the stemming language to be set. Fixes bug#11. * omindex,scriptindex: More consistent --help and --version output. * omindex: Add support for OpenDocument format mimetypes and extensions out of the box. Previously you could index them but had to pass a "-m" option for each OpenDocument filename extension you wanted to handle. * scriptindex: The "-q" option no longer actually controls anything. Just ignore it for backwards compatibility (and don't document it in --help). omega: * If executing an OmegaScript command causes a Xapian exception to be thrown, catch it and copy the error message into error_msg (which is read by the $error command). This allows such errors to reported in a nicer way. * Added "SORTREVERSE" CGI parameter which allows the sort order to be reversed when sorting on a value. Removed "SORTBANDS" CGI parameter since it no longer does anything. * Added $find{LIST,STRING} to return the subscript of the first occurrence of string STRING in list LIST. * Added $lookup{CDBFILE,KEY} OmegaScript command to perform a lookup in a CDB file. * Added new feature which allows you to avoid storing fieldnames in every document. Instead you just store the field values, one per line, and add something like "$set{fieldnames,$split{caption sample url}}" to the OmegaScript template to specify the fieldnames to use. This can save a lot of disk space for a large database. * Add new "$split{}" OmegaScript command which splits a string to give an OmegaScript list. * Fix $url{} to escape "+" to "%2b". Also fix encoding of top-bit-set characters on platforms where char is signed by default. * Speed up $highlight{} - only compare terms which are the same length. * Reduce memory usage if a lot of documents are marked as relevant. templates: * query: Make the page title shorter so there's more chance it will fit on icon bars, etc. * opensearch: Add missing escaping. * godmode: If a non-existent docid is specified, report the error and prompt the user to enter another docid. Fixes bug#60. portability: * omega: Fix printf type mismatch on 64 bit platforms. * omega: Cast time_t to unsigned long to avoid problems on 64bit platforms. * Use snprintf where available. * Write top-bit set characters using \xXX notation to avoid warnings from Intel's C++ compiler. Omega 0.9.2 (2005-07-15): * omega: Changed $highlight so if OPEN and CLOSE aren't specified, they default to highlighting each word from the query with a different background colour like gmane does (previous default was to use '<strong>' and '</strong>'). * omega: Call QueryParser::set_database() as this is now used to decide what to do for terms like "C#". * omega: Added the ability to set boolean prefixes for the QueryParser by setting a "boolprefix" map in the omegascript template. * omega: Added $length{} and $stoplist{} commands to OmegaScript. * scriptindex: Fix infinite loop if there's no newline at the end of a dumpfile. * docs/termprefixes.txt: Explain how to use termprefixes with scriptindex and omega, since that's what most people will want to know. * docs/omegascript.txt: Use standard "S" prefix for title in example for $setmap, rather than "XT". Omega 0.9.1 (2005-06-06): * Releases are now created using libtool 1.5.18 and automake 1.9.5. * Updated RPM packaging. Omega 0.9.0 (2005-05-13): * Updated for 0.9.0 API changes. * omindex/scriptindex: Generate terms like "c#". * Added mbox2omega script which allows a mail folder to be indexed using scriptindex. Mostly it's an example as there's no mechanism included to show the full original message. omega: * The configuration file is now looked for differently - you can now set the environmental variable OMEGA_CONFIG_FILE. See docs/overview.txt for details. * $highlight can now highlight terms like "C#". * Add new template 'opensearch' to implement basic opensearch feeds of search results. omindex: * URL hashing previously depended on sizeof(long) so databases weren't totally portable between platforms. This is now fixed, but to do so we've had to break compatibility with databases built on platforms with 64 bit longs with URLs > 228 bytes. * Removed useless "DUPE_duplicate" option. * Added support for indexing Perl "pod" documentation using pod2text. * Replaced -l/--no-recurse with -l/--depth-limit which takes an argument allowing recursion to be restriction to any depth, not just 0 or infinity! * Extend -M/--mime-type to allow an existing mapping to be removed by omitting the type. * Fixed code so that we get lstat() prototype on Linux systems where we have posix_fadvise(). scriptindex: * Improved handling of extra blank lines in dump file. * Strip multiple \r characters from end of line. * Complain if a dump file doesn't appear to have been = escaped correctly. * Flush database after each input file to ensure all changes from a file make it in. documentation: * docs/omegascript.txt: Clarify $field description slightly. * docs/cgiparams.txt,docs/omegascript.txt: Fixed 3 references to OmXxxx classes. * docs/termprefixes.txt: Added a single document covering all aspects of term prefixes. * docs/omegascript.txt: Moved $collapsed into correct place alphabetically! * docs/cgiparams.txt,docs/overview.txt: Improved description of how B filters are handled when building the query. * docs/scriptindex.txt: Note that actions are applied in the specified order. Omega 0.8.5 (2004-12-23): * README,INSTALL: Proper installation instructions. * omega: If an exception is thrown, make sure that the HTTP headers get written so that we don't cause "500 Internal Server Error". This problem was introduced by the change to allow a user specified Content-Type in 0.8.0. Partly addresses bug#60. * scriptindex: Fixed "Unknown Exception" when trying to "unhtml" text which contains "</body>" (bug#61). This bug was introduced in 0.8.4. * omindex/scriptindex: <h1> - <h6> and </h1> - </h6> now leave a space in the dumped HTML. This bug was introduced in 0.8.4 - before that any tag left a space in the dumped HTML. * omindex: Only try to delete removed documents in "replace duplicates" mode (which is the default). * omindex: Change behaviour of crawler such that it doesn't follow symbolic links any more. The new "--follow" command line option turns following of symlinks back on. * dbi2omega: Add a comment to the start of the file detailing what dbi2omega does. Omega 0.8.4 (2004-12-08): * omindex,scriptindex: Improved HTML to text conversion - now we strip leading and trailing whitespace and convert all other consecutive groups of whitespace to a single space. Also the parser now knows that some tags should be regarded as word breaks and some shouldn't (previously all tags were treated as word breaks). * omindex: Removed bogus extra line from code which was meant to truncate samples, titles, etc at a word boundary, but has never actually worked! * omindex: Added hooks for indexing the following formats: OpenOffice (requires unzip), MS Word (requires antiword), Wordperfect (requires wpd2text), RTF (requires unrtf). * omindex: If a filename to be passed to a filter program has a leading "-", protect it from possible interpretation as an option by prepending "./". * omega: When there's only a boolean query we promote it to be the query. Tweaked so we use boolean weights in this case. * omega: Use Query::empty() instead of the now deprecated Query::is_empty(). * omega,omindex,scriptindex: Use the new Database/WritableDatabase constructors. * templates/godmode: Finished off godmode template. * Compile everything as C++. * Check snprintf actually works - some older versions don't implement C90 snprintf semantics. * XAPIAN_FLAGS already links with xapianqueryparser so remove -lxapianqueryparser from omega_LDADD as it was causing link errors on cygwin. Omega 0.8.3 (2004-09-20): * scriptindex: --version now actually reports the version. --help now exits with status 0 rather than status 1. * RPM packaging: Updated. The most notable change is that the RPM is now called xapian-omega because there's already an omega RPM (in Fedora Core at least) which is a game. Also htdig2omega and htdig2omega.script are now included in the RPM. * Install htdig2omega.script in ${prefix}/share/omega/ rather than ${prefix}/share/. Omega 0.8.2 (2004-09-13): * omega: $highlight now handles accented characters (bug#9). * omega: Use new checkatleast parameter to Enquire::get_mset to implement MINHITS. * omindex: When running with "replace duplicates" mode (the default), detect documents removed since the last indexing run and delete them from the database (bug #34). * omindex: Use the new WritableDatabase::replace_document(term, doc) method. * scriptindex: Report index script file name and line number when reporting errors in it. Added warning for redundant actions, such as "truncate" as the last action in a rule. * templates/query: Always report if the database is not found - previously we only did so if there was a query. * templates/query: Fixed missing </center> tag which happened in certain cases. * docs/omegascript.txt: Added note about that $add{$hit,1} gives the "hit number". * Now includes htdig2omega and htdig2omega.script which allow you to crawl remote websites with ht://dig, then build a searchable index of them with Xapian and Omega. * Link with -lxapianqueryparser, not -lomqueryparser. Omega 0.8.1 (2004-06-30): * omindex: Renamed hash() to hash_string() to avoid colliding with something on IRIX. * omega: Changed MORELIKE to pick up to 40 terms, rather than up to 6 (feedback on the mailing list suggests this gives much better results). * scriptindex: Added explicit catch for std::bad_alloc. Omega 0.8.0 (2004-04-19): * scriptindex: Change default to *not* overwriting the database (use --overwrite if you really want to do this); -u is now accepted but ignored. * scriptindex: Use getopt for option parsing. * omindex: Added --overwrite option which forces an existing database to be deleted before indexing begins. * templates/xml: Correct spelling of `relavence' to `relevance'. NB: if you're parsing the XML output, you'll need to fix this spelling in your parser! * templates/xml: Now set HTTP header: "Content-Type: application/html". * templates/xml: Remove unused OmegaScript code: `$set{topterms,$or{$ne{$msize,0},$query}}'. * indextext.cc,omindex.cc,scriptindex.cc: Updated to use add_term() instead of add_term_nopos(). * omega: Added $httpheader Omegascript to allow arbitrary HTTP headers and alternative Content-Type headers to be specified. * omega: If the probabilistic query was bad, don't try to run the match. * omega: Don't crash if there's a date filter but no probabilistic query. * omindex/scriptindex: Raw terms with a multicharacter prefix are now indexed with a : inserted (e.g. as XFOO:Rterm). This matches what the query parser does. * omindex/scriptindex: Don't create R terms for terms which start with a digit. * omindex: Use O_STREAMING and/or posix_fadvise() when reading files to be indexed (if available). This helps to keep the Xapian database in cache, and should greatly improve indexing throughput. * docs/scriptindex.txt: Make more explicit that boolean produces a *single* boolean term. * docs/cgiparams.txt: Note that START and END should be in the format YYYYMMDD. For NEWS entries for Omega versions prior to 0.8.0, see the xapian-core NEWS file.