Add support for a new format to Omega

We can add support for a new file format to Omega through an external filter or a library. For this, we must follow a series of steps.

First of all, we need a mime type for the new file format. Omega uses mime types to identify the format of a file and handle it in a proper way. The official registry is at http://www.iana.org/assignments/media-types/ but not all filetypes have a corresponding official mime-type. In that case, a de-facto standard "x-" prefixed mime-type often exists. A good way to look for one is to ask the file utility to identify a file (Omega uses the same library as file to identify files when it does not recognise the extension):

file --mime-type example.fb2

which responds:

example.fb2: text/xml

Sometimes file just returns a generic answer (most commonly text/plain or application/octet-stream) and occasionally it misidentifies a file. If that is the case, we can associate the file format extension with a particular mime type at 'mimemap.tokens'. If multiple extensions are used for a format (such as htm and html for HTML) then add an entry for each.

When indexing a filename which has an extension in upper-case or mixed-case, omindex will check for an exact match for the extension, and if not found, it will force the extension to lower-case and try again, so just add the extension in lower-case unless different cases actually have different meanings.

In this example, text/xml is too broad so we can associate fb2 to application/x-fictionbook+xml which is much more specific.

Extracted data variables

In order to add a new filter and index a document, you will need to fill some C++ variables in index_file.cc:

It is not necessary to fill all the variables, but try to fill as many as you can.

Using an external filter

To add a new filter to omega we have to follow a series of steps:

  1. The first job is to find a good external filter. Some formats have several filters to choose from. The attributes which interest us are reliably extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files.

    The ideal (and simplest) case is that you have a filter which can produce an UTF-8 output in plain text. It may require special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.

    It is most efficient if the filter program can write to stdout, but output to a temporary file works too.

    For example, if we want to use python2text for handling text/x-python, we should use python2text --utf8 --stdout.

  2. Then, we need to add the filter to Omega. Omega has the ability to specify additional external filters on the command line using --filter=M[,[T][,C]]:CMD, which process files with MIME Content-Type M through command CMD and produces output (on stdout or in a temporary file) with format T (Content-Type or file extension; currently txt (default), html or svg) in character encoding C (default: UTF-8). For example

    --filter=text/x-foo,text/html,utf-16:'foo2utf16 --content %f %t'
    

    In this case, we are going to handle text/x-foo files with foo2utf16 that is going to produce html with UTF-16 encoding on a temporary file. Note that %f will be replaced with the filename and %t with a temporary output file (that is going to be created by omindex at runtime and the extension of it will reflect the expected output format). This tells omindex to index files with content-type text/x-foo by running

    foo2utf16 --content path/to/file path/to/temporary/file.html
    

    If you don't include %f, then the filename of the file to be extracted will be appended to the command, separated by a space and if you don't use %t, then omindex will expect output on stdout. Besides, %% can be used should you need a literal % in the command.

    If you specify false as the command in --filter, omindex will skip files with the specified MIME type. If you specify true as the command in --filter, omindex won't try to extract text from the file, but will index it such that it can be searched for via metadata which comes from the filing system (filename, extension, mime content-type, last modified time, size).

    If we want to add the filter permanently, we can add a new entry in index_add_default_filters at 'index_file.cc'. Following with the example

    index_command("text/x-foo", Filter("foo2utf16 --content %f %t", "text/html", "utf-16"));
    

    There are more options that we can use for Filter (see 'index_file.h').

  3. In some cases, you will have to run several programs for each file or make some extra work so you will either need to put together a script which fits what omindex supports, or else modify the source code in index_file.cc by adding a test for the new mime-type to the long if/else-if chain inside index_mimetype function. New formats should generally go at the end, unless they are very common

    } else if (mimetype == "text/x-foo") {
    

    The filename of the file is in file. The code you add should set the variables described in the Extracted data variables section above.

    string tmpfile = get_tmpfile("tmp.html");
    if (tmpfile.empty())
      return;
    string cmd = "foo2utf16 --content";
    append_filename_argument(cmd, file);
    append_filename_argument(cmd, tmpfile);
    MyHtmlParser p;
    try {
      (void)stdout_to_string(cmd);
      dump = file_to_string(tmpfile);
      p.parse_html(dump, "UTF-16", false);
      unlink(tmpfile.c_str());
    } catch (ReadError) {
      skip_cmd_failed(urlterm, context, cmd, d.get_size(), d.get_mtime());
      unlink(tmpfile.c_str());
      return;
    } catch (...) {
      unlink(tmpfile.c_str());
      throw;
    }
    dump = p.dump;
    title = p.title;
    author = p.author;
    keywords = p.keywords;
    topic = p.topic;
    sample = p.sample;
    

    The stdout_to_string function runs a command and captures its output as a C++ std::string. If the command is not installed on PATH, omindex detects this automatically and disables support for the mimetype in the current run, so it will only try the first file of each such type.

    If UTF-8 output is not supported, pick the best (or only!) supported encoding and then convert the output to UTF-8 - to do this, once you have dump, convert it like so (replacing "UTF-16" with the character set which is produced)

    convert_to_utf8(string, "UTF-16");
    

    In this case, MyHtmlParser will convert the text of the file to UTF-8 if necessary.

If you find a reliable external filter or library and think it might be useful to other people, please let us know about it.

Submitting a patch

Once you are happy with how your handler/filter works, please submit a patch so we can include it in future releases (creating a new trac ticket and attaching the patch is best). Before doing so, please also update docs/overview.rst by:

It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filters on it. Ideally ones with non-ASCII characters so that we know Unicode support works.

You can read more about how to contribute to Xapian.

If you have problems you can ask for help on the IRC channel or mailing list.