Recoll has an Application Programming Interface, usable both for indexing and searching, currently accessible from the Python language.
Another less radical way to extend the application is to write filters for new types of documents.
The processing of metadata attributes for documents
(fields
) is highly configurable.
Recoll filters cooperate to translate from the multitude
of input document formats, simple ones
as opendocument,
acrobat), or compound ones such
as Zip
or Email, into the final Recoll
indexing input format, which may
be text/plain
or text/html
. Most filters are executable
programs or scripts. A few filters are coded in C++ and live
inside recollindex. This latter kind will not
be described here.
There are currently (1.18 and since 1.13) two kinds of external executable filters:
Simple filters (exec
filters) run once and
exit. They can be bare programs
like antiword, or scripts
using other programs. They are very simple to write,
because they just need to print the converted document
to the standard output. Their output can
be text/plain
or text/html
.
Multiple filters (execm
filters), run as long as
their master process (recollindex) is
active. They can process multiple files (sparing the
process startup time which can be very significant),
or multiple documents per file (e.g.: for zip or chm
files). They communicate with the indexer through a
simple protocol, but are nevertheless a bit more
complicated than the older kind. Most of new
filters are written
in Python, using a common
module to handle the protocol. There is an
exception, rclimg which is written
in Perl. The subdocuments output by these filters can
be directly indexable (text or HTML), or they can be
other simple or compound documents that will need to
be processed by another filter.
In both cases, filters deal with regular file system files, and can process either a single document, or a linear list of documents in each file. Recoll is responsible for performing up to date checks, deal with more complex embedding and other upper level issues.
In the extreme case of a simple filter returning a
document in text/plain
format, no
metadata can be transferred from the filter to the
indexer. Generic metadata, like document size or
modification date, will be gathered and stored by the
indexer.
Filters that produce text/html
format can return an arbitrary amount of metadata inside HTML
meta
tags. These will be processed
according to the directives found in
the
fields
configuration
file.
The filters that can handle multiple documents per file
return a single piece of data to identify each document inside
the file. This piece of data, called
an ipath element
will be sent back by
Recoll to extract the document at query time, for previewing,
or for creating a temporary file to be opened by a
viewer.
The following section describes the simple
filters, and the next one gives a few explanations about
the execm
ones. You could conceivably
write a simple filter with only the elements in the
manual. This will not be the case for the other ones, for
which you will have to look at the code.
Recoll simple filters are usually shell-scripts, but this is in no way necessary. Extracting the text from the native format is the difficult part. Outputting the format expected by Recoll is trivial. Happily enough, most document formats have translators or text extractors which can be called from the filter. In some cases the output of the translating program is completely appropriate, and no intermediate shell-script is needed.
Filters are called with a single argument which is the source file name. They should output the result to stdout.
When writing a filter, you should decide if it will output plain text or HTML. Plain text is simpler, but you will not be able to add metadata or vary the output character encoding (this will be defined in a configuration file). Additionally, some formatting may be easier to preserve when previewing HTML. Actually the deciding factor is metadata: Recoll has a way to extract metadata from the HTML header and use it for field searches..
The RECOLL_FILTER_FORPREVIEW
environment
variable (values yes
, no
)
tells the filter if the operation is for indexing or
previewing. Some filters use this to output a slightly different
format, for example stripping uninteresting repeated keywords (ie:
Subject:
for email) when indexing. This is not
essential.
You should look at one of the simple filters, for example rclps for a starting point.
Don't forget to make your filter executable before testing !
If you can program and want to write
an execm
filter, it should not be too
difficult to make sense of one of the existing modules. For
example, look at rclzip which uses Zip
file paths as identifiers (ipath
),
and rclics, which uses an integer
index. Also have a look at the comments inside
the internfile/mh_execm.h
file and
possibly at the corresponding module.
execm
filters sometimes need to make
a choice for the nature of the ipath
elements that they use in communication with the
indexer. Here are a few guidelines:
Use ASCII or UTF-8 (if the identifier is an integer print it, for example, like printf %d would do).
If at all possible, the data should make some kind of sense when printed to a log file to help with debugging.
Recoll uses a colon (:
) as a
separator to store a complex path internally (for
deeper embedding). Colons inside
the ipath
elements output by a
filter will be escaped, but would be a bad choice as a
filter-specific separator (mostly, again, for
debugging issues).
In any case, the main goal is that it should
be easy for the filter to extract the target document, given
the file name and the ipath
element.
execm
filters will also produce
a document with a null ipath
element. Depending on the type of document, this may have
some associated data (e.g. the body of an email message), or
none (typical for an archive file). If it is empty, this
document will be useful anyway for some operations, as the
parent of the actual data documents.
There are two elements that link a file to the filter which should process it: the association of file to mime type and the association of a mime type with a filter.
The association of files to mime types is mostly based on
name suffixes. The types are defined inside the
mimemap
file. Example:
.doc = application/msword
If no suffix association is found for the file name, Recoll will try to execute the file -i command to determine a mime type.
The association of file types to filters is performed in
the
mimeconf
file. A sample will probably be
of better help than a long explanation:
[index] application/msword = exec antiword -t -i 1 -m UTF-8;\ mimetype = text/plain ; charset=utf-8 application/ogg = exec rclogg text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html application/x-chm = execm rclchm
The fragment specifies that:
application/msword
files
are processed by executing the antiword
program, which outputs
text/plain
encoded in
utf-8
.
application/ogg
files are
processed by the rclogg script, with
default output type (text/html
, with
encoding specified in the header, or utf-8
by default).
text/rtf
is processed by
unrtf, which outputs
text/html
. The
iso-8859-1
encoding is specified because it
is not the utf-8
default, and not output by
unrtf in the HTML header section.
application/x-chm
is processed
by a persistant filter. This is determined by the
execm
keyword.
The output HTML could be very minimal like the following example:
<html> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> </head> <body> Some text content </body> </html>
You should take care to escape some
characters inside the text by transforming them into
appropriate entities. At the very minimum,
"&
" should be transformed into
"&
", "<
"
should be transformed into
"<
". This is not always properly
done by translating programs which output HTML, and of
course never by those which output plain text.
When encapsulating plain text in an HTML body,
the display of a preview may be improved by enclosing the
text inside <pre>
tags.
The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.
Recoll will process meta
tags inside
the header as possible document fields candidates. Documents
fields can be processed by the indexer in different ways,
for searching or displaying inside query results. This is
described in a following
section.
By default, the indexer will process the standard header
fields if they are present: title
,
meta/description
,
and meta/keywords
are both indexed and stored
for query-time display.
A predefined non-standard meta
tag
will also be processed by Recoll without further
configuration: if a date
tag is present
and has the right format, it will be used as the document
date (for display and sorting), in preference to the file
modification date. The date format should be as follows:
<meta name="date" content="YYYY-mm-dd HH:MM:SS"> or <meta name="date" content="YYYY-mm-ddTHH:MM:SS">
Example:
<meta name="date" content="2013-02-24 17:50:00">
Filters also have the possibility to "invent" field names. This should also be output as meta tags:
<meta name="somefield" content="Some textual data" />
You can embed HTML markup inside the content of custom
fields, for improving the display inside result lists. In this
case, add a (wildly non-standard) markup
attribute to tell Recoll that the value is HTML and should not
be escaped for display.
<meta name="somefield" markup="html" content="Some <i>textual</i> data" />
As written above, the processing of fields is described in a further section.