4.3. API

4.3.1. Interface elements

A few elements in the interface are specific and and need an explanation.

udi

An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes), which is why a regular URI cannot be used. The structure and contents of the udi is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to length, the suppressed part being replaced by a hash value.

ipath

This data value (set as a field in the Doc object) is stored, along with the URL, but not indexed by Recoll. Its contents are not interpreted, and its use is up to the application. For example, the Recoll internal file system indexer stores the part of the document access path internal to the container file (ipath in this case is a list of subdocument sequential numbers). url and ipath are returned in every search result and permit access to the original document.

Stored and indexed fields

The fields file inside the Recoll configuration defines which document fields are either "indexed" (searchable), "stored" (retrievable with search results), or both.

Data for an external indexer, should be stored in a separate index, not the one for the Recoll internal file system indexer, except if the latter is not used at all). The reason is that the main document indexer purge pass would remove all the other indexer's documents, as they were not seen during indexing. The main indexer documents would also probably be a problem for the external indexer purge operation.

4.3.2. Python interface

4.3.2.1. Introduction

Recoll versions after 1.11 define a Python programming interface, both for searching and indexing. The indexing portion has seen little use, but the searching one is used in the Recoll Ubuntu Unity Lens and Recoll Web UI.

The API is inspired by the Python database API specification. There were two major changes in recent Recoll versions:

  • The basis for the Recoll API changed from Python database API version 1.0 (Recoll versions up to 1.18.1), to version 2.0 (Recoll 1.18.2 and later).
  • The recoll module became a package (with an internal recoll module) as of Recoll version 1.19, in order to add more functions. For existing code, this only changes the way the interface must be imported.

We will mostly describe the new API and package structure here. A paragraph at the end of this section will explain a few differences and ways to write code compatible with both versions.

The Python interface can be found in the source package, under python/recoll.

The python/recoll/ directory contains the usual setup.py. After configuring the main Recoll code, you can use the script to build and install the Python module:

            cd recoll-xxx/python/recoll
            python setup.py build
            python setup.py install
          

The normal Recoll installer installs the Python API along with the main code.

When installing from a repository, and depending on the distribution, the Python API can sometimes be found in a separate package.

4.3.2.2. Recoll package

The recoll package contains two modules:

  • The recoll module contains functions and classes used to query (or update) the index.

  • The rclextract module contains functions and classes used to access document data.

4.3.2.3. The recoll module

Functions
connect(confdir=None, extra_dbs=None, writable = False)
The connect() function connects to one or several Recoll index(es) and returns a Db object.
  • confdir may specify a configuration directory. The usual defaults apply.
  • extra_dbs is a list of additional indexes (Xapian directories).
  • writable decides if we can index new data through this connection.
This call initializes the recoll module, and it should always be performed before any other call or object creation.
Classes
The Db class

A Db object is created by a connect() function and holds a connection to a Recoll index.

Methods

Db.close()
Closes the connection. You can't do anything with the Db object after this.
Db.query(), Db.cursor()
These aliases return a blank Query object for this index.
Db.setAbstractParams(maxchars, contextwords)
Set the parameters used to build snippets (sets of keywords in context text fragments). maxchars defines the maximum total size of the abstract. contextwords defines how many terms are shown around the keyword.
Db.termMatch(match_type, expr, field='', maxlen=-1, casesens=False, diacsens=False, lang='english')
Expand an expression against the index term list. Performs the basic function from the GUI term explorer tool. match_type can be either of wildcard, regexp or stem. Returns a list of terms expanded from the input expression.
The Query class

A Query object (equivalent to a cursor in the Python DB API) is created by a Db.query() call. It is used to execute index searches.

Methods

Query.sortby(fieldname, ascending=True)
Sort results by fieldname, in ascending or descending order. Must be called before executing the search.
Query.execute(query_string, stemming=1, stemlang="english")
Starts a search for query_string, a Recoll search language string.
Query.executesd(SearchData)
Starts a search for the query defined by the SearchData object.
Query.fetchmany(size=query.arraysize)
Fetches the next Doc objects in the current search results, and returns them as an array of the required size, which is by default the value of the arraysize data member.
Query.fetchone()
Fetches the next Doc object from the current search results.
Query.close()
Closes the query. The object is unusable after the call.
Query.scroll(value, mode='relative')
Adjusts the position in the current result set. mode can be relative or absolute.
Query.getgroups()
Retrieves the expanded query terms as a list of pairs. Meaningful only after executexx In each pair, the first entry is a list of user terms (of size one for simple terms, or more for group and phrase clauses), the second a list of query terms as derived from the user terms and used in the Xapian Query.
Query.getxquery()
Return the Xapian query description as a Unicode string. Meaningful only after executexx.
Query.highlight(text, ishtml = 0, methods = object)
Will insert <span "class=rclmatch">, </span> tags around the match areas in the input text and return the modified text. ishtml can be set to indicate that the input text is HTML and that HTML special characters should not be escaped. methods if set should be an object with methods startMatch(i) and endMatch() which will be called for each match and should return a begin and end tag
Query.makedocabstract(doc, methods = object))
Create a snippets abstract for doc (a Doc object) by selecting text around the match terms. If methods is set, will also perform highlighting. See the highlight method.
Query.__iter__() and Query.next()
So that things like for doc in query: will work.

Data descriptors

Query.arraysize
Default number of records processed by fetchmany (r/w).
Query.rowcount
Number of records returned by the last execute.
Query.rownumber
Next index to be fetched from results. Normally increments after each fetchone() call, but can be set/reset before the call to effect seeking (equivalent to using scroll()). Starts at 0.
The Doc class

A Doc object contains index data for a given document. The data is extracted from the index when searching, or set by the indexer program when updating. The Doc object has many attributes to be read or set by its user. It matches exactly the Rcl::Doc C++ object. Some of the attributes are predefined, but, especially when indexing, others can be set, the name of which will be processed as field names by the indexing configuration. Inputs can be specified as Unicode or strings. Outputs are Unicode objects. All dates are specified as Unix timestamps, printed as strings. Please refer to the rcldb/rcldoc.h C++ file for a description of the predefined attributes.

At query time, only the fields that are defined as stored either by default or in the fields configuration file will be meaningful in the Doc object. Especially this will not be the case for the document text. See the rclextract module for accessing document contents.

Methods

get(key), [] operator
Retrieve the named doc attribute
getbinurl()
Retrieve the URL in byte array format (no transcoding), for use as parameter to a system call.
items()
Return a dictionary of doc object keys/values
keys()
list of doc object keys (attribute names).
The SearchData class

A SearchData object allows building a query by combining clauses, for execution by Query.executesd(). It can be used in replacement of the query language approach. The interface is going to change a little, so no detailed doc for now...

Methods

addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', qstring=string, slack=0, field='', stemming=1, subSearch=SearchData)

4.3.2.4. The rclextract module

Index queries do not provide document content (only a partial and unprecise reconstruction is performed to show the snippets text). In order to access the actual document data, the data extraction part of the indexing process must be performed (subdocument access and format translation). This is not trivial in general. The rclextract module currently provides a single class which can be used to access the data content for result documents.

Classes
The Extractor class

Methods

Extractor(doc)
An Extractor object is built from a Doc object, output from a query.
Extractor.textextract(ipath)
Extract document defined by ipath and return a Doc object. The doc.text field has the document text converted to either text/plain or text/html according to doc.mimetype. The typical use would be as follows:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing
Extractor.idoctofile(ipath, targetmtype, outfile='')
Extracts document into an output file, which can be given explicitly or will be created as a temporary file to be deleted by the caller. Typical use:
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)

4.3.2.5. Example code

The following sample would query the index with a user language string. See the python/samples directory inside the Recoll source for other examples. The recollgui subdirectory has a very embryonic GUI which demonstrates the highlighting and data extraction functions.

#!/usr/bin/env python

from recoll import recoll

db = recoll.connect()
db.setAbstractParams(maxchars=80, contextwords=4)

query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres > 5:
    nres = 5
for i in range(nres):
    doc = query.fetchone()
    print "Result #%d" % (query.rownumber,)
    for k in ("title", "size"):
        print k, ":", getattr(doc, k).encode('utf-8')
    abs = db.makeDocAbstract(doc, query).encode('utf-8')
    print abs
    print


4.3.2.6. Compatibility with the previous version

The following code fragments can be used to ensure that code can run with both the old and the new API (as long as it does not use the new abilities of the new API of course).

Adapting to the new package structure:


try:
    from recoll import recoll
    from recoll import rclextract
    hasextract = True
except:
    import recoll
    hasextract = False

Adapting to the change of nature of the next Query member. The same test can be used to choose to use the scroll() method (new) or set the next value (old).


       rownum = query.next if type(query.next) == int else \
                 query.rownumber