Monday, December 16, 2013

Learning PyLucene by Example

Apache Lucene is a a full text search framework built in Java. It has many appealing features such as high-performance indexing, powerful and efficient search algorithms, and cross-platform solution. PyLucene is a Python extension for using (Java) Lucene. Its goal is to allow users to use Lucene for text indexing and searching within Python. PyLucene is not a port but a Python wrapper around Java Lucene, which embeds Lucene running in a JVM into a Python process.

This is a quick guide on PyLucene. We show code snippets for full-text searching on bible versers, which are stored in a dictionary data structure. There are three steps in buliding the index: create an index, fill the index and close resources. In the first step, we choose StandardAnalyzer as the analyzer, SimpleFSDirectory as (file) storage scheme for our IndexWriter. In the second step, each verse (document) is labelled with five fields that serve as index for future search. The text of each verse is labelled as "Text."  Our search will be primarily on this field. However, the text of each verse is for indexing only but not stored (Field.Store.NO) in the index, since all verses are already stored in our main data store (bible dictionary). Label "Testament" allows us to distinguish if the verse is in Old or New testament. The last three fields: book, chapter, verse are keys to the data store that allow us to retrieve the text of the specified verse. Once the index is built, we close all the resources in the last step.

The snippet for building the index is as follows:

def make_index():
    '''Make index from data source -- bible
    Some global variables used:
        bible: a dictionary that stores all bible verses
        OTbooks: a list of books in old testament
        NTbooks: a list of books in new testament
        chapsInBook: a list of number of chapters in each book
    '''
    lucene.initVM()
    path = raw_input("Path for index: ")
    # 1. create an index
    index_path = File(path)
    analyzer = StandardAnalyzer(Version.LUCENE_35)
    index = SimpleFSDirectory(index_path)
    config = IndexWriterConfig(Version.LUCENE_35, analyzer)
    writer = IndexWriter(index, config)

    # 2 construct documents and fill the index
    for book in bible.keys():
        if book in OTbooks:
            testament = "Old"
        else:
            testament = "New"
        for chapter in xrange(1, chapsInBook[book]+1):
            for verse in xrange(1, len(bible[book][chapter])+1):
                verse_text = bible[book][chapter][verse]
                doc = Document()
                doc.add(Field("Text", verse_text, Field.Store.NO, Field.Index.ANALYZED))
                doc.add(Field("Testament", testament, Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Book", book, Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Chapter", str(chapter), Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Verse", str(verse), Field.Store.YES, Field.Index.ANALYZED))
                writer.addDocument(doc)

    # 3. close resources
    writer.close()
    index.close()


There are five steps in our simple search: open the index, parse the query string, search the index, display results and close resources. In the first step, we use IndexReader to open the index built before for our simple search. The query string (kwds) is parsed by the QueryParser in the second step. The search job is done in step three by IndexSearcher, and the results are stored in the object hits. In step four, we get (getField) the book, chapter, and verse fields from the documents returned in the previous step, which allow us to retrieve the bible verses from the data store and display them. Finally, we close all resources in step five.

The snippet for searching and displaying results is as follows:

def search(indexDir, kwds):
    '''Simple Search
    Input paramenters:
        1. indexDir: directory name of the index
        2. kwds: query string for this simple search
    display_verse(): procedure to display the specified bible verse
    '''
    lucene.initVM()
    # 1. open the index
    analyzer = StandardAnalyzer(Version.LUCENE_35)
    index = SimpleFSDirectory(File(indexDir)
    reader = IndexReader.open(index)
    n_docs = reader.numDocs()

    # 2. parse the query string
    queryparser = QueryParser(Version.LUCENE_35, "Text", analyzer)
    query = queryparser.parse(kwds)

    # 3. search the index
    searcher = IndexSearcher(reader)
    hits = searcher.search(query, n_docs).scoreDocs

    # 4. display results
    for i, hit in enumerate(hits):
        doc = searcher.doc(hit.doc)
        book = doc.getField('Book').stringValue()
        chapter = doc.getField('Chapter').stringValue()
        verse = doc.getField('Verse').stringValue()
        display_verse(book, int(chapter), int(verse))


    # 5. close resources
    searcher.close()



Wednesday, December 11, 2013

Using Java Within Python Applications -- Part II

In the second part, we exam the question: how to add JAR files to the sys.path at runtime? One's first thought is that setting CLASSPATH (an environment variable) should resolve the issue. Why would one bother to add JAR files at runtime? This is the one of the topics in the Jython book entitled "working with CLASSPATH." It provides two good reasons along with a solution for this question.

The first reason is to make end users' life easier for not to know anything about environment variables. The second and more compelling reason is "when there is no normal user account to provide environment variables." The issue "add JAR files to the sys.path at runtime" in Jython is similar to "load classes (JAR files) at runtime" in the Java world. Fortunately, solution exists for the Java case. The classPathHacker (Listing B-11) presented in Jython book is a translation of that solution from the Java world.

In the second part of this blog we will go through a more practical example to finish our tour on using Java within Python applications. The Java example (SnS.java) selected here is modified from the APIExamples.java that comes with the JSword.

The only task provided by SnS class is defined in the searchAndShow method, which takes a string (kwds) as key words, searches (bible.find()) through all books in the bible, and returns (html formatted) findings in a string (result.toString()). There are over a dozen jar files in the distribution of JSword package. One may start JSword application with a shell script (BiblDesktop.sh) that sets environment variables (including CLASSPATH) properly then initiates the GUI application.

In this example, we use Jython to add JAR files at runtime, to call SnS, and to print search results in text form. It is obvious that we have full knowledge of which JAR files are needed before starting the application, the example here is for educational purpose.

Here is the SnS Java class, which requires a few JAR files:

import java.net.URL;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.crosswire.common.util.NetUtil;
import org.crosswire.common.util.ResourceUtil;
import org.crosswire.common.xml.SAXEventProvider;
import org.crosswire.common.xml.TransformingSAXEventProvider;
import org.crosswire.common.xml.XMLUtil;
import org.crosswire.jsword.book.Book;
import org.crosswire.jsword.book.BookData;
import org.crosswire.jsword.book.BookException;
import org.crosswire.jsword.book.Books;
import org.crosswire.jsword.passage.Key;
import org.crosswire.jsword.passage.Passage;
import org.crosswire.jsword.passage.RestrictionType;
import org.xml.sax.SAXException;

/**
 * All the methods in this class highlight some are of the API and how to use it.
 *
 * @see gnu.lgpl.License for license details.
 *      The copyright to this program is held by it's authors.
 * @author Joe Walker [joe at eireneh dot com]
 */
public class SnS
{
    /**
     * The name of a Bible to find
     */
    private static final String BIBLE_NAME = "KJV"; //$NON-NLS-1$

    /**
     * An example of how to do a search and then get text for each range of verses.
     * @throws BookException
     * @throws SAXException
     *
     * @param kwds keyword to search                         JSL
     * @return search results is returned in String format   JSL
     *
     */
    public String searchAndShow(String kwds) throws BookException, SAXException
    {
        Book bible = Books.installed().getBook(BIBLE_NAME);

        Key key = bible.find(kwds); //$NON-NLS-1$

        // Here is an example of how to iterate over the ranges and get the text for each
        // The key's iterator would have iterated over verses.

        // The following shows how to use a stylesheet of your own choosing
        String path = "xsl/cswing/simple.xsl"; //$NON-NLS-1$
        URL xslurl = ResourceUtil.getResource(path);

        Iterator rangeIter = ((Passage) key).rangeIterator(RestrictionType.CHAPTER); // Make ranges break on chapter boundaries.
        //
        // prepare for result
        //      using a StringBuilder to hold all search results
        //              JSL
        //
        StringBuilder result = new StringBuilder();
        while (rangeIter.hasNext())
        {
            Key range = (Key) rangeIter.next();
            BookData data = new BookData(bible, range);
            SAXEventProvider osissep = data.getSAXEventProvider();
            SAXEventProvider htmlsep = new TransformingSAXEventProvider(NetUtil.toURI(xslurl), osissep);
            String text = XMLUtil.writeToString(htmlsep);
            result.append(text);
        }
        return result.toString();           // search results --- JSL
    }

    public static void main(String[] args) throws BookException, SAXException
    {
        SnS examples = new SnS();
        System.out.println(examples.searchAndShow("+what +wilt +thou"));
    }

}


As we mentioned before the classPathHacker is used for adding JAR files at runtime. Let cphacker.py be the file where classPathHacker is defined. Here is our Jython codes, where the file directory (dir) for storing all jar files is passed by the command line paremeter sys.argv. The main steps in this Jython codes are: add (jarLoad.addFile()) JAR files one by one at runtime, import the SnS class, create an SnS object (example), and let the object complete its task (searchAndShow).

import sys, glob
from cphacker import *

#
# preparation for Jar files (classpath) first
#
#   we need to do this prior to import the SnS class
#       this is because SnS import lots of classes in
#           those jar files
#       it does not work if you move this section to the main
#
jarLoad = classPathHacker()
dir = sys.argv[1]
jars = dir + "/*.jar"
jarfiles = glob.glob(jars)
for jar in jarfiles:
    jarLoad.addFile(jar)
jarLoad.addFile('.')

import SnS

if __name__ == "__main__":
    example = SnS()
    print example.searchAndShow("+what +wilt +thou").encode("utf-8")



There is one thing we need to address before closing this post: the
classPathHacker described in Jython book does not work for Jython 2.5.2. A slightly modified version can be obtained from glasblog, which works fine under

Using Java Within Python Applications -- Part I

There are two common approaches for using Java within Python applications. One is to construct a bridge between the Java virtual machine (JVM) and the Python interpreter, where Java and Python codes run separately. JPype, Py4J, JCC and Pyjnius are some examples in this category. The other, as Jython did, is to run Python within a JVM. Jython is described as:
an implementation of the Python programming language which is
designed to run on the Java(tm) Platform.
It is perceived that Jython is easier to install and to use compared with other options. The following examples give a brief tour of Jython. In our first example, we will see some Java objects (java.lang.Math) being used within Jython.

>>> from java.lang import Math
>>> Math.max(317, 220)
317L
>>> Math.pow(2, 4)
16.0

In our second example, we will create a Java object (Person) and use it within a Jython application.

Definition of the Person object: Person.java

public class Person {
    private String firstName;
    private String lastName;
    private int age;
    public Person(String firstName, String lastName, int age){
        this.firstName = firstName;
        this.lastName = lastName;
        this.age = age;
    }
    public String getFirstName() {
        return firstName;
    }
    public void setFirstName(String firstName) {
        this.firstName = firstName;
    }
    public int getAge() {
        return age;
    }
    public void setAge(int age) {
        this.age = age;
    }
}

Using Person.java in Jython (we need to compile the java code first):

>>> import Person
>>> john = Person("john", "dole", 27)
>>> john.getFirstName()
u'john'
>>> john.getAge()
27
>>> john.setFirstName("alias")
>>> john.getFirstName()
u'alias'

Our first two examples demonstrate that there is no distinction in using Java or Python objects within Jython: there is no need to start up JVM for using any Java objects, since Jython did that for us.

It is possible to extend (subclass) Java classes via Jython classes. Our third example, taken from the Jython book, show this.

The Java code that defines two methods: Calculator.java

/**
* Java calculator class that contains two simple methods
*/
public class Calculator {
    public Calculator(){
    }
    public double calculateTip(double cost, double tipPercentage){
        return cost * tipPercentage;
    }
    public double calculateTax(double cost, double taxPercentage){
        return cost * taxPercentage;
    }
}


The Python code (with minor corrections) to extend the Java class: JythonCalc.py

import Calculator
from java.lang import Math
class JythonCalc(Calculator):
    def __init__(self):
        pass
    def calculateTotal(self, cost, tip, tax):
        return cost + self.calculateTip(cost, tip) + self.calculateTax(cost, tax)
if __name__ == "__main__":
    calc = JythonCalc()
    cost = 23.75
    tip = .15
    tax = .07
    print "Starting Cost: ", cost
    print "Tip Percentage: ", tip
    print "Tax Percentage: ", tax
    print Math.round(calc.calculateTotal(cost, tip, tax))


The result will be:

Starting Cost:  23.75
Tip Percentage:  0.15
Tax Percentage:  0.07
29


One question remains: Is Jython the same language as Python? A short answer is Yes. In fact, there are differences. The main one is that the current version of Jython (v2.5) cannot use CPython extension modules written in C. If one wants to use such a module, one should look for an equivalent written in pure Java or Python. However, it is claimed that future release will eliminate this difference.