Monday, December 16, 2013

Learning PyLucene by Example

Apache Lucene is a a full text search framework built in Java. It has many appealing features such as high-performance indexing, powerful and efficient search algorithms, and cross-platform solution. PyLucene is a Python extension for using (Java) Lucene. Its goal is to allow users to use Lucene for text indexing and searching within Python. PyLucene is not a port but a Python wrapper around Java Lucene, which embeds Lucene running in a JVM into a Python process.

This is a quick guide on PyLucene. We show code snippets for full-text searching on bible versers, which are stored in a dictionary data structure. There are three steps in buliding the index: create an index, fill the index and close resources. In the first step, we choose StandardAnalyzer as the analyzer, SimpleFSDirectory as (file) storage scheme for our IndexWriter. In the second step, each verse (document) is labelled with five fields that serve as index for future search. The text of each verse is labelled as "Text."  Our search will be primarily on this field. However, the text of each verse is for indexing only but not stored (Field.Store.NO) in the index, since all verses are already stored in our main data store (bible dictionary). Label "Testament" allows us to distinguish if the verse is in Old or New testament. The last three fields: book, chapter, verse are keys to the data store that allow us to retrieve the text of the specified verse. Once the index is built, we close all the resources in the last step.

The snippet for building the index is as follows:

def make_index():
    '''Make index from data source -- bible
    Some global variables used:
        bible: a dictionary that stores all bible verses
        OTbooks: a list of books in old testament
        NTbooks: a list of books in new testament
        chapsInBook: a list of number of chapters in each book
    '''
    lucene.initVM()
    path = raw_input("Path for index: ")
    # 1. create an index
    index_path = File(path)
    analyzer = StandardAnalyzer(Version.LUCENE_35)
    index = SimpleFSDirectory(index_path)
    config = IndexWriterConfig(Version.LUCENE_35, analyzer)
    writer = IndexWriter(index, config)

    # 2 construct documents and fill the index
    for book in bible.keys():
        if book in OTbooks:
            testament = "Old"
        else:
            testament = "New"
        for chapter in xrange(1, chapsInBook[book]+1):
            for verse in xrange(1, len(bible[book][chapter])+1):
                verse_text = bible[book][chapter][verse]
                doc = Document()
                doc.add(Field("Text", verse_text, Field.Store.NO, Field.Index.ANALYZED))
                doc.add(Field("Testament", testament, Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Book", book, Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Chapter", str(chapter), Field.Store.YES, Field.Index.ANALYZED))
                doc.add(Field("Verse", str(verse), Field.Store.YES, Field.Index.ANALYZED))
                writer.addDocument(doc)

    # 3. close resources
    writer.close()
    index.close()


There are five steps in our simple search: open the index, parse the query string, search the index, display results and close resources. In the first step, we use IndexReader to open the index built before for our simple search. The query string (kwds) is parsed by the QueryParser in the second step. The search job is done in step three by IndexSearcher, and the results are stored in the object hits. In step four, we get (getField) the book, chapter, and verse fields from the documents returned in the previous step, which allow us to retrieve the bible verses from the data store and display them. Finally, we close all resources in step five.

The snippet for searching and displaying results is as follows:

def search(indexDir, kwds):
    '''Simple Search
    Input paramenters:
        1. indexDir: directory name of the index
        2. kwds: query string for this simple search
    display_verse(): procedure to display the specified bible verse
    '''
    lucene.initVM()
    # 1. open the index
    analyzer = StandardAnalyzer(Version.LUCENE_35)
    index = SimpleFSDirectory(File(indexDir)
    reader = IndexReader.open(index)
    n_docs = reader.numDocs()

    # 2. parse the query string
    queryparser = QueryParser(Version.LUCENE_35, "Text", analyzer)
    query = queryparser.parse(kwds)

    # 3. search the index
    searcher = IndexSearcher(reader)
    hits = searcher.search(query, n_docs).scoreDocs

    # 4. display results
    for i, hit in enumerate(hits):
        doc = searcher.doc(hit.doc)
        book = doc.getField('Book').stringValue()
        chapter = doc.getField('Chapter').stringValue()
        verse = doc.getField('Verse').stringValue()
        display_verse(book, int(chapter), int(verse))


    # 5. close resources
    searcher.close()



1 comment:

  1. Hello,
    Is there a method to check for multiple keywords.

    ReplyDelete