Apache Lucene is a a full text search framework built in Java. It has many appealing features such as high-performance indexing, powerful and efficient search algorithms, and cross-platform solution. PyLucene is a Python extension for using (Java) Lucene. Its goal is to allow users to use Lucene for text indexing and searching within Python. PyLucene is not a port but a Python wrapper around Java Lucene, which embeds Lucene running in a JVM into a Python process.
This is a quick guide on PyLucene. We show code snippets for full-text searching on bible versers, which are stored in a dictionary data structure. There are three steps in buliding the index: create an index, fill the index and close resources. In the first step, we choose StandardAnalyzer as the analyzer, SimpleFSDirectory as (file) storage scheme for our IndexWriter. In the second step, each verse (document) is labelled with five fields that serve as index for future search. The text of each verse is labelled as "Text." Our search will be primarily on this field. However, the text of each verse is for indexing only but not stored (Field.Store.NO) in the index, since all verses are already stored in our main data store (bible dictionary). Label "Testament" allows us to distinguish if the verse is in Old or New testament. The last three fields: book, chapter, verse are keys to the data store that allow us to retrieve the text of the specified verse. Once the index is built, we close all the resources in the last step.
The snippet for building the index is as follows:
def make_index():
'''Make index from data source -- bible
Some global variables used:
bible: a dictionary that stores all bible verses
OTbooks: a list of books in old testament
NTbooks: a list of books in new testament
chapsInBook: a list of number of chapters in each book
'''
lucene.initVM()
path = raw_input("Path for index: ")
# 1. create an index
index_path = File(path)
analyzer = StandardAnalyzer(Version.LUCENE_35)
index = SimpleFSDirectory(index_path)
config = IndexWriterConfig(Version.LUCENE_35, analyzer)
writer = IndexWriter(index, config)
# 2 construct documents and fill the index
for book in bible.keys():
if book in OTbooks:
testament = "Old"
else:
testament = "New"
for chapter in xrange(1, chapsInBook[book]+1):
for verse in xrange(1, len(bible[book][chapter])+1):
verse_text = bible[book][chapter][verse]
doc = Document()
doc.add(Field("Text", verse_text, Field.Store.NO, Field.Index.ANALYZED))
doc.add(Field("Testament", testament, Field.Store.YES, Field.Index.ANALYZED))
doc.add(Field("Book", book, Field.Store.YES, Field.Index.ANALYZED))
doc.add(Field("Chapter", str(chapter), Field.Store.YES, Field.Index.ANALYZED))
doc.add(Field("Verse", str(verse), Field.Store.YES, Field.Index.ANALYZED))
writer.addDocument(doc)
# 3. close resources
writer.close()
index.close()
There are five steps in our simple search: open the index, parse the query string, search the index, display results and close resources. In the first step, we use IndexReader to open the index built before for our simple search. The query string (kwds) is parsed by the QueryParser in the second step. The search job is done in step three by IndexSearcher, and the results are stored in the object hits. In step four, we get (getField) the book, chapter, and verse fields from the documents returned in the previous step, which allow us to retrieve the bible verses from the data store and display them. Finally, we close all resources in step five.
The snippet for searching and displaying results is as follows:
def search(indexDir, kwds):
'''Simple Search
Input paramenters:
1. indexDir: directory name of the index
2. kwds: query string for this simple search
display_verse(): procedure to display the specified bible verse
'''
lucene.initVM()
# 1. open the index
analyzer = StandardAnalyzer(Version.LUCENE_35)
index = SimpleFSDirectory(File(indexDir)
reader = IndexReader.open(index)
n_docs = reader.numDocs()
# 2. parse the query string
queryparser = QueryParser(Version.LUCENE_35, "Text", analyzer)
query = queryparser.parse(kwds)
# 3. search the index
searcher = IndexSearcher(reader)
hits = searcher.search(query, n_docs).scoreDocs
# 4. display results
for i, hit in enumerate(hits):
doc = searcher.doc(hit.doc)
book = doc.getField('Book').stringValue()
chapter = doc.getField('Chapter').stringValue()
verse = doc.getField('Verse').stringValue()
display_verse(book, int(chapter), int(verse))
# 5. close resources
searcher.close()
Hello,
ReplyDeleteIs there a method to check for multiple keywords.