Thursday, June 26, 2014

Musings on Openlibrary and my first get-to-know with Solr-1.4.1

OpenLibrary is a ambitious project which runs at openlibrary.org. As the name says, its a open(sourced) project. The code lies on github at https://github.com/internetarchive/openlibrary/.

Now the developers working on the project are extremely talented. The number of technologies and the way they have implemented them is awesome. Users can search a book( ofcourse ) , read it online ( using the beautiful Bookreader client side app) and also borrow a book ( havent used this feature ).

One of my clients , wanted me to implement the OpenLibrary project into their website. They already had some part working. BookReader was working , but the feature of searching inside a book , wasnt.

Openlibrary uses solr as its search engine. It is the most powerful search backend , as its said. Lack of previous developer made a big issue and moreover there was not much documentation present for my task.

I realised from the scripts, that solr 1.4.1 is to be used.  After reading more code, from BookReader I realised, that when we searched , it made a call to solr similar to -

<server>:8984/solr/inside/select?rows=1&wt=json&fl=ia,body_length,page_count&hl=true&hl.fl=body&hl.fragsize=0&hl.maxAnalyzedChars=-1&hl.usePhraseHighlighter=true&hl.simple.pre={{{&hl.simple.post=}}}&q.op=AND&q=ia%3A<id_of_opened_book>

It makes a similar second call with q as ia%3A<id_of_opened_book>%20AND%<searched_query>

In this second call, we get the highlighted results. Now these results are arranged in json. The next task is to locate and highlight the queried words on the ebook . For this we have a xml file of an OCR. In this case, we used abby reader. Queried words are located using the ocr xml file and highlighted on ebook.

Now the only thing remains is to get solr working for full text search. For this Openlibrary, makes a call to a php file called,  abby_to_text.php, which basically reads the OCR file and extracts paragraphs from it. This gets saved into solr.

To save into solr, we make a xml with fields of atleast the required ones, as , mentioned in schema.xml.
The schema I am using is at - https://github.com/internetarchive/openlibrary/blob/master/conf/solr-biblio/inside/conf/schema.xml.
The required fields are -
 <field name="ia" type="string" required="true" />
   <field name="body" type="textgen" required="true" compressed="true" />
   <field name="body_length" type="int" required="true" />
   <field name="page_count" type="int" indexed="true" required="true" />
Here ia is the book id, and body is the text of the book.
Also, you need to commit the results so that you can immediately see in solr admin.

The imp thing here is , that this schema is of inside core. This was also new thing.
More problems came because of solr old version. 1.4.1 is more than 4 years old.

But anyways. It was a good learning.




COnverting MarkDown to ReST format

Generally we make Readme on Github in markdown, but while making ReadtheDocs or Pypi pachage , we need rst docs.

For this comes in handy Pandoc .

Just run the command

pandoc --from=markdown --to=rst --output=install.rst install.md

and rst is ready. Awesome!!.