Thursday, June 26, 2014

Musings on Openlibrary and my first get-to-know with Solr-1.4.1

OpenLibrary is a ambitious project which runs at openlibrary.org. As the name says, its a open(sourced) project. The code lies on github at https://github.com/internetarchive/openlibrary/.

Now the developers working on the project are extremely talented. The number of technologies and the way they have implemented them is awesome. Users can search a book( ofcourse ) , read it online ( using the beautiful Bookreader client side app) and also borrow a book ( havent used this feature ).

One of my clients , wanted me to implement the OpenLibrary project into their website. They already had some part working. BookReader was working , but the feature of searching inside a book , wasnt.

Openlibrary uses solr as its search engine. It is the most powerful search backend , as its said. Lack of previous developer made a big issue and moreover there was not much documentation present for my task.

I realised from the scripts, that solr 1.4.1 is to be used.  After reading more code, from BookReader I realised, that when we searched , it made a call to solr similar to -

<server>:8984/solr/inside/select?rows=1&wt=json&fl=ia,body_length,page_count&hl=true&hl.fl=body&hl.fragsize=0&hl.maxAnalyzedChars=-1&hl.usePhraseHighlighter=true&hl.simple.pre={{{&hl.simple.post=}}}&q.op=AND&q=ia%3A<id_of_opened_book>

It makes a similar second call with q as ia%3A<id_of_opened_book>%20AND%<searched_query>

In this second call, we get the highlighted results. Now these results are arranged in json. The next task is to locate and highlight the queried words on the ebook . For this we have a xml file of an OCR. In this case, we used abby reader. Queried words are located using the ocr xml file and highlighted on ebook.

Now the only thing remains is to get solr working for full text search. For this Openlibrary, makes a call to a php file called,  abby_to_text.php, which basically reads the OCR file and extracts paragraphs from it. This gets saved into solr.

To save into solr, we make a xml with fields of atleast the required ones, as , mentioned in schema.xml.
The schema I am using is at - https://github.com/internetarchive/openlibrary/blob/master/conf/solr-biblio/inside/conf/schema.xml.
The required fields are -
 <field name="ia" type="string" required="true" />
   <field name="body" type="textgen" required="true" compressed="true" />
   <field name="body_length" type="int" required="true" />
   <field name="page_count" type="int" indexed="true" required="true" />
Here ia is the book id, and body is the text of the book.
Also, you need to commit the results so that you can immediately see in solr admin.

The imp thing here is , that this schema is of inside core. This was also new thing.
More problems came because of solr old version. 1.4.1 is more than 4 years old.

But anyways. It was a good learning.




COnverting MarkDown to ReST format

Generally we make Readme on Github in markdown, but while making ReadtheDocs or Pypi pachage , we need rst docs.

For this comes in handy Pandoc .

Just run the command

pandoc --from=markdown --to=rst --output=install.rst install.md

and rst is ready. Awesome!!.

Friday, April 25, 2014

Continuous Integration and Deployment using Bamboo and AWS Elastic Beanstalk

Walk-through of setting up Bamboo as CI and CD  

Bamboo is a popular Atlassion product . Lets go setup bamboo and discuss what steps did I do.
  • Install Bamboo on an EC2 instance
    • Configure to run on Port 80 instead on its default.
    • Make sure system has enough memory, I am using a m1.small instance. 
    • Bamboo has a startup script, use that , and make sure the permission thing.:P
  • For CI - 
    • Checkout the code
      • Used post push hook to automate the build plan on bamboo
    • Install dependencies
      • Remember to clean cache and remove the node_modules before installing
    • Run tests
      • Used Bamboo-mocha plugin for that. Ample doc is provided for that
    • Thats it !!
  • For CD -
    • Setup the deployment Server -
      • We are using Amazon Beanstalk,, our app being a Node.JS one.
    • The deployment process is tricky. Manually we have to initialize the repo and feed in a lot of details. But to do it automatically 
      • Initialize the repository with the AWSDevTools-RepositorySetup.sh script. It will add git aliases . We will now have git aws.push command
      • The deploy script searches for a file named aws_credentials_file, in the Home folder of the user in .elasticbeanstalk dir. So one task is to copy a file in home folder during each deployment.
    • Rest is simple.

This blog also has a lot of important details that helped me - http://blog.pedago.com/2014/02/18/build-and-deploy-with-grunt-bamboo-and-elastic-beanstalk/

Next step to include Code Coverage .. Will mention it in next blog post.

Wednesday, January 29, 2014

Python? , But why Python?

Well I am using python since last 3 years . It has been my main development language, other than javascript.

Often, someone comes around and asks , Why Python? I am usually left with my personal programming tasks where python comes in very handy when compared to other languages I have encountered, like  PHP or Java.

Well , here is a post for answering this question exactly. Check it here.

Here are the points -

  • Efficient - Has generators 
  • Fast
  • Broad use
  • Not just a language, has a number of implementations
  • Easy

Cheers!!

Yet another post on Redis

While working for a project , we used Redis as queue, using python-rq.  Running a redis-cli , I used the following commands -


  • keys *
  • type <key name>
  • and then according to the type , hash,list I would query the data
Some things were quite easy to understand
  • rq:workers
  • rq:queue:failed
  • rq:queue:default
  • and a success one as well
But apart from these, there were several entries - with name rq:job:<job_id>. After much reading, I found the internal working at http://python-rq.org/contrib/.

It says whenever a function call gets enqueued - 
  • Pushes the job's ids into queue , in my case the default
  • adds a hash objects of the job instance
So, when dequeue happens - 
  • Pops jobid from queue
  • Fetches Job data 
  • Executes the function and saves result has a hash key if success
  • else saves in failed queue with stack trace
All of this is given on Python-rq site.

There are two kinds of error I saw -
  • RuntimeError: maximum recursion depth exceeded while calling a Python object - This happened at queue.py of python-rq module, where I think, it was caused when control crossed max recursive limit, when it didnt find the jobs hashes, as discussed above in dequeue
  • Socket closed on remote end - The server closes client connection after 300s, in my case I didnt want to do them, so. let it be on forever by changing in /etc/redis/redis.conf , timeout value to 0
Go Redis!!

Sunday, January 19, 2014

Python Decorators - The correct way to do it

Was going through Graham Dumpleton's blog post - how you implemented your python decorator is wrong. Simple points that were discussed were -


  • Decorators can be functions as well as Classes.
    As a class -  

    class function_wrapper(object):
        def __init__(self, wrapped):
            self.wrapped = wrapped
        def __call__(self, *args, **kwargs):
            return self.wrapped(*args, **kwargs) 
    @function_wrapper
    def function():
        pass 

    As a function 

    def function_wrapper(wrapped):
        def _wrapper(*args, **kwargs):
            return wrapped(*args, **kwargs)
        return _wrapper 
    @function_wrapper
    def function():
        pass 

  • Use the functools.wraps decorator , it only preserves original functions __name__ and __class__
    In functions 

    import functools 
    def function_wrapper(wrapped):
        @functools.wraps(wrapped)
        def _wrapper(*args, **kwargs):
            return wrapped(*args, **kwargs)
        return _wrapper 
    @function_wrapper
    def function():
        pass 

    In classes, use the update_wrapper method- 

    import functools 
    class function_wrapper(object):
        def __init__(self, wrapped):
            self.wrapped = wrapped
            functools.update_wrapper(self, wrapped)
        def __call__(self, *args, **kwargs):
            return self.wrapped(*args, **kwargs)
  • Python 2.7 preserves Argument specification ( inspect.getargspec ) only in functional decorators, not in class based ones.
  • Doesnt preserve function source code for inspection ( inspect.getsource ) 
  • Cannot apply decorators on top of other decorators that are implemented as descriptors.

Saturday, January 11, 2014

Failed to attach to key daemon - Error in Shrew Soft VPN

Using Shrew Soft VPN on Ubuntu 12.04, I often face this error

Failed to attach to key daemon 

On some googling I found this post -> Ubuntu Forums

It says that this error is because IKE Daemon isnt running.

sudo iked

Hope this helps.