MicTadLo: 2014

Friday, 19 September 2014

PyMongo almost 2x faster with PyPy

Install PyPy and MongoDB with pacman under Arch Linux based distros:
```
$ sudo pacman -Sy
$ sudo pacman -S pypy mongodb
```

Create a new virtualenv (see howto) and install PyMongo:

$ mkvirtualenv testPyPy --python=/usr/bin/pypy
$ pip install pymongo

Start MongoDB:

$ mkdir /home/mictadlo/databases/
$ mongod --dbpath /home/mictadlo/databases/

I found here a PyMongo benchmark code:

import sys
import os
import pymongo
import time
import random

from datetime import datetime

min_date = datetime(2012, 1, 1)
max_date = datetime(2013, 1, 1)
delta = (max_date - min_date).total_seconds()

job_id = '1'

if len(sys.argv) < 2:
    sys.exit("You must supply the item_number argument")
elif len(sys.argv) > 2:
    job_id = sys.argv[2]   

documents_number = int(sys.argv[1])
batch_number = 5 * 1000;

job_name = 'Job#' + job_id
start = datetime.now();

# obtain a mongo connection
connection = pymongo.Connection("mongodb://localhost", safe=True)

# obtain a handle to the random database
db = connection.random
collection = db.randomData

batch_documents = [i for i in range(batch_number)];

for index in range(documents_number):
    try:           
        date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))
        value = random.random()
        document = {
            'created_on' : date,   
            'value' : value,   
        }
        batch_documents[index % batch_number] = document
        if (index + 1) % batch_number == 0:
            collection.insert(batch_documents)     
        index += 1;
        if index % 100000 == 0:
            print job_name, ' inserted ', index, ' documents.'     
    except:
        print 'Unexpected error:', sys.exc_info()[0], ', for index ', index
        raise
print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'

PyPy is almost 2x faster than CPython as the results below show

CPython:

$ python --version
Python 2.7.8

$ time python test_pymongo.py 12000000
Job#1  inserted  12000000  in  709.949361 s

real    11m50.123s
user    6m55.263s
sys     0m48.803s

PyPy:

$ python --version
Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47)
[PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)]

$ time python test_pymongo.py 12000000
Job#1  inserted  12000000  in  464.130798 s

real    7m44.711s
user    3m2.693s
sys     0m41.667s

Virtualenvwrapper makes it easier to use Python's virtualenv

Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
```
$ sudo pacman -Sy
$ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
```

Introduce virtualenv in your .bashrc:

$ nano ~/.bashrc
export WORKON_HOME=$HOME/.virtualenvs
source /usr/bin/virtualenvwrapper.sh

Create a project e.g. myProject in virtualenv:

$ mkvirtualenv myProject --python=/usr/bin/python2
$ pip install sphinx

Other virtualenv commands:

$ workon myProject        # activate the virtualenv called myProject

$ deactivate              # deactivate the current virtualenv

$ rmvirtualenv myProject  # delete the current virtualenv

How to install Pandoc on Arch Linux based distros

Add haskell-core was above extra in /etc/pacman.conf:

$ sudo nano /etc/pacman.conf
[haskell-core]
Server = http://xsounds.org/~haskell/core/$arch

Remove alex, ghc, happy and other haskell packages from the extra repository from your system

Add key and install Pandoc and Texlive (needed to generate PDF):

$ sudo pacman-key -r 4209170B
$ sudo pacman-key --lsign-key 4209170B
$ sudo pacman -Scc
$ sudo pacman -Syy
$ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types
$ sudo pacman -S texlive-core

I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
These is a little reStructuredText syntax:
```
Title
=====

Heading 1
---------

Heading 2
`````````

Heading 3
'''''''''

Heading 4
.........

Heading 5
~~~~~~~~~

Heading 6
*********

Heading 7
+++++++++

Heading 8
^^^^^^^^^

*Italic*

**bold**
```
ReStructuredText’s cheat sheet

Converting reStructuredText to PDF and DOCX:

$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst
$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst

Converting reStructuredText to other formats.

Sunday, 2 February 2014

SnpEff”s result HTML file parsing

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.

Number of effects by type and region

Type

Region

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
NON_SYNONYMOUS_CODING	6,800	4.732%
NON_SYNONYMOUS_START	1	0.001%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
START_LOST	1	0.001%
STOP_GAINED	355	0.247%
STOP_LOST	295	0.205%
SYNONYMOUS_CODING	3,619	2.519%
SYNONYMOUS_STOP	63	0.044%
UPSTREAM	43,593	30.337%

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
EXON	11,134	7.748%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
UPSTREAM	43,593	30.337%

I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0])

And this is the output from the above code

$ python test.py 
#Intron = 6,307