MicTadLo

Friday, 19 September 2014

PyMongo almost 2x faster with PyPy

Install PyPy and MongoDB with pacman under Arch Linux based distros:
```
$ sudo pacman -Sy
$ sudo pacman -S pypy mongodb
```

Create a new virtualenv (see howto) and install PyMongo:

$ mkvirtualenv testPyPy --python=/usr/bin/pypy
$ pip install pymongo

Start MongoDB:

$ mkdir /home/mictadlo/databases/
$ mongod --dbpath /home/mictadlo/databases/

I found here a PyMongo benchmark code:

import sys
import os
import pymongo
import time
import random

from datetime import datetime

min_date = datetime(2012, 1, 1)
max_date = datetime(2013, 1, 1)
delta = (max_date - min_date).total_seconds()

job_id = '1'

if len(sys.argv) < 2:
    sys.exit("You must supply the item_number argument")
elif len(sys.argv) > 2:
    job_id = sys.argv[2]   

documents_number = int(sys.argv[1])
batch_number = 5 * 1000;

job_name = 'Job#' + job_id
start = datetime.now();

# obtain a mongo connection
connection = pymongo.Connection("mongodb://localhost", safe=True)

# obtain a handle to the random database
db = connection.random
collection = db.randomData

batch_documents = [i for i in range(batch_number)];

for index in range(documents_number):
    try:           
        date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))
        value = random.random()
        document = {
            'created_on' : date,   
            'value' : value,   
        }
        batch_documents[index % batch_number] = document
        if (index + 1) % batch_number == 0:
            collection.insert(batch_documents)     
        index += 1;
        if index % 100000 == 0:
            print job_name, ' inserted ', index, ' documents.'     
    except:
        print 'Unexpected error:', sys.exc_info()[0], ', for index ', index
        raise
print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'

PyPy is almost 2x faster than CPython as the results below show

CPython:

$ python --version
Python 2.7.8

$ time python test_pymongo.py 12000000
Job#1  inserted  12000000  in  709.949361 s

real    11m50.123s
user    6m55.263s
sys     0m48.803s

PyPy:

$ python --version
Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47)
[PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)]

$ time python test_pymongo.py 12000000
Job#1  inserted  12000000  in  464.130798 s

real    7m44.711s
user    3m2.693s
sys     0m41.667s

Virtualenvwrapper makes it easier to use Python's virtualenv

Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
```
$ sudo pacman -Sy
$ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
```

Introduce virtualenv in your .bashrc:

$ nano ~/.bashrc
export WORKON_HOME=$HOME/.virtualenvs
source /usr/bin/virtualenvwrapper.sh

Create a project e.g. myProject in virtualenv:

$ mkvirtualenv myProject --python=/usr/bin/python2
$ pip install sphinx

Other virtualenv commands:

$ workon myProject        # activate the virtualenv called myProject

$ deactivate              # deactivate the current virtualenv

$ rmvirtualenv myProject  # delete the current virtualenv

How to install Pandoc on Arch Linux based distros

Add haskell-core was above extra in /etc/pacman.conf:

$ sudo nano /etc/pacman.conf
[haskell-core]
Server = http://xsounds.org/~haskell/core/$arch

Remove alex, ghc, happy and other haskell packages from the extra repository from your system

Add key and install Pandoc and Texlive (needed to generate PDF):

$ sudo pacman-key -r 4209170B
$ sudo pacman-key --lsign-key 4209170B
$ sudo pacman -Scc
$ sudo pacman -Syy
$ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types
$ sudo pacman -S texlive-core

I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
These is a little reStructuredText syntax:
```
Title
=====

Heading 1
---------

Heading 2
`````````

Heading 3
'''''''''

Heading 4
.........

Heading 5
~~~~~~~~~

Heading 6
*********

Heading 7
+++++++++

Heading 8
^^^^^^^^^

*Italic*

**bold**
```
ReStructuredText’s cheat sheet

Converting reStructuredText to PDF and DOCX:

$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst
$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst

Converting reStructuredText to other formats.

Sunday, 2 February 2014

SnpEff”s result HTML file parsing

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.

Number of effects by type and region

Type

Region

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
NON_SYNONYMOUS_CODING	6,800	4.732%
NON_SYNONYMOUS_START	1	0.001%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
START_LOST	1	0.001%
STOP_GAINED	355	0.247%
STOP_LOST	295	0.205%
SYNONYMOUS_CODING	3,619	2.519%
SYNONYMOUS_STOP	63	0.044%
UPSTREAM	43,593	30.337%

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
EXON	11,134	7.748%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
UPSTREAM	43,593	30.337%

I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0])

And this is the output from the above code

$ python test.py 
#Intron = 6,307

Saturday, 21 December 2013

Bioinformatics: Too many formats

In Bioinformatics many formats are not properly defined or implemented which makes it difficult to use them between different tools. The cartoon below describes the current situation in the Bioinformatics field. Many new file formats were born to solve previous file format problems or to introduce new features. They then were not any more compatible to the previous formats like (GFF2 to GFF3).

Source: http://xkcd.com/927/

A person or an organisation creating a new format at the same time provide a library in a particular language for example in C, however other programmers are using different programming languages e.g. like Python. Therefore a wrapper has to be written so that Python can use it with the C library. This has it's problems as each time the C library is improved, the wrapper for Python has to be also updated which leads to a delay in using the new improvements or bug fixes. It would be great if all of the bioinformatics libraries would use a “Simplified Wrapper and Interface Generator” (SWIG, http://www.swig.org) this is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages such as Python. Sometimes it is possible to use comma- or tab-delimited file formats. This work for example with the FASTA format but do not work with the GFF3 format as it requires you to define relationships between entities. The solution would be to use formats like eXtensible Markup Language (XML, www.w3.org/TR/REC-xml), JavaScript Object Notation (JSON, http://json.org) or YAML Ain't Markup Language (YAML, http://www.yaml.org). Many programming languages provide parser for these three formats. Storing the same information in the smallest file size could be achieved with YAML and JSON formats. The bigger file sizes are going to be with XML formats mainly because of XML's closing tags. XML and JSON provide a binary format; Efficient XML Interchange (EXI, www.w3.org/TR/exi/) and Binary JSON (BSJON, http://bsonspec.org). EXI makes XML data up to a hundred times smaller, increasing processing speed and also increasing the transmission speed of XML across existing networks (16 November 2013, http://www.agiledelta.com/product_efx.html). On the other hand BSON provides efficient encoding/decoding compared to JSON, but the file size might be bigger than that of the plain JSON format.

Saturday, 16 November 2013

LiteIDE a new IDE for Go programing language

Google created an open sourced programming language called GO which tries to combine the performance of a compiled language like C with a development speed of working in a language like Python.

In opinion it is also important to have a smart IDE which provides help with coding. One of the biggest helps I think is code completion. If you are new to a programming language it helps you to understand quicker other libraries by suggesting function names and arguments.

Google provides Eclipse plugins for Android development ADT and for their web app programming language Dart. To my surprise they do not provide any Eclipse plugin for GO. However, I found one Eclipse plugin called Goclipse but it looks to me that it may require more love, because the last release was in May 2011.

Recently, I found an IDE for GO called LiteIDE which is open source and cross-platform. It provides code completion for GO which works very nicely like the below picture shows. A full feature list of LiteIDE can be found here.

On my laptop I use Sabayon Linux which has a package for GO, however it is not the latest version of GO. Below are step-by-step instructions on how to install GO together with LiteIDE.

1. Step: Install GO

$ cd Downloads
$ wget -c http://go.googlecode.com/files/go1.1.2.linux-amd64.tar.gz
$ tar xfvz go1.1.2.linux-amd64.tar.gz
$ mkdir -p ~/apps/go_packages
$ mv go ~/apps

2. Step: Install LiteIDE

$ cd Downloads
$ git clone https://github.com/visualfc/liteide
$ cd liteide/build/
$ export QTDIR=/usr/lib/qt4
$ sh build_linux.sh
$ mv liteide/ $HOME/apps

3. Step: Set up environment variable in .bashrc

export $APPS=/home/uqmlore1/apps
export GOROOT=$APPS/go
export GOPATH=$APPS/go_packages
export GOOS=linux
export GOARCH=amd64
export LITEIDE=$APPS/liteide/bin/
export PATH=$GO/bin:$GOPATH/bin:$LITEIDE:$PATH

4. Test GO and LiteIDE

$ . ~/.bashrc
$ go version
go version go1.1.2 linux/amd64

$ liteide

Have fun!