Friday, 19 September 2014

PyMongo almost 2x faster with PyPy

  1. Install PyPy and MongoDB with pacman under Arch Linux based distros:

    $ sudo pacman -Sy
    $ sudo pacman -S pypy mongodb
  2. Create a new virtualenv (see howto) and install PyMongo:

    $ mkvirtualenv testPyPy --python=/usr/bin/pypy
    $ pip install pymongo
  3. Start MongoDB:

    $ mkdir /home/mictadlo/databases/
    $ mongod --dbpath /home/mictadlo/databases/
  4. I found here a PyMongo benchmark code:

    import sys
    import os
    import pymongo
    import time
    import random
    
    from datetime import datetime
    
    min_date = datetime(2012, 1, 1)
    max_date = datetime(2013, 1, 1)
    delta = (max_date - min_date).total_seconds()
    
    job_id = '1'
    
    if len(sys.argv) < 2:
        sys.exit("You must supply the item_number argument")
    elif len(sys.argv) > 2:
        job_id = sys.argv[2]   
    
    documents_number = int(sys.argv[1])
    batch_number = 5 * 1000;
    
    job_name = 'Job#' + job_id
    start = datetime.now();
    
    # obtain a mongo connection
    connection = pymongo.Connection("mongodb://localhost", safe=True)
    
    # obtain a handle to the random database
    db = connection.random
    collection = db.randomData
    
    batch_documents = [i for i in range(batch_number)];
    
    for index in range(documents_number):
        try:           
            date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))
            value = random.random()
            document = {
                'created_on' : date,   
                'value' : value,   
            }
            batch_documents[index % batch_number] = document
            if (index + 1) % batch_number == 0:
                collection.insert(batch_documents)     
            index += 1;
            if index % 100000 == 0:
                print job_name, ' inserted ', index, ' documents.'     
        except:
            print 'Unexpected error:', sys.exc_info()[0], ', for index ', index
            raise
    print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'
  5. PyPy is almost 2x faster than CPython as the results below show

    • CPython:

      $ python --version
      Python 2.7.8
      
      $ time python test_pymongo.py 12000000
      Job#1  inserted  12000000  in  709.949361 s
      
      real    11m50.123s
      user    6m55.263s
      sys     0m48.803s
    • PyPy:

      $ python --version
      Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47)
      [PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)]
      
      $ time python test_pymongo.py 12000000
      Job#1  inserted  12000000  in  464.130798 s
      
      real    7m44.711s
      user    3m2.693s
      sys     0m41.667s

Virtualenvwrapper makes it easier to use Python's virtualenv

  1. Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
    $ sudo pacman -Sy
    $ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
  2. Introduce virtualenv in your .bashrc:
    $ nano ~/.bashrc
    export WORKON_HOME=$HOME/.virtualenvs
    source /usr/bin/virtualenvwrapper.sh
  3. Create a project e.g. myProject in virtualenv:
    $ mkvirtualenv myProject --python=/usr/bin/python2
    $ pip install sphinx
  4. Other virtualenv commands:
  • $ workon myProject        # activate the virtualenv called myProject
  • $ deactivate              # deactivate the current virtualenv
  • $ rmvirtualenv myProject  # delete the current virtualenv

How to install Pandoc on Arch Linux based distros

  1. Add haskell-core was above extra in /etc/pacman.conf:
    $ sudo nano /etc/pacman.conf
    [haskell-core]
    Server = http://xsounds.org/~haskell/core/$arch
  2. Remove alex, ghc, happy and other haskell packages from the extra repository from your system
  3. Add key and install Pandoc and Texlive (needed to generate PDF):
    $ sudo pacman-key -r 4209170B
    $ sudo pacman-key --lsign-key 4209170B
    $ sudo pacman -Scc
    $ sudo pacman -Syy
    $ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types
    $ sudo pacman -S texlive-core 
  4. I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
    These is a little reStructuredText syntax:
    Title
    =====
    
    Heading 1
    ---------
    
    Heading 2
    `````````
    
    Heading 3
    '''''''''
    
    Heading 4
    .........
    
    Heading 5
    ~~~~~~~~~
    
    Heading 6
    *********
    
    Heading 7
    +++++++++
    
    Heading 8
    ^^^^^^^^^
    
    *Italic*
    
    **bold**
    ReStructuredText’s cheat sheet
  5. Converting reStructuredText to PDF and DOCX:
    $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst
    $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst
  6. Converting reStructuredText to other formats.

Sunday, 2 February 2014

SnpEff”s result HTML file parsing

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.


Number of effects by type and region
Type Region
Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
NON_SYNONYMOUS_CODING   6,800 4.732%
NON_SYNONYMOUS_START   1 0.001%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
START_LOST   1 0.001%
STOP_GAINED   355 0.247%
STOP_LOST   295 0.205%
SYNONYMOUS_CODING   3,619 2.519%
SYNONYMOUS_STOP   63 0.044%
UPSTREAM   43,593 30.337%

Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
EXON   11,134 7.748%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
UPSTREAM   43,593 30.337%


I could copy and  paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4  and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0]) 
 
And this is the output from the above code

$ python test.py 
#Intron = 6,307




Saturday, 21 December 2013

Bioinformatics: Too many formats


In Bioinformatics many formats are not properly defined or implemented which makes it difficult to use them between different tools. The cartoon below describes the current situation in the Bioinformatics field. Many new file formats were born to solve previous file format problems or to introduce new features. They then were not any more compatible to the previous formats like (GFF2 to GFF3). 
 
Source: http://xkcd.com/927/

A person or an organisation creating a new format at the same time provide a library in a particular language for example in C, however other programmers are using different programming languages e.g. like Python. Therefore a wrapper has to be written so that Python can use it with the C library. This has it's problems as each time the C library is improved, the wrapper for Python has to be also updated which leads to a delay in using the new improvements or bug fixes. It would be great if all of the bioinformatics libraries would use a “Simplified Wrapper and Interface Generator” (SWIG, http://www.swig.org) this is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages such as Python. Sometimes it is possible to use comma- or tab-delimited file formats. This work for example with the FASTA format but do not work with the GFF3 format as it requires you to define relationships between entities. The solution would be to use formats like eXtensible Markup Language (XML, www.w3.org/TR/REC-xml), JavaScript Object Notation (JSON, http://json.org) or YAML Ain't Markup Language (YAML, http://www.yaml.org). Many programming languages provide parser for these three formats. Storing the same information in the smallest file size could be achieved with YAML and JSON formats. The bigger file sizes are going to be with XML formats mainly because of XML's closing tags. XML and JSON provide a binary format; Efficient XML Interchange (EXI, www.w3.org/TR/exi/) and Binary JSON (BSJON, http://bsonspec.org). EXI makes XML data up to a hundred times smaller, increasing processing speed and also increasing the transmission speed of XML across existing networks (16 November 2013, http://www.agiledelta.com/product_efx.html). On the other hand BSON provides efficient encoding/decoding compared to JSON, but the file size might be bigger than that of the plain JSON format.

Saturday, 16 November 2013

LiteIDE a new IDE for Go programing language

Google created an open sourced programming language called GO which tries to combine the performance of a compiled language like C with a development speed of working in a language like Python.

In opinion it is also important to have a smart IDE which provides help with coding. One of the biggest helps I think is code completion.  If you are new to a programming language it helps you to understand quicker other libraries by suggesting function names and arguments.

Google provides Eclipse plugins for Android development ADT and for their web app programming language Dart.  To my surprise they do not provide any Eclipse plugin for GO. However, I found one Eclipse plugin called Goclipse but it looks to me that it may require more love, because the last release was in May 2011.

Recently, I found an IDE for GO called LiteIDE which is open source and cross-platform. It provides code completion for GO which works very nicely like the below picture shows. A full feature list of LiteIDE can be found here.

On my laptop I use Sabayon Linux which has a package for GO, however it is not the latest version of GO. Below are step-by-step instructions on how to install GO together with LiteIDE.

1. Step: Install GO
$ cd Downloads
$ wget -c http://go.googlecode.com/files/go1.1.2.linux-amd64.tar.gz
$ tar xfvz go1.1.2.linux-amd64.tar.gz
$ mkdir -p ~/apps/go_packages
$ mv go ~/apps

2. Step: Install LiteIDE
$ cd Downloads
$ git clone https://github.com/visualfc/liteide
$ cd liteide/build/
$ export QTDIR=/usr/lib/qt4
$ sh build_linux.sh
$ mv liteide/ $HOME/apps

3. Step: Set up environment variable in .bashrc
export $APPS=/home/uqmlore1/apps
export GOROOT=$APPS/go
export GOPATH=$APPS/go_packages
export GOOS=linux
export GOARCH=amd64
export LITEIDE=$APPS/liteide/bin/
export PATH=$GO/bin:$GOPATH/bin:$LITEIDE:$PATH

4. Test GO and LiteIDE
$ . ~/.bashrc
$ go version
go version go1.1.2 linux/amd64

$ liteide
Have fun!