Friday, 19 September 2014

PyMongo almost 2x faster with PyPy

  1. Install PyPy and MongoDB with pacman under Arch Linux based distros:

    $ sudo pacman -Sy
    $ sudo pacman -S pypy mongodb
  2. Create a new virtualenv (see howto) and install PyMongo:

    $ mkvirtualenv testPyPy --python=/usr/bin/pypy
    $ pip install pymongo
  3. Start MongoDB:

    $ mkdir /home/mictadlo/databases/
    $ mongod --dbpath /home/mictadlo/databases/
  4. I found here a PyMongo benchmark code:

    import sys
    import os
    import pymongo
    import time
    import random
    
    from datetime import datetime
    
    min_date = datetime(2012, 1, 1)
    max_date = datetime(2013, 1, 1)
    delta = (max_date - min_date).total_seconds()
    
    job_id = '1'
    
    if len(sys.argv) < 2:
        sys.exit("You must supply the item_number argument")
    elif len(sys.argv) > 2:
        job_id = sys.argv[2]   
    
    documents_number = int(sys.argv[1])
    batch_number = 5 * 1000;
    
    job_name = 'Job#' + job_id
    start = datetime.now();
    
    # obtain a mongo connection
    connection = pymongo.Connection("mongodb://localhost", safe=True)
    
    # obtain a handle to the random database
    db = connection.random
    collection = db.randomData
    
    batch_documents = [i for i in range(batch_number)];
    
    for index in range(documents_number):
        try:           
            date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))
            value = random.random()
            document = {
                'created_on' : date,   
                'value' : value,   
            }
            batch_documents[index % batch_number] = document
            if (index + 1) % batch_number == 0:
                collection.insert(batch_documents)     
            index += 1;
            if index % 100000 == 0:
                print job_name, ' inserted ', index, ' documents.'     
        except:
            print 'Unexpected error:', sys.exc_info()[0], ', for index ', index
            raise
    print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'
  5. PyPy is almost 2x faster than CPython as the results below show

    • CPython:

      $ python --version
      Python 2.7.8
      
      $ time python test_pymongo.py 12000000
      Job#1  inserted  12000000  in  709.949361 s
      
      real    11m50.123s
      user    6m55.263s
      sys     0m48.803s
    • PyPy:

      $ python --version
      Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47)
      [PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)]
      
      $ time python test_pymongo.py 12000000
      Job#1  inserted  12000000  in  464.130798 s
      
      real    7m44.711s
      user    3m2.693s
      sys     0m41.667s

Virtualenvwrapper makes it easier to use Python's virtualenv

  1. Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
    $ sudo pacman -Sy
    $ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
  2. Introduce virtualenv in your .bashrc:
    $ nano ~/.bashrc
    export WORKON_HOME=$HOME/.virtualenvs
    source /usr/bin/virtualenvwrapper.sh
  3. Create a project e.g. myProject in virtualenv:
    $ mkvirtualenv myProject --python=/usr/bin/python2
    $ pip install sphinx
  4. Other virtualenv commands:
  • $ workon myProject        # activate the virtualenv called myProject
  • $ deactivate              # deactivate the current virtualenv
  • $ rmvirtualenv myProject  # delete the current virtualenv

How to install Pandoc on Arch Linux based distros

  1. Add haskell-core was above extra in /etc/pacman.conf:
    $ sudo nano /etc/pacman.conf
    [haskell-core]
    Server = http://xsounds.org/~haskell/core/$arch
  2. Remove alex, ghc, happy and other haskell packages from the extra repository from your system
  3. Add key and install Pandoc and Texlive (needed to generate PDF):
    $ sudo pacman-key -r 4209170B
    $ sudo pacman-key --lsign-key 4209170B
    $ sudo pacman -Scc
    $ sudo pacman -Syy
    $ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types
    $ sudo pacman -S texlive-core 
  4. I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
    These is a little reStructuredText syntax:
    Title
    =====
    
    Heading 1
    ---------
    
    Heading 2
    `````````
    
    Heading 3
    '''''''''
    
    Heading 4
    .........
    
    Heading 5
    ~~~~~~~~~
    
    Heading 6
    *********
    
    Heading 7
    +++++++++
    
    Heading 8
    ^^^^^^^^^
    
    *Italic*
    
    **bold**
    ReStructuredText’s cheat sheet
  5. Converting reStructuredText to PDF and DOCX:
    $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst
    $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst
  6. Converting reStructuredText to other formats.

Sunday, 2 February 2014

SnpEff”s result HTML file parsing

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.


Number of effects by type and region
Type Region
Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
NON_SYNONYMOUS_CODING   6,800 4.732%
NON_SYNONYMOUS_START   1 0.001%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
START_LOST   1 0.001%
STOP_GAINED   355 0.247%
STOP_LOST   295 0.205%
SYNONYMOUS_CODING   3,619 2.519%
SYNONYMOUS_STOP   63 0.044%
UPSTREAM   43,593 30.337%

Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
EXON   11,134 7.748%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
UPSTREAM   43,593 30.337%


I could copy and  paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4  and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0]) 
 
And this is the output from the above code

$ python test.py 
#Intron = 6,307