-
Install PyPy and MongoDB with pacman under Arch Linux based distros:
$ sudo pacman -Sy $ sudo pacman -S pypy mongodb
-
Create a new virtualenv (see howto) and install PyMongo:
$ mkvirtualenv testPyPy --python=/usr/bin/pypy $ pip install pymongo
-
Start MongoDB:
$ mkdir /home/mictadlo/databases/ $ mongod --dbpath /home/mictadlo/databases/
-
I found here a PyMongo benchmark code:
import sys import os import pymongo import time import random from datetime import datetime min_date = datetime(2012, 1, 1) max_date = datetime(2013, 1, 1) delta = (max_date - min_date).total_seconds() job_id = '1' if len(sys.argv) < 2: sys.exit("You must supply the item_number argument") elif len(sys.argv) > 2: job_id = sys.argv[2] documents_number = int(sys.argv[1]) batch_number = 5 * 1000; job_name = 'Job#' + job_id start = datetime.now(); # obtain a mongo connection connection = pymongo.Connection("mongodb://localhost", safe=True) # obtain a handle to the random database db = connection.random collection = db.randomData batch_documents = [i for i in range(batch_number)]; for index in range(documents_number): try: date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta))) value = random.random() document = { 'created_on' : date, 'value' : value, } batch_documents[index % batch_number] = document if (index + 1) % batch_number == 0: collection.insert(batch_documents) index += 1; if index % 100000 == 0: print job_name, ' inserted ', index, ' documents.' except: print 'Unexpected error:', sys.exc_info()[0], ', for index ', index raise print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'
-
PyPy is almost 2x faster than CPython as the results below show
-
CPython:
$ python --version Python 2.7.8 $ time python test_pymongo.py 12000000 Job#1 inserted 12000000 in 709.949361 s real 11m50.123s user 6m55.263s sys 0m48.803s
-
PyPy:
$ python --version Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47) [PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)] $ time python test_pymongo.py 12000000 Job#1 inserted 12000000 in 464.130798 s real 7m44.711s user 3m2.693s sys 0m41.667s
-
Friday, 19 September 2014
PyMongo almost 2x faster with PyPy
Virtualenvwrapper makes it easier to use Python's virtualenv
-
Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
$ sudo pacman -Sy $ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
-
Introduce virtualenv in your .bashrc:
$ nano ~/.bashrc export WORKON_HOME=$HOME/.virtualenvs source /usr/bin/virtualenvwrapper.sh
-
Create a project e.g. myProject in virtualenv:
$ mkvirtualenv myProject --python=/usr/bin/python2 $ pip install sphinx
- Other virtualenv commands:
-
$ workon myProject # activate the virtualenv called myProject
-
$ deactivate # deactivate the current virtualenv
-
$ rmvirtualenv myProject # delete the current virtualenv
How to install Pandoc on Arch Linux based distros
-
Add haskell-core was above extra in /etc/pacman.conf:
$ sudo nano /etc/pacman.conf [haskell-core] Server = http://xsounds.org/~haskell/core/$arch
- Remove alex, ghc, happy and other haskell packages from the extra repository from your system
-
Add key and install Pandoc and Texlive (needed to generate PDF):
$ sudo pacman-key -r 4209170B $ sudo pacman-key --lsign-key 4209170B $ sudo pacman -Scc $ sudo pacman -Syy $ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types $ sudo pacman -S texlive-core
-
I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
These is a little reStructuredText syntax:
ReStructuredText’s cheat sheetTitle ===== Heading 1 --------- Heading 2 ````````` Heading 3 ''''''''' Heading 4 ......... Heading 5 ~~~~~~~~~ Heading 6 ********* Heading 7 +++++++++ Heading 8 ^^^^^^^^^ *Italic* **bold**
-
Converting reStructuredText to PDF and DOCX:
$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst
- Converting reStructuredText to other formats.
Sunday, 2 February 2014
SnpEff”s result HTML file parsing
Recently, I had to use multiple times
the SnpEff
application which annotates and predicts the effects of variants on
genes. One of the SnpEff's result files is a HTML file. A snippet
example of SnpEff's results can be found below and the code can be
found here.
Number of effects by type and region
I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.
For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.
Type | Region | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.
from bs4 import BeautifulSoup test = {} with open("SnpEff.html") as f: soup = BeautifulSoup(f, "lxml") for a in soup.find_all('a', attrs={'name': 'effects'}): for tr in a.find_all('tr')[3:]: tds = tr.find_all('td') if len(tds) > 0: test[str(tds[0].text).strip()] = (str(tds[1].text).strip(), str(tds[2].text).strip()) print "#Intron = " + str(test['INTRON'][0])
And this is the output from the above code
$ python test.py #Intron = 6,307
Subscribe to:
Posts (Atom)