-
Install PyPy and MongoDB with pacman under Arch Linux based distros:
$ sudo pacman -Sy $ sudo pacman -S pypy mongodb
-
Create a new virtualenv (see howto) and install PyMongo:
$ mkvirtualenv testPyPy --python=/usr/bin/pypy $ pip install pymongo
-
Start MongoDB:
$ mkdir /home/mictadlo/databases/ $ mongod --dbpath /home/mictadlo/databases/
-
I found here a PyMongo benchmark code:
import sys import os import pymongo import time import random from datetime import datetime min_date = datetime(2012, 1, 1) max_date = datetime(2013, 1, 1) delta = (max_date - min_date).total_seconds() job_id = '1' if len(sys.argv) < 2: sys.exit("You must supply the item_number argument") elif len(sys.argv) > 2: job_id = sys.argv[2] documents_number = int(sys.argv[1]) batch_number = 5 * 1000; job_name = 'Job#' + job_id start = datetime.now(); # obtain a mongo connection connection = pymongo.Connection("mongodb://localhost", safe=True) # obtain a handle to the random database db = connection.random collection = db.randomData batch_documents = [i for i in range(batch_number)]; for index in range(documents_number): try: date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta))) value = random.random() document = { 'created_on' : date, 'value' : value, } batch_documents[index % batch_number] = document if (index + 1) % batch_number == 0: collection.insert(batch_documents) index += 1; if index % 100000 == 0: print job_name, ' inserted ', index, ' documents.' except: print 'Unexpected error:', sys.exc_info()[0], ', for index ', index raise print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'
-
PyPy is almost 2x faster than CPython as the results below show
-
CPython:
$ python --version Python 2.7.8 $ time python test_pymongo.py 12000000 Job#1 inserted 12000000 in 709.949361 s real 11m50.123s user 6m55.263s sys 0m48.803s
-
PyPy:
$ python --version Python 2.7.6 (3cf384e86ef7, Jun 27 2014, 00:09:47) [PyPy 2.4.0-alpha0 with GCC 4.9.0 20140604 (prerelease)] $ time python test_pymongo.py 12000000 Job#1 inserted 12000000 in 464.130798 s real 7m44.711s user 3m2.693s sys 0m41.667s
-
MicTadLo
Open source - programming - Linux - digital photography -
Friday, 19 September 2014
PyMongo almost 2x faster with PyPy
Virtualenvwrapper makes it easier to use Python's virtualenv
-
Install the following packages. I still need to use Python 2. If you want to use Python 3 just remove the 2 below:
$ sudo pacman -Sy $ sudo pacman -S python-virtualenvwrapper python2-virtualenv python2-pip
-
Introduce virtualenv in your .bashrc:
$ nano ~/.bashrc export WORKON_HOME=$HOME/.virtualenvs source /usr/bin/virtualenvwrapper.sh
-
Create a project e.g. myProject in virtualenv:
$ mkvirtualenv myProject --python=/usr/bin/python2 $ pip install sphinx
- Other virtualenv commands:
-
$ workon myProject # activate the virtualenv called myProject
-
$ deactivate # deactivate the current virtualenv
-
$ rmvirtualenv myProject # delete the current virtualenv
How to install Pandoc on Arch Linux based distros
-
Add haskell-core was above extra in /etc/pacman.conf:
$ sudo nano /etc/pacman.conf [haskell-core] Server = http://xsounds.org/~haskell/core/$arch
- Remove alex, ghc, happy and other haskell packages from the extra repository from your system
-
Add key and install Pandoc and Texlive (needed to generate PDF):
$ sudo pacman-key -r 4209170B $ sudo pacman-key --lsign-key 4209170B $ sudo pacman -Scc $ sudo pacman -Syy $ sudo pacman -S haskell-pandoc haskell-pandoc-citeproc haskell-pandoc-types $ sudo pacman -S texlive-core
-
I am using Sphinx which is a tool that makes it easy to create intelligent and beautiful documentation. Sphinx and I use reStructuredText (rst) instead of markdown (md). Here is a comparison between markdown and reStructuredText. Try pandoc site offers to convert between different text syntax formats.
These is a little reStructuredText syntax:
ReStructuredText’s cheat sheetTitle ===== Heading 1 --------- Heading 2 ````````` Heading 3 ''''''''' Heading 4 ......... Heading 5 ~~~~~~~~~ Heading 6 ********* Heading 7 +++++++++ Heading 8 ^^^^^^^^^ *Italic* **bold**
-
Converting reStructuredText to PDF and DOCX:
$ pandoc -V geometry:a4paper -f rst --toc --smart -o test.pdf test.rst $ pandoc -V geometry:a4paper -f rst --toc --smart -o test.docx test.rst
- Converting reStructuredText to other formats.
Sunday, 2 February 2014
SnpEff”s result HTML file parsing
Recently, I had to use multiple times
the SnpEff
application which annotates and predicts the effects of variants on
genes. One of the SnpEff's result files is a HTML file. A snippet
example of SnpEff's results can be found below and the code can be
found here.
Number of effects by type and region
I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.
For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.
Type | Region | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.
from bs4 import BeautifulSoup test = {} with open("SnpEff.html") as f: soup = BeautifulSoup(f, "lxml") for a in soup.find_all('a', attrs={'name': 'effects'}): for tr in a.find_all('tr')[3:]: tds = tr.find_all('td') if len(tds) > 0: test[str(tds[0].text).strip()] = (str(tds[1].text).strip(), str(tds[2].text).strip()) print "#Intron = " + str(test['INTRON'][0])
And this is the output from the above code
$ python test.py #Intron = 6,307
Saturday, 21 December 2013
Bioinformatics: Too many formats
In Bioinformatics
many formats are not properly
defined or implemented which makes it difficult to use them between
different tools. The cartoon below
describes the current situation in the Bioinformatics
field. Many new
file formats were born to solve previous file format problems or to
introduce new features. They then were
not any more
compatible to the previous formats
like (GFF2 to GFF3).
Source: http://xkcd.com/927/
A person or an
organisation
creating a new
format at the same time provide
a library in a particular language for
example in C, however
other programmers
are using different programming languages
e.g. like Python.
Therefore a wrapper has
to be written so that Python
can use it with the C
library. This has it's problems
as each time the
C library is improved,
the wrapper for Python has to
be also updated
which leads to a
delay in using
the new improvements or bug fixes. It would be great if all of
the bioinformatics libraries
would use a
“Simplified Wrapper and Interface Generator” (SWIG,
http://www.swig.org) this is
a software development tool that connects programs written in C and
C++ with a variety of high-level programming languages
such as Python.
Sometimes it is possible
to use comma- or tab-delimited file formats. This
work for example with the
FASTA format but
do not work with
the GFF3 format
as it requires
you to define
relationships between entities. The
solution would be to use formats like
eXtensible Markup Language (XML, www.w3.org/TR/REC-xml), JavaScript
Object Notation (JSON, http://json.org) or YAML Ain't Markup Language
(YAML, http://www.yaml.org). Many programming languages
provide parser for these three formats. Storing the same information
in the smallest
file size could be achieved with YAML and JSON formats.
The bigger file sizes
are going to be with XML formats
mainly because of XML's closing tags. XML
and JSON provide a binary format; Efficient
XML Interchange (EXI,
www.w3.org/TR/exi/) and Binary JSON (BSJON, http://bsonspec.org). EXI
makes XML data up to a hundred
times smaller, increasing
processing speed and also
increasing the
transmission speed of XML across existing networks (16 November 2013,
http://www.agiledelta.com/product_efx.html). On the other hand BSON
provides efficient encoding/decoding compared
to JSON, but the file size might be bigger than that
of the plain JSON format.
Saturday, 16 November 2013
LiteIDE a new IDE for Go programing language
Google created an open sourced programming language called GO which tries to combine the
performance of a compiled
language like C with a development speed of working in a language like Python.
In opinion it is also important to have a smart IDE which provides help with coding. One of the biggest helps I think is code completion. If you are new to a programming language it helps you to understand quicker other libraries by suggesting function names and arguments.
Google provides Eclipse plugins for Android development ADT and for their web app programming language Dart. To my surprise they do not provide any Eclipse plugin for GO. However, I found one Eclipse plugin called Goclipse but it looks to me that it may require more love, because the last release was in May 2011.
Recently, I found an IDE for GO called LiteIDE which is open source and cross-platform. It provides code completion for GO which works very nicely like the below picture shows. A full feature list of LiteIDE can be found here.
On my laptop I use Sabayon Linux which has a package for GO, however it is not the latest version of GO. Below are step-by-step instructions on how to install GO together with LiteIDE.
1. Step: Install GO
2. Step: Install LiteIDE
3. Step: Set up environment variable in .bashrc
4. Test GO and LiteIDE
In opinion it is also important to have a smart IDE which provides help with coding. One of the biggest helps I think is code completion. If you are new to a programming language it helps you to understand quicker other libraries by suggesting function names and arguments.
Google provides Eclipse plugins for Android development ADT and for their web app programming language Dart. To my surprise they do not provide any Eclipse plugin for GO. However, I found one Eclipse plugin called Goclipse but it looks to me that it may require more love, because the last release was in May 2011.
Recently, I found an IDE for GO called LiteIDE which is open source and cross-platform. It provides code completion for GO which works very nicely like the below picture shows. A full feature list of LiteIDE can be found here.
On my laptop I use Sabayon Linux which has a package for GO, however it is not the latest version of GO. Below are step-by-step instructions on how to install GO together with LiteIDE.
1. Step: Install GO
$ cd Downloads
$ wget -c http://go.googlecode.com/files/go1.1.2.linux-amd64.tar.gz
$ tar xfvz go1.1.2.linux-amd64.tar.gz
$ mkdir -p ~/apps/go_packages
$ mv go ~/apps
2. Step: Install LiteIDE
$ cd Downloads
$ git clone https://github.com/visualfc/liteide
$ cd liteide/build/
$ export QTDIR=/usr/lib/qt4
$ sh build_linux.sh
$ mv liteide/ $HOME/apps
3. Step: Set up environment variable in .bashrc
export $APPS=/home/uqmlore1/apps
export GOROOT=$APPS/go
export GOPATH=$APPS/go_packages
export GOOS=linux
export GOARCH=amd64
export LITEIDE=$APPS/liteide/bin/
export PATH=$GO/bin:$GOPATH/bin:$LITEIDE:$PATH
4. Test GO and LiteIDE
$ . ~/.bashrcHave fun!
$ go version
go version go1.1.2 linux/amd64
$ liteide
Subscribe to:
Posts (Atom)