MicTadLo: February 2014

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.

Number of effects by type and region

Type

Region

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
NON_SYNONYMOUS_CODING	6,800	4.732%
NON_SYNONYMOUS_START	1	0.001%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
START_LOST	1	0.001%
STOP_GAINED	355	0.247%
STOP_LOST	295	0.205%
SYNONYMOUS_CODING	3,619	2.519%
SYNONYMOUS_STOP	63	0.044%
UPSTREAM	43,593	30.337%

Type (alphabetical order)	Count	Percent
DOWNSTREAM	45,227	31.475%
EXON	11,134	7.748%
INTERGENIC	37,384	26.016%
INTRON	6,307	4.389%
SPLICE_SITE_ACCEPTOR	18	0.013%
SPLICE_SITE_DONOR	31	0.022%
UPSTREAM	43,593	30.337%

I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4 and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0])

And this is the output from the above code

$ python test.py 
#Intron = 6,307

MicTadLo

Sunday, 2 February 2014

SnpEff”s result HTML file parsing