Sunday, 2 February 2014

SnpEff”s result HTML file parsing

Recently, I had to use multiple times the SnpEff application which annotates and predicts the effects of variants on genes. One of the SnpEff's result files is a HTML file. A snippet example of SnpEff's results can be found below and the code can be found here.


Number of effects by type and region
Type Region
Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
NON_SYNONYMOUS_CODING   6,800 4.732%
NON_SYNONYMOUS_START   1 0.001%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
START_LOST   1 0.001%
STOP_GAINED   355 0.247%
STOP_LOST   295 0.205%
SYNONYMOUS_CODING   3,619 2.519%
SYNONYMOUS_STOP   63 0.044%
UPSTREAM   43,593 30.337%

Type (alphabetical order)   Count Percent
DOWNSTREAM   45,227 31.475%
EXON   11,134 7.748%
INTERGENIC   37,384 26.016%
INTRON   6,307 4.389%
SPLICE_SITE_ACCEPTOR   18 0.013%
SPLICE_SITE_DONOR   31 0.022%
UPSTREAM   43,593 30.337%


I could copy and  paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found here.

For parsing the SnpEff's HTML file I decided to use a Python library called beautifulsoup4  and the code to parse the above HTML example can be found below and here.

from bs4 import BeautifulSoup
 
test = {}
 
with open("SnpEff.html") as f:
    soup = BeautifulSoup(f, "lxml")
 
    for a in soup.find_all('a', attrs={'name': 'effects'}):
        for tr in a.find_all('tr')[3:]:
            tds = tr.find_all('td')
            if len(tds) > 0:
                test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
                    str(tds[2].text).strip())
 
print "#Intron = " + str(test['INTRON'][0]) 
 
And this is the output from the above code

$ python test.py 
#Intron = 6,307