Recently, I had to use multiple times
the
SnpEff
application which annotates and predicts the effects of variants on
genes. One of the SnpEff's result files is a HTML file. A snippet
example of SnpEff's results can be found below and the code can be
found
here.
Number of effects by type and region
Type |
Region |
Type (alphabetical order) |
|
Count |
Percent |
DOWNSTREAM |
|
45,227 |
31.475% |
INTERGENIC |
|
37,384 |
26.016% |
INTRON |
|
6,307 |
4.389% |
NON_SYNONYMOUS_CODING |
|
6,800 |
4.732% |
NON_SYNONYMOUS_START |
|
1 |
0.001% |
SPLICE_SITE_ACCEPTOR |
|
18 |
0.013% |
SPLICE_SITE_DONOR |
|
31 |
0.022% |
START_LOST |
|
1 |
0.001% |
STOP_GAINED |
|
355 |
0.247% |
STOP_LOST |
|
295 |
0.205% |
SYNONYMOUS_CODING |
|
3,619 |
2.519% |
SYNONYMOUS_STOP |
|
63 |
0.044% |
UPSTREAM |
|
43,593 |
30.337% |
|
Type (alphabetical order) |
|
Count |
Percent |
DOWNSTREAM |
|
45,227 |
31.475% |
EXON |
|
11,134 |
7.748% |
INTERGENIC |
|
37,384 |
26.016% |
INTRON |
|
6,307 |
4.389% |
SPLICE_SITE_ACCEPTOR |
|
18 |
0.013% |
SPLICE_SITE_DONOR |
|
31 |
0.022% |
UPSTREAM |
|
43,593 |
30.337% |
|
I could copy and paste just the table information that I am interested in, but I like to have things automatically done, as they can be reused for other projects. In my previous blog
post I wrote that it is possible to parse XML files, but I did not mention that it is also possible to parse HTML files. Web browse renders HTML files and XML is used to describe information which can be shared between applications. More information about the difference between HTML and XML can be found
here.
For parsing the SnpEff's HTML file I decided to use a Python library called
beautifulsoup4 and the code to parse the above HTML example can be found below and
here.
from bs4 import BeautifulSoup
test = {}
with open("SnpEff.html") as f:
soup = BeautifulSoup(f, "lxml")
for a in soup.find_all('a', attrs={'name': 'effects'}):
for tr in a.find_all('tr')[3:]:
tds = tr.find_all('td')
if len(tds) > 0:
test[str(tds[0].text).strip()] = (str(tds[1].text).strip(),
str(tds[2].text).strip())
print "#Intron = " + str(test['INTRON'][0])
And this is the output from the above code
$ python test.py
#Intron = 6,307