In Bioinformatics
many formats are not properly
defined or implemented which makes it difficult to use them between
different tools. The cartoon below
describes the current situation in the Bioinformatics
field. Many new
file formats were born to solve previous file format problems or to
introduce new features. They then were
not any more
compatible to the previous formats
like (GFF2 to GFF3).
Source: http://xkcd.com/927/
A person or an
organisation
creating a new
format at the same time provide
a library in a particular language for
example in C, however
other programmers
are using different programming languages
e.g. like Python.
Therefore a wrapper has
to be written so that Python
can use it with the C
library. This has it's problems
as each time the
C library is improved,
the wrapper for Python has to
be also updated
which leads to a
delay in using
the new improvements or bug fixes. It would be great if all of
the bioinformatics libraries
would use a
“Simplified Wrapper and Interface Generator” (SWIG,
http://www.swig.org) this is
a software development tool that connects programs written in C and
C++ with a variety of high-level programming languages
such as Python.
Sometimes it is possible
to use comma- or tab-delimited file formats. This
work for example with the
FASTA format but
do not work with
the GFF3 format
as it requires
you to define
relationships between entities. The
solution would be to use formats like
eXtensible Markup Language (XML, www.w3.org/TR/REC-xml), JavaScript
Object Notation (JSON, http://json.org) or YAML Ain't Markup Language
(YAML, http://www.yaml.org). Many programming languages
provide parser for these three formats. Storing the same information
in the smallest
file size could be achieved with YAML and JSON formats.
The bigger file sizes
are going to be with XML formats
mainly because of XML's closing tags. XML
and JSON provide a binary format; Efficient
XML Interchange (EXI,
www.w3.org/TR/exi/) and Binary JSON (BSJON, http://bsonspec.org). EXI
makes XML data up to a hundred
times smaller, increasing
processing speed and also
increasing the
transmission speed of XML across existing networks (16 November 2013,
http://www.agiledelta.com/product_efx.html). On the other hand BSON
provides efficient encoding/decoding compared
to JSON, but the file size might be bigger than that
of the plain JSON format.