Saturday, 21 December 2013

Bioinformatics: Too many formats


In Bioinformatics many formats are not properly defined or implemented which makes it difficult to use them between different tools. The cartoon below describes the current situation in the Bioinformatics field. Many new file formats were born to solve previous file format problems or to introduce new features. They then were not any more compatible to the previous formats like (GFF2 to GFF3). 
 
Source: http://xkcd.com/927/

A person or an organisation creating a new format at the same time provide a library in a particular language for example in C, however other programmers are using different programming languages e.g. like Python. Therefore a wrapper has to be written so that Python can use it with the C library. This has it's problems as each time the C library is improved, the wrapper for Python has to be also updated which leads to a delay in using the new improvements or bug fixes. It would be great if all of the bioinformatics libraries would use a “Simplified Wrapper and Interface Generator” (SWIG, http://www.swig.org) this is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages such as Python. Sometimes it is possible to use comma- or tab-delimited file formats. This work for example with the FASTA format but do not work with the GFF3 format as it requires you to define relationships between entities. The solution would be to use formats like eXtensible Markup Language (XML, www.w3.org/TR/REC-xml), JavaScript Object Notation (JSON, http://json.org) or YAML Ain't Markup Language (YAML, http://www.yaml.org). Many programming languages provide parser for these three formats. Storing the same information in the smallest file size could be achieved with YAML and JSON formats. The bigger file sizes are going to be with XML formats mainly because of XML's closing tags. XML and JSON provide a binary format; Efficient XML Interchange (EXI, www.w3.org/TR/exi/) and Binary JSON (BSJON, http://bsonspec.org). EXI makes XML data up to a hundred times smaller, increasing processing speed and also increasing the transmission speed of XML across existing networks (16 November 2013, http://www.agiledelta.com/product_efx.html). On the other hand BSON provides efficient encoding/decoding compared to JSON, but the file size might be bigger than that of the plain JSON format.