Parse maf files

From Bio++ Wiki
Jump to: navigation, search

This recipe demonstrate how to parse MAF files using the Bio++ MAF parser, in combination with the boost iostreams library.

First we need to include what we need to read files: <source lang="cpp"> // From the STL:

  1. include <iostream>
  2. include <iomanip>
  3. include <string>
  4. include <memory>

using namespace std;

//From boost:

  1. include <boost/iostreams/device/file.hpp>
  2. include <boost/iostreams/filtering_stream.hpp>
  3. include <boost/iostreams/filter/gzip.hpp> //OR
  4. include <boost/iostreams/filter/bzip2.hpp> //OR
  5. include <boost/iostreams/filter/zlib.hpp>

using namespace boost::iostreams;

// From SeqLib:

  1. include <Seq/MafAlignmentParser.h>
  2. include <Seq/SequenceWithQuality.h> //This will be needed to deal with quality scores.

using namespace bpp; </source>

Maf files contain genome alignments, and are therefore quite large. They usually come as compressed files that we can parse directly in a C++ program. This can be achieved thanks to the boost iostreams library, which are fully compatible with the standard iostreams: <source lang="cpp"> string inputFile = "primates-6ways-chr1.maf.gz"; filtering_istream stream; stream.push(gzip_decompressor()); stream.push(file_source(inputFile)); MafAlignmentParser parser(&stream); </source>

The best way to parse a MAF file is to iterate over syntheny blocks it contains. In this example loop, we are interested in the quality scores attached to a special species: <source lang="cpp"> string species = "Ptro"; while (MafBlock* block = parser.nextBlock()) {

 for (unsigned int i = 0; i < block->getAlignment().getNumberOfSequences(); ++i) {
   string name = block->getAlignment().getSequence(i).getName();
   if (name == species) {
     //We retrieve the sequence as a MafSequence object:
     const MafSequence* mafSeq      = &dynamic_cast<const MafSequence&>(block->getAlignment().getSequence(i));
     //Now we get the attached scores:
     const SequenceQuality* qscores = &dynamic_cast<const SequenceQuality&>(
         mafSeq->getAnnotation(SequenceQuality::QUALITY_SCORE));
     //Here we deal with the scores:
     map<int, unsigned int> counts = VectorTools::countValues(qscores->getScores());
     //etc/
   }
 }
 //Get read of this block to avoid memory leaking:
 delete block;

} </source>