Reconstructing ancestral sequences

From Bio++ Wiki
Revision as of 09:03, 14 February 2011 by Julien (talk | contribs) (Added category.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This tutorial shows how to reconstruct ancestral sequences using likelihood methods. It assumes that you have a sequence alignment (as a SiteContainer object) and a phylogenetic tree (as a Tree object).

Headers needed

<source lang="cpp">

  1. include <Bpp/Seq/Containers.all> //Sequence containers
  2. include <Bpp/Phyl/Tree.h> //Phylogenetic trees
  3. include <Bpp/Phyl/Models.all> //Sequence evolution models
  4. include <Bpp/Phyl/Likelihood.all> //Likelihood methods, including ancestral states reconstruction
  5. include <Bpp/Numeric/Probs.all> //Rate distributions
  6. include <Bpp/Numeric/VectorTools.h> //For the Shannon entropy


Define a sequence evolution model and fit it

You can chose any homogeneous or non-homogeneous model to reconstruct ancestral sequences. Currently mixed models are not supported. In order to use the reconstruction method, we will need a double-recursive likelihood class:

<source lang="cpp"> Tree* tree = ... SiteContainer* sites = ... SubstitutionModel* model = new JTT92(AlphabetTools::PROTEIN_ALPHABET); DiscreteDistribution* rDist = new GammaDiscreteDistribution(4, 0.354); rDist->aliasParameter("beta", "alpha"); DRTreeLikelihood* drtl = new DRHomogeneousTreeLikelihood(model, rDist, *tree, *sites); </source>

In most cases you will like to fit the model to the data, and estimate parameters. This requires the use of the OptimizationTools::optimizeNumericalParameter function. For doing so, it might be a good idea to use a RHomogeneousTreeLikelihood, as it will be less memory greedy and faster to optimize, and in a second step build the DR likelihood object directly from estimated parameters.

Reconstruct ancestral sequences using the marginal method

This method implements the marginal reconstruction of Yang (ref). It reconstructs sequences at requested nodes independently, and therefore can be inaccurate when several ancestral sequences are to be reconstructed.

<source lang="cpp"> MarginalAncestralStateReconstruction mar(drtl); int nodeId = 25; //for example Sequence* seq = mar.getAncestralSequenceForNode(nodeId); </source>

This will retrieve the ancestral sequence for the node with id 25. In this case, the state with maximum posterior probability for each site will be used. In order to assess the uncertainty in the ancestral reconstruction, it is possible to retrieve the posterior probabilities for each state at each site:

<source lang="cpp"> vector< vector<double> > probs; Sequence* seq = mar.getAncestralSequenceForNode(nodeId, probs, false); </source>

The uncertainty is the reconstruction can for example be assessed with the Shannon entropy: <source lang="cpp"> vector<double> entropy(probs.size()); for (size_t i = 0; i < probs.size(); ++i) {

 entropy[i] = VectorTools::shannon(probs[i], 20),

} </source>

It is also possible to sample the ancestral sequence from the posterior probabilities: <source lang="cpp"> Sequence* seq = mar.getAncestralSequenceForNode(nodeId, 0, true); </source>

The full set of ancestral sequences can be retrieved as a SiteContainer object, with or without sampling: <source lang="cpp"> bool sample = true; SiteContainer* ancSites = mar.getAncestralSequences(sample); </source>

Note: for now, joint reconstruction is not available.