Tree likelihood

From Bio++ Wiki
Revision as of 21:11, 20 April 2013 by Julien (talk | contribs) (Computing the likelihood of a sequence alignment)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Computing the likelihood of a sequence alignment

Note: This chapter describes the new likelihood framework, under active development, available only under the "newlik" branch for the GIT repository. The code described here IS NOT YET available in the stable version of Bio++ !

Simple or double recursive algorithm

Two algorithms exist for computing the likelihood of a sequence alignment given a phylogenetic tree and model of sequence evolution. These algorithms are called "simple recursive" (R) and "double recursive" (DR). They are described in detail in Felsenstein's book [1]. The R algorithm uses less memory than the DR one, and is faster as it involves only one recursion to compute the likelihood. The DR algorithm however, enables the direct computation of likelihood change after a nearest neighbor interchange (NNI), and is therefore used when the topology of the tree is optimized. Another nice property of the DR algorithm is that it allows the fast computation of ancestral sequences and substitution mapping, while this would involve additional recursion with a R algorithm.

The classes performing both type of computations are called RTreeLikelihood and DRTreeLikelihood. They both inherit from the generic TreeLikelihood interface, and therefore have many methods in common. The difference between these two classes lies in their internal data structure, used for storing all conditional likelihood. These are called RTReeLikelihoodData and DRTreeLikelihoodData. You will most likely never have to worry about those classes, as they are used internally by the likelihood classes, unless you want to create your own likelihood class.

The TreeLikelihood classes allow you to compute the likelihood of one or several data sets, according to a certain model of sequence evolution. In Bio++, this model is described in a class implementing the SubstitutionProcess interface, described in the next section.

Substitution process

The substitution process describes how sequences evolve along a phylogenetic tree. Several types of modeling are supported, via distinct implementations, which all share the same SubstitutionProcess interface. Whatever the complexity of the process, it will always require a phylogenetic tree to be provided. In Bio++, trees are implemented in the TreeTemplate<Node> class. Such trees can have branch lengths, which are actually parameters of the substitution process. To ease the manipulation of these parameters, the SubstitutionProcess classes require a ParametrizableTree object, which can be instanciated from a TreeTemplate object:

TreeTemplate<Node>* tree = /* most likely read from a file */
ParametrizableTree* pTree = new ParametrizableTree(tree);

The ParametrizableTree class manipulates the list of branch length parameters and synchronize it, when needed, with the corresponding TreeTemplate object.

The SubstitutionProcess classes are tightly bound to SubstitutionModel classes. They will typically own one or more instance of those. While the SubstitutionModel class computes the so-called transition probabilities between all states of a model (defined after a sequence alphabet), for any time, the SubstitutionProcess class will provide such probabilities for each branch of the underlying tree. The SubstitutionProcess class will notably be in charge of computing the probabilities only when necessary (with a call to the corresponding method of the appropriate SubstitutionModel class), and store the result for later recalling. It therefore has to listen to any parameter change and make sure the computed probabilities are always correct. The SubstitutionProcess class also allows to compute other related quantities, in a similar fashion, including derivatives of such probabilities acccording to time, generator of the Markov model, equilibrium frequencies, and so on.

Several implementations of SubstitutionProcess are available, and are described in the following subsections.

Simple substitution process

The SimpleSubstitutionProcess class implements the simplest model of sequence evolution, defined by a single SubstitutionModel object passed as argument. The process assumes that all branches and all sites share the same subsitution model (modle homogeneous in time and in sites).

RAS model

The RateAcrossSitesSubstitutionProcess class implements Yang's Rate-across-sites model of rate variation. It is very similar to the SimpleSubstitutionProcess, but takes as argument as extra parameter a RateDistribution object, describing the discrete distribution of rates to use (typically a discretized gamma distribution).


References

  1. Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Mass.