Project

From Bio++ Wiki
Jump to: navigation, search

The aim of the Bio++ Project is to provide re-usable code for the rapid development of robust applications in the fields of sequence analysis, phylogenetics, molecular evolution and population genetics.

Bio++ is designed in an extensible object-oriented way, in the C++ language.

Available methods

Sequence analysis

  • Sequence and Site objects, with various Alphabet support (DNA, RNA, Proteins, Codons, any 'Word' of a given size).
  • Several containers available for inner storage, with several implementations. Support for alignments.
  • Various I/O formats supported: Fasta, Mase, Clustal, Phylip, DCSE, GenBank (sequence only), etc.
  • Sequence manipulation: truncation, concatenation, sub-sequences, etc.
  • In silico molecular biology: (reverse) transcription, translation, replication.
  • Several genetic codes availables: Standard and mitochondrial (vertebrates, echinoderms, yeast and other invertabrates)
  • Amino acids properties: volume, polarity and charge + physico-chemical distance (Miyata and Grantham) + import from any AAIndex entry.
  • Consensus sequences.
  • Pairwise alignment.
  • Similarity score computation.
  • Sequence bootstrap.
  • Homogeneity test (Bowker's test).
  • NGS tools: sequence quality scores, file formats (Phred, FastQ, MAF, etc).
  • etc.

Phylogenetics and molecular evolution

Data structure and IO

  • Phylogenetic trees.
  • IO from newick files, with support for multiple entries.
  • Support for NHX and Nexus formats.

Phylogenetic reconstuction methods

  • Parsimony (NNI)
  • Distance matrices estimation and I/O to files in Phylip format.
  • Distance methods: (U/W)PGMA, NJ, BioNJ.
  • Maximum likelihood (NNI, including a PhyML-like algorithm).
  • Mixed distance/ML tree reconstruction (iterative approaches).
  • Tree consensus methods, bipartitions, bootstrap value computations.

Substitution models

  • JC, K80, T92, F84, HKY85, TN93, GTR and more for nucleotides,
  • JC, DSO78, JTT92 + any PAML-formated model description for proteins, with possibility to estimate equilibrium frequencies.
  • Various codon models: Muse & Gaut 1994, Yang & Nielsen 1998, Goldman & Yang 1994 + user-defined.
  • Support for rate-across sites models, with virtually any probability distribution, allowing for invariant classes.
  • Covarion models.
  • Model including gaps.
  • Global clock tree likelihood models.
  • Virtually any kind of non-homogeneous model is supported!
  • Mixed models, including PAML's M1, M2, M3, M7 and M8.

Molecular evolution tools

  • Parameter estimation under maximum likelihood.
  • Ancestral states reconstructions: Marginal likelihood.
  • (Weighted) substitution mapping.
  • Sequences simulation under any substitution model, homogeneous or not.


Population genetics

  • A new file format to deal with codominant markers and bio-sequence data for individuals.
  • Import and export methods with various population genetics software.
  • Specific containers for polymorphism data.
  • Diversity and polymorphism statistics for codominant and sequence data.
  • Estimation of Wright F-statistics and pairwise genetic distance on codominant markers.
  • Statistics on synonymous and non synonymous sites for coding sequences
  • Various 'Neutrality' statistics on sequence data (Tajima, Fu and Li, Rand and Kann ...).
  • Various measures of linkage disequilibrium.
  • etc.


Genomics tools

  • Efficient parser for read sequences with quality scores (FastQ).
  • Efficient customizable parse for genome alignments in MAF format.
  • Classes and tools for handling features (GFF and GTF).
  • etc.


Numerical calculus

  • Numerical tools: extended functions (log, factorial, etc.)
  • Vector tools: element-wise functions, statistics (mean, var, sd, correlation, information theory)
  • Classes for matrices implementation.
  • Linear algebra: eigen decomposition, LU decomposition, inversion, etc.
  • Random number generation: Quick & Dirty (32bits only), Wichmann and Hill, Knuth. Samplers from probability distributions (uniform, normal, gamma, etc.).
  • Function object implementation, with first and second order derivatives.
  • Numerical derivatives computation.
  • Optimization algorithms: Golden section search, Brent's algorithm, Powell's and Downhill simplex method, but also methods using derivatives like conjugate gradient and Newton's method. Object implementation of these methods, using the event-driven Optmizer interface (works with Function objects).
  • Statistics: DataTable object, with I/O from CSV files, probability distributions, sampling and simulations.
  • Kernel density methods.
  • etc.

Utils

  • Files: working on file paths, getting file extensions and names, testing existence, open and store in string arrays, etc.
  • Text: convert text to any other type and vice versa, remove spaces, tokenize, switch between upper/lower case, etc.
  • Applications: read options from a file or command line
  • etc.

Graphical components for GUI development

These classes are developped using the Qt library.

  • Tree canvass and controlers

The PhyView phylogenetic editor is the first program to use the library. It features several methods like tree edition, rerooting, branch length editing, subtree sampling, and allows to associate data to a tree

PhyView
.