©Copyright 1999 - 2008 by PhyML Development Team.
The software PhyML is provided “as is” without warranty of any kind. In no event shall the authors or his employer be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. All parts of the source and documentation except where indicated are distributed under the GNU public licence. See http://www.opensource.org for details.

1 Availability

2 Authors

3 Overview

PhyML is a software package which primary task that is to estimate maximum likelihood phylogenies from alignments of nucleotide or amino-acid sequences. It provides a wide range of options that were designed to facilitate standard phylogenetic analyses. The main strength of PhyML lies in the large number of substitution models coupled to various options to search the space of phylogenetic tree topologies, going from very fast and efficient methods to slower but generally more accurate approaches. It also implements two methods to evaluate branch supports in a sound statistical framework (the non-parametric bootstrap and the approximate likelihood ratio test).

PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000 sequences 2,000,000 character-long can analyzed. In practice however, the amount of memory required to process a data set is proportional of the product of the number of sequences by their length. Hence, a large number of sequences can only be processed provided that they are short. Also, PhyML can handle long sequences provided that they are not numerous. With most standard personal computers, the “comfort zone” for PhyML generally lies around 100-500 sequences less than 10,000 character long. For larger data sets, we recommend using other softwares such as RAxML or GARLI or Treefinder (http://www.treefinder.de).

4 Bug report

While PhyML is, of course, bug-free (!) (please read the disclaimer carefuly...), if you ever come across an issue, please feel free to report it using the discuss group web site at the following address: https://groups.google.com/forum/?fromgroups#!forum/phyml-forum. Alternatively, you can send an email to s.guindon@auckland.ac.nz. Do not forget to mention the version of PhyML and program options you are using.

5 Installing PhyML

5.1 Sources and compilation

The sources of the program are available free of charge from http://stephaneguindon.github.io/phyml-downloads/. The compilation on UNIX-like systems is fairly standard. It is described in the ‘INSTALL’ file that comes with the sources. In a command-line window, go to the directory that contains the sources and type:

./configure;
make clean;
make V=0;

By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a version of PhyML that can run through a debugging tool (such as ddd[ddd]) or a profiling tool (such as gprof[gprof]) using the following instructions:

./configure --enable-debug;
make clean;
make V=0;

5.2 Installing PhyML on UNIX-like systems (including Mac OS)

Copy PhyML binary file in the directory you like. For the operating system to be able to locate the program, this directory must be specified in the global variable PATH. In order to achieve this, you will have to add export PATH=/your_path/:$PATH to the .bashrc or the .bash_profile located in your home directory (your_path is the path to the directory that contains PhyML binary).

5.3 Installing PhyML on Microsoft Windows

Copy the files phyml.exe and phyml.bat in the same directory. To launch PhyML, click on the icon corresponding to phyml.bat. Clicking on the icon for phyml.exe works too but the dimensions of the window will not fit PhyML PHYLIP-like interface.

5.4 Installing the parallel version of PhyML

Bootstrap analysis can run on multiple processors. Each processor analyses one bootstraped dataset. Therefore, the computing time needed to perform R bootstrap replicates is divided by the number of processors available.

This feature of PhyML relies on the MPI (Message Passing Interface) library. To use it, your computer must have MPI installed on it. In case MPI is not installed, you can dowload it from http://www.mcs.anl.gov/research/projects/mpich2/. Once MPI is installed, it is necessary to launch the MPI daemon. This can be done by entering the following instruction: mpd &. Note however that in most cases, the MPI daemon will already be running on your server so that you most likely do not need to worry about this. You can then just go in the phyml/ directory (the directory that contains the src/, examples/ and doc/ folders) and enter the commands below:

./configure --enable-mpi;
make clean;
make;

A binary file named phyml-mpi has now been created in the src/ directory and is ready to use with MPI. A typical MPI command-line which uses 4 CPUs is given below:

mpirun -n 4 ./phyml-mpi -i myseq -b 100

Please read section [sec:parallelbootstrap] of this document for more information.

5.5 Installing PhyML-BEAGLE

PhyML can use the BEAGLE library for the likelihood computation. BEAGLE provides provides significant speed-up: the single core version of PhyML-BEAGLE can be up to 10 times faster than PhyML on a single core and up to 150 times on Graphical Processing Units. PhyML-BEAGLE will eventually have of the features of PhyML, even though at the moment the boostrap and the invariant site options are not available. Also, please note that in some cases, the final log-likelihood reported by PhyML and PhyML-BEALGE may not exactly match, though the differences observed are very minor (in the 104 to 104 range).

In order to install PhyML-BEAGLE, you first need to download and install the BEAGLE library available from https://code.google.com/p/beagle-lib/. Then run the following commands:

./configure --enable-beagle;
make clean;
make;

A binary file named phyml-beagle will be created in the src/ directory. The interface to phyml-beagle (i.e., commandline option of PHYLIP-like interface) is exactly identical to that of PhyML.

6 Program usage.

PhyML has three distinct user-interfaces. The first corresponds to a PHYLIP-like text interface that makes the choice of the options self-explanatory. The command-line interface is well-suited for people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution.

6.1 PHYLIP-like interface

The default is to use the PHYLIP-like text interface by simply typing ‘phyml’ in a command-line window or by clicking on the PhyML icon (see Section [sec:install-windows]). After entering the name of the input sequence file, a list of sub-menus helps the users set up the analysis. There are currently four distinct sub-menus:

  1. Input Data: specify whether the input file contains amino-acid or nucleotide sequences. What the sequence format is (see Section [sec:inputoutput]) and how many data sets should be analysed.

  2. Substitution Model: selection of the Markov model of substitution.

  3. Tree Searching: selection of the tree topology searching algorithm.

  4. Branch Support: selection of the method that is used to measure branch support.

+’ and ‘-’ keys are used to move forward and backward in the sub-menu list. Once the model parameters have been defined, typing ‘Y’ (or ‘y’) launches the calculations. The meaning of some options may not be obvious to users that are not familiar with phylogenetics. In such situation, we strongly recommend to use the default options. As long as the format of the input sequence file is correctly specified (sub-menu Input data), the safest option for non-expert users is to use the default settings. The different options provided within each sub-menu are described in what follows.

6.1.1 Input Data sub-menu

Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see Section [sec:inputoutput]). Type D to change settings.

PHYLIP format comes in two flavours: interleaved or sequential (see Section [sec:inputoutput]). Type I to selected among the two formats.

If the input sequence file contains more than one data sets, PhyML can analyse each of them in a single run of the program. Type M to change settings.

This option allows you to append a string that identifies the current PhyML run. Say for instance that you want to analyse the same data set with two models. You can then ‘tag’ the first PhyML run with the name of the first model while the second run is tagged with the name of the second model.

6.1.2 Substitution model sub-menu

PhyML implements a wide range of substitution models: JC69 , K80 , F81 , F84 , HKY85 , TN93 GTR and custom for nucleotides; LG , WAG , Dayhoff , JTT , Blosum62 , mtREV , rtREV , cpREV , DCMut , VT and mtMAM and custom for amino acids. Cycle through the list of nucleotide or amino-acids substitution models by typing M. Both nucleotide and amino-acid lists include a ‘custom’ model. The custom option provides the most flexible way to specify the nucleotide substitution model. The model is defined by a string made of six digits. The default string is ‘000000’, which means that the six relative rates of nucleotide changes: AC, AG, AT, CG, CT and GT, are equal. The string ‘010010’ indicates that the rates AG and CT are equal and distinct from AC=AT=CG=GT. This model corresponds to HKY85 (default) or K80 if the nucleotide frequencies are all set to 0.25. ‘010020’ and ‘012345’ correspond to TN93 and GTR models respectively. The digit string therefore defines groups of relative substitution rates. The initial rate within each group is set to 1.0, which corresponds to F81 (JC69 if the base frequencies are equal). Users also have the opportunity to define their own initial rate values. These rates are then optimised afterwards (option ‘O’) or fixed to their initial values. The custom option can be used to implement all substitution models that are special cases of GTR. Table [tab:modelcode] on page gives the correspondence between the ‘standard’ name of the model (see http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg) and the custom model code. The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid substitution model that is not hard-coded in PhyML. The symmetric part of the rate matrix, as well as the equilibrium amino-acid frequencies, are given in a file which name is given as input of the program. The format of this file is described in the section [sec:customaa].

For nucleotide sequences, optimising equilibrium frequencies means that the values of these parameters are estimated in the maximum likelihood framework. When the custom model option is selected, it is also possible to give the program a user-defined nucleotide frequency distribution at equilibrium (option E). For protein sequences, the stationary amino-acid frequencies are either those defined by the substitution model or those estimated by counting the number of different amino-acids observed in the data. Hence, the meaning of the F option depends on the type of the data to be processed.

Fix or estimate the transition/transversion ratio in the maximum likelihood framework. This option is only available when DNA sequences are to be analysed under K80, HKY85 or TN93 models. The definition given to this parameter by PhyML is the same as PAML’s one. Therefore, the value of this parameter does not correspond to the ratio between the expected number of transitions and the expected number of transversions during a unit of time. This last definition is the one used in PHYLIP. PAML’s manual gives more detail about the distinction between the two definitions (http://abacus.gene.ucl.ac.uk/software/paml.html).

The proportion of invariable sites, i.e., the expected frequency of sites that do not evolve, can be fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each site in the sequence may accumulate substitutions at some point during its evolution, even if no differences across sequences are actually observed at that site. Users can also fix this parameter to any value in the [0.0,1.0] range or estimate it from the data in the maximum-likelihood framework.

Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete gamma distribution. Type R to switch this option on or off. The different categories of this discrete distribution correspond to different (relative) rates of evolution. The number of categories of this distribution is set to 4 by default. It is probably not wise to go below this number. Larger values are generally preferred. However, the computational burden involved is proportional to the number of categories (i.e., an analysis with 8 categories will generally take twice the time of the same analysis with only 4 categories). Note that the likelihood will not necessarily increase as the number of categories increases. Hence, the number of categories should be kept below a “reasonable” number, say 20. The default number of categories can be changed by typing C.

The middle of each discretized substitution rate class can be determined using the mean or the median. PAML, MrBayes and RAxML use the mean. However, the median is generally associated with greater likelihoods than the mean. This conclusion is based on our analysis of several real-world data sets extracted from TreeBase. Despite this, the default option in PhyML is to use the mean in order to make PhyML likelihoods comparable to those of other phylogenetic software. One must bare in mind that .

The shape of the gamma distribution determines the range of rate variation across sites. Small values, typically in the [0.1,1.0] range, correspond to large variability. Larger values correspond to moderate to low heterogeneity. The gamma shape parameter can be fixed by the user or estimated via maximum-likelihood. Type A to select one or the other option.

6.1.3 Tree searching sub-menu

By default the tree topology is optimised in order to maximise the likelihood. However, it is also possible to avoid any topological alteration. This option is useful when one wants to compute the likelihood of a tree given as input (see below). Type O to select among these two options.

PhyML proposes three different methods to estimate tree topologies. The default approach is to use simultaneous NNI. This option corresponds to the original PhyML algorithm . The second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree topologies compared to NNI but is also significantly slower. The third approach, termed BEST, simply estimates the phylogeny using both methods and returns the best solution among the two. Type S to choose among these three choices.

When the SPR or the BEST options are selected, is is possible to use random trees rather than BioNJ or a user-defined tree, as starting tree. If this option is turned on (type R to change), five trees, corresponding to five random starts, will be estimated. The output tree file will contain the best tree found among those five. The number of random starts can be modified by typing N. Setting the number of random starting trees to N means that the analysis will take (slightly more than) N times the time required for a standard analysis where only one (BioNJ) starting tree is used. However, the analysis of real data sets shows that the best trees estimated using the random start option almost systematically have higher likelihoods than those inferred using a single starting tree.

When the tree topology optimisation option is turned on, PhyML proceeds by refining an input tree. By default, this input tree is estimated using BioNJ . The alternative option is to use a parsimony tree. We found this option specially useful when analysing large data sets with NNI moves as it generally leads to greater likelihoods than those obtained when starting from a BioNJ trees. The user can also to input her/his own tree. This tree should be in Newick format (see Section [sec:inputoutput]). This option is useful when one wants to evaluate the likelihood of a given tree with a fixed topology, using PhyML. Type U to choose among these two options.

6.1.4 Branch support sub-menu

The support of the data for each internal branch of the phylogeny can be estimated using non-parametric bootstrap. By default, this option is switched off. Typing B switches on the bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this number the more precise the bootstrap support estimates are. However, for each bootstrap replicate a phylogeny is estimated. Hence, the time needed to analyse N bootstrap replicates corresponds to N-times the time spent on the analysis of the original data set. N=100 is generally considered as a reasonable number of replicates.

When the bootstrap option is switched off (see above), approximate likelihood branch supports are estimated. This approach is considerably faster than the bootstrap one. However, both methods intend to estimate different quantities and conducting a fair comparison between both criteria is not straightforward. The estimation of approximate likelihood branch support comes in multiple flavours. The default is set to aBayes, corresponding to the approximate Bayes method described in . The approximate likelihood ratio test (aLRT) , Shimodaira–Hasegawa aLRT (SH-aLRT) statistics are the other available options.

6.2 Command-line interface

An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to modify the default parameters can launch the program with the ‘phyml -i seq_file_name’ command. The list of all command line arguments and how to use them is given in the ‘Help’ section which is displayed when entering the ‘phyml –help’ command. The available command-line options are described in what follows.

6.3 XML interface

6.4 Parallel bootstrap

Bootstrapping is a highly parallelizable task. Indeed, bootstrap replicates are independent from one another. Each bootstrap replicate can then be analysed separately. Modern computers often have more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel strategy, performing R bootstrap replicates on C CPUs ‘costs’ the same amount of computation time as processing R×C bootstrap replicates on a single CPU. In other words, for a given number of replicates, the computation time is divided by R compared to the non-parallel approach.

PhyML sources must be compiled with specific options to turn on the parallel option (see Section [sec:MPI]). Once the binary file (phyml) has been generated, running a bootstrap analysis with, say 100 replicates on 2 CPUs, can be done by typing the following command-line:

mpd &;
mpirun -np 2 ./phyml -i seqfile -b 100;

The first command launches the mpi daemon while the second launches the analysis. Note that launching the daemon needs to be done only once. The output files are similar to the ones generated using the standard, non-parallel, analysis (see Section [sec:inputoutput]). Note that running the program in batch mode, i.e.:

mpirun -np 2 ./phyml -i seqfile -b 100 &

will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome... Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs required in the mpirun command.

7 Inputs & outputs for command-line and PHYLIP interface

PhyML reads data from standard text files, without the need for any particular file name extension.

7.1 Sequence formats


5 80
seq1  CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
seq2  CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
seq3  RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT
seq4  RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT

ATCKGCTTTTGGCAGGAAAT
ATCKGCTTTTGGCGGGAAAT
AGCKGCTGTTG?????????
ATCTGCTTTTGGCGGGAAAT
ATCTGCTTTTGGCGGGAAAT

5 40
seq1  CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG
seq2  CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG
seq3  RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX
seq4  RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX
seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx

[fig:aligntree]


[ This is a comment ]
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=10 NCHAR=20;
FORMAT DATATYPE=DNA;
MATRIX
tax1       ?ATGATTTCCTTAGTAGCGG
tax2       CAGGATTTCCTTAGTAGCGG
tax3       ?AGGATTTCCTTAGTAGCGG
tax4       ?????????????GTAGCGG
tax5       CAGGATTTCCTTAGTAGCGG
tax6       CAGGATTTCCTTAGTAGCGG
tax7       ???GATTTCCTTAGTAGCGG
tax8       ????????????????????
tax9       ???GGATTTCTTCGTAGCGG
tax10      ???????????????AGCGG;
END;

[ This is a comment ]
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=10 NCHAR=20;
FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3";
MATRIX
tax1       ?0320333113302302122
tax2       10220333113302302122
tax3       ?0220333113302302122
tax4       ?????????????2302122
tax5       10220333113302302122
tax6       10220333113302302122
tax7       ???20333113302302122
tax8       ????????????????????
tax9       ???22033313312302122
tax10      ???????????????02122;
END;

[ This is a comment ]
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=10 NCHAR=20;
FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03";
MATRIX
tax1       ??00030200030303010103030002030002010202
tax2       0100020200030303010103030002030002010202
tax3       ??00020200030303010103030002030002010202
tax4       ??????????????????????????02030002010202
tax5       0100020200030303010103030002030002010202
tax6       0100020200030303010103030002030002010202
tax7       ??????0200030303010103030002030002010202
tax8       ????????????????????????????????????????
tax9       ??????0202000303030103030102030002010202
tax10      ??????????????????????????????0002010202;
END;

[fig:nexus]

Alignments of DNA or protein sequences must be in PHYLIP or NEXUS sequential or interleaved format (Figures [fig:aligntree] and [fig:nexus]). For PHYLIP formated sequence alignments, the first line of the input file contains the number of species and the number of characters, in free format, separated by blank characters. One slight difference with PHYLIP format deals with sequence name lengths. While PHYLIP format limits this length to ten characters, PhyML can read up to hundred character long sequence names. Blanks and the symbols “(),:” are not allowed within sequence names because the Newick tree format makes special use of these symbols. Another slight difference with PHYLIP format is that actual sequences must be separated from their names by at least one blank character.

A PHYLIP input sequence file may also display more than a single data set. Each of these data sets must be in PHYLIP format and two successive alignments must be separated by an empty line. Processing multiple data sets requires to toggle the ‘M’ option in the Input Data sub-menu or use the ‘-n’ command line option and enter the number of data sets to analyse. The multiple data set option can be used to process re-sampled data that were generated using a non-parametric procedure such as cross-validation or jackknife (a bootstrap option is already included in PhyML). This option is also useful in multiple gene studies, even if fitting the same substitution model to all data sets may not be suitable.

PhyML can also process alignments in NEXUS format. Although not all the options provided by this format are supported by PhyML, a few specific features are exploited. Of course, this format can handle nucleotide and protein sequence alignments in sequential or interleaved format. It is also possible to use custom alphabets, replacing the standard 4-state and 20-state alphabets for nucleotides and amino-acids respectively. Examples of a 4-state custom alphabet are given in Figure [fig:nexus]. Each state must here correspond to one digit or more. The set of states must be a list of consecutive digits starting from 0. For instance, the list “0, 1, 3, 4” is not a valid alphabet. Each state in the symbol list must be separated from the next one by a space. Hence, alphabets with large number of states can be easily defined by using two-digit number (starting with 00 up to 19 for a 20 state alphabet). Most importantly, this feature gives the opportunity to analyse data sets made of presence/absence character states (use the symbols=0 1 option for such data). Alignments made of custom-defined states will be processed using the Jukes and Cantor model. Other options of the program (e.g., number of rate classes, tree topology search algorithm) are freely configurable. Note that, at the moment, the maximum number of different states is set to 22 in order to save memory space. It is however possible to lift this threshold by modifiying the value of the variable T_MAX_ALPHABET in the file ‘utilities.h’. The program will then have to be re-compiled.

7.1.1 Gaps and ambiguous characters

Gaps correspond to the ‘-’ symbol. They are systematically treated as unknown characters “on the grounds that we don’t know what would be there if something were there” (J. Felsenstein, PHYLIP main documentation). The likelihood at these sites is summed over all the possible states (i.e., nucleotides or amino acids) that could actually be observed at these particular positions. Note however that columns of the alignment that display only gaps or unknown characters are simply discarded because they do not carry any phylogenetic information (they are equally well explained by any model). PhyML also handles ambiguous characters such as R for A or G (purines) and Y for C or T (pyrimidines). Tables [tab:ambigunt] and [tab:ambiguaa] give the list of valid characters/symbols and the corresponding nucleotides or amino acids.

Character Nucleotide Character Nucleotide
A Adenosine Y C or T
G Guanosine K G or T
C Cytidine B C or G or T
T Thymidine D A or G or T
U Uridine (=T) H A or C or T
M A or C V A or C or G
R A or G or N or X or ? unknown
W A or T (=A or C or G or T)
S C or G

[tab:ambigunt]

Character Amino-Acid Character Amino-Acid
A Alanine L Leucine
R Arginine K Lysine
N or B Asparagine M Methionine
D Aspartic acid F Phenylalanine
C Cysteine P Proline
Q or Z Glutamine S Serine
E Glutamic acid T Threonine
G Glycine W Tryptophan
H Histidine Y Tyrosine
I Isoleucine V Valine
L Leucine or X or ? unknown
K Lysine (can be any amino acid)

[tab:ambiguaa]

7.1.2 Specifying outgroup sequences

PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In order to do so, sequence names that display a ‘*’ character will be automatically considered as belonging to the outgroup.

The topology of the rooted tree is exactly the same as the unrooted version of the same tree. In other words, PhyML first ignores the distinction between ingroup and outgroup sequences, builds a maximum likelihood unrooted tree and then tries to add the root. If the outgroup has more than one sequence, the position of the root might be ambiguous. In such situation, PhyML tries to identify the most relevant position of the root by considering which edge provides the best separation between ingroup and outgroup taxa (i.e., we are trying to make the outgroup “as monophyletic as possible”).

7.2 Tree format

PhyML can read one or several phylogenetic trees from an input file. This option is accessible through the Tree Searching sub menu or the ‘-u’ argument from the command line. Input trees are generally used as initial maximum likelihood estimates to be subsequently adjusted by the tree searching algorithm. Trees can be either rooted or unrooted and multifurcations are allowed. Taxa names must, of course, match the corresponding sequence names.

((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01);
((seq3,seq2),seq1,(seq4,seq5));

[fig:trees]

7.3 Multiple alignments and trees

Single or multiple sequence data sets may be used in combination with single or multiple input trees. When the number of data sets is one (nD=1) and there is only one input tree (nT=1), then this tree is simply used as input for the single data set analysis. When nD=1 and nT>1, each input tree is used successively for the analysis of the single alignment. PhyML then outputs the tree with the highest likelihood. If nD>1 and nT=1, the same input tree is used for the analysis of each data set. The last combination is nD>1 and nT>1. In this situation, the i-th tree in the input tree file is used to analyse the i-th data set. Hence, nD and nT must be equal here.

7.4 Custom amino-acid rate model

The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded in PhyML. This model must be time-reversible. Hence, the matrix of substitution rates is symmetrical. The format of the rate matrix with the associated stationary frequencies is identical to the one used in PAML. An example is given below:

p0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cmp0.33cm &&&&&&&&&&&&&&&&&&&
0.55 & &&&&&&&&&&&&&&&&&&
0.51 & 0.64 & &&&&&&&&&&&&&&&&
0.74 & 0.15 & 5.43 & &&&&&&&&&&&&&&&&
1.03 & 0.53 & 0.27 & 0.03 & &&&&&&&&&&&&&&&
0.91 & 3.04 & 1.54 & 0.62 & 0.10 & &&&&&&&&&&&&&&
1.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 & &&&&&&&&&&&&&
1.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 & &&&&&&&&&&&&
0.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 & &&&&&&&&&&&
0.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 & &&&&&&&&&&
0.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 & &&&&&&&&&
0.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 & &&&&&&&&
0.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 & &&&&&&&
0.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 & &&&&&&
1.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 & &&&&&
3.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 & &&&&
2.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 & &&&
0.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 & &&
0.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 & &
2.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 &

8.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09

The entry on the i-th row and j-th column of this matrix corresponds to the rate of substitutions between amino-acids i and j. The last line in the file gives the stationary frequencies and must be separated from the rate matrix by one line. The ordering of the amino-acids is alphabetical, i.e, Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr and Val.

7.5 Topological constraint file

PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In order to do so, one should use the –constraint_file file_name command-line option where file_name lists the topological constraints. Such constraints are straightforward to define. For instance, the following constraints:

((A,B,C),(D,E,F));

indicate that taxa A, B and C belong to the same clade. D, E and F also belong to the same clade and the two clades hence defined should not overlap. Under these two constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that the constraints are defined using a multifurcating tree in NEWICK format. Note that this tree does not need to display the whole list of taxa. For instance, while the only taxa involved in specifying topological constraints above are A, B, C, D, E & F, the actual data set could include more than these six taxa only.

PhyML tree topology search algorithms all rely on improving a starting tree. By default, BioNJ is the method of choice for building this tree. However, there is no guarantee that the phylogeny estimated with PhyML does comply with the topological constraints. While it is probably possible to implement BioNJ with topological constraints, we have not done so yet. Instead, the same multifurcating tree that defines the topological constraints should also be used as starting tree using the -u (–inputtree) option. Altogether, the command line should look like the following: -u=file_name –constraint_file=file_name. It is not possible to use as input tree a non-binary phylogeny that is distinct from that provided in the constraint tree file. However, any binary tree compatible with the constraint one can be used as input tree.

7.6 Output files

Sequence file name : ‘seq

Output file name Content
seq_phyml_tree ML tree
seq_phyml_stats ML model parameters
seq_phyml_boot_trees ML trees – bootstrap replicates
seq_phyml_boot_stats ML model parameters – bootstrap replicates
seq_phyml_rand_trees ML trees – multiple random starts
seq_phyml_ancestral_seq ML trees – ancestral sequences

[tab:output]

Table [tab:output] presents the list of files resulting from an analysis. Basically, each output file name can be divided into three parts. The first part is the sequence file name, the second part corresponds to the extension ‘_phyml_’ and the third part is related to the file content. When launched with the default options, PhyML only generates two files: the tree file and the model parameter file. The estimated maximum likelihood tree is in standard Newick format (see Figure [fig:trees]). The model parameters file, or statistics file, displays the maximum likelihood estimates of the substitution model parameters, the likelihood of the maximum likelihood phylogenetic model, and other important information concerning the settings of the analysis (e.g., type of data, name of the substitution model, starting tree, etc.). Two additional output files are created if bootstrap supports were evaluated. These files simply contain the maximum likelihood trees and the substitution model parameters estimated from each bootstrap replicate. Such information can be used to estimate sampling errors around each parameter of the phylogenetic model. When the random tree option is turned on, the maximum likelihood trees estimated from each random starting trees are printed in a separate tree file (see last row of Table [tab:output]).

PhyML estimates ancestral sequences by calculating the marginal probability of each character state at each internal node of the phylogeny. These probabilities are given in the file seq_phyml_ancestral_seq. The bulk of this file is a table where each row corresponds to a site in the original alignment and an the number corresponding to an internal node. It is relatively straightforward to identify which number corresponds to which node in the tree by examining the information provided at the beginning of seq_phyml_ancestral_seq. This section of the file displays the tree structure in terms of a list of node numbers rather than in the NEWICK format. For instance, the tree (A,B,(C,D)); corresponds to the following list of nodes:


Node nums:   0   4  (dir:  0) names = 'A'      '(null)';
Node nums:   4   2  (dir:  1) names = '(null)' 'B';
Node nums:   4   5  (dir:  2) names = '(null)' '(null)';
Node nums:   5   1  (dir:  0) names = '(null)' 'C';
Node nums:   5   3  (dir:  1) names = '(null)' 'D';

The two integers following Node nums are the node numbers. They are displayed in a recursive manner. The number on the left column is that of the ancestral node while the one on the right column is the direct descendant. The following columns are the node names. These names are set to null except for the tip nodes, where the corresponding taxon names are displayed.

7.7 Treatment of invariable sites with fixed branch lengths

PhyML allows users to give an input tree with fixed topology and branch lengths and find the proportion of invariable sites that maximise the likelihood (option -o r). These two options can be considered as conflicting since branch lengths depend on the proportion of invariants. Hence, changing the proportion of invariants implies that branch lengths are changing too. More formally, let l denote the length of a branch, i.e., the expected number of substitutions per site, and p be the proportion of invariants. We have l=(1p)lʹ, where lʹ is the expected number of substitutions per _variable_ sites. When asked to optimize p but leave l unchanged, PhyML does the following:

  1. Calculate lʹ=l/(1p) and leave lʹ unchanged throughout the optimization.

  2. Find the value of p that maximises the likelihood. Let p* denote this value.

  3. Set l*=(1p*)lʹ and print out the tree with l* (instead of l).

PhyML therefore assumes that the users wants to fix the branch lengths measured at _variable_ sites only (i.e., l* is fixed). This is the reason why the branch lengths in the input and output trees do differ despite the use of the the -o r option. While we believe that this approach relies on a sound rationale, it is not perfect. In particular, the original transformation of branch lengths (lʹ=l/(1p)) relies on a default value for p with is set to 0.2 in practice. It is difficult to justify the use of this value rather than another one. One suggestion proposed by Bart Hazes is to avoid fixing the branch lengths altogether and rather estimate the value of a scaling factor applied to each branch length in the input tree (option –contrained_lens). We agree that this solution probably matches very well most users expectation, i.e., “find the best value of p while constraining the ratio of branch lengths to be that given in the input tree”. Please feel free to send us your suggestions regarding this problem by posting on the forum (http://groups.google.com/group/phyml-forum).

8 Inputs & outputs for the XML interface

8.1 Mixture models in PhyML

[sec:mixtures]

PhyML implements a wide range of mixture models. The discrete gamma model is arguably the most popular of these models in phylogenetics. However, in theory, mixture models are not restricted to the description of the variation of substitution rates across sites. For instance, if there are good reasons to believe that the relative rates of substitution between nucleotides vary along the sequence alignments, it makes sense to use a mixture of GTR models. Consider the case where substitutions between A and C occur at high rate in some regions of the alignment and low rate elsewhere, a mixture with two classes, each class having its own GTR rate matrix, would be suitable. The likelihood at any site of the alignment is then obtained by averaging the likelihoods obtained for each GTR rate matrix, with the same weight given to each of these matrices.

PhyML implements a generic framework that allows users to define mixtures on substitution rates, rate matrices and nucleotide or amino-acid equilibrium frequencies. Each class of the mixture model is built by assembling a substitution rate, a rate matrix1 and a vector of equilibrium frequencies. For instance, let {R1,R2,R3} be a set of substitution rates, {M1,M2} a set of rate matrices and {F1,F2} a set of vectors of equilibrium frequencies. One could then define the first class of the mixture model as 𝒞1={R1,M1,F1}, a second class as 𝒞2={R2,M1,F1}, and a third class as 𝒞3={R3,M2,F2}. If R1, R2 and R3 correspond to slow, medium and fast substitution rates, then this mixture model allows the fast evolving rates to have their own vector of equilibrium frequencies and rate matrix, distinct from that found at the medium or slow evolving sites. The likelihood at any given site Ds of the alignment is then:

Pr(Ds)=c=13Pr(Ds𝒞s=c)Pr(𝒞s=c),

where Pr(𝒞s=c) is obtained by multiplying the probability (density) of the three components (i.e., rate, matrix, frequencies). For instance, Pr(𝒞1={R1,M1,F1})=Pr(R1)×Pr(M1)×Pr(F1). We therefore assume here that substitution rates, rate matrices and equilibrium frequencies are independent from one another.

Note that, using the same substitution rates, rate matrices and vector of equilibrium frequencies, it is possible to construct many other mixture models. For instance, the mixture model with the largest number of classes can be created by considering all the combinations of these three components. We would then get a mixture of 3×2×2=12 classes, corresponding to all the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies.

8.2 Partitions

We first introduce some terms of vocabulary that have not been presented before. A partitionned data set, also referred to as partition, is a set of partition elements. Typically, a partitionned data set will be made of a set of distinct gene alignments. A partition element will then correspond to one (or several) of these gene alignments. Note that the biology litterature often uses the term partition to refer to an element of a partitionned data. We thus use here instead the mathematical definition of the terms ‘partition’ and ‘partition element’.

Phylogenetics models usually assume individual columns of an alignment to evolve independently from one another. Codon-based models (e.g., ) are exceptions to this rule since the substitution process applies here to triplets of consecutive sites of coding sequences. The non-independence of the substitution process at the three coding positions (due to the specificities of the genetic code), can therefore be accounted for. Assuming that sites evolve independently does not mean that a distinct model is fitted to each site of the alignment. Estimating the parameters of these models would not make much sense in practice due to the very limited amount of phylogenetic signal conveyed by individual sites. Site independence means instead that the columns of the observed alignment were sampled randomly from the same “population of columns”. The stochasticity of the substitution process running along the tree is deemed responsible to the variability of site patterns.

Some parameters of the phylogenetic model are considered to be common to all the sites in the alignment. The tree topology is typically one such parameter. The transition/transversion ratio is also generally assumed to be the same for all columns. Other parameters can vary from site to site. The rate at which substitutions accumulate is one of these parameters. Hence, different sites can have distinct rates. However, such rates are all “drawn” from the same probabilitic distribution (generally a discrete Gamma density). Hence, while different sites may have distinct rates of evolution, they all share the same distribution of rates.

This reasonning also applies on a larger scale. When analysing multiple genes, one can indeed assume that the same mechanism generated the different site patterns observed for every gene. Here again, we can assume that all the genes share the same underlying tree topology (commonly refered to as the “species tree”). Other parameters of the phylogenetic model, such as branch lengths for instance, might be shared across genes. However, due to the specificities of the gene evolution processes, some model parameters need to be adjusted for each gene separately. To sum up, the phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution of parameters, shared across several partition elements and other parameters estimated separately for each element of the partition.

The likelihood of a data set made of the concatenation of n sequence alignments noted D(1), D(2),,D(n) is then obtained as follows:

Pr(D(1),D(2),,D(n))=i=1nPr(D(i))=i=1ns=1LiPr(Ds(i)),

where Li is the number of site columns in partition element i. Pr(Ds(i)) is then obtained using Equation [equ:mixtlk], i.e., by summing over the different classes of the mixture model that applies to site s for partition element i. Hence, the joint probability of all the partition elements is here broken down into the product of likelihood of every site for each partition element. As noted just above, any given component of the mixture model at a given particular site is shared by the other sites that belong to the same partition element and, for some of them, by sites in other partition elements (e.g., the same tree topology is shared by all the sites, throughout all the partition elements).

PhyML implements a wide variety of partition models. The only parameter that is constrained to be shared by all the partition elements is the tree topology. This constraint makes sense when considering distantly related taxa, typically inter-species data. For closely related taxa, i.e., when analysing intra-species or population-level data, not all the genes might have the same evolutionary history. Recombination events combined to the incomplete lineage sorting phenomenon can generate discrepancies between the gene trees and the underlying species tree (see for a review). The phylogenetic softwares BEST , STEM and *BEAST are dedicated to the estimation of species tree phylogenies from the analysis of multi-gene data and allow gene-tree topologies to vary across genes.

Aside from the tree topology that is common to all the sites and all the partition elements, other parameters of the phylogenetic model can be either shared across partition elements or estimated separately for each of these. When analysing three partition elements, A, B and C for instance, PhyML can fit a model where the same set of branch lengths applies to A and B while C has its own estimated lengths. The same goes for the substitution model: the same GTR model, with identical parameter values, can be fitted to A and C and JC69 for instance can be used for B. The sections below give more detailed information on the range of models available and how to set up the corresponding XML configuration files to implement them.

8.3 Combining mixture and partitions in PhyML: the theory

The rationale behind mixture models as implemented in PhyML lies in (1) the definition of suitable rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly of these components so as to create the classes of a mixture. The main idea behind partitionned analysis in PhyML lies in (1) the hypothesis of statistical independance of the different data partition elements and (2) distinct data partition can share model components such as rate matrices, equilibrium frequencies or distribution of rates across sites. More formally, the likelihood of a data set made of n partition elements is written as follows:

Pr(D(1),D(2),,D(n))=i=1ns=1LiPr(Ds(i))=i=1ns=1Lic=1KiPr(Ds(i)𝒞=c)Pr(𝒞=c),

where Li is the number of sites in partition element i and Ki is the number of classes in the mixture model that applies to this same partition element. Each class of a mixture is made of a rate matrix M, a vector of equilibrium frequencies F and a relative rate of substitution R. Branch lengths, L and tree topology τ are also required for the calculation of the likelihood. Hence we have:

Pr(D(1),D(2),,D(n))=i=1ns=1Lic=1KiPr(Ds(i)𝒞=c)Pr(𝒞=c)=i=1ns=1LimifiriPr(Ds(i)Mm(i),Ff(i),Rr(i),L(i),τ)Pr(Mm(i),Ff(i),Rr(i))(m,f,r,i)

where i, i and i are the number of rate matrices, vector of equilibrium frequencies and relative rates that apply to partition element i respectively. (m,f,r,i) is an indicator function that takes value 1 if the combination Mm, Ff and Rr is acually defined in the model for this particular partition element i. Its value is 0 otherwise. In the example given in section [sec:mixtures] {R1,R2,R3} is the set of substitution rates, {M1,M2} the set of rate matrices and {F1,F2} the set of vectors of equilibrium frequencies. We then define the first class of the mixture model as 𝒞1={R1,M1,F1}, a second class as 𝒞2={R2,M1,F1} and the third as 𝒞3={R3,M2,F2}. Hence, we have (1,1,1,i), (1,1,2,i) and (2,2,3,i) equal to one while the nine other values that this indicator function takes, corresponding to the possible combinations of two vectors of frequencies, two matrices and three rates, are all zero.

As stated before, our implementation assumes that the different components of a mixture are independant. In other words, we have Pr(Mm(i),Ff(i),Rr(i))=Pr(Mm(i))×Pr(Ff(i))×Pr(Rr(i)). In practice, the joint probability Pr(Mm(i),Ff(i),Rr(i)) is obtained as follows:

Pr(Mm(i),Ff(i),Rr(i))=Pr(Mm(i))Pr(Ff(i))Pr(Rr(i))m,f,rPr(Mm(i))Pr(Ff(i))Pr(Rr(i))(m,f,r,i)

The probabilities Pr(Mm(i)), Pr(Ff(i)) and Pr(Rr(i)), also called ‘weights’, can be fixed or estimated from the data.

8.4 The XML format and its use in PhyML

The few paragraphs below are largely inspired from the Wikipedia page that describes the XML format (http://en.wikipedia.org/wiki/XML). XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. An XML document is divided into markup and content, which may be distinguished by the application of simple syntactic rules. Generally, strings that constitute markup either begin with the character ‘<’ and end with a ‘>’. Strings of characters that are not markup are content:

<markup>
 content
</markup>

A markup construct that begins with ‘<’ and ends with ‘>’ is called a tag. Tags come in three flavors: (1) start-tags (e.g, <section>), end-tags (e.g., </section>) and empty-element tags (e.g., <line-break />). A component either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start- and end-tags, if any, are the element’s content, and may contain markup, including other elements, which are called child elements. In the following example, the element img has two attributes, src and alt: <img src=madonna.jpg alt=Foligno Madonna, by Raphael/>. Another example would be <step number=3>Connect A to B.</step> where the name of the attribute is ``number“ and the value is ``3”.

In practice, building a mixture model in a XML file readable by PhyML is relatively straightforward. The first step is to define the different components of each class of the mixture. Consider for instance that the fitted model will have a Gamma distribution with four classes plus a proportion of invariants. The rate component of the mixture can then be specified using the following XML code:


<siterates id="SiteRates1">
  <weights  id="Distrib" family="gamma+inv" alpha=".1" \
  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
  </weights>
  <instance id="R1" init.value="1.0"/>
  <instance id="R2" init.value="1.0"/>
  <instance id="R5" init.value="0.0"/>
  <instance id="R3" init.value="1.0"/>
  <instance id="R4" init.value="1.0"/>
</siterates>

In the example above, the <siterates> component completely defines a model of substitution rate variation across sites. This component has a particular identity, i.e., a name associated to it (“SiteRates1” here), which is not mandatory. This <siterates> component has six sub-components. The first is the <weights> component, followed by five <instance> components. The <weights> component defines the type of distribution that characterizes the variation of rates across sites. A discrete Gamma plus invariants is used here. Two parameters specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial values are set by using the corresponding attributes and attribute values (alpha=0.1 and pinv=0.4). Also, PhyML can optimise these parameters so as to maximise the likelihood of the whole phylogenetic model (optimise.pinv=yes and optimise.alpha=yes). The following five <instance> components define the rate classes themselves. The id attribute is here mandatory and must be unique to each class. Note that one of the initial (relative) rate (init.value attribute) is set to zero. The corresponding rate class (the third in this example) will then correspond to the invariant site category.

Having specified the part of the phylogenetic model that describes the variation of rates across sites, we can now move on to build the rest of the model. The component below defines two substitution models:


<ratematrices id="RateMatrices">
  <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/>
  <instance id="M2" model="GTR" optimise.rr="yes"/>
</ratematrices>

This <ratematrices> component sets out a list of substitution models (HKY85 and GTR here). Here again, the different elements in this list correspond to the <instance> sub-components. Each instance must have a unique id attribute for a reason that will become obvious shortly. The remaining attributes and their functions are described in Section [sec:xmlratematrices].

The next “ingredient” in our phylogenetic model are vectors of nucleotide frequencies. The <equfreqs> component below specifies two of such vectors:


<equfreqs id="EquFreq">
  <instance id="F1"/>
  <instance id="F2"/>
</equfreqs>

Now, we need to assemble these three components (rate variation across sites, rate matrices and vectors of equilibrium frequencies) into a mixture model. The <partitionelem> component below defines one such model:


<partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3, R4, R5"/>
  <mixtureelem list="M1, M1, M1, M2, M2"/>
  <mixtureelem list="F1, F2, F1, F2, F2"/>
</partitionelem>

The <partitionelem> component defines a particular partition element. In this example, the partition element corresponds to the sequence file called nucleic.txt, which is an alignment of nucleotide sequences (see the data.type attribute value). The <mixtureelem> are sub-components of the <partitionelem> component. Each <mixtureelem> has a list atrribute. Each such list gives the ID of components that have been defined before. For instance, the first <mixtureelem> refers to the five classes of the <siterates> component. The ordering of the different term in these list matters a lot since it is directly related to the elements in each class of the mixture model. Hence, the first element in the <list> attribute of the first <mixtureelem> added to the first element in the <list> attribute of the second <mixtureelem> plus the the first element in <list> attribute of the third <mixtureelem> defines the first class of the mixture model. Therefore, the mixture model defined above has five classes: 𝒞1={R1,M1,F1}, 𝒞2={R2,M1,F2}, 𝒞3={R3,M1,F1}, 𝒞4={R4,M2,F2} and 𝒞5={R5,M2,F2}.

8.5 Setting up mixture and partition models in PhyML: the basics

Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features of evolution are gene-specific (e.g., substitution rates vary across genes). Models that can accomodate for such variation, as mixture models do, are therefore relevant in this context. However, other evolutionary features are shared across loci (e.g., genes located in the same genomic region usually have similar GC contents). As a consequence, some components of mixture models need to be estimated separately for each partition element while others should be shared by different partition elements.

Below is a simple example with a partitionned data set made of two elements:


<branchlengths id="BranchLens">
  <instance id="L1"/>
  <instance id="L2"/>
</branchlengths>

<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3, R4, R5"/>
  <mixtureelem list="L1, L1, L1, L1, L1"/>
</partitionelem>

<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3, R4, R5"/>
  <mixtureelem list="L2, L2, L2, L2, L2"/>
</partitionelem>

Mixture elements with names R1,, R5 refer to the Γ4+I model defined previsouly (see Section [sec:XML format]). The <branchlengths> XML component defines a mixture element that had not been introduced before. It defined vectors of branch lengths that apply to the estimated phylogeny. Two instances of such vectors are defined: L1 and L2. When examining the two partition elements (<partitionelem> component), it appears that L1 is associated with Part1 while L2 is associated with Part2. Hence, branch lengths will be estimated separately for these two partition elements.

Note that a given partition element can only have one branchlengths instance associated to it. For instance, the example given below is not valid:


<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3, R4, R5"/>
  <mixtureelem list="L1, L1, L1, L2, L2"/>
</partitionelem>

In other words, mixture of branch lengths are forbidden. One reason for this restriction is that mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets of branch lengths have the same likelihood) . But mostly, combining mixture of branch lengths with mixture of rates appears like a deadly combination. Consider for instance the following model:


<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3"/>
  <mixtureelem list="L1, L2, L3"/>
</partitionelem>

It is here impossible to tell apart branch lengths and substitution rates. Such model is strongly non-identifiable and therefore not relevant.

In the example given above, the same Γ4+I model (i.e. the same gamma shape parameter and proportion of invariant ) applies to the two partition elements. It is possible to use two distinct Γ4+I models instead using the following XML code:


<siterates id="SiteRates1">
  <weights  id="Distrib1" family="gamma+inv" alpha=".1" \
  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
  </weights>
  <instance id="R1" init.value="1.0"/>
  <instance id="R2" init.value="1.0"/>
  <instance id="R5" init.value="0.0"/>
  <instance id="R3" init.value="1.0"/>
  <instance id="R4" init.value="1.0"/>
</siterates>

<siterates id="SiteRates2">
  <weights  id="Distrib2" family="gamma+inv" alpha=".1" \
  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
  </weights>
  <instance id="R6"  init.value="1.0"/>
  <instance id="R7"  init.value="1.0"/>
  <instance id="R8"  init.value="0.0"/>
  <instance id="R9"  init.value="1.0"/>
  <instance id="R10" init.value="1.0"/>
</siterates>

<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
  <mixtureelem list="R1, R2, R3, R4, R5"/>
  <mixtureelem list="L1, L1, L1, L1, L1"/>
</partitionelem>

<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
  <mixtureelem list="R6, R7, R8, R9, R10"/>
  <mixtureelem list="L2, L2, L2, L2, L2"/>
</partitionelem>

SiteRates1 and SiteRates2 here define two distinct Γ4+I models. Each of these models apply to one of the two partition elements (nucleic1.txt and nucleic2.txt), allowing them to display different patterns of rate variation across sites.

8.6 XML options

8.6.1 phyml component

Options:

8.6.2 topology component

Options:


<topology>
  <instance id="T1" init.tree="bionj" optimise.tree="true" \
   search="spr"/>
</topology>

8.6.3 ratematrices component

[sec:xmlratematrices] Options:

The ratematrices component has the attribute optimise.weights=yes/no (default is no). If optimise.weights=yes, then the probabilities (or weights) or each matrix in the set of matrices defined by this component (see Equation [equ:weights]), will be estimated from the data.


  <ratematrices id="RM1" optimise.weights="yes">
    <instance id="M1" model="custom" model.code="000000"/>
    <instance id="M2" model="GTR" optimise.rr="yes"/>
    <instance id="M3" model="WAG"/>
  </ratematrices>

8.6.4 equfreqs component

Options:

The equfreqs component has the attribute optimise.weights=yes/no (default is no). If optimise.weights=yes, then the probabilities (or weights) or each vector of equilibrium frequencies in the set of vectors defined by this component (see Equation [equ:weights]), will be estimated from the data.


  <equfreqs id="EF1" optimise.weights="yes">
    <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/>
    <instance id="F2" aa.freqs="empirical"/>
    <instance id="F3" optimise.freq="yes"/>
  </equfreqs>

8.6.5 branchlengths component

Options:


  <branchlengths id="BL1">
    <instance id="L1" optimise.lens="yes"/>
    <instance id="L2"/>
    <instance id="L3" optimise.lens="false"/>
  </branchlengths>

8.6.6 siterates component

Options:

A siterate component generally includes a weights element that specifies the probabilitic distribution of the relative rates. The available options for such element are:


  <siterates id="SR1">
    <instance id="R1" init.value="1.0"/>
    <instance id="R2" init.value="1.0"/>
    <instance id="R3" init.value="1.0"/>
    <instance id="R4" init.value="1.0"/>
    <weights  id="D1" family="gamma" optimise.alpha="yes" \
    optimise.pinv="no">
    </weights>
  </siterates>

8.6.7 partitionelem and mixtureelem components

Options:

Each partitionelem element should include exactly four mixtureelem elements, corresponding to branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of in which the mixtureelem elements are given does not matter, though exceptions apply for the Γ+I model (see below). The n-th element in the list attribute of each mixtureelem defines the n-th class of the mixture model. In the example given below, the first class of the mixture is made of the following elements: T1, F1, R1 and L1, the second class is made of T1, F1, R2 and L1, etc.


  <partitionelem id="partition1" file.name="./small_p1.nxs" \
   data.type="nt" interleaved="yes">
    <mixtureelem list="T1, T1, T1, T1"/>
    <mixtureelem list="F1, F1, F1, F1"/>
    <mixtureelem list="R1, R2, R3, R4"/>
    <mixtureelem list="L1, L1, L1, L1"/>
  </partitionelem>

In general, the ordering of the mixtureelem elements does not matter. However, when the model has invariable sites, then the corresponding class should be first in the list of classes provided by mixtureelem. For instance, in the example above, if the rates are defined as follows:


  <siterates id="SR1">
    <instance id="R1" init.value="0.0"/>
    <instance id="R2" init.value="1.0"/>
    <instance id="R3" init.value="1.0"/>
    <instance id="R4" init.value="1.0"/>
    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
    optimise.pinv="no">
    </weights>
  </siterates>

then R1 corresponds to the invariable rate class (as init.value=0.0). As R1 is first in the mixtureelem (see line 6 in the example of partionelem given above), PhyML will print out an explicit error message and bail out. One way to avoid this shortcoming is to define mixtureelem as R4, R2, R3, R1 instead.

8.7 A simple example: GTR + Γ4 + I

The example below provides all the required options to fit a Γ4+I model to a single alignment of nucleotide sequences under the GTR model of substitution using a SPR search for the best tree topology. The phyml component sets the name for the analysis to simple.example, meaning that each output file will display this particular string of characters. Also, the tree and statistics file names will begin with p1.output. The tree topology will be estimated so as to maximise the likelihood and the topology search algorithm used here is SPR, as indicated by the value of the corresponding attribute (i.e., search=spr). Only one vector of branch lengths will be used here since only one partition element will be processed. Hence, the <branchlengths> component only has one <instance> sub-component. Also, a single GTR model will apply to all the classes for the mixture model – the <ratematrices> component has only one <instance> sub-component, corresponding to this particular substitution model. The next component, <equfreqs>, indicates that a single vector of equilibrium frequencies will apply here. Next, the <siterates> component has five <instance> sub-components. Four of these correspond to the non-zero relative rates of evolution a defined by a discrete Gamma distribution. The last one (<instance id=R5 value=0.0/>) defines the class of the mixture corresponding to invariable sites. The <weight> component indicates that a Γ+I model will be fitted here. The shape parameter of the Gamma distribution and the proportion of invariants will be estimated from the data. The <partitionelem> gives information about the sequence alignment (the corresponding file name, the type of data and the alignment format). The <mixtureelem> components next define the mixture model. Each class of the fitted model corresponds to one column, with the first column made of the following elements: T1, M1, F1, R1 and L1. The second class of the mixture is made of T1, M1, F1, R2, L1 and so forth.


<phyml runid="simple.example" output.file="p1.output">

  <topology>
    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
    search="spr"/>
  </topology>

  <branchlengths id="BL1">
    <instance id="L1" optimise.lens="yes"/>
  </branchlengths>

  <ratematrices id="RM1">
    <instance id="M1" model="GTR"/>
  </ratematrices>

  <equfreqs id="EF1">
    <instance id="F1"/>
  </equfreqs>

  <siterates id="SR1">
    <instance id="R1" value="1.0"/>
    <instance id="R2" value="1.0"/>
    <instance id="R3" value="1.0"/>
    <instance id="R4" value="1.0"/>
    <instance id="R5" value="0.0"/>
    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
    optimise.pinv="yes">
    </weights>
  </siterates>

  <partitionelem id="partition_elem1" file.name=\
  "./p1.seq" data.type="nt" interleaved="yes">
    <mixtureelem list="T1, T1, T1, T1, T1"/>
    <mixtureelem list="M1, M1, M1, M1, M1"/>
    <mixtureelem list="F1, F1, F1, F1, F1"/>
    <mixtureelem list="R1, R2, R3, R4, R5"/>
    <mixtureelem list="L1, L1, L1, L1, L1"/>
  </partitionelem>

</phyml>

8.8 A second example: LG4X

The example below shows how to fit the LG4X model to a given alignment of amino-acid sequences (file M587.nex.Phy). LG4X is a mixture model with four classes. Each class has its own rate and corresponding frequencies (hence the use of the FreeRate model below, see the <siterates> component). In the particular example given here, the rate values and frequencies are set by the users. These parameters will then be optimized by PhyML (optimise.freerates=yes). Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by the user (Note that these matrices can be downloaded from the following web address: http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt. They are also provided in the PhyML package example/lg4x/ directory.)

<phyml run.id="lg4x" output.file="M587.tests" branch.test="no">

  <!-- Tree topology: start with BioNJ and then SPRs -->
  <topology>
    <instance id="T1" init.tree="user" file.name="user_tree.txt" \
    search="spr" optimise.tree="no"/>
  </topology>


  <!-- Four rate matrices, read from files -->
  <ratematrices id="RM1">
    <instance id="M1" model="customaa" ratematrix.file="X1.mat"/>
    <instance id="M2" model="customaa" ratematrix.file="X2.mat"/>
    <instance id="M3" model="customaa" ratematrix.file="X3.mat"/>
    <instance id="M4" model="customaa" ratematrix.file="X4.mat"/>
  </ratematrices>

  <!-- Freerate model of variation of rates across sites -->
  <siterates id="SR1">
    <instance id="R1" init.value="0.197063"/>
    <instance id="R2" init.value="0.750275"/>
    <instance id="R3" init.value="1.951569"/>
    <instance id="R4" init.value="5.161586"/>
    <weights  id="D1" family="freerates" optimise.freerates="yes">
      <instance appliesto="R1" value="0.422481"/>
      <instance appliesto="R2" value="0.336848"/>
      <instance appliesto="R3" value="0.180132"/>
      <instance appliesto="R4" value="0.060539"/>
    </weights>
  </siterates>

  <!-- Amino-acid equilibrium freqs. are given by the models -->
  <equfreqs id="EF1">
    <instance id="F1" aa.freqs="model"/>
    <instance id="F2" aa.freqs="model"/>
    <instance id="F3" aa.freqs="model"/>
    <instance id="F4" aa.freqs="model"/>
  </equfreqs>


  <!-- One vector of branch lengths -->
  <branchlengths id="BL1" >
    <instance id="L1" optimise.lens="yes"/>
  </branchlengths>


  <!-- Mixture model assemblage -->
  <partitionelem id="partition1" file.name="M587.nex.Phy" \
  data.type="aa" interleaved="yes">
    <mixtureelem list="T1, T1, T1, T1"/>
    <mixtureelem list="M1, M2, M3, M4"/>
    <mixtureelem list="F1, F2, F3, F4"/>
    <mixtureelem list="R1, R2, R3, R4"/>
    <mixtureelem list="L1, L1, L1, L1"/>
  </partitionelem>

</phyml>

In order to fit the LG4X model to the proteic sequence file provided in the examples/ directory, simply type ./phyml –xml=../examples/lg4x/lg4x.xml (assuming the PhyML binary is installed in the src/ directory). You can of course slightly tweak the file ../examples/lg4x/lg4x.xml and use it as a template to fit this model to another data set.

8.9 An example with multiple partition elements

The example below gives the complete XML file to specify the analysis of three partition elements, corresponding to the nucleotide sequence files small_p1_pos1.seq, small_p1_pos2.seq and small_p1_pos3.seq in interleaved PHYLIP format. small_p1_pos1.seq is fitted with the HKY85 model of substitution (with the transition/transversion ratio being estimated from the data), combined to a Γ4 model of rate variation across sites (with the gamma shape parameter being estimated from the data). small_p1_pos2.seq is fitted to a custom substitution model with the constraint AG=CT. The nucleotide frequencies are set to 14 here. The model does not allow substitution rates to vary across sites. small_p1_pos3.seq is fitted using a GTR model conbined to a Γ4+I model of rate variation across sites. Note that the equilibrium nucleotide frequencies for the fourth and fifth class of the mixture are set to be equal to that estimated from the first partition element (i.e., F1) . The initial phylogeny is built using BioNJ and the tree topology is to be estimated using a NNI search algorithm.


<phyml runid="nnisearch" output.file="small_p1_output">

  <topology>
    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
    search="nni"/>
  </topology>

  <branchlengths id="BL1">
    <instance id="L1" optimise.lens="yes"/>
    <instance id="L2"/>
    <instance id="L3"/>
  </branchlengths>

  <ratematrices id="RM1">
    <instance id="M1" model="HKY85" optimise.tstv="yes"/>
    <instance id="M2" model="custom" model.code="102304" \
    optimise.rr="yes"/>
    <instance id="M3" model="GTR"/>
  </ratematrices>

  <equfreqs id="EF1">
    <instance id="F1"/>
    <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/>
    <instance id="F3"/>
  </equfreqs>

  <siterates id="SR1">
    <instance id="R1" value="1.0"/>
    <instance id="R2" value="1.0"/>
    <instance id="R3" value="1.0"/>
    <instance id="R4" value="1.0"/>
    <weights  id="D1" family="gamma" optimise.alpha="yes" \
    optimise.pinv="no">
    </weights>
  </siterates>

  <siterates id="SR2">
    <instance id="R8" value="1.0"/>
    <weights  id="D2" family="gamma" optimise.alpha="yes" \
    optimise.pinv="yes">
    </weights>
  </siterates>

  <siterates id="SR3">
    <instance id="R10" value="1.0"/>
    <instance id="R11" value="1.0"/>
    <instance id="R12" value="1.0"/>
    <instance id="R13" value="1.0"/>
    <instance id="R14" value="1.0"/>
    <weights  id="D3" family="gamma" optimise.alpha="yes" \
    optimise.pinv="yes">
    </weights>
  </siterates>

  <partitionelem id="partition_elem1" file.name=\
  "./small_p1_pos1.seq" data.type="nt" interleaved="yes">
    <mixtureelem list="T1, T1, T1, T1"/>
    <mixtureelem list="M1, M1, M1, M1"/>
    <mixtureelem list="F1, F1, F1, F1"/>
    <mixtureelem list="R1, R2, R3, R4"/>
    <mixtureelem list="L1, L1, L1, L1"/>
  </partitionelem>

  <partitionelem id="partition_elem2" file.name=\
  "./small_p1_pos2.seq" data.type="nt" interleaved="yes">
    <mixtureelem list="T1"/>
    <mixtureelem list="M2"/>
    <mixtureelem list="R8"/>
    <mixtureelem list="F2"/>
    <mixtureelem list="L2"/>
  </partitionelem>

  <partitionelem id="partition_elem3" file.name=\
  "./small_p1_pos3.seq" data.type="nt" interleaved="yes">
    <mixtureelem list="T1, T1, T1, T1, T1"/>
    <mixtureelem list="M3, M3, M3, M3, M3"/>
    <mixtureelem list="R10, R11, R12, R13, R14"/>
    <mixtureelem list="F3, F3, F3, F1, F1"/>
    <mixtureelem list="L3, L3, L3, L3, L3"/>
  </partitionelem>

</phyml>

8.10 Branch lengths with invariants and partionned data

Accommodating for models with invariable sites applying to some elements of a partitioned data, with these elements sharing the same set of edge lengths can lead to inconsistencies. Consider for instance a partitioned data set with two elements. Assume that these two elements share the same set of edge lengths. Also, consider that GTR+I applies to the first element and HKY applies to the second. Now, the expected number of substitutions per site for the first element of the partition is equal to (1p)l, where p is the estimated proportion of invariants and l is the maximum-likelihood estimate for the length of that specific edge. For the second element of the partition, the expected number of substitutions per site is equal to l, rather than (1p)l. While l are common to the two elements, matching the specification of the input model, the actual edge lengths do differ across the two partition elements. Please be aware that, due to the programming structure implemented in PhyML, the program will only return one value here, which will be equal to (1p)l.

9 Citing PhyML

The “default citation” for PhyML is:

The “historic citation” for PhyML is:

10 Other programs in the PhyML package

PhyML is software package that provides tools to tackle problems other than estimating maximum likelihood phylogenies. Installing these tools and processing data sets is explained is the following sections.

10.1 PhyTime

PhyTime is a program that estimates node ages and substitution rates using a Bayesian approach. The performance and main features of this software are described in two article (see Section [sec:citephytime]).

It relies on a Gibbs sampler which outperforms the “standard” Metropolis-Hastings algorithm implemented in a number of phylogenetic softwares. The details and performance of this approach are described in the following article:

10.1.1 Installing PhyTime

Compiling PhyTime is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyTime is not readily available for Windows machines but compilation should be easy on this system too. In the ‘phyml’ directory, where the ‘src/’ and ‘doc/’ directories stand, enter the following commands:

./configure --enable-phytime;
make clean;
make;

This set of commands generates a binary file called phytime which can be found in the ‘src/’ directory.

10.1.2 Running PhyTime

Passing options and running PhyTime on your data set is quite similar to running PhyML in commmand-line mode. The main differences between the two programs are explained below:

A typical PhyTime command-line should look like the following:

./phytime -i seqname -u treename --calibration=calibration_file -m GTR -c 8

Assuming the file ‘seqname’ contains DNA sequences in PHYLIP or NEXUS format, ‘treename’ is the rooted input tree in Newick format and ‘calibration_file’ is a set of calibration nodes, PhyTime will estimate the posterior distribution of node times and substitution rates under the assumption that the substitution process follows a GTR model with 8 classes of rates in the Gamma distribution of rates across sites. The model parameter values are estimated by a Gibbs sampling technique. This algorithm tries diferent values of the model parameters and record the most probable ones. By default, 106 values for each parameter are collected. These values are recorded every 103 sample. These settings can be modified using the appropriate command-line options (see below).

10.1.3 Upper bounds of model parameters

The maximum expected number of substitutions per along a given branch is set to 1.0. Since calibration times provide prior information about the time scale considered, it is possible to use that information to define an upper bound for the substitution rate. This upper bound is equal to the ratio of the maximum value for a branch length (1.0) by the amount of time elapsed since the oldest calibration point (i.e., the minimum of the lower bounds taken over the whole set of calibration points)2. It is important to keep in mind that the upper bound of the average substitution rate depends on the time unit used in the calibration priors. The value of the upper bound is printed on screen at the start of the execution.

PhyTime implements two models that authorize rates to be autocorrelated. The strength of autocorrelation is governed by a parameter which value is estimated from the data. However, it is necessary to set an appropriate upper bound for this parameter prior running the analysis. The maximum value is set such that the correlation between the rate at the beginning and at the end of a branch of length 1.0 calendar time unit is not different from 0. Here again the upper bound for the model parameter depends on the time unit. It is important to choose this unit so that a branch of length 1.0 calendar unit can be considered as short. For this reason, .

10.1.4 PhyTime specific options

Beside the –calibration option, there are other command line options that are specific to PhyTime:

10.1.5 PhyTime output

The program PhyTime generates two output files. The file called ‘your_seqfile_phytime.XXXX.stats’, where XXXX is a randomly generated integer, lists the node times and branch relative rates sampled during the estimation process. It also gives the sampled values for other parameters, such as the autocorrelation of rates (parameter ‘Nu’), and the rate of evolution (parameter ‘EvolRate’) amongst others. This output file can be analysed with the program Tracer from the BEAST package (http://beast.bio.ed.ac.uk/Main_Page). The second file is called ‘your_seqfile_phytime.XXXX.trees’. It is the list of trees that were collected during the estimation process, i.e., phylogenies sampled from the posterior density of trees. This file can be processed using the software TreeAnnotator, also part of the BEAST package (see http://beast.bio.ed.ac.uk/Main_Page) in order to generate confidence sets for the node time estimates.

Important information is also displayed on the standard output of PhyTime (the standard output generally corresponds to the terminal window from which PhyTime was launched). The first column of this output gives the current generation, or run, of the chain. It starts at 1 and goes up to 1E+6 by default (use –chain_len to change this value, see above). The second column gives the time elapsed in seconds since the sampling began. The third column gives the log likelihood of the phylogenetic model (i.e., ‘Felsenstein’s likelihood’). The fourth column gives the logarithm of the joint prior probability of substitution rates along the tree and node heights. The fifth column gives the current sampled value of the EvolRate parameter along with the corresponding Effective Sample Size (ESS) (see Section [sec:ess]) for this parameter. The sixth column gives the tree height and the corresponding ESS. The seventh column gives the value of the autocorrelation parameter followed by the corresponding ESS. The eightth column gives the values of the birth rate parameter that governs the birth-rate model of species divergence dates. The last column of the standard output gives the minimum of the ESS values taken over the whole set of node height estimates. It provides useful information when one has to decide whether or not the sample size is large enough to draw valid conclusion, i.e., decide whether the chain was run for long enough (see Section [sec:recomphytime] for more detail about adequate chain length).

10.1.6 ClockRate vs. EvolRate

The average rate of evolution along a branch is broken into two components. One is called ClockRate and is the same throughout the tree. The other is called EvolRate and corresponds to a weighted average of branch-specific rates. The model of rate evolution implemented in PhyTime forces the branch-specific rate values to be greater than one. As a consequence, ClockRate is usually smaller EvolRate.

In more mathematical terms, let μ be the value of ClockRate, ri be the value of the relative rate along branch i and Δi the time elapsed along branch i. The value of EvolRate is then given by:

EvolRate=μi2n3riΔii2n3Δi.

It is clear from this equation that multiplying each ri by a constant and dividing μ by the same constant does not change the value of EvolRate. The ris and μ are then confounded, or non-identifiable, and only the value of EvolRate can be estimated from the data. .

10.1.7 Effective sample size

The MCMC technique generates samples from a target distribution (in our case, the joint posterior density of parameters). Due to the Markovian nature of the method, these samples are not independent. The ESS is the estimated number of independent measurements obtained from a set of (usually dependent) measurements. It is calculated using the following formula:

ESS=N(1r1+r),

where N is the length of the chain (i.e., the ‘raw’ or ‘correlated’ sample size) and r is the autocorrelation value, which is obtained using the following formula:

r=1(Nk)σx2i=1Nk(Xiμx)(Xi+kμx),

where μx and σx are the mean and standard deviation of the Xi values respectively and k is the lag. The value of r that is used in PhyTime corresponds to the case where k=1, which therefore gives a first order approximation of the ‘average’ autocorrelation value (i.e., the autocorrelation averaged over the set of possible values of the lag).

10.1.8 Prior distributions of model parameters

Any Bayesian analysis requires specifying a prior distribution of model parameters. The outcome of the data analysis, i.e., the posterior distribution, is influenced by the priors. It is especially true if the signal conveyed by the data is weak. While some have argued that the specification of priors relies more on arbitrary decisions than sound scientific reasoning, choosing relevant prior distributions is in fact fully integrated in the process of building model that generates the observed data. In particular, the problem of estimating divergence times naturally lends itself to hierarchical Bayesian modelling. Based on the hypothesis that rates of evolution are conserved (to some extant) throughout the course of evolution, the hierarchical Bayesian approach provides an adequate framework for inferring substitution rates and divergence dates separately. Hence, in this situation, it makes good sense to use what is known about a relatively well-defined feature of the evolution of genetic sequences (the “molecular clock” hypothesis combined to stochastic variations of rates across lineages) to build a prior distribution on rates along edges.

10.1.9 Citing PhyTime

The “default citation” is:

An earlier article also described some of the methods implemented in PhyTime:

10.2 PhyloGeo

PhyloGeo is a program that implements the competition-dispersal phylogeography model described in Ranjard, Welch, Paturel and Guindon “Modelling competition and dispersal in a statistical phylogeographic framework”. Accepted for publication in Systematic Biology.

It implements a Markov Chain Monte Carlo approach that samples from the posterior distribution of the three parameters of interest in this model, namely the competition intensity λ, the dispersal bias parameter σ and the overal dispersal rate τ. The data consist in a phylogeny with node heights proportional to their ages and geographical locations for evety taxon in this tree.

10.2.1 Installing PhyloGeo

Compiling PhyloGeo is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyloGeo is not readily available for Windows machines but compilation should be easy on this system too. In the ‘phyml’ directory, where the ‘src/’ and ‘doc/’ directories stand, enter the following commands:

./configure --enable-geo;
make clean;
make;

This set of commands generates a binary file called phylogeo which can be found in the ‘src/’ directory.

10.2.2 Running PhyloGeo

PhyloGeo takes as input a rooted tree file in Newick format and a tree with geographical locations for all the tips of the phylogeny. Here is an example of valid tree and the corresponding spatial locations just below:

(((unicaA:1.30202,unicaB:1.30202):1.34596,(((nitidaC:0.94617,(nitidaA:0.31497,
nitidaB:0.31497):0.63120):0.18955,(((mauiensisA:0.00370,mauiensisB:0.00370):0.20068,
(pilimauiensisA:0.05151,pilimauiensisB:0.05151):0.15287):0.78769,(brunneaA:0.10582,
brunneaB:0.10582):0.88625):0.14365):0.80126,(((molokaiensisA:0.03728,
molokaiensisB:0.03728):0.71371,(deplanataA:0.01814,deplanataB:0.01814):0.73285):0.34764,
((parvulaA:0.20487,parvulaB:0.20487):0.40191,(kauaiensisA:0.24276,
kauaiensisB:0.24276):0.36401):0.49186):0.83835):0.71099):1.38043,
(nihoaA:0.05673,nihoaB:0.05673):3.97168);
nihoaA                  23.062222   161.926111
nihoaB                  23.062222   161.926111
kauaiensisA             22.0644445  159.5455555
kauaiensisB             22.0644445  159.5455555
unicaA                  21.436875   158.0524305
unicaB                  21.436875   158.0524305
parvulaA                21.436875   158.0524305
parvulaB                21.436875   158.0524305
molokaiensisA           20.90532    156.6499
molokaiensisB           20.90532    156.6499
deplanataA              20.90532    156.6499
deplanataB              20.90532    156.6499
brunneaA                20.90532    156.6499
brunneaB                20.90532    156.6499
mauiensisA              20.90532    156.6499
mauiensisB              20.90532    156.6499
pilimauiensisA          20.90532    156.6499
pilimauiensisB          20.90532    156.6499
nitidaA                 19.7362         155.6069
nitidaB                 19.7362         155.6069
nitidaC                 19.7362         155.6069

In order to run PhyloGeo, enter the following command: ./phylogeo ./tree_file ./spatial_location_file > phylogeo_output. PhyloGeo will then print out the sampled values of the model parameters in the file phylogeo_output. This file can then be used to generate the marginal posterior densities of the model parameters. In particular, evidence for competition corresponds to value of λ smaller than 1.0. Please see the original article for more information on how to interpret the model parameters.

10.2.3 Citing PhyloGeo

Ranjard, L., Welch D., Paturel M. and Guindon S. “Modelling competition and dispersal in a statistical phylogeographic framework”. 2014. Systematic Biology.

11 Recommendations on program usage

11.1 PhyML

The choice of the tree searching algorithm among those provided by PhyML is generally a tough one. The fastest option relies on local and simultaneous modifications of the phylogeny using NNI moves. More thorough explorations of the space of topologies are also available through the SPR options. As these two classes of tree topology moves involve different computational burdens, it is important to determine which option is the most suitable for the type of data set or analysis one wants to perform. Below is a list of recommendations for typical phylogenetic analyses.

  1. Single data set, unlimited computing time. The best option here is probably to use a SPR search (i.e., straight SPR of best of SPR and NNI). If the focus is on estimating the relationships between species, it is a good idea to use more than one starting tree to decrease the chance of getting stuck in a local maximum of the likelihood function. Using NNIs is appropriate if the analysis does not mainly focus on estimating the evolutionary relationships between species (e.g. a tree is needed to estimate the parameters of codon-based models later on). Branch supports can be estimated using bootstrap and approximate likelihood ratios.

  2. Single data set, restricted computing time. The three tree searching options can be used depending on the computing time available and the size of the data set. For small data sets (i.e., < 50 sequences), NNI will generally perform well provided that the phylogenetic signal is strong. It is relevant to estimate a first tree using NNI moves and examine the reconstructed phylogeny in order to have a rough idea of the strength of the phylogenetic signal (the presence of small internal branch lengths is generally considered as a sign of a weak phylogenetic signal, specially when sequences are short). For larger data sets (> 50 sequences), a SPR search is recommended if there are good evidence of a lack of phylogenetic signal. Bootstrap analysis will generally involve large computational burdens. Estimating branch supports using approximate likelihood ratios therefore provides an interesting alternative here.

  3. Multiple data sets, unlimited computing time. Comparative genomic analyses sometimes rely on building phylogenies from the analysis of a large number of gene families. Here again, the NNI option is the most relevant if the focus is not on recovering the most accurate picture of the evolutionary relationships between species. Slower SPR-based heuristics should be used when the topology of the tree is an important parameter of the analysis (e.g., identification of horizontally transferred genes using phylogenetic tree comparisons). Internal branch support is generally not a crucial parameter of the multiple data set analyses. Using approximate likelihood ratio is probably the best choice here.

  4. Multiple data sets, limited computing time. The large amount of data to be processed in a limited time generally requires the use of the fastest tree searching and branch support estimation methods Hence, NNI and approximate likelihood ratios rather than SPR and non-parametric bootstrap are generally the most appropriate here.

Another important point is the choice of the substitution model. While default options generally provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the best-fit substitution model. This pre-analysis can be done using popular software such as Modeltest or ProtTest for instance. These programs generally recommend the use of a discrete gamma distribution to model the substitution process as variability of rates among sites is a common feature of molecular evolution. The choice of the number of rate classes to use for this distribution is also an important one. While the default is set to four categories in PhyML, it is recommended to use larger number of classes if possible in order to best approximate the patterns of rate variation across sites . Note however that run times are directly proportional to the number of classes of the discrete gamma distribution. Here again, a pre-analysis with the simplest model should help the user to determine the number of rate classes that represents the best trade-off between computing time and fit of the model to the data.

11.2 PhyTime

Analysing a data set using PhyTime should involve three steps based on the following questions: (1) do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how long shall I run the program for? I explain below how to provide answers to these questions.

12 Frequently asked questions

  1. PhyML crashes before reading the sequences. What’s wrong ?

  2. The program crashes after reading the sequences. What’s wrong ?

  3. Does PhyML handle outgroup sequences ?

  4. Does PhyML estimate clock-constrained trees ?

  5. Can PhyML analyse partitioned data, such as multiple gene sequences ?

13 Acknowledgements

The development of PhyML since 2000 has been supported by the Centre National de la Recherche Scientifique (CNRS) and the Ministère de l’Éducation Nationale.


  1. the rate matrix corresponds here the symmetrical matrix giving the so-called “echangeability rates”

  2. The actual formula involves an extra parameter which does not need to be introduced here