Tree_Simulation_and_Trait

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Summary of Document

In this document, we’ll go through how we can quickliy simulate trees so we can learn how to do important phylogenetic techniques and methods, such as: tree manipulation, rerooting, signal estimation, trait mapping, etc.

First, you should load the required packages. You should install if you do not have them yet.

library(ape)
library(phytools)

## Loading required package: maps

library(ggtree)

## Warning: package 'ggtree' was built under R version 4.5.1

## ggtree v3.16.3 Learn more at https://yulab-smu.top/contribution-tree-data/
## 
## Please cite:
## 
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:ape':
## 
##     rotate

library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:maps':
## 
##     unemp

library(ggplot2)
library(geiger)

First, we’ll need to simulate a tree

We can use the rtree function to do this. We’ll keep things pretty default, but there are ways to simulate trees with certain types of structures, etc.

We’ll also set a “seed”. This is a number which will allow us to make the same tree. You might not want to do this if you’re wanting to test many differnet possible trees.

The ‘cex’ argument adjusts font size of labels

set.seed(425)
tree<-rtree(20)
plot(tree, cex=0.8)

There are many different ways to visualize a tree, as you can see below

plot(tree, cex=0.8, type="fan")

plot(tree, cex=0.8, type="cladogram")

plot(tree, cex=0.8, type="phylogram")

plot(tree, cex=0.8, type="radial")

plot(tree, cex=0.8, type="tidy")

plot(tree, cex=0.8, type="unrooted")

Rerooting our phylogeny

In order to re-root (or root) our phylogeny, we need to specify which node, tip, or group of tips we want to root it on. for example, we can root it with a tip name.

We can root on a group of tips, if you give it a lots

root.tree<-root(tree, outgroup = "t11")
plot(root.tree, cex=0.8)

If we want to root on a node, we have to know what the node numbers are. Running the nodelabels() function after plotting the tree lets us know what those are.

plot(tree, cex=0.8)
nodelabels()

We can then root like this

root.tree<-root(tree, node=29)
plot(root.tree, cex=0.8)

We can also add titles to our trees, and even not show our tips

plot(tree, show.tip.label = F)
title("simulated tree")

Simulating Continuous Data

Often in comparative phylogenetics work, we are interested in the evolution of continuous traits. These are things like size, mass, width, genome size, etc.

There are many ways to investigate plotting and assessing the trait. We can use Brownian motion to simulate the trait (see for more information: https://en.wikipedia.org/wiki/Brownian_motion)

We can simulate the traits for each tip, and indicate what we want the bounds of the trait to be, and to see how variable we want the trait to be

trait<-fastBM(tree, bounds=c(0, 10), sig2=2)

We can see there is a distribution

hist(trait, breaks=10, col="slateblue", xlab="Trait Value (units)")

One of the most useful ways to visualize these traits on a tree is to use colors. This is usually in the form of a heat map. To do this, we’ll need to make our color palette. Below, we’ll make the palette and then plot them for our tip labels.

cols<-setNames(heat.colors(length(trait))[rank(trait)], 
               names(trait))
plot(tree, tip.color=cols)

However, as you can see, these color scan be difficult to see in this fashion. I often suggest using a different color palette, like the viridis palette from the viridis package. We’ll plot this and then make the labels a bit bigger

cols<-setNames(viridis(length(trait))[rank(trait)], 
               names(trait))
plot(tree, tip.color=cols, cex=1.5)

Plotting Continuous Traits on Branches

Now tip labels, are fine and all, but sometimes it’s better to plot them as branches. We can then even have an estimate for the ancestral condition (through ancestral character state reconstruction). We can do this with the phytools package

coltree<-contMap(tree, trait, res=250, fsize=1, lwd=4)

coltree

## Object of class "contMap" containing:
## 
## (1) A phylogenetic tree with 20 tips and 19 internal nodes.
## 
## (2) A mapped continuous trait on the range (0.012295, 2.343992).

However, as before, the palette is not ideal. This has issues with color blindness and grayscale. We can try viridis again. Here smaller traits are with darker colors, larger traits with lighter colors

coltree$cols[]<-viridis(length(coltree$cols))
plot(coltree)

Trimming tips

Sometimes you’ll end up with trait datasets that do not include all of the tips/species that you have on your tree. It’s pretty easy to remove those. In order to demo this, we’ll simulate a tree with more tips than our trait data simulated above. The tree will have ten extra tips, and we’ll use our original trait data (20 species). The way this tree is simulated, we’ll have the same names for the first 20 things

set.seed(425)
tree.big<-rtree(30)
plot(tree.big)

We have to first come up with a list of tips which are found in the tree, but not the dataset. It is important that the larger item be first, then the trait data

to.remove<-setdiff(tree.big$tip.label, names(trait))
to.remove

##  [1] "t28" "t27" "t30" "t24" "t25" "t26" "t22" "t21" "t29" "t23"

We can see it provides t21-t30 We can then trim the tree with that information using the drop.tip function. We give the tree we want to trim, and the names we want to cut out. We can also use the keep.tip function to keep tips rather than trim them; It functionally works the same, but inverse. We’ll trim it below, then plot the color mapped one

treetrim<-drop.tip(tree.big, to.remove)
coltree2<-contMap(treetrim, trait, res=250, fsize=0.9, lwd=6)

We’ll change the colors again

coltree2$cols[]<-viridis(length(coltree2$cols))
plot(coltree2, lwd=6)

Measuring phylogenetic signal for a continuous trait

As traits often evolve along the phylogeny (as we had our code do), we should use a metric to measure how strong the relationship is. For example, we can use Pagel’s Lambda, Blomberg’s K, etc. We can also test to see how the trait fits a model of Brownian Motion (discussed above) or Ornstein-Uhlenbeck (link to article: https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process)

We can run a test using the fitcontinuous() function from the geiger package. We’ll save the values in objects here for tests of how well our trait data fits the Brownian Motion (BM), Ornstein-Uhlenbeck (OH) and Lambda tests.

bm.test<-fitContinuous(tree, trait, model="BM")
ou.test<-fitContinuous(tree, trait, model="OU")

## Warning in fitContinuous(tree, trait, model = "OU"): Non-ultrametric tree with
## OU model, using VCV method.

## Warning in cache$dat - mu: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.
## Warning in cache$dat - mu: Recycling array of length 1 in vector-array arithmetic is deprecated.
##   Use c() or as.vector() instead.

## Warning in fitContinuous(tree, trait, model = "OU"): 
## Parameter estimates appear at bounds:
##  alpha

lam.test<-fitContinuous(tree, trait, model="lambda")

We can then look at the results to see which has the best AICc value. We could also use AIC or lik. scores, but the AICc corrects for the fact models have different numbers of parameters (affects our degrees of freedom). All of the below have similar values, with OU having slightly lower values for AICc. As it’s pretty close, they all suggest similar patterns, which makes sense, as BM and OU are very similar random walk models (with some more complexity in OU), and lambda is a metric that assesses Brownian Motion.

bm.test

## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.922003
##  z0 = 0.653800
## 
##  model summary:
##  log-likelihood = -26.522334
##  AIC = 57.044669
##  AICc = 57.750551
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  number of iterations with same best fit = 100
##  frequency of best fit = 1.000
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

ou.test

## GEIGER-fitted comparative model of continuous data
##  fitted 'OU' model parameters:
##  alpha = 2.718282
##  sigsq = 3.267498
##  z0 = 1.151627
## 
##  model summary:
##  log-likelihood = -22.871608
##  AIC = 51.743217
##  AICc = 53.243217
##  free parameters = 3
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 40
##  number of iterations with same best fit = NA
##  frequency of best fit = NA
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

lam.test

## GEIGER-fitted comparative model of continuous data
##  fitted 'lambda' model parameters:
##  lambda = 0.424419
##  sigsq = 0.366614
##  z0 = 0.789089
## 
##  model summary:
##  log-likelihood = -24.966572
##  AIC = 55.933145
##  AICc = 57.433145
##  free parameters = 3
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  number of iterations with same best fit = 21
##  frequency of best fit = 0.210
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

While we can use fitcontinuous for Lambda, we can test for lambda or Blomberg’s K using the phylosig() function from the phytools package

phylam<-phylosig(tree, trait, method="lambda")
phyk<-phylosig(tree, trait, method="K")

We can then see the values here

phylam

## 
## Phylogenetic signal lambda : 0.424415 
## logL(lambda) : -24.9666

phyk

## 
## Phylogenetic signal K : 0.38459

What if we want to work with Discrete data?

Discrete data is very common as well, but can be a bit trickier to visualize, especially for ancestral nodes. We’ll simulate some data here based on an ER model. It’ll have 3 traits, if the seed works correctly. We’ll also make a dataframe version of it, as it’ll be needed for what we’re going to do

set.seed(7)
discrete<-rTraitDisc(tree, model="ER", k=3)
discrete

## t16 t12  t3 t17 t14  t4 t19 t15 t13  t7  t5  t1 t20  t9 t11 t18  t8  t6  t2 t10 
##   A   A   C   C   A   A   A   B   B   A   B   B   B   B   A   B   B   B   B   B 
## Levels: A B C

disdat<-as.data.frame(discrete)

We can then plot it with some color

dis.tree<-plotTree.datamatrix(tree, disdat, fsize=0.8)

## Loading required package: RColorBrewer

We can also plot it as the tip labels like this, but it’s not as visually appealing (in my opinion)

plot(tree, tip.color = c("orchid", "red", 'slateblue')[discrete])

we can make other discrete maps with the contmap, but it defeats the purpose of it being a “cont” map

contMap(tree, discrete)

We can work with other packages to do nice plots, such as ggtree. We can make a nice looking tree accounting for branch length differences with the dots, and make the font bold and italic

test.tree<-ggtree(tree, layout="rectangular")+
  geom_tiplab(align=T, fontface="bold.italic")

## Warning: `aes_()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`
## ℹ The deprecated feature was likely used in the ggtree package.
##   Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning in fortify(data, ...): Arguments in `...` must be used.
## ✖ Problematic arguments:
## • as.Date = as.Date
## • yscale_mapping = yscale_mapping
## • hang = hang
## ℹ Did you misspell an argument name?

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggtree package.
##   Please report the issue at <https://github.com/YuLab-SMU/ggtree/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

test.tree

We can then make a heatmap plot like before, but with a few more options. These heatmaps can also be used to plot continuous variables, and even many variables. We’ll keep default colors here, but we’ll add a new legend. The backslash + n adds a new line

gheatmap(test.tree, disdat, width=0.1,
         offset=0.2,
         custom_column_labels = "State",
         legend_title = "Wing\nType",
         color="black")

We want to be able to assess ancestral condition in discrete characters too. We can do this through simmaps. The below would be one simmap. This could vary though, depending on the support

disc<-make.simmap(tree, discrete)

## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##            A          B          C
## A -0.4860471  0.2566204  0.2294267
## B  0.2566204 -0.2566204  0.0000000
## C  0.2294267  0.0000000 -0.2294267
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##         A         B         C 
## 0.3333333 0.3333333 0.3333333

## Done.

plotSimmap(disc)

## no colors provided. using the following legend:
##         A         B         C 
##   "black" "#DF536B" "#61D04F"

The shift in colors can give us an estimate on where changes in the discrete come from, but it’s better to do many simulations to give us some level of confidence. We can set up a model like below, and run 1000 simulations given our trait data and the model. the summary gives us information on how often we see each state.

er_model<-fitMk(tree, discrete, model="ER", pi="fitzjohn")
testing_smap<-simmap(er_model, nsim=1000)
summary(testing_smap)

## 1000 trees with a mapped discrete character with states:
##  A, B, C 
## 
## trees have 5.199 changes between states on average
## 
## changes are of the following types:
##        A,B   A,C   B,A   B,C   C,A   C,B
## x->y 1.454 1.068 2.051 0.212 0.215 0.199
## 
## mean total time spent in each state is:
##              A          B         C    total
## raw  5.0105887 10.8669394 0.7243291 16.60186
## prop 0.3018089  0.6545617 0.0436294  1.00000

( We can then set up viridis colors for how many states we have (that’s the levels(discrete) part), and then we can map the figure

cols2<-setNames(viridis(length(levels(discrete)), begin=0.2, end=0.9), 
                levels(discrete))
plot(summary(testing_smap),colors=cols2,ftype="i")
legend("topleft",c("A","B", "C"),
       pch=21,pt.bg=cols2,pt.cex=2)

The resulting pie charts at nodes tell us the probability of each state at that position.

Tree_Simulation_and_Trait_Mapping

Carl Hjelmen

2025-11-14