#________________________________
# Assign the number 3 to x
<- 3
x
#________________________________
# Display whatever is stored in x
x
[1] 3
NOTE: This document is a work in progress. If you see any errors, please reach out to either Alex or myself so we can address them.
This is a collection of some of the most used statistical procedures and visualization techniques in the Smith Lab. Generally speaking, this document is structured from the perspective of someone who wants to analyse community ecology data (and is completely biased towards the procedures I performed during my PhD), but that does not mean these procedures are only useful in that context, so even if you are not doing a community ecology project, you might find this tome useful.
Here, you will find brief descriptions of what each procedure is, links to useful resources, and step-by-step instructions on how to perform them in R. Depending on what year you are reading this, some of the R functions we use here may have changed in how they work, but these instructions should give you at least a starting point. You can find the R and package versions used during the writing of this document in Section 5.
I have broken down every procedure into its steps, and displayed the code and outputs for you to follow along. I have annotated every code chunk with brief explanations of what most things in the code do in case you are not familiar with a specific command. If I wrote a comment explaining what a command does, it is unlikely I did it again if the code reappears down the line.
If you are not comfortable reading R code, here is a quick guide on how to do so.
<-
. This means that whatever is to the right of that arrow gets stored in the name we write on the left, and every time you call that name, all of the values assigned to it show up. For example, if you write x <- 3
, every time you tell R to run x
, the number 3
will appear.#________________________________
# Assign the number 3 to x
<- 3
x
#________________________________
# Display whatever is stored in x
x
[1] 3
mean()
. The things that go inside the parentheses are the parameters of the function (such as your data file, any tweaking you want to do to the how the function is doing its calculations, etc.). Their values are assigned with an =
sign and are separated by commas, so every function will look something like function(x = something, y = something_else)
.%>%
, which comes from the magrittr
package. Base R has its own operator (|>
), but it’s slower than %>%
(as of the writing of this document), though you’d only notice the differences in speed if you are running procedures that take a long time to finish (so use whichever one you like!). Piping operators can be read as “then” (e.g. do this, then do that, then do that, etc.). For example, data_frame %>% select(variable_x) %>% distinct()
can be read as “take my data frame called ‘data_frame’, then select variable ‘x’ from it, then show me the distinct values of ‘x’”.Now we are ready to start. First, let’s load some packages that we will be using throughout this document.
# This is an alternative way of calling your R packages to your session instead
# of writing "library()" for every package that you have. Either one is fine,
# but I added this here so you know it exists.
lapply(
c(
"tidyverse", # For data manipulation and graphics (also loads magrittr pipe)
"ggtext", # For easier axis title writing when we need sub or superscripts
"here", # For easier file pathing in the project directory
"readxl" # For importing Excel files
),# Function to be applied over values in vector above
library, character.only = TRUE) # Need this for it to work for some reason
Alpha diversity is a measure of the local diversity of species in any given place, often talked about in terms of number of species but sometimes taking into account their relative abundances as well. We will cover a couple of ways that you are likely to analyse alpha diversity data.
These models basically do magic, but since this magic can be complex I will only give you a brief and simplified explanation of how they work. If, however, you are interested in learning more about them, I highly recommend Noam Ross’s GAMs in R course (which at the time of this writing is completely free!), Gavin Simpson’s YouTube workshop on GAMs, and if you really want to know more about the math behind them, Simon Wood’s book.
In short, GAMs are models that can learn from the data what the best shape to represent them is. I will try to impart on you why I called this magic with an example.
Most models assume before hand the behaviour of the data in question, and the math behind them (calculation of parameters) and the subsequent test statistics and p-values that evaluate the model as a description of the data, are all based on that shape. Take linear models for instance – by definition, they assume a linear relationship between the variables in question. If the variables you are working with do not have a linear relationship, you are bound to miss perhaps important behaviours of the system you are trying to represent with your model. For example, suppose you have a positive hump-shaped relationship between two variables (let’s say species diversity and elevation), and then while doing your statistical analyses you fit a linear model. By doing so, you will be able to capture the positive nature of the relationship (species diversity increases with elevation), but you will not capture the humped nature of that relation (in the diversity and elevation hypothetical here, there would be a dramatic increase of diversity at a certain point which would then stop happening, and you would completely miss that since your linear model would tell you that diversity is changing in a linear, one-step-at-a-time fashion). On the other hand, if your data behave linearly, a linear model would capture the nature of the relationship between the two variables very well, since it would assume the correct behaviour of the data. Remember, however, how I said at the beginning of this that GAMs can learn from the data what the best shape to represent them is? This means that if the data behave linearly, our GAM will come out looking, and behaving like a linear model, and if our data behave “humped-shapedly”, our GAM will take that form instead!
Briefly, this is done through things called basis functions, which when put together make up a GAM. Imagine that you take your entire elevation gradient and you split it, for instance, into 10 equidistant sections. Now suppose you fit, for instance, a cubic regression model to the diversity and elevation data for each of those 10 sections, and that you could then combine the results of each of those sections to ultimately describe the diversity-elevation gradient in one grand model. Well, that is more or less how GAMs work! The basis functions are the functions used throughout the 10 sections of the gradient, and the GAM is the overall model that combines the results from each of the sections!
The assumptions of GAMs are the same as those for parametric linear models, which you might already have been exposed to in undergraduate statistics courses. There are three main assumptions:
Assumption number 1 might have been presented to you in the past as the assumption of “normality”, and you might have also heard things like “make sure your data are normal!”, or “are your data normally distributed?”, or “you should transform your data to make them normal.”, or any number of sentences like that. This stems from the fact that statistics courses tend to emphasize parametric tests, which tend to assume a Gaussian distribution (known as the normal distribution) of the error terms (if the data are normally distributed, this gets reflected in the error terms, which is what matters for modeling. For more information on this, see Montgomery’s book, or any other basic statistics book). In GAMs, we can choose the error distribution we think is most appropriate to our data based on our knowledge of its nature, so we are not bound by the normal distribution and generally do not need to do any transformations to our data. For example, count data is made up of integers (whole numbers, not decimals) and cannot have negative values (it is impossible for you to count -10 oranges – what would that even mean?). The Gaussian distribution goes from -\(\infty\) to +\(\infty\) and includes decimal numbers, so any model using the normal distribution to predict count data would be predicting nonsensical things (such as counting -10.5 oranges!). Therefore, it is crucial that you choose the correct error distribution for your model, because otherwise your results just will not make any sense (you will get outputs and p-values and everything else, but they will be very poor representations of the system you are trying to describe!).
Assumption 2 refers to the relationship between each of your data points, and whether they are influenced by (or related to) each other. In technical terms, the two points must not be drawn from different populations if the assumption of independence is to be upheld (they must be from the same population). Of course, this is a border-line impossible thing to control (especially in field studies!), since for all intents and purposes, things are connected to each other with varying degrees of intensity, so nothing is truly independent. However, statistically speaking, what you are trying to see is whether the behaviour of each data point is independent from others or not, and you can minimize this through your experimental design or by accounting for that dependence in your model. For instance, suppose we want to measure the average concentration of red blood cells in a population of 50 students. If, when you collect your data, you take measurements from only 25 students, but measure each of them twice, suddenly your data points are not independent, because every two observations will be closely related to each other, whereas if you had sampled each of the 50 students once, it is less likely that the data would be related. The consequence of violating this assumption is that you are at an increased risk of making a Type I error, which is finding a pattern that suggests a relationship between two variables that doesn’t really exist. I’ll illustrate what that means with another example. Say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper, which we will revisit in Section 4.3.
Assumption 3 has to do with variance, and it is at the root of statistical thinking. Suppose you want to see how variable “A” is affected by variable “B” (say blueberry yield and the application of a specific herbicide concentration). The most basic way to think about this is in a signal vs. noise setting. If there is a relationship between “A” and “B” (i.e. a signal exists between blueberry yield and that concentration of that herbicide), it is generally obscured by noise (variance caused by many other things such as rainfall, efficiency of application of the herbicide, blueberry plant health, etc.), and we use statistics to quantify the noise and the signal so we can see how big the signal actually is (i.e. how strong the relationship between “A” and “B” is). Now suppose what you want is to see how variable “A” is affected along differing levels by variable “B” (i.e. how blueberry yield is affected along a gradient of concentrations of this herbicide). Here, there may be differing levels of noise along the gradient (e.g. suppose lower concentrations have smaller variance because at those levels the herbicide doesn’t really do anything to the plants, but that variance increases as concentrations increase because different plants have different tolerances to the herbicide), so you need to ensure you either account for this change in variance in your experimental design or in your model, otherwise you are at a higher risk of making a Type II error, which is finding no evidence of a relationship between “A” and “B” when there actually is one (i.e. too much noise for you to find the signal). If your violation of constant variance comes from unequal sample sizes along the factor levels you have chosen, the consequences may be a bit different. Suppose you found low variance in the lower concentrations of herbicide and high variance in high concentrations – there are two scenarios that can arise here. 1) If you have fewer samples for the low herbicide concentrations (low variance associated with fewer samples) and many more samples of the high herbicide concentrations (high variance associated with higher number of samples), you are at risk of Type II error; 2) If you have fewer samples for the high herbicide concentrations (higher variance associated with fewer samples) and more samples of the low herbicide concentrations (low variance associated with higher number of samples), you are now at risk of making a Type I error.
There are ways to check how our data meet (or violate) these assumptions when fitting our GAMs, and knowing whether or not you violate them is important because we can then address them methodologically, or at the very least, we can discuss how our conclusions may be affected by these violations (if they exist).
First, we will load the packages required to complete the entire procedure.
library(mgcv) # For fitting GAMs
library(gratia) # For cleaner-looking assumption plots
Your data for GAMs should be laid out in the “each row is an observation and each column is a variable” format. For instance, suppose you wanted to get information about the inhabitants of a galaxy far far away such as their name, birth year (before the battle of Yavin), weight, height, etc. An intuitive way to lay out those data would be to make every row represent a specific inhabitant and every column a variable about that inhabitant, so that in the end you would have something this:
# This data set is part of the dplyr package
# Displays data about all Star Wars characters starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Your own data need to be laid out in the same way, where each row is an observation (e.g. individual organisms, or collection sites) and each column is a variable about that observation (e.g. elevation, temperature, humidity, etc.).
We will be using the following Staphylinidae elevation and heat tolerance data from Costa Rica to learn how to run GAMs in R.
#______________________________________________________________________________
# Importing the data frame
<- read.csv(
staph_ctmax_elevation_data here("data",
"raw",
"OTU_named_190220.csv")) # Imports the data from working directory
#______________________________________________________________________________
# Displaying the top 10 rows of the data frame
%>%
staph_ctmax_elevation_data as_tibble() # Just makes this output look nicer
# A tibble: 57 × 3
bin elevation ctmax
<chr> <int> <dbl>
1 OTU-01 1300 37.5
2 OTU-01 1300 36.5
3 OTU-01 1300 37
4 OTU-01 1300 36.5
5 OTU-01 1300 36.5
6 OTU-01 1300 34
7 OTU-01 1200 36
8 OTU-02 1200 36
9 OTU-03 1200 36
10 OTU-04 1300 39.5
# ℹ 47 more rows
We will be using the gam()
function from the mgcv
package to fit our model and the gam.check()
function from the mgcv
package to check whether our error terms meet or violate the model assumptions. If you want, you can use the appraise()
function from the gratia
package to have ggplot2
versions of the graphs you get from gam.check()
. For this example, we can expect a Gaussian distribution of our error terms because CTmax is not an integer and can be both positive and negative (theoretically, you could have a species with a CTmax of -10°C, who knows!).
#______________________________________________________________________________
# Model fitting
<- gam(
staph_ctmax_gam formula = ctmax ~ s(elevation, # s() specifies variable as smooth (wiggly)
k = 5), # Specifies number of basis functions
data = staph_ctmax_elevation_data, # Data frame with our variables of interest
family = "gaussian", # Specifies the expected error distribution
method = "REML" # Algorithm that penalizes over-fitting of the data
)
#______________________________________________________________________________
# Checking assumptions
gam.check(staph_ctmax_gam)
Method: REML Optimizer: outer newton
full convergence after 5 iterations.
Gradient range [-3.56664e-10,1.45528e-11]
(score 116.2551 & scale 3.236838).
Hessian positive definite, eigenvalue range [0.7384314,27.53047].
Model rank = 5 / 5
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(elevation) 4.00 2.81 0.95 0.32
In this output, we see 4 plots and some stuff printed onto the console. The plots will help us check our assumptions, and the console log will tell us:
You can often fix a violation of constant variance by increasing the number of basis functions in the k
argument in the gam()
function (I don’t know for sure, but it looks like this test is for whether the residuals behave randomly within each basis function rather than among all functions. I say this because of what you will see below).
The plots came out looking a bit squished here because they are printed as individual plots in RStudio, but here they got combined into the space of one and I didn’t try to fix it (sorry!). At this point, however, is where I would use the appraise()
function from gratia
to have better-looking plots.
# Prints model diagnostic plots
appraise(staph_ctmax_gam)
We see in Figure 1 that our residuals mostly fall very close to the line in the “QQ plot of residuals”, which indicates that our error terms mostly behave in a Gaussian way. We can see in the “Histogram of residuals” plot that our residuals seem to be left skewed, which accounts for why not all residuals fell within the line in the “QQ plot”. Some statisticians may recommend we transform our data to address this skew, but I will leave it up to you whether to do it or not. I tend to think that if the vast majority of residuals behave according to the distribution we chose (and the ones that don’t are not absurd-looking outliers), we can go ahead with our model and then in the discussion we can talk about the outliers that did not fall perfectly within the expected distribution.
We see in the “residuals vs linear predictors” plot that there is a sideways cone shape to our residuals (looks like >). This indicates that there is no constant variance along the values of y, and that where CTmaxs are higher (looks like above 40°C), there tends to be less variation than where CTmaxs are lower (below 40°C).
If we see any pattern with the mean of the residuals vs fits plot (the middle line doesn’t cut the dots in half horizontally), the assumption of independence may be violated. This does not seem to be the case with our data, but we will explore this a bit more in Section 4.2.
So, we know that our model is going to yield results that make sense (we chose the correct error distribution), but that we may be at risk of seeing some type of error because we don’t have constant variance. This being said, it does NOT mean that our data are bad and that therefore modeling them is a fruitless venture. For once, just by trying to get a model going, we have learned that CTmax does not vary equally along its range, and that that where we find things with CTmaxs above 40°C, we tend to find only things with high CTmax, but where we find things with CTmaxs below 40°C, we tend to find all kinds of CTmaxs, which biologically speaking is an extremely interesting find! All we have to be cautious of is overstating whatever results we find about the relationship and what shape it shows. Because our test statistic in the gam.check()
function did not indicate that we violated constant variance, we should not worry too much about it (could be because it tests variance within each basis function instead of across the entire gradient, I don’t know), though we should keep it in mind!.
Now, we will extract the results from our fitted model.
summary(staph_ctmax_gam) # function from base R that extracts model results
Family: gaussian
Link function: identity
Formula:
ctmax ~ s(elevation, k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.9211 0.2383 163.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(elevation) 2.806 3.342 37.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.689 Deviance explained = 70.5%
-REML = 116.26 Scale est. = 3.2368 n = 57
This output is broken up into three main parts:
Now let’s visualize the model using the base R graphical output (it gives us some information that we can’t really get with the gratia
package, as far as I know).
plot(staph_ctmax_gam,
residuals = TRUE, # Shows the residuals
pch = 1, # Selects size of the dots
shift = coef(staph_ctmax_gam)[1], # Shifts the scale of y to include the intercept
seWithMean = TRUE, # Adds the error associated with the mean to the se
shade = TRUE, # Makes 95% confidence interval into shaded area
xlab = "Elevation (m)") # Specifies label of x axis
We see in Figure 2 evidence that CTmax decreases with elevation, but that the variance between the CTmax values increases with elevation. This is extremely interesting biologically speaking, because we see evidence that CTmax is important for Staphylinidae at lower elevations, but that it may not be the case at higher elevations given that we tend to find a larger range of values there!
Now let’s make the above plot with the ggplot2
package, which allows for great customization while being intuitive in how it works. We have already loaded ggplot2
when we loaded the tidyverse
package in Section 1, so we are good to go. Note that the ggplot2
method of plotting GAMs (as far as I know) doesn’t use the mgcv
package to generate its model, so you may find some minor discrepancies in the base R plot of our mgcv
model (Figure 2) and the ggplot2
plot (Figure 3).
ggplot(staph_ctmax_elevation_data, # Data frame that has our data
aes(x = elevation, # Our x variable
y = ctmax)) + # Our y variable
geom_point() + # Choosing to plot our data in the form of points
geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
formula = y ~ s(x, # Specifies formula of our model
k = 5), # Number of basis functions
se = TRUE, # Whether we want the gray 95% CI or not (we do!)
method.args = list(family = "gaussian")) + # Error distribution
labs(x = "Elevation (m)", # Customizing x and y axis labels
y = "CT~max~ (°C)") + # Text between ~~ will be subscript (markdown syntax)
theme_classic() + # Removes gray grid in background and adds lines to axes
theme(axis.title = element_text(size = 12), # Font size for all axis titles
axis.title.y = element_markdown(), # Allows markdown in y axis label
axis.text = element_text(size = 10)) # Font size for all axis text
So in conclusion, we found evidence of a significant effect of elevation on Staphylinidae CTmax, but because of the lack of constant variance, we should be cautious when claiming certainties about the specifics of the model (e.g. the shape of the relationship). This lack of constant variance could be due to a violation in the independence assumption (here, I’m speaking in biological terms), which we will address in Section 4.
Beta diversity is a measure of the differences between communities (numbers of species and what species are there) in different places (e.g. traps, habitats, field sites, etc.). We will cover a visualization technique that allows us to see those differences in community composition and statistical tests that go along with it.
This is a visualization technique that allows us to see the differences in community composition (what kinds of species you find) between your treatments (levels of your factor of interest). It is not a statistical test, so it should always be accompanied by a test such as PERMANOVA or ANOSIM so that we can say whether the differences we see in the NMDS are statistically significant or not.
First, we will load the packages required to complete the entire procedure.
library(vegan) # For community analyses
Let’s also pre-assign a colourblind palette so we can use it for the graphs we will make.
#______________________________________________________________________________
# Colour palette
<- c(
cbpalette "#999999",
"#E69F00",
"#56B4E9",
"#009E73",
"#F0E442",
"#0072B2",
"#D55E00",
"#CC79A7"
)
It is possible that you will have one big excel file with all your data in it, which is great. This will have to be transformed into at least two separate datasheets, but fear not: this can all be done in R and you don’t have to worry about having several Excel files to keep track of. However, it is also possible you have the data frames already set up for your analyses somewhere, in which case you can skip this step.
Here we will import an example datasheet might look like yours from Kembel and Cahill Jr (2011).
#______________________________________________________________________________
# Importing the data frame
<- read_excel(
community_df here("data",
"raw",
"community_df.xlsx")) # Imports the data from working directory
#______________________________________________________________________________
# Displaying the data frame
# Prints the top 10 rows of the data frame if table is massive community_df
# A tibble: 331 × 16
plot_number species abundance sla leaf_area leaf_thickness leaf_tissue_dens
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9 Viola_… 10 839. 1.05 0.2 0.00596
2 7 Planta… 20 548. 0.256 0.223 0.00817
3 11 Potent… 40 471. 4.28 0.343 0.00619
4 3 Andros… 20 450. 0.255 0.36 0.00618
5 14 Campan… 10 402. 0.998 0.247 0.0101
6 7 Poa_sa… 50 385. 0.347 0.203 0.0128
7 17 Orthoc… 60 371. 0.653 0.204 0.0132
8 10 Astrag… 10 359. 2.03 0.237 0.0118
9 9 Draba_… 10 348. 0.600 0.307 0.00938
10 9 Descur… 10 347. 0.886 0.29 0.00993
# ℹ 321 more rows
# ℹ 9 more variables: srl <dbl>, root_tissue_dens <dbl>, root_diam <dbl>,
# habitat <chr>, site <chr>, slope <dbl>, aspect <dbl>, slope_position <dbl>,
# rel_moisture <dbl>
As a quick side note, notice how each row represents the abundance of one species caught in a specific plot. Yours might be numbers of a species by trap, or individual specimen data, which can easily be turned into a data frame like ours if the meta data is available.
Now that we have the data frame loaded in R, we can start preparing the building blocks required to run the NMDS. First, we will create two data frames: 1) the community matrix, which will contain the number of species collected in each of our collection plots (or traps, or GPS coordinate grids, really depends on your project), and 2) the metadata, containing information about each of the collection plots.
The species matrix can be generated as follows, and should look like this:
<- community_df %>%
matrix select(plot_number, # Selects the relevant columns from our data frame
species,%>%
abundance) pivot_wider(names_from = species, # Transposes species names to columns
values_from = abundance,
values_fill = 0) %>%
arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
select(plot_number, # Puts plot_number as the first column
sort(colnames(.))) # Orders remaining columns alphabetically
#______________________________________________________________________________
# Display the matrix
matrix
# A tibble: 17 × 77
plot_number Achillea_millefolium Allium_textile Amelanchier_alnifolia
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 0
2 2 10 10 0
3 3 30 0 0
4 5 0 0 0
5 6 0 0 0
6 7 0 0 0
7 8 0 0 0
8 9 50 0 10
9 10 40 0 0
10 11 40 30 0
11 12 0 20 0
12 13 40 10 10
13 14 0 10 10
14 15 40 10 0
15 16 30 0 0
16 17 10 0 0
17 18 40 0 0
# ℹ 73 more variables: Androsace_occidentalis <dbl>, Anemone_patens <dbl>,
# Antennaria_neglecta <dbl>, Antennaria_parvifolia <dbl>,
# Arnica_fulgens <dbl>, Arnica_sororia <dbl>, Artemisia_cana <dbl>,
# Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
# Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
# Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
# Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>, …
The community metadata can be generated as follows, assuming each plot row has a unique value for each of the variables. It’s possible you have several measurements of some abiotic variable, so in that case you’d need to average it so that you only have one value per plot number. Please see the printed table at the end, because yours should look like it.
<- community_df %>%
metadata group_by(plot_number) %>% # Applies whatever comes next to each level of plot_number
distinct(habitat, # Select one of each distinct value of the bellow variables
site,
slope,
aspect,
slope_position,%>%
rel_moisture) arrange(plot_number) %>% # Orders by plot_number in ascending fashion
ungroup() # Makes whole table independent from group_by variable, important later
#______________________________________________________________________________
# Display the meta data sheet
# Prints the entire metadata table metadata
# A tibble: 17 × 7
plot_number habitat site slope aspect slope_position rel_moisture
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Mixedgrass Onefour 0 270 3 1
2 2 Mixedgrass Onefour 20 130 1.5 2
3 3 Mixedgrass Onefour 5 90 1 2
4 5 Mixedgrass Onefour 5 130 2 1
5 6 Mixedgrass Onefour 1 90 3 1
6 7 Mixedgrass Onefour 8 0 2 2
7 8 Fescue Kinsella 5 100 3 2
8 9 Fescue Kinsella 5 160 1.5 3
9 10 Fescue Kinsella 18 155 2.5 2
10 11 Fescue Kinsella 12 180 2 1
11 12 Fescue Kinsella 5 164 3 3
12 13 Fescue Kinsella 15 220 2.5 3
13 14 Fescue Kinsella 3 160 3 3
14 15 Fescue Kinsella 5 160 3 3
15 16 Fescue Kinsella 6 150 1.5 3
16 17 Fescue Kinsella 4 140 2.5 2
17 18 <NA> <NA> NA NA NA NA
Now we are ready to run the NMDS.
To run the NMDS, we will use the metaMDS()
function from the vegan
package.
#______________________________________________________________________________
# Setting seed to ensure reproducible results since nmds is a permutational
# procedure. This makes it so that the random numbers picked to generate these
# results will always be the same. Number you pick can be anything.
set.seed(2023)
#______________________________________________________________________________
# NMDS
<- metaMDS(matrix %>%
community_nmds select(! 1)) # Omits column 1 (plot_number)
Square root transformation
Wisconsin double standardization
Run 0 stress 0.06758156
Run 1 stress 0.07097971
Run 2 stress 0.06482428
... New best solution
... Procrustes: rmse 0.07340193 max resid 0.2350523
Run 3 stress 0.0675814
Run 4 stress 0.06467223
... New best solution
... Procrustes: rmse 0.01147297 max resid 0.0343962
Run 5 stress 0.067413
Run 6 stress 0.0674132
Run 7 stress 0.06482413
... Procrustes: rmse 0.01147154 max resid 0.03440339
Run 8 stress 0.0653533
Run 9 stress 0.06787984
Run 10 stress 0.06756966
Run 11 stress 0.06787951
Run 12 stress 0.06535327
Run 13 stress 0.06741359
Run 14 stress 0.06758146
Run 15 stress 0.06467203
... New best solution
... Procrustes: rmse 0.0003099958 max resid 0.0009114333
... Similar to previous best
Run 16 stress 0.06788061
Run 17 stress 0.06467214
... Procrustes: rmse 0.0002691596 max resid 0.0007917589
... Similar to previous best
Run 18 stress 0.07051754
Run 19 stress 0.06482403
... Procrustes: rmse 0.01154534 max resid 0.0353847
Run 20 stress 0.06600201
*** Best solution repeated 2 times
This function iteratively attempts to condense the multidimensional data frame (many columns with many rows to be compared) into a two dimensional set of points (essentially x and y coordinates) while trying to stay true to the original differences between each point of the data frame. This is what all these “runs” are in the output above. You can imagine that the more you try to force this multidimensional set of differences into two dimensions (a regular graph) you would create a lot of stress. In fact, this is a measure you can get! Stress is a measure of how well the NMDS plot is representing the differences in our original data set. We can plot what is called a stress plot of our NMDS to see how much stress our NMDS had when trying to condense our differences.
stressplot(community_nmds) # Generates the stress plot
Visually, what you want to see is this “staircase” style graph (Figure 4). We can also directly get the stress value by extracting it from our metaMDS()
results with community_nmds$stress
, which in our case is 0.065. According to Kruskal, you want the stress value to be less than 0.2 (low stress). Often when the stress is too high, our metaMDS()
function will fail to converge, so at the end of the output above where you see all of the iterations it will say something like “failed to converge”. Conversely, when the stress is too low (too close to 0), it will also fail to converge. When that is the case, you can play around with reducing or increasing the size of your plots (pooling or splitting them apart tends to work). Since our example converged nicely, we can then plot our results.
ggplot2
The first thing we need to do is create a data frame that combines the results from our metaMDS()
function with our metadata data frame.
<- as.data.frame(community_nmds$points) %>% # Extracts NMDS plotting points
data_scores mutate(plot_number = metadata$plot_number, # Attaches columns from metadata table
habitat = metadata$habitat)
Now let’s view the table
# Prints the table data_scores
MDS1 MDS2 plot_number habitat
1 -1.0309795 0.2629909 1 Mixedgrass
2 -0.7136177 -0.3885305 2 Mixedgrass
3 -0.6773968 -0.5183459 3 Mixedgrass
4 -1.6868373 0.3290805 5 Mixedgrass
5 -1.5790633 -0.2702064 6 Mixedgrass
6 -1.2286524 -0.0570536 7 Mixedgrass
7 -1.0491372 0.1059923 8 Fescue
8 0.9419983 0.4102748 9 Fescue
9 0.7731372 0.1122020 10 Fescue
10 0.2347052 -0.1745242 11 Fescue
11 0.1459202 0.3860476 12 Fescue
12 0.8159381 0.3035745 13 Fescue
13 0.6712101 0.1708772 14 Fescue
14 1.0346944 -0.3985301 15 Fescue
15 1.2593688 -0.3332723 16 Fescue
16 0.6240574 0.3493586 17 Fescue
17 1.4646544 -0.2899354 18 <NA>
This next step is optional, but I like it, so I’m including it here. In it, we generate a data frame that contains the coordinates of the centroids for each group, so when we make the plot, we can make the centroids stand out from the other data.
<- data_scores %>%
centroid group_by(habitat) %>%
summarize(MDS1 = mean(MDS1),
MDS2 = mean(MDS2)) %>%
drop_na(habitat)
At this point, we are ready to have a ggplot2
plot of our NMDS with our main variable of interest. In this case, we want to see, as you might have guessed from the last table we created, that we are interested in how the community of species differs between two habitats: mixed grass and fescue. So let’s see what that looks like!
ggplot(data = data_scores %>%
drop_na(habitat), # Removes rows with missing metadata
aes(x = MDS1,
y = MDS2,
colour = habitat)) +
stat_ellipse(level = 0.95,
show.legend = FALSE) + # Specifies the confidence interval of the ellipsoids
# coord_fixed() + # Optional. Makes x and y units change at same rate
geom_point(size = 2) +
geom_point(data = centroid,
aes(fill = habitat),
size = 5,
shape = 21,
colour = "black",
show.legend = FALSE) +
labs(colour = "Habitat") +
scale_colour_manual(values = cbpalette) +
scale_fill_manual(values = cbpalette) +
theme_classic() + # Gets rid of grey background grid and adds lines to axes
theme(text = element_text(size = 12),
axis.text.x = element_text(size = 10),
legend.position = "right")
There we have it, Figure 5 is a nice-looking ggplot2
NMDS plot!
Visually, it looks like the two communities are different since there is little overlap between the ellipsoids, but we cannot tell that just by looking at it (well… sometimes you can, like here!), so we need a statistical test that will give us a p-value for whether they are different or not.
I learned how to make NMDS plots in ggplot2
through Pat Schloss’s amazing Riffomonas YouTube channel, which I HIGHLY recommend you check out!
An additional resource that you might find useful is the tutorial Steven Kembel made on how he analyzed the data we used from his paper.
They are multivariate (many y variables) statistical tests that tell us whether there is evidence that two or more groups are significantly different from one another. In the Smith Lab, very often our groups will be different habitats, and the many y variables will be the species found in each habitat. The ANOSIM (analysis of similarity) can only have one grouping variable, while the PERMANOVA (permutational analysis of variance) can have many explanatory variables. Both analyses are a form of ANOVA, but non-parametric (they both work with any error distribution, so the residuals from your data don’t need to have a normal distribution). Please see Anderson and Walsh’s paper if you are interested in a deep dive of their differences and what each of them does (plus another one called the Mantel test) and the GUSTA ME (guide to statistical analysis in microbial ecology) website and paper for more guides about these analyses.
First, a bit of background. Like the NMDS, ANOSIM uses ranks as its measures rather than what is called an Euclidean distance (the length between the two points in question). For instance, suppose you have three buildings, and you want to measure the distance between each of them. Suppose then that you found the distance between building A and B is 15 m, between A and C is 10, and B and C is 5. These are, by definition, the Euclidean distances between those buildings. Now when you go write those distances in a column and you sort it from farthest to closest, you may have something like this:
# A tibble: 3 × 2
`Building combination` `Distance (m)`
<chr> <dbl>
1 A to B 15
2 A to C 10
3 B to C 5
Suppose now that we add a new column to this table that will represent the rank of these distances from farthest to closest (literally as you may think in terms of a race – if you got there in the shortest amount of time, you are ranked first, and so on). Then our table will now look like this.
# A tibble: 3 × 3
`Building combination` `Distance (m)` Rank
<chr> <dbl> <dbl>
1 A to B 15 1
2 A to C 10 2
3 B to C 5 3
The ANOSIM will do its calculations based on the rank column rather than the distance column, which is why it’s a non-parametric test, since parametric tests assume a distribution of your error terms, which means you are assuming population mean and variance values, which means you need Euclidean-distance-like numbers to calculate them. Ranks, on the other hand, don’t have a distribution, only a range, so any analysis that uses ranks instead of Euclidean distances does not assume a distribution, which is the case for the ANOSIM. Note that the other assumptions that tend to come along with parametric tests (constant variance and independence) don’t apply here because both of them are about the behaviours of the variance in your residuals as they pertain to some distribution, and ranks don’t have a distribution.
The only assumption the ANOSIM has is that the range of ranks between the groups being compared is equal or at least very similar. In other words, the sample size between the groups needs to be equal or very similar.
I need to give you another bit of background before I dive into the assumptions. This test is called permutational analysis of variance because it runs an ANOVA-like test on our data, then it shuffles our data around, runs the test again, and then repeats this process many times (we can set how many times it does this, usually it’s 999). This, therefore, explains its only assumption – that each datum (singular form of data by the way!) is exchangeable with another datum under the null hypothesis.
What this means intuitively is the following: suppose I have two populations, one that belongs to species “A” and one that belongs to species “B”. Now suppose I wanted to measure the average height of each population. These two species need to, therefore, be treated separately from one another (or their relation accounted for in the model) when we do our analysis because they are two different populations (species “A” and species “B”), and individuals from species “A” might have extremely different properties than those of species “B” (just like we talked about in Section 2.1.2). Suppose then that I had a bucket with individuals of species “A”, from which I randomly picked 100 individuals, measured their height, tossed them back into the bucket, and that I did this many times. Which individuals appears in that sequence of 100 should not matter to my overall results of average height, no matter how many rounds of 100 draws I do. We can therefore say, that members of species “A” are exchangeable with one another. Now what if I added a bunch of individuals from species “B” into that bucket? Suddenly, which individuals are drawn matter, because if in round one 60 draws are from species “B”, and in round two only 10 are from species “B”, your average values of height for each set of 100 draws will be completely different. In the same way, the PERMANOVA analysis shuffles our data around, does tests, shuffles them again, tests again, and so on, and this is why the assumption of data objects being exchangeable with one another is made in the first place.
PERMANOVA also uses ranks as the dissimilarity metric (see Section 3.2.2.1 to understand what this means), so it has no assumptions of distribution or constant variance.
We will use the same set of packages we used in Section 3.1.
The data needs to be set up in the exact same way we did for the NMDS example, so if you’ve already done that, great! If you haven’t, please refer to Section 3.1.2.2 to see how to do it. The one thing we must do here is preemptively remove the rows with NA from our “metadata” data frame, because both ANOSIM and PERMANOVA will simply not run if there are rows with NAs.
<- metadata %>%
metadata_nona drop_na(habitat)
After we do this, we must then also remove plot_number 18 (the one with NA metadata) from our “matrix” data frame because now “metadata” has 16 rows while “matrix” has 17.
<- matrix %>% # nona stands for "no NA", and it also means grandma in Italian!
matrix_nona filter(plot_number != 18) # This is the only row with NA metadata
To run the ANOSIM, we will use the anosim()
function from the vegan
package. The distance
argument specifies which distance calculation will be used. There are many kinds, but we will use the Bray-Curtis Dissimilarity Index, which generates difference metrics while accounting for species abundance, which we have in this data set.
#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)
#______________________________________________________________________________
# ANOSIM
<- anosim(matrix_nona %>%
anosim_results select(! 1), # Drops the plot_number column
grouping = metadata_nona$habitat,
distance = "bray") # Specifies the distance calculation method
#______________________________________________________________________________
# Display ANOSIM results
anosim_results
Call:
anosim(x = matrix_nona %>% select(!1), grouping = metadata_nona$habitat, distance = "bray")
Dissimilarity: bray
ANOSIM statistic R: 0.7222
Significance: 0.002
Permutation: free
Number of permutations: 999
We get two values of interest from this output: the ANOSIM statistic R and the significance value. The first is a value that ranges from 0 to 1. The closer it is to 1, the more different the groups are, while the closer they are to 0, the more similar they are. The latter is essentially a p-value. We can see that we have a significance value of 0.002 and an ANOSIM statistic R value of 0.72, so we can conclude that we see strong evidence that the fescue and mixed grass communities are significantly different from one another, just like what the NMDS plot suggested!
The PERMANOVA runs very much like an ANOVA, in that we need to specify a model equation with a y variable (here it’ll be our matrix instead of a single variable) and all of the x variables that we want to add to the analysis. This means that we are able to measure the effects of multiple x variables on our community data, including the interactions between them!
To run a PERMANOVA, we will use the adonis2()
function from the vegan
package.
#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)
#______________________________________________________________________________
# PERMANOVA
<- adonis2(matrix_nona %>% select(! 1) ~ habitat,
permanova_results data = metadata_nona, # Specifies where the explanatory variables come from
method = "bray")
#______________________________________________________________________________
# Display PERMANOVA results
permanova_results
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = matrix_nona %>% select(!1) ~ habitat, data = metadata_nona, method = "bray")
Df SumOfSqs R2 F Pr(>F)
habitat 1 1.7186 0.42686 10.427 0.002 **
Residual 14 2.3076 0.57314
Total 15 4.0262 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you are familiar with ANOVA tables, you might have noticed that the PERMANOVA output looks very much like the output from a regular ANOVA. We see that p = 0.002, so the PERMANOVA agrees with the ANOSIM, showing us evidence that the communities between the fescue and mixed grass habitats are significantly different from each other!
We will now run the PERMANOVA again, only this time we will add the other meta-data variables to our model to showcase the main difference between using a PERMANOVA and an ANOSIM.
#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)
#______________________________________________________________________________
# PERMANOVA
<- adonis2(matrix_nona %>% select(! 1) ~ # ~ is the way R represents the = sign in an equation
interactions_permanova_results *slope*slope_position*rel_moisture, # Looking for interaction effects
habitatdata = metadata_nona, # Specifies where the explanatory variables come from
method = "bray")
#______________________________________________________________________________
# Display PERMANOVA results
interactions_permanova_results
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = matrix_nona %>% select(!1) ~ habitat * slope * slope_position * rel_moisture, data = metadata_nona, method = "bray")
Df SumOfSqs R2 F Pr(>F)
habitat 1 1.7186 0.42686 7.3350 0.014 *
slope 1 0.1193 0.02964 0.5092 0.762
slope_position 1 0.2223 0.05521 0.9487 0.441
rel_moisture 1 0.2178 0.05410 0.9296 0.472
habitat:slope 1 0.1522 0.03779 0.6495 0.638
habitat:slope_position 1 0.1550 0.03849 0.6615 0.611
slope:slope_position 1 0.1617 0.04017 0.6903 0.615
habitat:rel_moisture 1 0.0876 0.02175 0.3737 0.866
slope:rel_moisture 1 0.1137 0.02823 0.4851 0.746
slope_position:rel_moisture 1 0.1499 0.03722 0.6396 0.636
habitat:slope:slope_position 1 0.1868 0.04640 0.7974 0.515
habitat:slope:rel_moisture 1 0.0773 0.01920 0.3300 0.894
slope:slope_position:rel_moisture 1 0.1954 0.04854 0.8341 0.513
Residual 2 0.4686 0.11639
Total 15 4.0262 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see here that despite all the other variables and interactions, “habitat” is the only variable that showed evidence of a significant effect on the two communities (p = 0.014). Therefore, in this case, using either ANOSIM or PERMANOVA would’ve made no difference to what we found, but if one of the other variables or interactions showed evidence of a significant effect, we would not have been able to capture that with the ANOSIM!
In short, nestedness is a measure of how much of one community is just a subset of another, while turnover is a measure of how much one community is completely different than another. If you think back to our NMDS graphs (see Section 3.1.2.4), you may have an easier time visualizing what I’m saying. Suppose we have two communities, and each is represented by an ellipsoid (just like in the NMDS graph). Now suppose one community is simply a subset of the other, then what you would see is a little ellipsoid inside a bigger ellipsoid, which would indicate that the little one is a subset of the larger one. If the two communities are completely different, you would see two ellipsoids that don’t overlap at all. There are different ways to measure the nestendess and turnover components of communities, but we will only learn one of those ways here. If you want to learn more about beta diversity partitioning into nestedness and turnover, checkout Baselga’s theoretical paper on the topic, and Baselga and Orme’s paper about the package we will be using in this section!
First, let’s load the packages we will need to complete the entire procedure.
library(betapart)
Warning: package 'betapart' was built under R version 4.3.2
We will be using the same data we used in Section 3.1.2.2, but here we will create a presence/absence matrix. I will print the data here so you see what it looks like.
community_df
# A tibble: 331 × 16
plot_number species abundance sla leaf_area leaf_thickness leaf_tissue_dens
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9 Viola_… 10 839. 1.05 0.2 0.00596
2 7 Planta… 20 548. 0.256 0.223 0.00817
3 11 Potent… 40 471. 4.28 0.343 0.00619
4 3 Andros… 20 450. 0.255 0.36 0.00618
5 14 Campan… 10 402. 0.998 0.247 0.0101
6 7 Poa_sa… 50 385. 0.347 0.203 0.0128
7 17 Orthoc… 60 371. 0.653 0.204 0.0132
8 10 Astrag… 10 359. 2.03 0.237 0.0118
9 9 Draba_… 10 348. 0.600 0.307 0.00938
10 9 Descur… 10 347. 0.886 0.29 0.00993
# ℹ 321 more rows
# ℹ 9 more variables: srl <dbl>, root_tissue_dens <dbl>, root_diam <dbl>,
# habitat <chr>, site <chr>, slope <dbl>, aspect <dbl>, slope_position <dbl>,
# rel_moisture <dbl>
Now let’s create our presence/absence matrix.
#______________________________________________________________________________
# Making a presence absence community matrix
<- community_df %>%
presence_matrix mutate(abundance = replace(abundance, # Column to be changed
> 0, # Condition of change
abundance 1)) %>% # Value of change (turns > 0 values into 1)
select(plot_number, # Selects the relevant columns from our data frame
species,%>%
abundance) pivot_wider(names_from = species, # Transposes species names to columns
values_from = abundance,
values_fill = 0) %>%
arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
column_to_rownames("plot_number") %>% # Stores plot_number as row names
select(sort(colnames(.))) # Orders remaining columns alphabetically
#______________________________________________________________________________
# Displaying the data frames
%>%
presence_matrix as_tibble()
# A tibble: 17 × 76
Achillea_millefolium Allium_textile Amelanchier_alnifolia
<dbl> <dbl> <dbl>
1 0 0 0
2 1 1 0
3 1 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 1 0 1
9 1 0 0
10 1 1 0
11 0 1 0
12 1 1 1
13 0 1 1
14 1 1 0
15 1 0 0
16 1 0 0
17 1 0 0
# ℹ 73 more variables: Androsace_occidentalis <dbl>, Anemone_patens <dbl>,
# Antennaria_neglecta <dbl>, Antennaria_parvifolia <dbl>,
# Arnica_fulgens <dbl>, Arnica_sororia <dbl>, Artemisia_cana <dbl>,
# Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
# Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
# Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
# Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>, …
We are now ready to run our beta partition calculations.
We will calculate the beta diversity partition for our fescue and mixed grass communities using the beta.multi()
function from the betapart
package. This calculation is done for the communities across our plots, which is why we do one analysis for all data, and the conclusions we draw about the habitats is inferred on later, and can be coupled with NMDS visualizations.
#______________________________________________________________________________
# Overall community beta diversity partition
<- beta.multi(presence_matrix)
betapartition
#______________________________________________________________________________
# Displaying the results
betapartition
$beta.SIM
[1] 0.8182466
$beta.SNE
[1] 0.04983409
$beta.SOR
[1] 0.8680807
As you can see from the output, we get three values: beta.SIM (the turnover value), beta.SNE (the nestedness value), and beta.SOR (the overall beta diversity, which is the sum of the other values). These values clearly indicate that our community is mostly characterized by turnover, since the beta.SIM value (0.8182466) is much higher than the beta.SNE value (0.0498341). This perfectly matches what we saw with our NMDS plot in Section 3.1.2.4!
Phylogenetic analyses tell us about the level of relatedness between the species in the communities we are studying and how that relatedness affects any trait data we have collected about that community. Phylogenies are hypotheses about how these relationships exist and can be constructed in a variety of ways. If you want to learn more about how they are built and how they work, check out Hall’s “phylogenetic trees made easy” book. For the analyses we are about to learn, all you need to know is that we will have to choose a phylogeny and from it we will use what is called a branch length, which is the shortest path along the tree between two taxa. To visualize this, imagine you put your finger on the tip of the phylogeny that represents one of your taxon, then you go up its nodes (the points at which taxa separate from a common ancestor), all the way up to the common ancestor between the two taxa in question, and then down the nodes all the way to the other taxon – the summed lengths of all the branches in that trajectory (the lengths between the nodes) is the phylogenetic distance between those two taxa.
Phylogenetic diversity (PD) is the sum of the branch lengths of a subset of taxa. Suppose you have two plots with 10 species each, and that species in plot “A” are very closely related and species in plot “B” are distantly related. If we were to calculate the PD of each plot, we would find a larger number for plot “B” than for plot “A”, because since the species in plot “B” are more distantly related than those in plot “A”, their branch lengths would be longer than those in plot “A”. You can find out more about PD in Faith’s paper.
We only need one package to calculate PD, and that is the picante
package.
library(picante) # Does many phylogenetic things
We will be using the same Staphylinidae CTmax and elevation data set we used in Section 2.1. This data set has an accompanying phylogeny which Alex created based on COI gene fragments from each individual, which were then used to assign an operational taxonomic unit name (OTU-01, OTU-02, etc.), which can be used (for differentiation purposes) in place of species.
#______________________________________________________________________________
# Importing the data frame and phylogeny
#________________________________
# Data frame
<- read.csv( # Imports the data from selected path
staph_ctmax_elevation_data here("data",
"raw",
"OTU_named_190220.csv")
)
#________________________________
# Phylogeny
<- read.tree( # From the `ape` package which is part of `picante`
phy here("data",
"phylo_trees",
"OTU_named_190220.nwk")
)
#______________________________________________________________________________
# Displaying things
#________________________________
# Display the top 10 rows of the data frame
%>%
staph_ctmax_elevation_data as_tibble() # Just to make this output look nicer
# A tibble: 57 × 3
bin elevation ctmax
<chr> <int> <dbl>
1 OTU-01 1300 37.5
2 OTU-01 1300 36.5
3 OTU-01 1300 37
4 OTU-01 1300 36.5
5 OTU-01 1300 36.5
6 OTU-01 1300 34
7 OTU-01 1200 36
8 OTU-02 1200 36
9 OTU-03 1200 36
10 OTU-04 1300 39.5
# ℹ 47 more rows
#________________________________
# Display phylogeny
plot(phy)
Before we proceed, we need to check if our phylogeny has multichotomies in it (nodes that splits into three or more parts). Doing so visually is not always reliable (please see Liam Revell’s blog post about this) so instead we will use the is.binary()
function from the ape
package, which is included in the picante
package. If the result of the function comes out as TRUE
, we’re good to go. If it comes out as FALSE
, then that means multichotomies are present. We can fix them by running the multi2di()
function from the ape
package, which resolves the multichotomies by adding branches to the phylogeny of length 0 wherever the multichotomies exist.
So now let’s see if our phylogeny is fully bifurcated (no multichotomies).
is.binary(phy)
[1] TRUE
It looks like we have no multichotimies.
Now that our phylogeny is in order, let’s prepare our data. Our data frame will need to be in an occurrence matrix format (same one used in Section 3.1.2.2).
#______________________________________________________________________________
# Making the matrix
<- staph_ctmax_elevation_data %>%
staph_elevation_matrix group_by(elevation) %>%
count(bin,
name = "abundance") %>%
ungroup() %>% # Need this for later
pivot_wider(names_from = bin, # Transposes species names to columns
values_from = abundance,
values_fill = 0) %>%
select(elevation, # Puts elevation as the first column
sort(colnames(.))) # Orders remaining columns alphabetically
#______________________________________________________________________________
# Display the matrix
staph_elevation_matrix
# A tibble: 7 × 29
elevation `OTU-01` `OTU-02` `OTU-03` `OTU-04` `OTU-05` `OTU-06` `OTU-07`
<int> <int> <int> <int> <int> <int> <int> <int>
1 10 0 0 0 0 0 0 0
2 300 0 0 0 0 0 0 0
3 700 0 0 0 0 0 5 0
4 1000 0 0 0 1 0 0 1
5 1200 1 1 1 1 1 1 0
6 1300 6 0 0 1 0 0 0
7 1500 0 0 0 0 0 0 0
# ℹ 21 more variables: `OTU-08` <int>, `OTU-09` <int>, `OTU-10` <int>,
# `OTU-11` <int>, `OTU-12` <int>, `OTU-13` <int>, `OTU-14` <int>,
# `OTU-15` <int>, `OTU-16` <int>, `OTU-17` <int>, `OTU-18` <int>,
# `OTU-19` <int>, `OTU-20` <int>, `OTU-21` <int>, `OTU-22` <int>,
# `OTU-23` <int>, `OTU-24` <int>, `OTU-25` <int>, `OTU-26` <int>,
# `OTU-27` <int>, `OTU-28` <int>
After we get our matrix into that format, we can proceed with the PD calculation. Note that because Alex made the phylogeny from scratch using only our taxa, we don’t need to do anything else to it. However, had we gotten a phylogeny (say, from a paper) that included taxa that we don’t have, we would need to trim that phylogeny so that it would contain only our taxa. That can be done with the prune.sample()
function from the picante
package, but we don’t need to do that here.
We will calculate PD using the pd()
function from the picante
package.
#______________________________________________________________________________
# Creating data frame with PD results
<- pd(
phy_diversity %>% # Selects the data we will use
staph_elevation_matrix select(! 1), # Removes the elevation column from calculation
# The phylogeny we are using
phy, include.root = FALSE) %>% # Tree root not generally our taxa
mutate(elevation_m = staph_elevation_matrix$elevation) # Add elevation column
#______________________________________________________________________________
# Displaying PD results
%>%
phy_diversity as_tibble()
# A tibble: 7 × 3
PD SR elevation_m
<dbl> <dbl> <int>
1 1.11 5 10
2 0.994 5 300
3 1.54 6 700
4 0.501 2 1000
5 1.38 6 1200
6 0.762 4 1300
7 1.19 6 1500
We can see that not only do we get our PD results, but we also get species richness counts for each of the elevations! You can now analyse these values with GAMs (see Section 2.1), or whatever analyses you have in mind.
We will begin by thinking back to the problem of independence we learned about in Section 2.1.2 and revisiting one of the examples I gave there. It goes as follows: say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper.
The phylogenetic signal is a number we can get (along with a test statistic) that tells us whether this assumption of independence in the trait data is being violated by species with similar traits being too closely related. There are two different methods to calculate phylogenetic signals that we will learn to do here: Blomberg’s \(K\) and Pagel’s \(\lambda\). Blomberg’s \(K\) generally goes from 0 to 1, where 0 means no phylogenetic signal and 1 means a strong phylogenetic signal. It can, however, go over 1 because of the way it’s calculated, and indeed you may see this from time to time in empirical studies. Pagel’s \(\lambda\) also goes from 0 to 1, where 0 means no phylogenetic signal at all and 1 means a strong phylogenetic signal. Though it also can go over 1, it’s much less likely to do so than Blomberg’s \(K\). Both tests are calculated differently, so to play on the safe side, it’s a good idea to do both, and if one of them shows a phylogenetic signal, you should then do a phylogenetic correction of your trait data (see Section 4.3). If you want to learn more details about these metrics, check out Münkemüller’s paper!
We will need the phytools
package to complete our procedures. We will also use the picante
package in the phylogeny and data preparation section.
library(phytools)
library(picante)
We will be using the Staphylinidae CTmax and elevation data and phylogeny we loaded in Section 4.1.2.2 and doing the same phylogeny check for multichotomies we did there. Please see Section 4.1.2.2 for instructions on how to do this step.
After we load our data, we need to create a data frame with average trait values (CTmax and elevation in our case) for each taxon, so that in the end, each taxon only has one CTmax value and one elevation value.
#______________________________________________________________________________
# Making the data frame
<- staph_ctmax_elevation_data %>%
average_trait_df group_by(bin) %>%
mutate(mean_ctmax = mean(ctmax),
mean_elevation = mean(elevation)) %>%
distinct(mean_ctmax,
mean_elevation)
#______________________________________________________________________________
# Displaying the data frame
average_trait_df
# A tibble: 28 × 3
# Groups: bin [28]
bin mean_ctmax mean_elevation
<chr> <dbl> <dbl>
1 OTU-01 36.3 1286.
2 OTU-02 36 1200
3 OTU-03 36 1200
4 OTU-04 39 1167.
5 OTU-05 36 1200
6 OTU-06 37.7 783.
7 OTU-07 39.5 1000
8 OTU-08 32.5 700
9 OTU-09 40.5 700
10 OTU-10 39.5 700
# ℹ 18 more rows
Then, we must match the order of the taxa names in our average_trait_df frame to the order in which the taxa are listed in the phylogeny we are using, otherwise the phylogenetic signal calculations will not work. We can do this by using the match.phylo.data()
function from the picante
package. It returns new phylogeny and data frame objects with matched taxa names saved in a list. We must do a separate one of these for each trait where we will be looking for a phylogenetic signal, so we will have one for CTmax and one for elevation.
#______________________________________________________________________________
# CTmax matched phylogeny and data
<- average_trait_df %>%
phy_sig_ctmax select(bin,
%>%
mean_ctmax) deframe() %>% # Turns it into a named vector
match.phylo.data(phy, .)
#______________________________________________________________________________
# Elevation matched phylogeny and data
<- average_trait_df %>%
phy_sig_elev select(bin,
%>%
mean_elevation) deframe() %>% # Turns it into a named vector
match.phylo.data(phy, .)
If we open one of the lists we just created (we’ll open the CTmax one as an example), you will see that the OTUs in the “bin”column from the “data” object are listed in the same order as the “tip.label” column in the “phy” object.
#______________________________________________________________________________
# Extracting the tip.label column of the phy object in the ctmax list
$phy$tip.label %>%
phy_sig_ctmaxhead(10) # Show only first 10 objects
[1] "OTU-08" "OTU-10" "OTU-22" "OTU-02" "OTU-19" "OTU-23" "OTU-01" "OTU-20"
[9] "OTU-04" "OTU-12"
#______________________________________________________________________________
# Extracting the data object in the list
$data %>%
phy_sig_ctmaxhead(10) # Show only first 10 objects
OTU-08 OTU-10 OTU-22 OTU-02 OTU-19 OTU-23 OTU-01 OTU-20
32.50000 39.50000 38.50000 36.00000 39.50000 42.50000 36.28571 40.00000
OTU-04 OTU-12
39.00000 38.50000
We are now ready to calculate Blomberg’s \(K\) and Pagel’s \(\lambda\).
To calculate Blomberg’s \(K\) and Pagel’s \(\lambda\), we will use the phylosig()
function from the phytools
package.
#______________________________________________________________________________
# Phylogenetic signal for CTmax
set.seed(2023) # For reproducible permutational results
#________________________________
# Bloomberg's K
<- phylosig(
kctmax $phy, # The phylogeny we will use
phy_sig_ctmax$data, # The trait data we will use
phy_sig_ctmaxmethod = "K", # Choose Blomberg's K as the method
test = TRUE, # Does a hypothesis test that signal is absent
nsim = 999 # Number of permutations
)
#________________________________
# Pagel's Lambda
<- phylosig(
lambdactmax $phy, # The phylogeny we will use
phy_sig_ctmax$data, # The trait data we will use
phy_sig_ctmaxmethod = "lambda", # Choose Pagel's lambda as the method
test = TRUE, # Does a hypothesis test of lambda = 0 (no phylogenetic signal)
nsim = 999 # Number of permutations
)
#______________________________________________________________________________
# Display results
kctmax
Phylogenetic signal K : 1.20574
P-value (based on 999 randomizations) : 0.102102
lambdactmax
Phylogenetic signal lambda : 0.745975
logL(lambda) : -75.8545
LR(lambda=0) : 10.4023
P-value (based on LR test) : 0.00125856
We see a \(K\) value of 1.2 which indicates a strong phylogenetic signal. The p-value (p = 0.10) is high, which to me indicates that the math behind it is testing whether what we see is equal to what is expected (which is what every p-value represents), but instead of expecting 0 (which is usually the case), I believe it’s expecting 1. I could be wrong though and I have not found a satisfactory answer anywhere, so definitely don’t quote me on that!
We see a \(\lambda\) value of 0.75, which also indicates a phylogenetic signal (with a p-value that makes more sense to me; p = 0.0013).
Now let’s do the same procedure for elevation. Elevation can be considered a trait, since it is a value associated with our specimens (as in, a trait doesn’t have to be a physiological thing, it can also be where they are found, behaviours, etc.).
#______________________________________________________________________________
# Phylogenetic signal for CTmax
set.seed(2023) # For reproducible permutational results
#________________________________
# Bloomberg's K
<- phylosig(
kelev $phy, # The phylogeny we will use
phy_sig_elev$data, # The trait data we will use
phy_sig_elevmethod = "K", # Choose Blomberg's K as the method
test = TRUE, # Does a hypothesis test that signal is absent
nsim = 999 # Number of permutations
)
#________________________________
# Pagel's Lambda
<- phylosig(
lambdaelev $phy, # The phylogeny we will use
phy_sig_elev$data, # The trait data we will use
phy_sig_elevmethod = "lambda", # Choose Pagel's lambda as the method
test = TRUE, # Does a hypothesis test of lambda = 0 (no phylogenetic signal)
nsim = 999 # Number of permutations
)
#______________________________________________________________________________
# Display results
kelev
Phylogenetic signal K : 0.44045
P-value (based on 999 randomizations) : 0.827828
lambdaelev
Phylogenetic signal lambda : 7.33137e-05
logL(lambda) : -214.842
LR(lambda=0) : -0.000871588
P-value (based on LR test) : 1
We see a \(K\) value of 0.44 which indicates no phylogenetic signal (p = 0.83).
We see a \(\lambda\) value of 0.000073, which also indicates no phylogenetic signal (p = 1).
We now have enough evidence suggesting that our CTmax data is showing a phylogenetic signal, despite the fact that we saw no evidence of such a signal for elevation. We, therefore, need to correct for that so we can see the whether there is an effect of elevation on CTmax regardless of how closely related the species are.
This is what’s colloquially called a phylogenetic correction, and you only need to do it if one of your traits showed evidence of a phylogenetic signal (see Section 4.2). In short, suppose you have height measures for species “A” (say 10 cm) and “B” (say 6 cm), and you want to compute the difference in the values, so you go ahead and do \(10 - 6 = 4\). The value you just found (4) is what we are here calling a contrast. Now suppose you detected a phylogenetic signal in your height measures for these two species. What the phylogenetic independent contrast (PIC) does is it takes that difference you just calculated and divides it by the square root of the phylogenetic distance between species “A” and “B”. The more distantly related they are, the smaller your contrast value will be, while the more closely related they are, the larger your contrast will be. Intuitively, this means the following: suppose you are comparing heights of species along an elevation gradient, and that you want to see how elevation affects height. One of your species is a beetle, the other an elephant. Your height metric won’t be able to tell us if the high contrast in height between the two species is due to elevation or just to the fact that one is an elephant and one is a beetle, so the phylogenetic correction will decrease the value of that contrast according to how distantly related elephants and beetles are. On the other hand, if you have two beetle species on this elevation gradient, with one being 5 cm tall and the other 10 m tall, then because beetles are closely related, this contrast you found is more likely to be explained by elevation and the phylogenetic correction will increase the value of that contrast in proportion to how close the phylogenetic distance between those beetle species is (also, if you see a beetle that is 10 m tall, RUN!!!). You can read more about the math and logic behind it in Felsenstein’s paper.
We will need the ape
package, which is already included in the picante
package.
We will use the data frame we created in Section 4.2.2.2 with average trait values for each BIN in our Staphylinidae CTmax and elevation data which we called average_trait_df. We need an additional preparatory step with our phylogeny in this example because for some reason, the function we will use to calculate the PIC values thinks our tree is not rooted and had multichotomies, even though we already checked that in Section 4.1.2.2. We will use the multi2di()
function to fix that.
# CTmax listed phylogeny
$phy <- multi2di(phy_sig_ctmax$phy)
phy_sig_ctmax
# Elevation listed phylogeny
$phy <- multi2di(phy_sig_elev$phy) phy_sig_elev
To calculate phylogenetically corrected trait value contrasts, we will use the pic()
function from the ape
package, which is included in the picante
package. We will also store the PIC values in a data frame so we can model them later with a GAM (see Section 2.1).
# Creating data frame with phylogenetic corrected values for CTmax in one columm
# and elevation in another column
<- tibble(pic_ctmax = pic(phy_sig_ctmax$data,
pic_data $phy),
phy_sig_ctmaxpic_elev = pic(phy_sig_elev$data,
$phy)) phy_sig_elev
We can now use these values in modeling procedures such as GAMs (Section 2.1), GLMs, and so on. We will make a plot just to see what the phylogenetically corrected values for CTmax and elevation look like in relation to one another.
ggplot(pic_data, # Data frame that has our data
aes(x = pic_elev, # Our x variable
y = pic_ctmax)) + # Our y variable
geom_point() + # Choosing to plot our data in the form of points
geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
formula = y ~ s(x, # Specifies formula of our model
k = 5), # Number of basis functions
se = TRUE, # Whether we want the gray 95% CI or not (we do!)
method.args = list(family = "gaussian")) + # Error distribution
labs(x = "PIC of Elevation (m)", # Customizing x and y axis labels
y = "PIC of CT~max~") + # Text between ~~ will be subscript (markdown syntax)
theme_classic() + # Removes gray grid in background and adds lines to axes
theme(axis.title = element_text(size = 12), # Font size for all axis titles
axis.title.y = element_markdown(), # Allows markdown in y axis label
axis.text = element_text(size = 10)) # Font size for all axis text
We did not actually do the full model fitting procedure here (see Section 2.1 for how to do it), but it certainly looks like there is evidence of a strong, approximately linear, negative relationship between CTmax and elevation (Figure 6)!
Sometimes, you will need to combine two phylogenies into one to make your phylogenetic analyses stronger. For instance, suppose you have COI barcode data for ant species in your collections. Since COI is not good at inferring deeper phylogenetic relations, you may want to take an ant phylogeny from the literature (or make your own) made from gene fragments that better estimate the deeper nodes of the phylogeny, and combine it with your COI phylogeny, which is good at estimating the relationships at the tips.
First, we will load the packages required to complete the entire procedure, as well as some to visualize the phylogeny. This procedure uses a lot of base R, but it can also be done using the tidyverse (if you take the time to figure out how to do it that way).
library(phytools) # For phylogeny manipulations
library(picante) # For other phylogeny functions and packages such as "ape"
library(ggtree) # To visualize the trees. This needs to be installed outside
# of CRAN, through BioConductor. Google "install ggtree" and it should come up.
NOTE: Before we begin, it is mandatory that the tip names on all your phylogenies are formated in the same way. Here, tips will have the “Species_name” format, where their genus name is capitalized, and spaces are underscores. If any of your phylogenies are not formatted in this way, I recommend changing them to the format above so that the code below runs smoothly. You can do this by hand (using notepad or whatever way you think is easiest), or you can do it in R using the functions from the stringr
package (good for manipulating strings), or any other way you can think of. Since the phylogenies in this example are already formatted appropriately, we don’t need to worry about this step here.
In this example, we will be grafting 12 spider barcode-based phylogenies for genera that had more than one species in Chris Ho’s Master’s project to Wheeler et al.’s tree which is really good at estimating deep phylogenetic relations of spiders. There is an important bit of logic to understand here: If you have genera in your collection that only have one species, then it won’t make a difference whether you represent them in the phylogeny as their full species name or only their genus name because they are the only representative of that branch. This allows us to use phylogenies from the literature that have our genus but not necessarily our species in them.
We will now load our phylogeny files into R. As I said above, they are already formatted consistently. Also, we will use specific nomenclature when referring to each of them to make the process a bit more intuitive. The phylogeny that is good at representing the deeper relationships of the tree will serve as the “backbone”. The phylogenies good at representing relationships at the tips (such as barcode phylogenies) will be named after the nodes at which they fuse with the “backbone” phylogeny, so in this case, their genera.
It is important that we save all phylogenies to be grafted onto the backbone in a vector object (which turns into a list if you have more than one phylogeny to graft), a necessary step for the grafting code even if you only had one phylogeny to graft.
#______________________________________________________________________________
# Backbone phylogeny from Wheeler et al. (2017)
<- read.tree(
backbone here("data",
"phylo_trees",
"ML_pGH16_by_gene_boot1K_named_algonquin.nwk")
)
#______________________________________________________________________________
# Phylogenies to be grafted onto the backbone tree
<- c(
subtrees Agelenopsis = read.tree(here("data", "phylo_trees", "Agelenopsis.nwk")),
Araneus = read.tree(here("data", "phylo_trees", "Araneus.nwk")),
Bathyphantes = read.tree(here("data", "phylo_trees", "Bathyphantes.nwk")),
Cicurina = read.tree(here("data", "phylo_trees", "Cicurina.nwk")),
Clubiona = read.tree(here("data", "phylo_trees", "Clubiona.nwk")),
Neoantistea = read.tree(here("data", "phylo_trees", "Neoantistea.nwk")),
Pelegrina = read.tree(here("data", "phylo_trees", "Pelegrina.nwk")),
Philodromus = read.tree(here("data", "phylo_trees", "Philodromus.nwk")),
Tetragnatha = read.tree(here("data", "phylo_trees", "Tetragnatha.nwk")),
Theridion = read.tree(here("data", "phylo_trees", "Theridion.nwk")),
Walckenaeria = read.tree(here("data", "phylo_trees", "Walckenaeria.nwk")),
Xysticus = read.tree(here("data", "phylo_trees", "Xysticus.nwk"))
)
At this stage, you should format your backbone tree so that only the genus name is showing at each tip. In this example, Alex made those changes by hand in Notepad for our genera of interest, but if you want to do it in R (again, provided your tips are in the “Species_name” format) this is the code you would use.
If we run this code, the entire “backbone” phylogeny gets trimmed to the first word in the tip label before the first underscore, which may be genus, family, or whatever classification you are after, and it will keep only one branch per classification (genus, etc.). Doing this is fine so long as you are not dropping tips that you will use for your analysis.
Note that this backbone tree is a large spider tree of life where their names are in the “Family_Genus_species” format, so when we run the code below, most tips will be named after their families, but the ones Alex changed by hand to be genus only will show up as genera.
#________________________________
# Save current tip labels and genera names in separate objects
<- backbone$tip.label
tips <- unique(sapply(strsplit(tips, "_"), function(x) x[1]))
genera #________________________________
# Creates named number vector where the numbers represent the position of the
# first occurrence of each genus within the phylogeny
<- sapply(genera, function(x, y) grep(x, y)[1], y = tips)
ii #________________________________
# Drops tips that are not in the ii object
<- drop.tip(backbone, setdiff(backbone$tip.label, tips[ii]))
backbone #________________________________
# Rename tips to be genus name only
$tip.label <- sapply(strsplit(backbone$tip.label, "_"), function(x) x[1]) backbone
Now let’s see what this phylogeny looks like (you will have to zoom in to read the tip labels because there are so many taxa!).
ggtree(backbone,
layout = "circular",
open.angle = 10) +
geom_tippoint(size = 1,
show.legend = FALSE) +
geom_tiplab(colour = "black",
size = 1.5,
nudge_x = 0.1) + # Adds family names away from phylogeny
geom_rootpoint()
To graft the species phylogenies onto our backbone phylogeny, we will run the forloop below, which was modified from this online discussion .
# Note that "i" represents any one object in the "subtrees" object we created
for(i in 1:length(subtrees)){
# Creates object with only genus name from subtrees by separating at
# the underscores
<- strsplit(subtrees[[i]]$tip.label[[1]], "_")[[1]][1]
genus
# Creates object with maximum branch length of each subtree
<- max(nodeHeights(subtrees[[i]]))
h
# Object that identifies which of the backbone tree tips matches the
# genus written in the genus object above, writes its position
<- which(backbone$tip.label == genus)
tip
# Binds backbone tree to subtrees at the genera stored in tip object above
<- bind.tree(backbone, subtrees[[i]], where = tip)
backbone }
That’s it! We have successfully grafted barcode-based phylogenies onto a tree that was built with different gene fragments.
Note that how the branch lengths were generated may differ between trees, which means the grafted branches might be scaled differently than the branches from the backbone tree (sort of like having different units). Therefore, you must consider the implications of this in relation to what you are using the grafted phylogeny for. If you are using the tree to check for phylogenetic signals and do phylogenetic corrections, I believe you should be good, since you’ll still be able to have a rough estimate of the level of relatedness between the species represented in the grafted phylogeny (as opposed to no estimate at all, which is what happens if you don’t use any phylogenetic hypothesis). If you are thinking about using the grafted phylogeny as a definitive statement of the level of relatedness between your species, you’ll likely be unable to do so with this, since the distances across the branch will not be consistent. There are likely other scenarios to consider, but it’ll depend on what your project is.
Now we will visualize the grafted tree.
ggtree(backbone,
layout = "circular",
open.angle = 10) +
geom_tippoint(size = 1,
show.legend = FALSE) +
geom_tiplab(colour = "black",
size = 1.5,
nudge_x = 0.1) + # Adds family names away from phylogeny
geom_rootpoint()
If you compare Figure 8 and Figure 9 (zoom in to read the names!), you can see that where the genera used to be listed in Figure 8, we instead see the grafted phylogenies with the BOLD specimens in Figure 9.
Below are the R version and package versions I used at the time of writing this document.
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggtree_3.8.2 phytools_1.9-16 maps_3.4.1 picante_1.8.2
[5] ape_5.7-1 betapart_1.6 vegan_2.6-4 lattice_0.21-8
[9] permute_0.9-7 gratia_0.8.1 mgcv_1.8-42 nlme_3.1-162
[13] readxl_1.4.3 here_1.0.1 ggtext_0.1.2 lubridate_1.9.3
[17] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[21] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[25] tidyverse_2.0.0 schtools_0.4.1 glue_1.6.2