Smith Lab common stats using R

Author
Affiliation
Alexandre Loureiro

PhD candidate | Smith Lab, University of Guelph

Last updated

November 8, 2023

NOTE: This document is a work in progress. If you see any errors, please reach out to either Alex or myself so we can address them.

1 Introduction

This is a collection of some of the most used statistical procedures and visualization techniques in the Smith Lab. Generally speaking, this document is structured from the perspective of someone who wants to analyse community ecology data (and is completely biased towards the procedures I performed during my PhD), but that does not mean these procedures are only useful in that context, so even if you are not doing a community ecology project, you might find this tome useful.

Here, you will find brief descriptions of what each procedure is, links to useful resources, and step-by-step instructions on how to perform them in R. Depending on what year you are reading this, some of the R functions we use here may have changed in how they work, but these instructions should give you at least a starting point. You can find the R and package versions used during the writing of this document in Section 5.

I have broken down every procedure into its steps, and displayed the code and outputs for you to follow along. I have annotated every code chunk with brief explanations of what most things in the code do in case you are not familiar with a specific command. If I wrote a comment explaining what a command does, it is unlikely I did it again if the code reappears down the line.

If you are not comfortable reading R code, here is a quick guide on how to do so.

  • Variables get their values assigned by this operator: <-. This means that whatever is to the right of that arrow gets stored in the name we write on the left, and every time you call that name, all of the values assigned to it show up. For example, if you write x <- 3, every time you tell R to run x, the number 3 will appear.
#________________________________
# Assign the number 3 to x
x <- 3

#________________________________
# Display whatever is stored in x
x
[1] 3
  • Functions are operations you perform on the data, and you can identify them by the parentheses that always follow them. For instance, the function that calculates means in R is mean(). The things that go inside the parentheses are the parameters of the function (such as your data file, any tweaking you want to do to the how the function is doing its calculations, etc.). Their values are assigned with an = sign and are separated by commas, so every function will look something like function(x = something, y = something_else).
  • Piping is a concept in which you chain one function after another so that when you run a line of code, it runs all of the chained procedures at once. This, perhaps most importantly, makes the code more readable to us (the computer doesn’t really care if functions are chained). The piping operator I use throughout this document is %>%, which comes from the magrittr package. Base R has its own operator (|>), but it’s slower than %>% (as of the writing of this document), though you’d only notice the differences in speed if you are running procedures that take a long time to finish (so use whichever one you like!). Piping operators can be read as “then” (e.g. do this, then do that, then do that, etc.). For example, data_frame %>% select(variable_x) %>% distinct() can be read as “take my data frame called ‘data_frame’, then select variable ‘x’ from it, then show me the distinct values of ‘x’”.
  • Finally, I tend to break up my code into a new line at most instances of a comma and at every instance of a piping operator, because this stops the code from running all the way to the side of the screen, which improves readability (in my opinion). This, however, is a stylistic choice, and you don’t have to do it if you don’t want to. I find it also helps me when I’m editing the code.

Now we are ready to start. First, let’s load some packages that we will be using throughout this document.

# This is an alternative way of calling your R packages to your session instead
# of writing "library()" for every package that you have. Either one is fine,
# but I added this here so you know it exists.
lapply(
  c(
    "tidyverse", # For data manipulation and graphics (also loads magrittr pipe)
    "ggtext", # For easier axis title writing when we need sub or superscripts
    "here", # For easier file pathing in the project directory
    "readxl" # For importing Excel files
  ),
  library, # Function to be applied over values in vector above
  character.only = TRUE) # Need this for it to work for some reason

2 Alpha diversity analyses

Alpha diversity is a measure of the local diversity of species in any given place, often talked about in terms of number of species but sometimes taking into account their relative abundances as well. We will cover a couple of ways that you are likely to analyse alpha diversity data.

2.1 Generalized additive model (GAM)

2.1.1 What it is

These models basically do magic, but since this magic can be complex I will only give you a brief and simplified explanation of how they work. If, however, you are interested in learning more about them, I highly recommend Noam Ross’s GAMs in R course (which at the time of this writing is completely free!), Gavin Simpson’s YouTube workshop on GAMs, and if you really want to know more about the math behind them, Simon Wood’s book.

In short, GAMs are models that can learn from the data what the best shape to represent them is. I will try to impart on you why I called this magic with an example.

Most models assume before hand the behaviour of the data in question, and the math behind them (calculation of parameters) and the subsequent test statistics and p-values that evaluate the model as a description of the data, are all based on that shape. Take linear models for instance – by definition, they assume a linear relationship between the variables in question. If the variables you are working with do not have a linear relationship, you are bound to miss perhaps important behaviours of the system you are trying to represent with your model. For example, suppose you have a positive hump-shaped relationship between two variables (let’s say species diversity and elevation), and then while doing your statistical analyses you fit a linear model. By doing so, you will be able to capture the positive nature of the relationship (species diversity increases with elevation), but you will not capture the humped nature of that relation (in the diversity and elevation hypothetical here, there would be a dramatic increase of diversity at a certain point which would then stop happening, and you would completely miss that since your linear model would tell you that diversity is changing in a linear, one-step-at-a-time fashion). On the other hand, if your data behave linearly, a linear model would capture the nature of the relationship between the two variables very well, since it would assume the correct behaviour of the data. Remember, however, how I said at the beginning of this that GAMs can learn from the data what the best shape to represent them is? This means that if the data behave linearly, our GAM will come out looking, and behaving like a linear model, and if our data behave “humped-shapedly”, our GAM will take that form instead!

Briefly, this is done through things called basis functions, which when put together make up a GAM. Imagine that you take your entire elevation gradient and you split it, for instance, into 10 equidistant sections. Now suppose you fit, for instance, a cubic regression model to the diversity and elevation data for each of those 10 sections, and that you could then combine the results of each of those sections to ultimately describe the diversity-elevation gradient in one grand model. Well, that is more or less how GAMs work! The basis functions are the functions used throughout the 10 sections of the gradient, and the GAM is the overall model that combines the results from each of the sections!

2.1.2 Assumptions

The assumptions of GAMs are the same as those for parametric linear models, which you might already have been exposed to in undergraduate statistics courses. There are three main assumptions:

  1. Error distribution chosen is correct
  2. Independence of observations
  3. Constant variance of observations

Assumption number 1 might have been presented to you in the past as the assumption of “normality”, and you might have also heard things like “make sure your data are normal!”, or “are your data normally distributed?”, or “you should transform your data to make them normal.”, or any number of sentences like that. This stems from the fact that statistics courses tend to emphasize parametric tests, which tend to assume a Gaussian distribution (known as the normal distribution) of the error terms (if the data are normally distributed, this gets reflected in the error terms, which is what matters for modeling. For more information on this, see Montgomery’s book, or any other basic statistics book). In GAMs, we can choose the error distribution we think is most appropriate to our data based on our knowledge of its nature, so we are not bound by the normal distribution and generally do not need to do any transformations to our data. For example, count data is made up of integers (whole numbers, not decimals) and cannot have negative values (it is impossible for you to count -10 oranges – what would that even mean?). The Gaussian distribution goes from -\(\infty\) to +\(\infty\) and includes decimal numbers, so any model using the normal distribution to predict count data would be predicting nonsensical things (such as counting -10.5 oranges!). Therefore, it is crucial that you choose the correct error distribution for your model, because otherwise your results just will not make any sense (you will get outputs and p-values and everything else, but they will be very poor representations of the system you are trying to describe!).

Assumption 2 refers to the relationship between each of your data points, and whether they are influenced by (or related to) each other. In technical terms, the two points must not be drawn from different populations if the assumption of independence is to be upheld (they must be from the same population). Of course, this is a border-line impossible thing to control (especially in field studies!), since for all intents and purposes, things are connected to each other with varying degrees of intensity, so nothing is truly independent. However, statistically speaking, what you are trying to see is whether the behaviour of each data point is independent from others or not, and you can minimize this through your experimental design or by accounting for that dependence in your model. For instance, suppose we want to measure the average concentration of red blood cells in a population of 50 students. If, when you collect your data, you take measurements from only 25 students, but measure each of them twice, suddenly your data points are not independent, because every two observations will be closely related to each other, whereas if you had sampled each of the 50 students once, it is less likely that the data would be related. The consequence of violating this assumption is that you are at an increased risk of making a Type I error, which is finding a pattern that suggests a relationship between two variables that doesn’t really exist. I’ll illustrate what that means with another example. Say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper, which we will revisit in Section 4.3.

Assumption 3 has to do with variance, and it is at the root of statistical thinking. Suppose you want to see how variable “A” is affected by variable “B” (say blueberry yield and the application of a specific herbicide concentration). The most basic way to think about this is in a signal vs. noise setting. If there is a relationship between “A” and “B” (i.e. a signal exists between blueberry yield and that concentration of that herbicide), it is generally obscured by noise (variance caused by many other things such as rainfall, efficiency of application of the herbicide, blueberry plant health, etc.), and we use statistics to quantify the noise and the signal so we can see how big the signal actually is (i.e. how strong the relationship between “A” and “B” is). Now suppose what you want is to see how variable “A” is affected along differing levels by variable “B” (i.e. how blueberry yield is affected along a gradient of concentrations of this herbicide). Here, there may be differing levels of noise along the gradient (e.g. suppose lower concentrations have smaller variance because at those levels the herbicide doesn’t really do anything to the plants, but that variance increases as concentrations increase because different plants have different tolerances to the herbicide), so you need to ensure you either account for this change in variance in your experimental design or in your model, otherwise you are at a higher risk of making a Type II error, which is finding no evidence of a relationship between “A” and “B” when there actually is one (i.e. too much noise for you to find the signal). If your violation of constant variance comes from unequal sample sizes along the factor levels you have chosen, the consequences may be a bit different. Suppose you found low variance in the lower concentrations of herbicide and high variance in high concentrations – there are two scenarios that can arise here. 1) If you have fewer samples for the low herbicide concentrations (low variance associated with fewer samples) and many more samples of the high herbicide concentrations (high variance associated with higher number of samples), you are at risk of Type II error; 2) If you have fewer samples for the high herbicide concentrations (higher variance associated with fewer samples) and more samples of the low herbicide concentrations (low variance associated with higher number of samples), you are now at risk of making a Type I error.

There are ways to check how our data meet (or violate) these assumptions when fitting our GAMs, and knowing whether or not you violate them is important because we can then address them methodologically, or at the very least, we can discuss how our conclusions may be affected by these violations (if they exist).

2.1.3 How to do it in R

2.1.3.1 Load packages

First, we will load the packages required to complete the entire procedure.

library(mgcv) # For fitting GAMs
library(gratia) # For cleaner-looking assumption plots

2.1.3.2 Prepare your data

Your data for GAMs should be laid out in the “each row is an observation and each column is a variable” format. For instance, suppose you wanted to get information about the inhabitants of a galaxy far far away such as their name, birth year (before the battle of Yavin), weight, height, etc. An intuitive way to lay out those data would be to make every row represent a specific inhabitant and every column a variable about that inhabitant, so that in the end you would have something this:

# This data set is part of the dplyr package
starwars # Displays data about all Star Wars characters
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Your own data need to be laid out in the same way, where each row is an observation (e.g. individual organisms, or collection sites) and each column is a variable about that observation (e.g. elevation, temperature, humidity, etc.).

We will be using the following Staphylinidae elevation and heat tolerance data from Costa Rica to learn how to run GAMs in R.

#______________________________________________________________________________
# Importing the data frame
staph_ctmax_elevation_data <- read.csv(
  here("data",
       "raw",
       "OTU_named_190220.csv")) # Imports the data from working directory

#______________________________________________________________________________
# Displaying the top 10 rows of the data frame
staph_ctmax_elevation_data %>%
  as_tibble() # Just makes this output look nicer
# A tibble: 57 × 3
   bin    elevation ctmax
   <chr>      <int> <dbl>
 1 OTU-01      1300  37.5
 2 OTU-01      1300  36.5
 3 OTU-01      1300  37  
 4 OTU-01      1300  36.5
 5 OTU-01      1300  36.5
 6 OTU-01      1300  34  
 7 OTU-01      1200  36  
 8 OTU-02      1200  36  
 9 OTU-03      1200  36  
10 OTU-04      1300  39.5
# ℹ 47 more rows

2.1.3.3 Running a GAM

We will be using the gam() function from the mgcv package to fit our model and the gam.check() function from the mgcv package to check whether our error terms meet or violate the model assumptions. If you want, you can use the appraise() function from the gratia package to have ggplot2 versions of the graphs you get from gam.check(). For this example, we can expect a Gaussian distribution of our error terms because CTmax is not an integer and can be both positive and negative (theoretically, you could have a species with a CTmax of -10°C, who knows!).

#______________________________________________________________________________
# Model fitting
staph_ctmax_gam <- gam(
  formula = ctmax ~ s(elevation, # s() specifies variable as smooth (wiggly)
                      k = 5), # Specifies number of basis functions
  data = staph_ctmax_elevation_data, # Data frame with our variables of interest
  family = "gaussian", # Specifies the expected error distribution
  method = "REML" # Algorithm that penalizes over-fitting of the data
)

#______________________________________________________________________________
# Checking assumptions
gam.check(staph_ctmax_gam)


Method: REML   Optimizer: outer newton
full convergence after 5 iterations.
Gradient range [-3.56664e-10,1.45528e-11]
(score 116.2551 & scale 3.236838).
Hessian positive definite, eigenvalue range [0.7384314,27.53047].
Model rank =  5 / 5 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

               k'  edf k-index p-value
s(elevation) 4.00 2.81    0.95    0.32

In this output, we see 4 plots and some stuff printed onto the console. The plots will help us check our assumptions, and the console log will tell us:

  1. If there was convergence in our model fitting (if not, we can’t trust the results)
  2. The behaviour of the ctmax elevation relationship through the edf (effective degrees of freedom) (e.g. edf = 1 means a linear relationship, edf = 2 quadratic, edf = 3 cubic, and so on)
  3. A test statistic with a p-value that indicates whether the residuals are randomly distributed or not (constant variance assumption), with a small p-value (< 0.05) indicating non-random distribution of the residuals (violation).

You can often fix a violation of constant variance by increasing the number of basis functions in the k argument in the gam() function (I don’t know for sure, but it looks like this test is for whether the residuals behave randomly within each basis function rather than among all functions. I say this because of what you will see below).

The plots came out looking a bit squished here because they are printed as individual plots in RStudio, but here they got combined into the space of one and I didn’t try to fix it (sorry!). At this point, however, is where I would use the appraise() function from gratia to have better-looking plots.

# Prints model diagnostic plots
appraise(staph_ctmax_gam)

Figure 1: Model diagnostic plots of the Staphylinidae CTmax and elevation (m.a.s.l.) GAM.

We see in Figure 1 that our residuals mostly fall very close to the line in the “QQ plot of residuals”, which indicates that our error terms mostly behave in a Gaussian way. We can see in the “Histogram of residuals” plot that our residuals seem to be left skewed, which accounts for why not all residuals fell within the line in the “QQ plot”. Some statisticians may recommend we transform our data to address this skew, but I will leave it up to you whether to do it or not. I tend to think that if the vast majority of residuals behave according to the distribution we chose (and the ones that don’t are not absurd-looking outliers), we can go ahead with our model and then in the discussion we can talk about the outliers that did not fall perfectly within the expected distribution.

We see in the “residuals vs linear predictors” plot that there is a sideways cone shape to our residuals (looks like >). This indicates that there is no constant variance along the values of y, and that where CTmaxs are higher (looks like above 40°C), there tends to be less variation than where CTmaxs are lower (below 40°C).

If we see any pattern with the mean of the residuals vs fits plot (the middle line doesn’t cut the dots in half horizontally), the assumption of independence may be violated. This does not seem to be the case with our data, but we will explore this a bit more in Section 4.2.

So, we know that our model is going to yield results that make sense (we chose the correct error distribution), but that we may be at risk of seeing some type of error because we don’t have constant variance. This being said, it does NOT mean that our data are bad and that therefore modeling them is a fruitless venture. For once, just by trying to get a model going, we have learned that CTmax does not vary equally along its range, and that that where we find things with CTmaxs above 40°C, we tend to find only things with high CTmax, but where we find things with CTmaxs below 40°C, we tend to find all kinds of CTmaxs, which biologically speaking is an extremely interesting find! All we have to be cautious of is overstating whatever results we find about the relationship and what shape it shows. Because our test statistic in the gam.check() function did not indicate that we violated constant variance, we should not worry too much about it (could be because it tests variance within each basis function instead of across the entire gradient, I don’t know), though we should keep it in mind!.

Now, we will extract the results from our fitted model.

summary(staph_ctmax_gam) # function from base R that extracts model results

Family: gaussian 
Link function: identity 

Formula:
ctmax ~ s(elevation, k = 5)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.9211     0.2383   163.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(elevation) 2.806  3.342 37.49  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.689   Deviance explained = 70.5%
-REML = 116.26  Scale est. = 3.2368    n = 57

This output is broken up into three main parts:

  1. From the “Family” section to the “Formula” section. Here, we see some descriptors about the model we just fitted.
  2. From the “Parametric coefficients” section to the “Approximate significance of smooth terms” section. Here we see information about each of the parameters of our model. For the parametric coefficients, we only have an intercept (every model has an intercept – it’s the value of y when x is 0). We see its estimated value (CTmax of 38°C when elevation is 0 m.a.s.l), the standard error associated with that estimate, the test statistic that tested whether the estimate is different from 0, and the associated p-value. If we had other categorical variables in our model, this is where we would see their results. For the smooth terms (our continuous variables of interest, which in this case is only elevation), we see its effective degrees of freedom (which indicate the relationship between elevation and CTmax is behaving almost cubically), the Ref.df and F values, which are test statistic values used in an ANOVA that evaluates whether we see evidence of a significant effect of elevation or not, and the corresponding p-value.
  3. From the R-sq. line to the end of the output. Here, we see values that describe how well our model explained the variance in the data, as well as some other values that we don’t have to worry about. If the relationship we are modeling behaved linearly (edf = 1), we should use the “R-sq. (adjusted)” value to see how much variation is explained by our model, whereas if it behaving in any other, more complex way, we should look at the “deviance explained” value (this is what Wood says in his book I linked above – I’m not totally confident that I understand the difference, so we should do what he recommended). We see that our model explained 70.% of the variance in the data, which is pretty good!

Now let’s visualize the model using the base R graphical output (it gives us some information that we can’t really get with the gratia package, as far as I know).

plot(staph_ctmax_gam,
     residuals = TRUE, # Shows the residuals 
     pch = 1, # Selects size of the dots 
     shift = coef(staph_ctmax_gam)[1], # Shifts the scale of y to include the intercept 
     seWithMean = TRUE, # Adds the error associated with the mean to the se
     shade = TRUE, # Makes 95% confidence interval into shaded area
     xlab = "Elevation (m)") # Specifies label of x axis

Figure 2: Base R plot of the Staphylinidae CTmax and elevation (m.a.s.l.) GAM.

We see in Figure 2 evidence that CTmax decreases with elevation, but that the variance between the CTmax values increases with elevation. This is extremely interesting biologically speaking, because we see evidence that CTmax is important for Staphylinidae at lower elevations, but that it may not be the case at higher elevations given that we tend to find a larger range of values there!

Now let’s make the above plot with the ggplot2 package, which allows for great customization while being intuitive in how it works. We have already loaded ggplot2 when we loaded the tidyverse package in Section 1, so we are good to go. Note that the ggplot2 method of plotting GAMs (as far as I know) doesn’t use the mgcv package to generate its model, so you may find some minor discrepancies in the base R plot of our mgcv model (Figure 2) and the ggplot2 plot (Figure 3).

ggplot(staph_ctmax_elevation_data, # Data frame that has our data
       aes(x = elevation, # Our x variable
           y = ctmax)) + # Our y variable
  geom_point() + # Choosing to plot our data in the form of points
  geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
              formula = y ~ s(x, # Specifies formula of our model
                              k = 5), # Number of basis functions
              se = TRUE, # Whether we want the gray 95% CI or not (we do!)
              method.args = list(family = "gaussian")) + # Error distribution
  labs(x = "Elevation (m)", # Customizing x and y axis labels
       y = "CT~max~ (°C)") + # Text between ~~ will be subscript (markdown syntax)
  theme_classic() + # Removes gray grid in background and adds lines to axes
  theme(axis.title = element_text(size = 12), # Font size for all axis titles
        axis.title.y = element_markdown(), # Allows markdown in y axis label
        axis.text = element_text(size = 10)) # Font size for all axis text

Figure 3: ggplot2 plot of the Staphylinidae CTmax (°C) and elevation (m.a.s.l.) GAM.

So in conclusion, we found evidence of a significant effect of elevation on Staphylinidae CTmax, but because of the lack of constant variance, we should be cautious when claiming certainties about the specifics of the model (e.g. the shape of the relationship). This lack of constant variance could be due to a violation in the independence assumption (here, I’m speaking in biological terms), which we will address in Section 4.

3 Beta diversity analyses

Beta diversity is a measure of the differences between communities (numbers of species and what species are there) in different places (e.g. traps, habitats, field sites, etc.). We will cover a visualization technique that allows us to see those differences in community composition and statistical tests that go along with it.

3.1 NMDS

3.1.1 What it is

This is a visualization technique that allows us to see the differences in community composition (what kinds of species you find) between your treatments (levels of your factor of interest). It is not a statistical test, so it should always be accompanied by a test such as PERMANOVA or ANOSIM so that we can say whether the differences we see in the NMDS are statistically significant or not.

3.1.2 How to do it in R

3.1.2.1 Load packages

First, we will load the packages required to complete the entire procedure.

library(vegan) # For community analyses

Let’s also pre-assign a colourblind palette so we can use it for the graphs we will make.

#______________________________________________________________________________
# Colour palette
cbpalette <- c(
  "#999999",
  "#E69F00",
  "#56B4E9",
  "#009E73",
  "#F0E442",
  "#0072B2",
  "#D55E00",
  "#CC79A7"
)

3.1.2.2 Prepare your data

It is possible that you will have one big excel file with all your data in it, which is great. This will have to be transformed into at least two separate datasheets, but fear not: this can all be done in R and you don’t have to worry about having several Excel files to keep track of. However, it is also possible you have the data frames already set up for your analyses somewhere, in which case you can skip this step.

Here we will import an example datasheet might look like yours from Kembel and Cahill Jr (2011).

#______________________________________________________________________________
# Importing the data frame
community_df <- read_excel(
  here("data",
       "raw",
       "community_df.xlsx")) # Imports the data from working directory

#______________________________________________________________________________
# Displaying the data frame
community_df # Prints the top 10 rows of the data frame if table is massive
# A tibble: 331 × 16
   plot_number species abundance   sla leaf_area leaf_thickness leaf_tissue_dens
         <dbl> <chr>       <dbl> <dbl>     <dbl>          <dbl>            <dbl>
 1           9 Viola_…        10  839.     1.05           0.2            0.00596
 2           7 Planta…        20  548.     0.256          0.223          0.00817
 3          11 Potent…        40  471.     4.28           0.343          0.00619
 4           3 Andros…        20  450.     0.255          0.36           0.00618
 5          14 Campan…        10  402.     0.998          0.247          0.0101 
 6           7 Poa_sa…        50  385.     0.347          0.203          0.0128 
 7          17 Orthoc…        60  371.     0.653          0.204          0.0132 
 8          10 Astrag…        10  359.     2.03           0.237          0.0118 
 9           9 Draba_…        10  348.     0.600          0.307          0.00938
10           9 Descur…        10  347.     0.886          0.29           0.00993
# ℹ 321 more rows
# ℹ 9 more variables: srl <dbl>, root_tissue_dens <dbl>, root_diam <dbl>,
#   habitat <chr>, site <chr>, slope <dbl>, aspect <dbl>, slope_position <dbl>,
#   rel_moisture <dbl>

As a quick side note, notice how each row represents the abundance of one species caught in a specific plot. Yours might be numbers of a species by trap, or individual specimen data, which can easily be turned into a data frame like ours if the meta data is available.

Now that we have the data frame loaded in R, we can start preparing the building blocks required to run the NMDS. First, we will create two data frames: 1) the community matrix, which will contain the number of species collected in each of our collection plots (or traps, or GPS coordinate grids, really depends on your project), and 2) the metadata, containing information about each of the collection plots.

The species matrix can be generated as follows, and should look like this:

matrix <- community_df %>%
  select(plot_number, # Selects the relevant columns from our data frame
         species,
         abundance) %>%
  pivot_wider(names_from = species, # Transposes species names to columns
              values_from = abundance,
              values_fill = 0) %>%
  arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
  select(plot_number, # Puts plot_number as the first column
         sort(colnames(.))) # Orders remaining columns alphabetically

#______________________________________________________________________________
# Display the matrix
matrix
# A tibble: 17 × 77
   plot_number Achillea_millefolium Allium_textile Amelanchier_alnifolia
         <dbl>                <dbl>          <dbl>                 <dbl>
 1           1                    0              0                     0
 2           2                   10             10                     0
 3           3                   30              0                     0
 4           5                    0              0                     0
 5           6                    0              0                     0
 6           7                    0              0                     0
 7           8                    0              0                     0
 8           9                   50              0                    10
 9          10                   40              0                     0
10          11                   40             30                     0
11          12                    0             20                     0
12          13                   40             10                    10
13          14                    0             10                    10
14          15                   40             10                     0
15          16                   30              0                     0
16          17                   10              0                     0
17          18                   40              0                     0
# ℹ 73 more variables: Androsace_occidentalis <dbl>, Anemone_patens <dbl>,
#   Antennaria_neglecta <dbl>, Antennaria_parvifolia <dbl>,
#   Arnica_fulgens <dbl>, Arnica_sororia <dbl>, Artemisia_cana <dbl>,
#   Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
#   Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
#   Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
#   Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>, …

The community metadata can be generated as follows, assuming each plot row has a unique value for each of the variables. It’s possible you have several measurements of some abiotic variable, so in that case you’d need to average it so that you only have one value per plot number. Please see the printed table at the end, because yours should look like it.

metadata <- community_df %>%  
  group_by(plot_number) %>% # Applies whatever comes next to each level of plot_number
  distinct(habitat, # Select one of each distinct value of the bellow variables
           site,
           slope,
           aspect,
           slope_position,
           rel_moisture) %>% 
  arrange(plot_number) %>% # Orders by plot_number in ascending fashion
  ungroup() # Makes whole table independent from group_by variable, important later

#______________________________________________________________________________
# Display the meta data sheet
metadata # Prints the entire metadata table
# A tibble: 17 × 7
   plot_number habitat    site     slope aspect slope_position rel_moisture
         <dbl> <chr>      <chr>    <dbl>  <dbl>          <dbl>        <dbl>
 1           1 Mixedgrass Onefour      0    270            3              1
 2           2 Mixedgrass Onefour     20    130            1.5            2
 3           3 Mixedgrass Onefour      5     90            1              2
 4           5 Mixedgrass Onefour      5    130            2              1
 5           6 Mixedgrass Onefour      1     90            3              1
 6           7 Mixedgrass Onefour      8      0            2              2
 7           8 Fescue     Kinsella     5    100            3              2
 8           9 Fescue     Kinsella     5    160            1.5            3
 9          10 Fescue     Kinsella    18    155            2.5            2
10          11 Fescue     Kinsella    12    180            2              1
11          12 Fescue     Kinsella     5    164            3              3
12          13 Fescue     Kinsella    15    220            2.5            3
13          14 Fescue     Kinsella     3    160            3              3
14          15 Fescue     Kinsella     5    160            3              3
15          16 Fescue     Kinsella     6    150            1.5            3
16          17 Fescue     Kinsella     4    140            2.5            2
17          18 <NA>       <NA>        NA     NA           NA             NA

Now we are ready to run the NMDS.

3.1.2.3 Running the NMDS

To run the NMDS, we will use the metaMDS() function from the vegan package.

#______________________________________________________________________________
# Setting seed to ensure reproducible results since nmds is a permutational 
# procedure. This makes it so that the random numbers picked to generate these 
# results will always be the same. Number you pick can be anything.
set.seed(2023)

#______________________________________________________________________________
# NMDS
community_nmds <- metaMDS(matrix %>%
                            select(! 1)) # Omits column 1 (plot_number)
Square root transformation
Wisconsin double standardization
Run 0 stress 0.06758156 
Run 1 stress 0.07097971 
Run 2 stress 0.06482428 
... New best solution
... Procrustes: rmse 0.07340193  max resid 0.2350523 
Run 3 stress 0.0675814 
Run 4 stress 0.06467223 
... New best solution
... Procrustes: rmse 0.01147297  max resid 0.0343962 
Run 5 stress 0.067413 
Run 6 stress 0.0674132 
Run 7 stress 0.06482413 
... Procrustes: rmse 0.01147154  max resid 0.03440339 
Run 8 stress 0.0653533 
Run 9 stress 0.06787984 
Run 10 stress 0.06756966 
Run 11 stress 0.06787951 
Run 12 stress 0.06535327 
Run 13 stress 0.06741359 
Run 14 stress 0.06758146 
Run 15 stress 0.06467203 
... New best solution
... Procrustes: rmse 0.0003099958  max resid 0.0009114333 
... Similar to previous best
Run 16 stress 0.06788061 
Run 17 stress 0.06467214 
... Procrustes: rmse 0.0002691596  max resid 0.0007917589 
... Similar to previous best
Run 18 stress 0.07051754 
Run 19 stress 0.06482403 
... Procrustes: rmse 0.01154534  max resid 0.0353847 
Run 20 stress 0.06600201 
*** Best solution repeated 2 times

This function iteratively attempts to condense the multidimensional data frame (many columns with many rows to be compared) into a two dimensional set of points (essentially x and y coordinates) while trying to stay true to the original differences between each point of the data frame. This is what all these “runs” are in the output above. You can imagine that the more you try to force this multidimensional set of differences into two dimensions (a regular graph) you would create a lot of stress. In fact, this is a measure you can get! Stress is a measure of how well the NMDS plot is representing the differences in our original data set. We can plot what is called a stress plot of our NMDS to see how much stress our NMDS had when trying to condense our differences.

stressplot(community_nmds) # Generates the stress plot

Figure 4: Stress plot for NMDS done on fescue and mixed-seed grass community data

Visually, what you want to see is this “staircase” style graph (Figure 4). We can also directly get the stress value by extracting it from our metaMDS() results with community_nmds$stress, which in our case is 0.065. According to Kruskal, you want the stress value to be less than 0.2 (low stress). Often when the stress is too high, our metaMDS() function will fail to converge, so at the end of the output above where you see all of the iterations it will say something like “failed to converge”. Conversely, when the stress is too low (too close to 0), it will also fail to converge. When that is the case, you can play around with reducing or increasing the size of your plots (pooling or splitting them apart tends to work). Since our example converged nicely, we can then plot our results.

3.1.2.4 Ploting the NMDS with ggplot2

The first thing we need to do is create a data frame that combines the results from our metaMDS() function with our metadata data frame.

data_scores <- as.data.frame(community_nmds$points) %>% # Extracts NMDS plotting points
  mutate(plot_number = metadata$plot_number, # Attaches columns from metadata table
         habitat = metadata$habitat)

Now let’s view the table

data_scores #  Prints the table
         MDS1       MDS2 plot_number    habitat
1  -1.0309795  0.2629909           1 Mixedgrass
2  -0.7136177 -0.3885305           2 Mixedgrass
3  -0.6773968 -0.5183459           3 Mixedgrass
4  -1.6868373  0.3290805           5 Mixedgrass
5  -1.5790633 -0.2702064           6 Mixedgrass
6  -1.2286524 -0.0570536           7 Mixedgrass
7  -1.0491372  0.1059923           8     Fescue
8   0.9419983  0.4102748           9     Fescue
9   0.7731372  0.1122020          10     Fescue
10  0.2347052 -0.1745242          11     Fescue
11  0.1459202  0.3860476          12     Fescue
12  0.8159381  0.3035745          13     Fescue
13  0.6712101  0.1708772          14     Fescue
14  1.0346944 -0.3985301          15     Fescue
15  1.2593688 -0.3332723          16     Fescue
16  0.6240574  0.3493586          17     Fescue
17  1.4646544 -0.2899354          18       <NA>

This next step is optional, but I like it, so I’m including it here. In it, we generate a data frame that contains the coordinates of the centroids for each group, so when we make the plot, we can make the centroids stand out from the other data.

centroid <- data_scores %>%
  group_by(habitat) %>%
  summarize(MDS1 = mean(MDS1),
            MDS2 = mean(MDS2)) %>%
  drop_na(habitat)

At this point, we are ready to have a ggplot2 plot of our NMDS with our main variable of interest. In this case, we want to see, as you might have guessed from the last table we created, that we are interested in how the community of species differs between two habitats: mixed grass and fescue. So let’s see what that looks like!

ggplot(data = data_scores %>%
         drop_na(habitat), # Removes rows with missing metadata
       aes(x = MDS1,
           y = MDS2,
           colour = habitat)) +
  stat_ellipse(level = 0.95,
               show.legend = FALSE) + # Specifies the confidence interval of the ellipsoids
  # coord_fixed() + # Optional. Makes x and y units change at same rate
  geom_point(size = 2) +
  geom_point(data = centroid,
             aes(fill = habitat),
             size = 5,
             shape = 21,
             colour = "black",
             show.legend = FALSE) +
  labs(colour = "Habitat") +
  scale_colour_manual(values = cbpalette) +
  scale_fill_manual(values = cbpalette) +
  theme_classic() + # Gets rid of grey background grid and adds lines to axes
  theme(text = element_text(size = 12), 
        axis.text.x = element_text(size = 10),
        legend.position = "right")

Figure 5: NMDS of species present in mixed grass and fescue communities.

There we have it, Figure 5 is a nice-looking ggplot2 NMDS plot!

Visually, it looks like the two communities are different since there is little overlap between the ellipsoids, but we cannot tell that just by looking at it (well… sometimes you can, like here!), so we need a statistical test that will give us a p-value for whether they are different or not.

I learned how to make NMDS plots in ggplot2 through Pat Schloss’s amazing Riffomonas YouTube channel, which I HIGHLY recommend you check out!

An additional resource that you might find useful is the tutorial Steven Kembel made on how he analyzed the data we used from his paper.

3.2 Statistical tests for community differences (ANOSIM and PERMANOVA)

3.2.1 What they are

They are multivariate (many y variables) statistical tests that tell us whether there is evidence that two or more groups are significantly different from one another. In the Smith Lab, very often our groups will be different habitats, and the many y variables will be the species found in each habitat. The ANOSIM (analysis of similarity) can only have one grouping variable, while the PERMANOVA (permutational analysis of variance) can have many explanatory variables. Both analyses are a form of ANOVA, but non-parametric (they both work with any error distribution, so the residuals from your data don’t need to have a normal distribution). Please see Anderson and Walsh’s paper if you are interested in a deep dive of their differences and what each of them does (plus another one called the Mantel test) and the GUSTA ME (guide to statistical analysis in microbial ecology) website and paper for more guides about these analyses.

3.2.2 Assumptions

3.2.2.1 ANOSIM

First, a bit of background. Like the NMDS, ANOSIM uses ranks as its measures rather than what is called an Euclidean distance (the length between the two points in question). For instance, suppose you have three buildings, and you want to measure the distance between each of them. Suppose then that you found the distance between building A and B is 15 m, between A and C is 10, and B and C is 5. These are, by definition, the Euclidean distances between those buildings. Now when you go write those distances in a column and you sort it from farthest to closest, you may have something like this:

# A tibble: 3 × 2
  `Building combination` `Distance (m)`
  <chr>                           <dbl>
1 A to B                             15
2 A to C                             10
3 B to C                              5

Suppose now that we add a new column to this table that will represent the rank of these distances from farthest to closest (literally as you may think in terms of a race – if you got there in the shortest amount of time, you are ranked first, and so on). Then our table will now look like this.

# A tibble: 3 × 3
  `Building combination` `Distance (m)`  Rank
  <chr>                           <dbl> <dbl>
1 A to B                             15     1
2 A to C                             10     2
3 B to C                              5     3

The ANOSIM will do its calculations based on the rank column rather than the distance column, which is why it’s a non-parametric test, since parametric tests assume a distribution of your error terms, which means you are assuming population mean and variance values, which means you need Euclidean-distance-like numbers to calculate them. Ranks, on the other hand, don’t have a distribution, only a range, so any analysis that uses ranks instead of Euclidean distances does not assume a distribution, which is the case for the ANOSIM. Note that the other assumptions that tend to come along with parametric tests (constant variance and independence) don’t apply here because both of them are about the behaviours of the variance in your residuals as they pertain to some distribution, and ranks don’t have a distribution.

The only assumption the ANOSIM has is that the range of ranks between the groups being compared is equal or at least very similar. In other words, the sample size between the groups needs to be equal or very similar.

3.2.2.2 PERMANOVA

I need to give you another bit of background before I dive into the assumptions. This test is called permutational analysis of variance because it runs an ANOVA-like test on our data, then it shuffles our data around, runs the test again, and then repeats this process many times (we can set how many times it does this, usually it’s 999). This, therefore, explains its only assumption – that each datum (singular form of data by the way!) is exchangeable with another datum under the null hypothesis.

What this means intuitively is the following: suppose I have two populations, one that belongs to species “A” and one that belongs to species “B”. Now suppose I wanted to measure the average height of each population. These two species need to, therefore, be treated separately from one another (or their relation accounted for in the model) when we do our analysis because they are two different populations (species “A” and species “B”), and individuals from species “A” might have extremely different properties than those of species “B” (just like we talked about in Section 2.1.2). Suppose then that I had a bucket with individuals of species “A”, from which I randomly picked 100 individuals, measured their height, tossed them back into the bucket, and that I did this many times. Which individuals appears in that sequence of 100 should not matter to my overall results of average height, no matter how many rounds of 100 draws I do. We can therefore say, that members of species “A” are exchangeable with one another. Now what if I added a bunch of individuals from species “B” into that bucket? Suddenly, which individuals are drawn matter, because if in round one 60 draws are from species “B”, and in round two only 10 are from species “B”, your average values of height for each set of 100 draws will be completely different. In the same way, the PERMANOVA analysis shuffles our data around, does tests, shuffles them again, tests again, and so on, and this is why the assumption of data objects being exchangeable with one another is made in the first place.

PERMANOVA also uses ranks as the dissimilarity metric (see Section 3.2.2.1 to understand what this means), so it has no assumptions of distribution or constant variance.

3.2.3 How to do them in R

3.2.3.1 Load packages

We will use the same set of packages we used in Section 3.1.

3.2.3.2 Prepare your data

The data needs to be set up in the exact same way we did for the NMDS example, so if you’ve already done that, great! If you haven’t, please refer to Section 3.1.2.2 to see how to do it. The one thing we must do here is preemptively remove the rows with NA from our “metadata” data frame, because both ANOSIM and PERMANOVA will simply not run if there are rows with NAs.

metadata_nona <- metadata %>% 
  drop_na(habitat)

After we do this, we must then also remove plot_number 18 (the one with NA metadata) from our “matrix” data frame because now “metadata” has 16 rows while “matrix” has 17.

matrix_nona <- matrix %>% # nona stands for "no NA", and it also means grandma in Italian!
  filter(plot_number != 18) # This is the only row with NA metadata

3.2.3.3 Running an ANOSIM

To run the ANOSIM, we will use the anosim() function from the vegan package. The distance argument specifies which distance calculation will be used. There are many kinds, but we will use the Bray-Curtis Dissimilarity Index, which generates difference metrics while accounting for species abundance, which we have in this data set.

#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)

#______________________________________________________________________________
# ANOSIM
anosim_results <- anosim(matrix_nona %>%
                           select(! 1), # Drops the plot_number column
       grouping = metadata_nona$habitat,
       distance = "bray") # Specifies the distance calculation method

#______________________________________________________________________________
# Display ANOSIM results
anosim_results

Call:
anosim(x = matrix_nona %>% select(!1), grouping = metadata_nona$habitat,      distance = "bray") 
Dissimilarity: bray 

ANOSIM statistic R: 0.7222 
      Significance: 0.002 

Permutation: free
Number of permutations: 999

We get two values of interest from this output: the ANOSIM statistic R and the significance value. The first is a value that ranges from 0 to 1. The closer it is to 1, the more different the groups are, while the closer they are to 0, the more similar they are. The latter is essentially a p-value. We can see that we have a significance value of 0.002 and an ANOSIM statistic R value of 0.72, so we can conclude that we see strong evidence that the fescue and mixed grass communities are significantly different from one another, just like what the NMDS plot suggested!

3.2.3.4 Running a PERMANOVA

The PERMANOVA runs very much like an ANOVA, in that we need to specify a model equation with a y variable (here it’ll be our matrix instead of a single variable) and all of the x variables that we want to add to the analysis. This means that we are able to measure the effects of multiple x variables on our community data, including the interactions between them!

To run a PERMANOVA, we will use the adonis2() function from the vegan package.

#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)

#______________________________________________________________________________
# PERMANOVA
permanova_results <- adonis2(matrix_nona %>% select(! 1) ~ habitat,
        data = metadata_nona, #  Specifies where the explanatory variables come from
        method = "bray")

#______________________________________________________________________________
# Display PERMANOVA results
permanova_results
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = matrix_nona %>% select(!1) ~ habitat, data = metadata_nona, method = "bray")
         Df SumOfSqs      R2      F Pr(>F)   
habitat   1   1.7186 0.42686 10.427  0.002 **
Residual 14   2.3076 0.57314                 
Total    15   4.0262 1.00000                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you are familiar with ANOVA tables, you might have noticed that the PERMANOVA output looks very much like the output from a regular ANOVA. We see that p = 0.002, so the PERMANOVA agrees with the ANOSIM, showing us evidence that the communities between the fescue and mixed grass habitats are significantly different from each other!

We will now run the PERMANOVA again, only this time we will add the other meta-data variables to our model to showcase the main difference between using a PERMANOVA and an ANOSIM.

#______________________________________________________________________________
# Setting the seed for reproducible permutational results
set.seed(2023)

#______________________________________________________________________________
# PERMANOVA
interactions_permanova_results <- adonis2(matrix_nona %>% select(! 1) ~ # ~ is the way R represents the = sign in an equation
          habitat*slope*slope_position*rel_moisture, # Looking for interaction effects
          data = metadata_nona, # Specifies where the explanatory variables come from
          method = "bray")

#______________________________________________________________________________
# Display PERMANOVA results
interactions_permanova_results
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999

adonis2(formula = matrix_nona %>% select(!1) ~ habitat * slope * slope_position * rel_moisture, data = metadata_nona, method = "bray")
                                  Df SumOfSqs      R2      F Pr(>F)  
habitat                            1   1.7186 0.42686 7.3350  0.014 *
slope                              1   0.1193 0.02964 0.5092  0.762  
slope_position                     1   0.2223 0.05521 0.9487  0.441  
rel_moisture                       1   0.2178 0.05410 0.9296  0.472  
habitat:slope                      1   0.1522 0.03779 0.6495  0.638  
habitat:slope_position             1   0.1550 0.03849 0.6615  0.611  
slope:slope_position               1   0.1617 0.04017 0.6903  0.615  
habitat:rel_moisture               1   0.0876 0.02175 0.3737  0.866  
slope:rel_moisture                 1   0.1137 0.02823 0.4851  0.746  
slope_position:rel_moisture        1   0.1499 0.03722 0.6396  0.636  
habitat:slope:slope_position       1   0.1868 0.04640 0.7974  0.515  
habitat:slope:rel_moisture         1   0.0773 0.01920 0.3300  0.894  
slope:slope_position:rel_moisture  1   0.1954 0.04854 0.8341  0.513  
Residual                           2   0.4686 0.11639                
Total                             15   4.0262 1.00000                
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see here that despite all the other variables and interactions, “habitat” is the only variable that showed evidence of a significant effect on the two communities (p = 0.014). Therefore, in this case, using either ANOSIM or PERMANOVA would’ve made no difference to what we found, but if one of the other variables or interactions showed evidence of a significant effect, we would not have been able to capture that with the ANOSIM!

3.3 Beta partition (nestedness and turnover)

3.3.1 What it is

In short, nestedness is a measure of how much of one community is just a subset of another, while turnover is a measure of how much one community is completely different than another. If you think back to our NMDS graphs (see Section 3.1.2.4), you may have an easier time visualizing what I’m saying. Suppose we have two communities, and each is represented by an ellipsoid (just like in the NMDS graph). Now suppose one community is simply a subset of the other, then what you would see is a little ellipsoid inside a bigger ellipsoid, which would indicate that the little one is a subset of the larger one. If the two communities are completely different, you would see two ellipsoids that don’t overlap at all. There are different ways to measure the nestendess and turnover components of communities, but we will only learn one of those ways here. If you want to learn more about beta diversity partitioning into nestedness and turnover, checkout Baselga’s theoretical paper on the topic, and Baselga and Orme’s paper about the package we will be using in this section!

3.3.2 How to do it in R

3.3.2.1 Load packages

First, let’s load the packages we will need to complete the entire procedure.

library(betapart)
Warning: package 'betapart' was built under R version 4.3.2

3.3.2.2 Prepare your data

We will be using the same data we used in Section 3.1.2.2, but here we will create a presence/absence matrix. I will print the data here so you see what it looks like.

community_df
# A tibble: 331 × 16
   plot_number species abundance   sla leaf_area leaf_thickness leaf_tissue_dens
         <dbl> <chr>       <dbl> <dbl>     <dbl>          <dbl>            <dbl>
 1           9 Viola_…        10  839.     1.05           0.2            0.00596
 2           7 Planta…        20  548.     0.256          0.223          0.00817
 3          11 Potent…        40  471.     4.28           0.343          0.00619
 4           3 Andros…        20  450.     0.255          0.36           0.00618
 5          14 Campan…        10  402.     0.998          0.247          0.0101 
 6           7 Poa_sa…        50  385.     0.347          0.203          0.0128 
 7          17 Orthoc…        60  371.     0.653          0.204          0.0132 
 8          10 Astrag…        10  359.     2.03           0.237          0.0118 
 9           9 Draba_…        10  348.     0.600          0.307          0.00938
10           9 Descur…        10  347.     0.886          0.29           0.00993
# ℹ 321 more rows
# ℹ 9 more variables: srl <dbl>, root_tissue_dens <dbl>, root_diam <dbl>,
#   habitat <chr>, site <chr>, slope <dbl>, aspect <dbl>, slope_position <dbl>,
#   rel_moisture <dbl>

Now let’s create our presence/absence matrix.

#______________________________________________________________________________
# Making a presence absence community matrix
presence_matrix <- community_df %>%
  mutate(abundance = replace(abundance, # Column to be changed
                             abundance > 0, # Condition of change
                             1)) %>% # Value of change (turns > 0 values into 1)
  select(plot_number, # Selects the relevant columns from our data frame
         species,
         abundance) %>%
  pivot_wider(names_from = species, # Transposes species names to columns
              values_from = abundance,
              values_fill = 0) %>%
  arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
  column_to_rownames("plot_number") %>% # Stores plot_number as row names
  select(sort(colnames(.))) # Orders remaining columns alphabetically

#______________________________________________________________________________
# Displaying the data frames
presence_matrix %>%
  as_tibble()
# A tibble: 17 × 76
   Achillea_millefolium Allium_textile Amelanchier_alnifolia
                  <dbl>          <dbl>                 <dbl>
 1                    0              0                     0
 2                    1              1                     0
 3                    1              0                     0
 4                    0              0                     0
 5                    0              0                     0
 6                    0              0                     0
 7                    0              0                     0
 8                    1              0                     1
 9                    1              0                     0
10                    1              1                     0
11                    0              1                     0
12                    1              1                     1
13                    0              1                     1
14                    1              1                     0
15                    1              0                     0
16                    1              0                     0
17                    1              0                     0
# ℹ 73 more variables: Androsace_occidentalis <dbl>, Anemone_patens <dbl>,
#   Antennaria_neglecta <dbl>, Antennaria_parvifolia <dbl>,
#   Arnica_fulgens <dbl>, Arnica_sororia <dbl>, Artemisia_cana <dbl>,
#   Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
#   Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
#   Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
#   Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>, …

We are now ready to run our beta partition calculations.

3.3.2.3 Running the beta partition calculation

We will calculate the beta diversity partition for our fescue and mixed grass communities using the beta.multi() function from the betapart package. This calculation is done for the communities across our plots, which is why we do one analysis for all data, and the conclusions we draw about the habitats is inferred on later, and can be coupled with NMDS visualizations.

#______________________________________________________________________________
# Overall community beta diversity partition
betapartition <- beta.multi(presence_matrix)

#______________________________________________________________________________
# Displaying the results
betapartition
$beta.SIM
[1] 0.8182466

$beta.SNE
[1] 0.04983409

$beta.SOR
[1] 0.8680807

As you can see from the output, we get three values: beta.SIM (the turnover value), beta.SNE (the nestedness value), and beta.SOR (the overall beta diversity, which is the sum of the other values). These values clearly indicate that our community is mostly characterized by turnover, since the beta.SIM value (0.8182466) is much higher than the beta.SNE value (0.0498341). This perfectly matches what we saw with our NMDS plot in Section 3.1.2.4!

4 Phylogenetic analyses

Phylogenetic analyses tell us about the level of relatedness between the species in the communities we are studying and how that relatedness affects any trait data we have collected about that community. Phylogenies are hypotheses about how these relationships exist and can be constructed in a variety of ways. If you want to learn more about how they are built and how they work, check out Hall’s “phylogenetic trees made easy” book. For the analyses we are about to learn, all you need to know is that we will have to choose a phylogeny and from it we will use what is called a branch length, which is the shortest path along the tree between two taxa. To visualize this, imagine you put your finger on the tip of the phylogeny that represents one of your taxon, then you go up its nodes (the points at which taxa separate from a common ancestor), all the way up to the common ancestor between the two taxa in question, and then down the nodes all the way to the other taxon – the summed lengths of all the branches in that trajectory (the lengths between the nodes) is the phylogenetic distance between those two taxa.

4.1 Phylogenetic diversity

4.1.1 What it is

Phylogenetic diversity (PD) is the sum of the branch lengths of a subset of taxa. Suppose you have two plots with 10 species each, and that species in plot “A” are very closely related and species in plot “B” are distantly related. If we were to calculate the PD of each plot, we would find a larger number for plot “B” than for plot “A”, because since the species in plot “B” are more distantly related than those in plot “A”, their branch lengths would be longer than those in plot “A”. You can find out more about PD in Faith’s paper.

4.1.2 How to do it in R

4.1.2.1 Load packages

We only need one package to calculate PD, and that is the picante package.

library(picante) # Does many phylogenetic things

4.1.2.2 Prepare your data

We will be using the same Staphylinidae CTmax and elevation data set we used in Section 2.1. This data set has an accompanying phylogeny which Alex created based on COI gene fragments from each individual, which were then used to assign an operational taxonomic unit name (OTU-01, OTU-02, etc.), which can be used (for differentiation purposes) in place of species.

#______________________________________________________________________________
# Importing the data frame and phylogeny

#________________________________
# Data frame
staph_ctmax_elevation_data <- read.csv( # Imports the data from selected path
  here("data",
       "raw",
       "OTU_named_190220.csv")
)

#________________________________
# Phylogeny
phy <- read.tree( # From the `ape` package which is part of `picante`
  here("data",
       "phylo_trees",
       "OTU_named_190220.nwk")
)

#______________________________________________________________________________
# Displaying things

#________________________________
# Display the top 10 rows of the data frame
staph_ctmax_elevation_data %>%
  as_tibble() # Just to make this output look nicer
# A tibble: 57 × 3
   bin    elevation ctmax
   <chr>      <int> <dbl>
 1 OTU-01      1300  37.5
 2 OTU-01      1300  36.5
 3 OTU-01      1300  37  
 4 OTU-01      1300  36.5
 5 OTU-01      1300  36.5
 6 OTU-01      1300  34  
 7 OTU-01      1200  36  
 8 OTU-02      1200  36  
 9 OTU-03      1200  36  
10 OTU-04      1300  39.5
# ℹ 47 more rows
#________________________________
# Display phylogeny
plot(phy)

Before we proceed, we need to check if our phylogeny has multichotomies in it (nodes that splits into three or more parts). Doing so visually is not always reliable (please see Liam Revell’s blog post about this) so instead we will use the is.binary() function from the ape package, which is included in the picante package. If the result of the function comes out as TRUE, we’re good to go. If it comes out as FALSE, then that means multichotomies are present. We can fix them by running the multi2di() function from the ape package, which resolves the multichotomies by adding branches to the phylogeny of length 0 wherever the multichotomies exist.

So now let’s see if our phylogeny is fully bifurcated (no multichotomies).

is.binary(phy)
[1] TRUE

It looks like we have no multichotimies.

Now that our phylogeny is in order, let’s prepare our data. Our data frame will need to be in an occurrence matrix format (same one used in Section 3.1.2.2).

#______________________________________________________________________________
# Making the matrix
staph_elevation_matrix <- staph_ctmax_elevation_data %>%
  group_by(elevation) %>%
  count(bin,
        name = "abundance") %>%
  ungroup() %>% # Need this for later
  pivot_wider(names_from = bin, # Transposes species names to columns
              values_from = abundance,
              values_fill = 0) %>%
  select(elevation, # Puts elevation as the first column
         sort(colnames(.))) # Orders remaining columns alphabetically

#______________________________________________________________________________
# Display the matrix
staph_elevation_matrix
# A tibble: 7 × 29
  elevation `OTU-01` `OTU-02` `OTU-03` `OTU-04` `OTU-05` `OTU-06` `OTU-07`
      <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>
1        10        0        0        0        0        0        0        0
2       300        0        0        0        0        0        0        0
3       700        0        0        0        0        0        5        0
4      1000        0        0        0        1        0        0        1
5      1200        1        1        1        1        1        1        0
6      1300        6        0        0        1        0        0        0
7      1500        0        0        0        0        0        0        0
# ℹ 21 more variables: `OTU-08` <int>, `OTU-09` <int>, `OTU-10` <int>,
#   `OTU-11` <int>, `OTU-12` <int>, `OTU-13` <int>, `OTU-14` <int>,
#   `OTU-15` <int>, `OTU-16` <int>, `OTU-17` <int>, `OTU-18` <int>,
#   `OTU-19` <int>, `OTU-20` <int>, `OTU-21` <int>, `OTU-22` <int>,
#   `OTU-23` <int>, `OTU-24` <int>, `OTU-25` <int>, `OTU-26` <int>,
#   `OTU-27` <int>, `OTU-28` <int>

After we get our matrix into that format, we can proceed with the PD calculation. Note that because Alex made the phylogeny from scratch using only our taxa, we don’t need to do anything else to it. However, had we gotten a phylogeny (say, from a paper) that included taxa that we don’t have, we would need to trim that phylogeny so that it would contain only our taxa. That can be done with the prune.sample() function from the picante package, but we don’t need to do that here.

4.1.2.3 Running the PD calculation

We will calculate PD using the pd() function from the picante package.

#______________________________________________________________________________
# Creating data frame with PD results
phy_diversity <- pd(
  staph_elevation_matrix %>% # Selects the data we will use
    select(! 1), # Removes the elevation column from calculation
  phy, # The phylogeny we are using
  include.root = FALSE) %>% # Tree root not generally our taxa
  mutate(elevation_m = staph_elevation_matrix$elevation) # Add elevation column

#______________________________________________________________________________
# Displaying PD results
phy_diversity %>%
  as_tibble()
# A tibble: 7 × 3
     PD    SR elevation_m
  <dbl> <dbl>       <int>
1 1.11      5          10
2 0.994     5         300
3 1.54      6         700
4 0.501     2        1000
5 1.38      6        1200
6 0.762     4        1300
7 1.19      6        1500

We can see that not only do we get our PD results, but we also get species richness counts for each of the elevations! You can now analyse these values with GAMs (see Section 2.1), or whatever analyses you have in mind.

4.2 Phylogenetic signal

4.2.1 What it is

We will begin by thinking back to the problem of independence we learned about in Section 2.1.2 and revisiting one of the examples I gave there. It goes as follows: say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper.

The phylogenetic signal is a number we can get (along with a test statistic) that tells us whether this assumption of independence in the trait data is being violated by species with similar traits being too closely related. There are two different methods to calculate phylogenetic signals that we will learn to do here: Blomberg’s \(K\) and Pagel’s \(\lambda\). Blomberg’s \(K\) generally goes from 0 to 1, where 0 means no phylogenetic signal and 1 means a strong phylogenetic signal. It can, however, go over 1 because of the way it’s calculated, and indeed you may see this from time to time in empirical studies. Pagel’s \(\lambda\) also goes from 0 to 1, where 0 means no phylogenetic signal at all and 1 means a strong phylogenetic signal. Though it also can go over 1, it’s much less likely to do so than Blomberg’s \(K\). Both tests are calculated differently, so to play on the safe side, it’s a good idea to do both, and if one of them shows a phylogenetic signal, you should then do a phylogenetic correction of your trait data (see Section 4.3). If you want to learn more details about these metrics, check out Münkemüller’s paper!

4.2.2 How to do it in R

4.2.2.1 Load packages

We will need the phytools package to complete our procedures. We will also use the picante package in the phylogeny and data preparation section.

library(phytools)
library(picante)

4.2.2.2 Prepare your data

We will be using the Staphylinidae CTmax and elevation data and phylogeny we loaded in Section 4.1.2.2 and doing the same phylogeny check for multichotomies we did there. Please see Section 4.1.2.2 for instructions on how to do this step.

After we load our data, we need to create a data frame with average trait values (CTmax and elevation in our case) for each taxon, so that in the end, each taxon only has one CTmax value and one elevation value.

#______________________________________________________________________________
# Making the data frame
average_trait_df <- staph_ctmax_elevation_data %>%
  group_by(bin) %>%
  mutate(mean_ctmax = mean(ctmax),
         mean_elevation = mean(elevation)) %>%
  distinct(mean_ctmax,
           mean_elevation)

#______________________________________________________________________________
# Displaying the data frame
average_trait_df
# A tibble: 28 × 3
# Groups:   bin [28]
   bin    mean_ctmax mean_elevation
   <chr>       <dbl>          <dbl>
 1 OTU-01       36.3          1286.
 2 OTU-02       36            1200 
 3 OTU-03       36            1200 
 4 OTU-04       39            1167.
 5 OTU-05       36            1200 
 6 OTU-06       37.7           783.
 7 OTU-07       39.5          1000 
 8 OTU-08       32.5           700 
 9 OTU-09       40.5           700 
10 OTU-10       39.5           700 
# ℹ 18 more rows

Then, we must match the order of the taxa names in our average_trait_df frame to the order in which the taxa are listed in the phylogeny we are using, otherwise the phylogenetic signal calculations will not work. We can do this by using the match.phylo.data() function from the picante package. It returns new phylogeny and data frame objects with matched taxa names saved in a list. We must do a separate one of these for each trait where we will be looking for a phylogenetic signal, so we will have one for CTmax and one for elevation.

#______________________________________________________________________________
# CTmax matched phylogeny and data
phy_sig_ctmax <- average_trait_df %>%
  select(bin,
         mean_ctmax) %>%
  deframe() %>% # Turns it into a named vector
  match.phylo.data(phy, .)

#______________________________________________________________________________
# Elevation matched phylogeny and data
phy_sig_elev <- average_trait_df %>%
  select(bin,
         mean_elevation) %>%
  deframe() %>% # Turns it into a named vector
  match.phylo.data(phy, .)

If we open one of the lists we just created (we’ll open the CTmax one as an example), you will see that the OTUs in the “bin”column from the “data” object are listed in the same order as the “tip.label” column in the “phy” object.

#______________________________________________________________________________
# Extracting the tip.label column of the phy object in the ctmax list
phy_sig_ctmax$phy$tip.label %>%
  head(10) # Show only first 10 objects
 [1] "OTU-08" "OTU-10" "OTU-22" "OTU-02" "OTU-19" "OTU-23" "OTU-01" "OTU-20"
 [9] "OTU-04" "OTU-12"
#______________________________________________________________________________
# Extracting the data object in the list
phy_sig_ctmax$data %>%
  head(10) # Show only first 10 objects
  OTU-08   OTU-10   OTU-22   OTU-02   OTU-19   OTU-23   OTU-01   OTU-20 
32.50000 39.50000 38.50000 36.00000 39.50000 42.50000 36.28571 40.00000 
  OTU-04   OTU-12 
39.00000 38.50000 

We are now ready to calculate Blomberg’s \(K\) and Pagel’s \(\lambda\).

4.2.2.3 Running the phylogenetic signal calculations

To calculate Blomberg’s \(K\) and Pagel’s \(\lambda\), we will use the phylosig() function from the phytools package.

#______________________________________________________________________________
# Phylogenetic signal for CTmax
set.seed(2023) # For reproducible permutational results

#________________________________
# Bloomberg's K
kctmax <- phylosig(
  phy_sig_ctmax$phy, # The phylogeny we will use
  phy_sig_ctmax$data, # The trait data we will use
  method = "K", # Choose Blomberg's K as the method
  test = TRUE, # Does a hypothesis test that signal is absent
  nsim = 999 # Number of permutations
  )

#________________________________
# Pagel's Lambda 
lambdactmax <- phylosig(
  phy_sig_ctmax$phy, # The phylogeny we will use
  phy_sig_ctmax$data, # The trait data we will use
  method = "lambda", # Choose Pagel's lambda as the method
  test = TRUE, # Does a hypothesis test of lambda = 0 (no phylogenetic signal)
  nsim = 999 # Number of permutations
  )

#______________________________________________________________________________
# Display results
kctmax

Phylogenetic signal K : 1.20574 
P-value (based on 999 randomizations) : 0.102102 
lambdactmax

Phylogenetic signal lambda : 0.745975 
logL(lambda) : -75.8545 
LR(lambda=0) : 10.4023 
P-value (based on LR test) : 0.00125856 

We see a \(K\) value of 1.2 which indicates a strong phylogenetic signal. The p-value (p = 0.10) is high, which to me indicates that the math behind it is testing whether what we see is equal to what is expected (which is what every p-value represents), but instead of expecting 0 (which is usually the case), I believe it’s expecting 1. I could be wrong though and I have not found a satisfactory answer anywhere, so definitely don’t quote me on that!

We see a \(\lambda\) value of 0.75, which also indicates a phylogenetic signal (with a p-value that makes more sense to me; p = 0.0013).

Now let’s do the same procedure for elevation. Elevation can be considered a trait, since it is a value associated with our specimens (as in, a trait doesn’t have to be a physiological thing, it can also be where they are found, behaviours, etc.).

#______________________________________________________________________________
# Phylogenetic signal for CTmax
set.seed(2023) # For reproducible permutational results

#________________________________
# Bloomberg's K
kelev <- phylosig(
  phy_sig_elev$phy, # The phylogeny we will use
  phy_sig_elev$data, # The trait data we will use
  method = "K", # Choose Blomberg's K as the method
  test = TRUE, # Does a hypothesis test that signal is absent
  nsim = 999 # Number of permutations
  )

#________________________________
# Pagel's Lambda 
lambdaelev <- phylosig(
  phy_sig_elev$phy, # The phylogeny we will use
  phy_sig_elev$data, # The trait data we will use
  method = "lambda", # Choose Pagel's lambda as the method
  test = TRUE, # Does a hypothesis test of lambda = 0 (no phylogenetic signal)
  nsim = 999 # Number of permutations
  )

#______________________________________________________________________________
# Display results
kelev

Phylogenetic signal K : 0.44045 
P-value (based on 999 randomizations) : 0.827828 
lambdaelev

Phylogenetic signal lambda : 7.33137e-05 
logL(lambda) : -214.842 
LR(lambda=0) : -0.000871588 
P-value (based on LR test) : 1 

We see a \(K\) value of 0.44 which indicates no phylogenetic signal (p = 0.83).

We see a \(\lambda\) value of 0.000073, which also indicates no phylogenetic signal (p = 1).

We now have enough evidence suggesting that our CTmax data is showing a phylogenetic signal, despite the fact that we saw no evidence of such a signal for elevation. We, therefore, need to correct for that so we can see the whether there is an effect of elevation on CTmax regardless of how closely related the species are.

4.3 Phylogenetic independent contrast (PIC)

4.3.1 What it is

This is what’s colloquially called a phylogenetic correction, and you only need to do it if one of your traits showed evidence of a phylogenetic signal (see Section 4.2). In short, suppose you have height measures for species “A” (say 10 cm) and “B” (say 6 cm), and you want to compute the difference in the values, so you go ahead and do \(10 - 6 = 4\). The value you just found (4) is what we are here calling a contrast. Now suppose you detected a phylogenetic signal in your height measures for these two species. What the phylogenetic independent contrast (PIC) does is it takes that difference you just calculated and divides it by the square root of the phylogenetic distance between species “A” and “B”. The more distantly related they are, the smaller your contrast value will be, while the more closely related they are, the larger your contrast will be. Intuitively, this means the following: suppose you are comparing heights of species along an elevation gradient, and that you want to see how elevation affects height. One of your species is a beetle, the other an elephant. Your height metric won’t be able to tell us if the high contrast in height between the two species is due to elevation or just to the fact that one is an elephant and one is a beetle, so the phylogenetic correction will decrease the value of that contrast according to how distantly related elephants and beetles are. On the other hand, if you have two beetle species on this elevation gradient, with one being 5 cm tall and the other 10 m tall, then because beetles are closely related, this contrast you found is more likely to be explained by elevation and the phylogenetic correction will increase the value of that contrast in proportion to how close the phylogenetic distance between those beetle species is (also, if you see a beetle that is 10 m tall, RUN!!!). You can read more about the math and logic behind it in Felsenstein’s paper.

4.3.2 How to do it in R

4.3.2.1 Loading packages

We will need the ape package, which is already included in the picante package.

4.3.2.2 Prepare your data

We will use the data frame we created in Section 4.2.2.2 with average trait values for each BIN in our Staphylinidae CTmax and elevation data which we called average_trait_df. We need an additional preparatory step with our phylogeny in this example because for some reason, the function we will use to calculate the PIC values thinks our tree is not rooted and had multichotomies, even though we already checked that in Section 4.1.2.2. We will use the multi2di() function to fix that.

# CTmax listed phylogeny
phy_sig_ctmax$phy <- multi2di(phy_sig_ctmax$phy)

# Elevation listed phylogeny
phy_sig_elev$phy <- multi2di(phy_sig_elev$phy)

4.3.2.3 Running PICs

To calculate phylogenetically corrected trait value contrasts, we will use the pic() function from the ape package, which is included in the picante package. We will also store the PIC values in a data frame so we can model them later with a GAM (see Section 2.1).

# Creating data frame with phylogenetic corrected values for CTmax in one columm
# and elevation in another column
pic_data <- tibble(pic_ctmax = pic(phy_sig_ctmax$data,
                                   phy_sig_ctmax$phy),
                   pic_elev = pic(phy_sig_elev$data,
                                  phy_sig_elev$phy))

We can now use these values in modeling procedures such as GAMs (Section 2.1), GLMs, and so on. We will make a plot just to see what the phylogenetically corrected values for CTmax and elevation look like in relation to one another.

ggplot(pic_data, # Data frame that has our data
       aes(x = pic_elev, # Our x variable
           y = pic_ctmax)) + # Our y variable
  geom_point() + # Choosing to plot our data in the form of points
  geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
              formula = y ~ s(x, # Specifies formula of our model
                              k = 5), # Number of basis functions
              se = TRUE, # Whether we want the gray 95% CI or not (we do!)
              method.args = list(family = "gaussian")) + # Error distribution
  labs(x = "PIC of Elevation (m)", # Customizing x and y axis labels
       y = "PIC of CT~max~") + # Text between ~~ will be subscript (markdown syntax)
  theme_classic() + # Removes gray grid in background and adds lines to axes
  theme(axis.title = element_text(size = 12), # Font size for all axis titles
        axis.title.y = element_markdown(), # Allows markdown in y axis label
        axis.text = element_text(size = 10)) # Font size for all axis text

Figure 6: Generalized additive model of phylogenetic independent contrasts of Staphylinidae CTmax (°C) and elevation (m)

We did not actually do the full model fitting procedure here (see Section 2.1 for how to do it), but it certainly looks like there is evidence of a strong, approximately linear, negative relationship between CTmax and elevation (Figure 6)!

4.4 Nearest taxon index (NTI) and net relatedness index (NRI)

4.4.1 What they are

They are a way of telling the degree to which taxa are closely related to each other in any given area (habitat, plot, trap, etc.) at the tip-levels of the phylogeny (NTI) or at the deeper parts of the phylogeny (NRI). If they are very closely related, we say that the species in that area are phylogenetically clustered, and if they are very distantly related, we say the species in the area show phylogenetic evenness. Both indices have the same scale, where positive values (with p-values > 0.95) indicate clustering and negative values (with p-values < 0.05) indicate evenness. These indices are calculated by taking the mean pairwise distance between taxa in the area (MPD) and the mean nearest taxon distance (MNTI), and multiplying those values by -1 (which gives you NRI and NTI, respectively). If you want to learn more about these indices, including how they are calculated, check out the picante package vignette.

IMPORTANT NOTE: If your phylogeny was built using DNA barcode data (COI gene fragments), this means that you should NOT use NRI as a metric, because COI is good at resolving the relationship between taxa at the tips of the phylogeny (so using NTI makes sense), and not at its deeper parts (so using NRI does not makes sense). If you have phylogenies made with fragments of genes that are better at resolving the deeper nodes of the phylogeny but not the tips, the opposite of what I just said is true: use NRI but don’t use NTI. If you have a phylogeny that is made up of genes that resolve both the deep parts and tips, you can use both metrics. In this example, we will be using a COI based phylogeny, so technically I should not do NRI calculations, but since this is an example to teach you how to do it, I’ll do it anyways.

4.4.2 How to do them in R

4.4.2.1 Load packages

We will use the picante package, which you will have likely already loaded from previous sections.

4.4.2.2 Prepare your data

For this example, we will be using the Staphylinidae CTmax and elevation data and phylogeny in a similar way to how we used them in Section 4.1.2.2. The only substantial difference is that we need to turn the matrix into a presence/absence matrix ( we will replace all abundance values greater than 0 with 1).

#______________________________________________________________________________
# Making the matrix
staph_elevation_presence_matrix <- staph_ctmax_elevation_data %>%
  group_by(elevation) %>%
  count(bin,
        name = "abundance") %>%
  mutate(abundance = replace(abundance, # Says what to replace
                             abundance > 0, # Condition of replacement
                             1)) %>% # Value of replacement
  ungroup() %>% # Need this for later
  pivot_wider(names_from = bin, # Transposes species names to columns
              values_from = abundance,
              values_fill = 0) %>%
  select(elevation,
         sort(colnames(.))) # Orders remaining columns alphabetically

#______________________________________________________________________________
# Display the matrix
staph_elevation_presence_matrix %>%
  as_tibble()
# A tibble: 7 × 29
  elevation `OTU-01` `OTU-02` `OTU-03` `OTU-04` `OTU-05` `OTU-06` `OTU-07`
      <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1        10        0        0        0        0        0        0        0
2       300        0        0        0        0        0        0        0
3       700        0        0        0        0        0        1        0
4      1000        0        0        0        1        0        0        1
5      1200        1        1        1        1        1        1        0
6      1300        1        0        0        1        0        0        0
7      1500        0        0        0        0        0        0        0
# ℹ 21 more variables: `OTU-08` <dbl>, `OTU-09` <dbl>, `OTU-10` <dbl>,
#   `OTU-11` <dbl>, `OTU-12` <dbl>, `OTU-13` <dbl>, `OTU-14` <dbl>,
#   `OTU-15` <dbl>, `OTU-16` <dbl>, `OTU-17` <dbl>, `OTU-18` <dbl>,
#   `OTU-19` <dbl>, `OTU-20` <dbl>, `OTU-21` <dbl>, `OTU-22` <dbl>,
#   `OTU-23` <dbl>, `OTU-24` <dbl>, `OTU-25` <dbl>, `OTU-26` <dbl>,
#   `OTU-27` <dbl>, `OTU-28` <dbl>

After we get our species occurrence matrix and phylogeny, we need to match the order of the genera in our matrix to the order in which they appear in the phylogeny. We can do this with the match.phylo.comm() function from the picante package.

#______________________________________________________________________________
# Matching order of taxa in matrix and phylogeny tips 
phy_match <- match.phylo.comm(
  phy,
  staph_elevation_presence_matrix %>%
    column_to_rownames("elevation")) # Stores elevation as row names (needed)

#______________________________________________________________________________
# Show matching taxa order from matrix and tip.label from phylogeny

# Tip labels
phy_match$phy$tip.label %>%
  head(10) # Shows only the first 10 values of the vector
 [1] "OTU-08" "OTU-10" "OTU-22" "OTU-02" "OTU-19" "OTU-23" "OTU-01" "OTU-20"
 [9] "OTU-04" "OTU-12"
# BIN column order from matrix
colnames(phy_match$comm) %>%
  head(10)
 [1] "OTU-08" "OTU-10" "OTU-22" "OTU-02" "OTU-19" "OTU-23" "OTU-01" "OTU-20"
 [9] "OTU-04" "OTU-12"

We can see that the taxa names in the “comm” object are now in the same order as the “tip.label” column in the “phy” object.

Now we need to turn our phylogeny into a distance matrix of its own. We can do this with the cophenetic() function from the stats package from base R.

#______________________________________________________________________________
# Turn matched phylogeny to distance matrix
phydist <- cophenetic(phy_match$phy)

#______________________________________________________________________________
# Display the tree as a matrix
phydist %>%
  as_tibble()
# A tibble: 28 × 28
   `OTU-08` `OTU-10` `OTU-22` `OTU-02` `OTU-19` `OTU-23` `OTU-01` `OTU-20`
      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1    0        0.506    0.346    0.400    0.350    0.414    0.499    0.487
 2    0.506    0        0.775    0.829    0.779    0.843    0.928    0.916
 3    0.346    0.775    0        0.415    0.365    0.428    0.514    0.501
 4    0.400    0.829    0.415    0        0.272    0.442    0.528    0.515
 5    0.350    0.779    0.365    0.272    0        0.392    0.478    0.465
 6    0.414    0.843    0.428    0.442    0.392    0        0.492    0.480
 7    0.499    0.928    0.514    0.528    0.478    0.492    0        0.367
 8    0.487    0.916    0.501    0.515    0.465    0.480    0.367    0    
 9    0.545    0.974    0.559    0.573    0.523    0.538    0.434    0.421
10    0.527    0.956    0.541    0.555    0.505    0.520    0.416    0.403
# ℹ 18 more rows
# ℹ 20 more variables: `OTU-04` <dbl>, `OTU-12` <dbl>, `OTU-13` <dbl>,
#   `OTU-06` <dbl>, `OTU-28` <dbl>, `OTU-24` <dbl>, `OTU-11` <dbl>,
#   `OTU-17` <dbl>, `OTU-14` <dbl>, `OTU-18` <dbl>, `OTU-05` <dbl>,
#   `OTU-15` <dbl>, `OTU-03` <dbl>, `OTU-07` <dbl>, `OTU-26` <dbl>,
#   `OTU-16` <dbl>, `OTU-27` <dbl>, `OTU-09` <dbl>, `OTU-21` <dbl>,
#   `OTU-25` <dbl>

We are now ready to run the NTI and NRI calculations

4.4.2.3 Running NTI and NRI

First, we need to calculate the mean pairwise distance (MPD) and the mean nearest taxon distance (MNTD) using the ses.mpd() and the ses.mntd() functions from the picante package, respectively.

#______________________________________________________________________________
# Setting seed for reproducible permutational results
set.seed(2023)

#______________________________________________________________________________
# Mean pairwise distance
mpd <- ses.mpd(
  phy_match$comm, # Our matched community matrix
  phydist, # Our matched phylogeny in distance matrix format
  null.model = "taxa.labels", # Null model to which our data are compared to
  abundance.weighted = FALSE, # No abundance data, so set this to false
  runs = 1000 # Number of permutations
)

#______________________________________________________________________________
#  Mean nearest taxon distance
mntd <- ses.mntd(
  phy_match$comm, # Our matched community matrix
  phydist, # Our matched phylogeny in distance matrix format
  null.model = "taxa.labels", # Null model to which our data are compared to
  abundance.weighted = FALSE, # No abundance data, so set this to false
  runs = 1000 # Number of permutations
)

#______________________________________________________________________________
# Display results
mpd
     ntaxa   mpd.obs mpd.rand.mean mpd.rand.sd mpd.obs.rank  mpd.obs.z
10       5 0.4520776     0.4642795  0.07788384        577.0 -0.1566678
300      5 0.4105522     0.4636686  0.07811292        215.0 -0.6799955
700      6 0.5878765     0.4603934  0.06549538        921.0  1.9464441
1000     2 0.5014025     0.4726616  0.14693210        714.5  0.1956071
1200     6 0.4959122     0.4630733  0.06790422        781.0  0.4836071
1300     4 0.3912718     0.4623588  0.08783865        152.0 -0.8092903
1500     6 0.4115303     0.4587272  0.06506319        218.0 -0.7254019
     mpd.obs.p runs
10   0.5764236 1000
300  0.2147852 1000
700  0.9200799 1000
1000 0.7137862 1000
1200 0.7802198 1000
1300 0.1518482 1000
1500 0.2177822 1000
mntd
     ntaxa  mntd.obs mntd.rand.mean mntd.rand.sd mntd.obs.rank mntd.obs.z
10       5 0.3907101      0.3702727   0.05731118           660  0.3566033
300      5 0.3379609      0.3726900   0.05678562           280 -0.6115830
700      6 0.4002026      0.3621377   0.04953784           793  0.7684017
1000     2 0.5014025      0.4628358   0.13787611           788  0.2797203
1200     6 0.4050516      0.3605841   0.05171782           808  0.8598114
1300     4 0.3440324      0.3916380   0.06688031           256 -0.7118033
1500     6 0.3488892      0.3612234   0.05020143           447 -0.2456930
     mntd.obs.p runs
10    0.6593407 1000
300   0.2797203 1000
700   0.7922078 1000
1000  0.7872128 1000
1200  0.8071928 1000
1300  0.2557443 1000
1500  0.4465534 1000

We can see we get many values from each output. The values of most interest to us from each output respectively are “mpd.obs.z” and “mntd.obs.z”, which we will multiply by -1 to get NRI and NTI values, and “mpd.obs.p” and “mntd.obs.p”, which are the p-values for each measure.

We will now create a data frame with these values and add new columns to it where we will store our NRI and NTI values as well as the elevations where each value was calculated from.

#______________________________________________________________________________
# Making the data frame
clus_elev <- data.frame(nri = mpd$mpd.obs.z * -1,
                        nti = mntd$mntd.obs.z * -1,
                        nri_p = mpd$mpd.obs.p,
                        nti_p = mntd$mntd.obs.p,
                        elev = staph_elevation_presence_matrix$elevation) %>%
  pivot_longer(c(nti, nri), # Combining the NTI and NRI columns
               names_to = "index", # Name of new column with the names
               values_to = "index_value") %>% # Name of new column with values
  mutate(p_value = if_else(index == "nri", # Combining the p-value columns
                           nri_p, # If index is nri, write nri_p
                           nti_p), # else, write nti_p
         evidence_of = case_when(p_value > 0.95 ~ "Evenness",
                                 p_value <= 0.95
                                 & p_value >= 0.05 ~ "Nothing",
                                 p_value < 0.05 ~ "Clustering")) %>%
  select(!c(nti_p, # Removing superfluous columns (because we merged them above)
            nri_p))

#______________________________________________________________________________
# Displaying the data frame
clus_elev %>%
  as_tibble()
# A tibble: 14 × 5
    elev index index_value p_value evidence_of
   <int> <chr>       <dbl>   <dbl> <chr>      
 1    10 nti        -0.357   0.659 Nothing    
 2    10 nri         0.157   0.576 Nothing    
 3   300 nti         0.612   0.280 Nothing    
 4   300 nri         0.680   0.215 Nothing    
 5   700 nti        -0.768   0.792 Nothing    
 6   700 nri        -1.95    0.920 Nothing    
 7  1000 nti        -0.280   0.787 Nothing    
 8  1000 nri        -0.196   0.714 Nothing    
 9  1200 nti        -0.860   0.807 Nothing    
10  1200 nri        -0.484   0.780 Nothing    
11  1300 nti         0.712   0.256 Nothing    
12  1300 nri         0.809   0.152 Nothing    
13  1500 nti         0.246   0.447 Nothing    
14  1500 nri         0.725   0.218 Nothing    

We can see that we did not find any significant (strong) NTI and NRI values, which means that our data frame of Staphylinidae CTmax and elevation does not show evidence that the Staphylinid community is clustered or even at any point of the elevation gradient. These values we found would likely change with larger data sets, and in fact such an analysis has been done by Sarah Dolson, a former MSc student from the Smith Lab, and you can check out her (and coauthors’) paper here to see what they found!

We can take this visualization a step further, a plot NRI and NTI values along elevation.

#______________________________________________________________________________
# Adding a letter column so that we can differentiate the two plots by letter
clus_elev <- clus_elev %>% 
  mutate(letter = if_else(index == "nri",
                          "A",
                          "B")) #  For labeling facet_wrap
#______________________________________________________________________________
# Plot
ggplot(
  clus_elev %>%
    drop_na(evidence_of),
  aes(x = elev,
      y = index_value)) +
  geom_point(aes(fill = evidence_of),
             shape = 21) + # Empty circle point
  geom_smooth(method = "gam", 
              formula = y ~ s(x, k = 5), 
              se = T, 
              method.args = list(family = "gaussian")) +
  labs(x = "Elevation (m)",
       y = NULL,
       fill = "Evidence of") +
  scale_fill_manual(values = c("orange", # Dots coloured according to p-value
                               "gray88", 
                               "green"),
                    breaks = c("Clustering",
                               "Nothing",
                               "Evenness")) +
  facet_wrap(~ index,
             labeller = labeller(index = c(nri = "NRI", # Capitalizing indices
                                           nti = "NTI")),
             strip.position = "left") + # Puts strips left. Looks like y label
  geom_text(aes(label = letter), # Adds A and B label to plots
            x = -Inf, # This and below: position adjustments, size, and bold
            y = Inf,
            hjust = -0.5, 
            vjust = 1.5,
            fontface = "bold",
            size = 7) +
  theme_classic() +
  theme(text = element_text(size = 12), 
        axis.text.x = element_text(size = 10),
        legend.title = element_text(size = 10),
        legend.key.size = unit(0.25, "cm"),
        legend.margin = margin(t = -0, r = 3, b = 3, l = 3),
        legend.position = c(0.88, 0.12),
        legend.background = element_rect(fill = "white",
                                         colour = "black"),
        strip.background = element_blank(),
        strip.placement = "outside",
        strip.text = element_text(size = 12))

Figure 7: NTI and NRI of Staphylinidae communities along an elevation gradient coloured by significance of p-value.

4.5 Grafting phylogenies

4.5.1 What it is

Sometimes, you will need to combine two phylogenies into one to make your phylogenetic analyses stronger. For instance, suppose you have COI barcode data for ant species in your collections. Since COI is not good at inferring deeper phylogenetic relations, you may want to take an ant phylogeny from the literature (or make your own) made from gene fragments that better estimate the deeper nodes of the phylogeny, and combine it with your COI phylogeny, which is good at estimating the relationships at the tips.

4.5.2 How to do it in R

4.5.2.1 Load packages

First, we will load the packages required to complete the entire procedure, as well as some to visualize the phylogeny. This procedure uses a lot of base R, but it can also be done using the tidyverse (if you take the time to figure out how to do it that way).

library(phytools) # For phylogeny manipulations
library(picante) # For other phylogeny functions and packages such as "ape"
library(ggtree) # To visualize the trees. This needs to be installed outside
# of CRAN, through BioConductor. Google "install ggtree" and it should come up.

4.5.2.2 Prepare your data

NOTE: Before we begin, it is mandatory that the tip names on all your phylogenies are formated in the same way. Here, tips will have the “Species_name” format, where their genus name is capitalized, and spaces are underscores. If any of your phylogenies are not formatted in this way, I recommend changing them to the format above so that the code below runs smoothly. You can do this by hand (using notepad or whatever way you think is easiest), or you can do it in R using the functions from the stringr package (good for manipulating strings), or any other way you can think of. Since the phylogenies in this example are already formatted appropriately, we don’t need to worry about this step here.

In this example, we will be grafting 12 spider barcode-based phylogenies for genera that had more than one species in Chris Ho’s Master’s project to Wheeler et al.’s tree which is really good at estimating deep phylogenetic relations of spiders. There is an important bit of logic to understand here: If you have genera in your collection that only have one species, then it won’t make a difference whether you represent them in the phylogeny as their full species name or only their genus name because they are the only representative of that branch. This allows us to use phylogenies from the literature that have our genus but not necessarily our species in them.

We will now load our phylogeny files into R. As I said above, they are already formatted consistently. Also, we will use specific nomenclature when referring to each of them to make the process a bit more intuitive. The phylogeny that is good at representing the deeper relationships of the tree will serve as the “backbone”. The phylogenies good at representing relationships at the tips (such as barcode phylogenies) will be named after the nodes at which they fuse with the “backbone” phylogeny, so in this case, their genera.

It is important that we save all phylogenies to be grafted onto the backbone in a vector object (which turns into a list if you have more than one phylogeny to graft), a necessary step for the grafting code even if you only had one phylogeny to graft.

#______________________________________________________________________________
# Backbone phylogeny from Wheeler et al. (2017)
backbone <- read.tree(
  here("data",
       "phylo_trees",
       "ML_pGH16_by_gene_boot1K_named_algonquin.nwk")
)

#______________________________________________________________________________
# Phylogenies to be grafted onto the backbone tree
subtrees <- c(
  Agelenopsis = read.tree(here("data", "phylo_trees", "Agelenopsis.nwk")),
  Araneus = read.tree(here("data", "phylo_trees", "Araneus.nwk")),
  Bathyphantes = read.tree(here("data", "phylo_trees", "Bathyphantes.nwk")),
  Cicurina = read.tree(here("data", "phylo_trees", "Cicurina.nwk")),
  Clubiona = read.tree(here("data", "phylo_trees", "Clubiona.nwk")),
  Neoantistea = read.tree(here("data", "phylo_trees", "Neoantistea.nwk")),
  Pelegrina = read.tree(here("data", "phylo_trees", "Pelegrina.nwk")),
  Philodromus = read.tree(here("data", "phylo_trees", "Philodromus.nwk")),
  Tetragnatha = read.tree(here("data", "phylo_trees", "Tetragnatha.nwk")),
  Theridion = read.tree(here("data", "phylo_trees", "Theridion.nwk")),
  Walckenaeria = read.tree(here("data", "phylo_trees", "Walckenaeria.nwk")),
  Xysticus = read.tree(here("data", "phylo_trees", "Xysticus.nwk"))
)

At this stage, you should format your backbone tree so that only the genus name is showing at each tip. In this example, Alex made those changes by hand in Notepad for our genera of interest, but if you want to do it in R (again, provided your tips are in the “Species_name” format) this is the code you would use.

If we run this code, the entire “backbone” phylogeny gets trimmed to the first word in the tip label before the first underscore, which may be genus, family, or whatever classification you are after, and it will keep only one branch per classification (genus, etc.). Doing this is fine so long as you are not dropping tips that you will use for your analysis.

Note that this backbone tree is a large spider tree of life where their names are in the “Family_Genus_species” format, so when we run the code below, most tips will be named after their families, but the ones Alex changed by hand to be genus only will show up as genera.

#________________________________
# Save current tip labels and genera names in separate objects 
tips <- backbone$tip.label
genera <- unique(sapply(strsplit(tips, "_"), function(x) x[1]))
#________________________________
# Creates named number vector where the numbers represent the position of the 
# first occurrence of each genus within the phylogeny
ii <- sapply(genera, function(x, y) grep(x, y)[1], y = tips)
#________________________________
# Drops tips that are not in the ii object
backbone <- drop.tip(backbone, setdiff(backbone$tip.label, tips[ii]))
#________________________________
# Rename tips to be genus name only
backbone$tip.label <- sapply(strsplit(backbone$tip.label, "_"), function(x) x[1])

Now let’s see what this phylogeny looks like (you will have to zoom in to read the tip labels because there are so many taxa!).

ggtree(backbone,
       layout = "circular",
       open.angle = 10) +
  geom_tippoint(size = 1,
                show.legend = FALSE) +
  geom_tiplab(colour = "black",
              size = 1.5,
              nudge_x = 0.1) + # Adds family names away from phylogeny
  geom_rootpoint()

Figure 8: Spider phylogeny condensed to family in most cases and to genus for the genera Alex identified by hand. We modified the genus nodes for the genera we had barcode-based phylogenies for.

4.5.2.3 Running the grafting procedure

To graft the species phylogenies onto our backbone phylogeny, we will run the forloop below, which was modified from this online discussion .

# Note that "i" represents any one object in the "subtrees" object we created
for(i in 1:length(subtrees)){
  
  # Creates object with only genus name from subtrees by separating at 
  # the underscores
  genus <- strsplit(subtrees[[i]]$tip.label[[1]], "_")[[1]][1]
  
  # Creates object with maximum branch length of each subtree
  h <- max(nodeHeights(subtrees[[i]]))
  
  # Object that identifies which of the backbone tree tips matches the 
  # genus written in the genus object above, writes its position
  tip <- which(backbone$tip.label == genus)
  
  # Binds backbone tree to subtrees at the genera stored in tip object above
  backbone <- bind.tree(backbone, subtrees[[i]], where = tip)
}

That’s it! We have successfully grafted barcode-based phylogenies onto a tree that was built with different gene fragments.

Note that how the branch lengths were generated may differ between trees, which means the grafted branches might be scaled differently than the branches from the backbone tree (sort of like having different units). Therefore, you must consider the implications of this in relation to what you are using the grafted phylogeny for. If you are using the tree to check for phylogenetic signals and do phylogenetic corrections, I believe you should be good, since you’ll still be able to have a rough estimate of the level of relatedness between the species represented in the grafted phylogeny (as opposed to no estimate at all, which is what happens if you don’t use any phylogenetic hypothesis). If you are thinking about using the grafted phylogeny as a definitive statement of the level of relatedness between your species, you’ll likely be unable to do so with this, since the distances across the branch will not be consistent. There are likely other scenarios to consider, but it’ll depend on what your project is.

Now we will visualize the grafted tree.

ggtree(backbone,
       layout = "circular",
       open.angle = 10) +
  geom_tippoint(size = 1,
                show.legend = FALSE) +
  geom_tiplab(colour = "black",
              size = 1.5,
              nudge_x = 0.1) + # Adds family names away from phylogeny
  geom_rootpoint()

Figure 9: Spider phylogeny with grafted species-level phylogenies Alex made using DNA barcodes for certain spider genera. We modified the genus nodes for the genera we had barcode-based phylogenies for.

If you compare Figure 8 and Figure 9 (zoom in to read the names!), you can see that where the genera used to be listed in Figure 8, we instead see the grafted phylogenies with the BOLD specimens in Figure 9.

5 Appendix

Below are the R version and package versions I used at the time of writing this document.

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggtree_3.8.2    phytools_1.9-16 maps_3.4.1      picante_1.8.2  
 [5] ape_5.7-1       betapart_1.6    vegan_2.6-4     lattice_0.21-8 
 [9] permute_0.9-7   gratia_0.8.1    mgcv_1.8-42     nlme_3.1-162   
[13] readxl_1.4.3    here_1.0.1      ggtext_0.1.2    lubridate_1.9.3
[17] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
[21] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.4  
[25] tidyverse_2.0.0 schtools_0.4.1  glue_1.6.2