Smith Lab common stats using R

Author
Affiliation
Alexandre Loureiro

PhD candidate | Smith Lab, University of Guelph

Last updated

November 8, 2023

NOTE: This document is a work in progress. If you see any errors, please reach out to either Alex or myself so we can address them.

1 Introduction

This is a collection of some of the most used statistical procedures and visualization techniques in the Smith Lab. Generally speaking, this document is structured from the perspective of someone who wants to analyse community ecology data (and is completely biased towards the procedures I performed during my PhD), but that does not mean these procedures are only useful in that context, so even if you are not doing a community ecology project, you might find this tome useful.

Here, you will find brief descriptions of what each procedure is, links to useful resources, and step-by-step instructions on how to perform them in R. Depending on what year you are reading this, some of the R functions we use here may have changed in how they work, but these instructions should give you at least a starting point. You can find the R and package versions used during the writing of this document in Section 5.

I have broken down every procedure into its steps, and displayed the code and outputs for you to follow along. I have annotated every code chunk with brief explanations of what most things in the code do in case you are not familiar with a specific command. If I wrote a comment explaining what a command does, it is unlikely I did it again if the code reappears down the line.

If you are not comfortable reading R code, here is a quick guide on how to do so.

  • Variables get their values assigned by this operator: <-. This means that whatever is to the right of that arrow gets stored in the name we write on the left, and every time you call that name, all of the values assigned to it show up. For example, if you write x <- 3, every time you tell R to run x, the number 3 will appear.
#________________________________
# Assign the number 3 to x
x <- 3

#________________________________
# Display whatever is stored in x
x
[1] 3
  • Functions are operations you perform on the data, and you can identify them by the parentheses that always follow them. For instance, the function that calculates means in R is mean(). The things that go inside the parentheses are the parameters of the function (such as your data file, any tweaking you want to do to the how the function is doing its calculations, etc.). Their values are assigned with an = sign and are separated by commas, so every function will look something like function(x = something, y = something_else).
  • Piping is a concept in which you chain one function after another so that when you run a line of code, it runs all of the chained procedures at once. This, perhaps most importantly, makes the code more readable to us (the computer doesn’t really care if functions are chained). The piping operator I use throughout this document is %>%, which comes from the magrittr package. Base R has its own operator (|>), but it’s slower than %>% (as of the writing of this document), though you’d only notice the differences in speed if you are running procedures that take a long time to finish (so use whichever one you like!). Piping operators can be read as “then” (e.g. do this, then do that, then do that, etc.). For example, data_frame %>% select(variable_x) %>% distinct() can be read as “take my data frame called ‘data_frame’, then select variable ‘x’ from it, then show me the distinct values of ‘x’”.
  • Finally, I tend to break up my code into a new line at most instances of a comma and at every instance of a piping operator, because this stops the code from running all the way to the side of the screen, which improves readability (in my opinion). This, however, is a stylistic choice, and you don’t have to do it if you don’t want to. I find it also helps me when I’m editing the code.

Now we are ready to start. First, let’s load some packages that we will be using throughout this document.

# This is an alternative way of calling your R packages to your session instead
# of writing "library()" for every package that you have. Either one is fine,
# but I added this here so you know it exists.
lapply(
  c(
    "tidyverse", # For data manipulation and graphics (also loads magrittr pipe)
    "ggtext", # For easier axis title writing when we need sub or superscripts
    "here", # For easier file pathing in the project directory
    "readxl" # For importing Excel files
  ),
  library, # Function to be applied over values in vector above
  character.only = TRUE) # Need this for it to work for some reason

2 Alpha diversity analyses

Alpha diversity is a measure of the local diversity of species in any given place, often talked about in terms of number of species but sometimes taking into account their relative abundances as well. We will cover a couple of ways that you are likely to analyse alpha diversity data.

2.1 Generalized additive model (GAM)

2.1.1 What it is

These models basically do magic, but since this magic can be complex I will only give you a brief and simplified explanation of how they work. If, however, you are interested in learning more about them, I highly recommend Noam Ross’s GAMs in R course (which at the time of this writing is completely free!), Gavin Simpson’s YouTube workshop on GAMs, and if you really want to know more about the math behind them, Simon Wood’s book.

In short, GAMs are models that can learn from the data what the best shape to represent them is. I will try to impart on you why I called this magic with an example.

Most models assume before hand the behaviour of the data in question, and the math behind them (calculation of parameters) and the subsequent test statistics and p-values that evaluate the model as a description of the data, are all based on that shape. Take linear models for instance – by definition, they assume a linear relationship between the variables in question. If the variables you are working with do not have a linear relationship, you are bound to miss perhaps important behaviours of the system you are trying to represent with your model. For example, suppose you have a positive hump-shaped relationship between two variables (let’s say species diversity and elevation), and then while doing your statistical analyses you fit a linear model. By doing so, you will be able to capture the positive nature of the relationship (species diversity increases with elevation), but you will not capture the humped nature of that relation (in the diversity and elevation hypothetical here, there would be a dramatic increase of diversity at a certain point which would then stop happening, and you would completely miss that since your linear model would tell you that diversity is changing in a linear, one-step-at-a-time fashion). On the other hand, if your data behave linearly, a linear model would capture the nature of the relationship between the two variables very well, since it would assume the correct behaviour of the data. Remember, however, how I said at the beginning of this that GAMs can learn from the data what the best shape to represent them is? This means that if the data behave linearly, our GAM will come out looking, and behaving like a linear model, and if our data behave “humped-shapedly”, our GAM will take that form instead!

Briefly, this is done through things called basis functions, which when put together make up a GAM. Imagine that you take your entire elevation gradient and you split it, for instance, into 10 equidistant sections. Now suppose you fit, for instance, a cubic regression model to the diversity and elevation data for each of those 10 sections, and that you could then combine the results of each of those sections to ultimately describe the diversity-elevation gradient in one grand model. Well, that is more or less how GAMs work! The basis functions are the functions used throughout the 10 sections of the gradient, and the GAM is the overall model that combines the results from each of the sections!

2.1.2 Assumptions

The assumptions of GAMs are the same as those for parametric linear models, which you might already have been exposed to in undergraduate statistics courses. There are three main assumptions:

  1. Error distribution chosen is correct
  2. Independence of observations
  3. Constant variance of observations

Assumption number 1 might have been presented to you in the past as the assumption of “normality”, and you might have also heard things like “make sure your data are normal!”, or “are your data normally distributed?”, or “you should transform your data to make them normal.”, or any number of sentences like that. This stems from the fact that statistics courses tend to emphasize parametric tests, which tend to assume a Gaussian distribution (known as the normal distribution) of the error terms (if the data are normally distributed, this gets reflected in the error terms, which is what matters for modeling. For more information on this, see Montgomery’s book, or any other basic statistics book). In GAMs, we can choose the error distribution we think is most appropriate to our data based on our knowledge of its nature, so we are not bound by the normal distribution and generally do not need to do any transformations to our data. For example, count data is made up of integers (whole numbers, not decimals) and cannot have negative values (it is impossible for you to count -10 oranges – what would that even mean?). The Gaussian distribution goes from -\(\infty\) to +\(\infty\) and includes decimal numbers, so any model using the normal distribution to predict count data would be predicting nonsensical things (such as counting -10.5 oranges!). Therefore, it is crucial that you choose the correct error distribution for your model, because otherwise your results just will not make any sense (you will get outputs and p-values and everything else, but they will be very poor representations of the system you are trying to describe!).

Assumption 2 refers to the relationship between each of your data points, and whether they are influenced by (or related to) each other. In technical terms, the two points must not be drawn from different populations if the assumption of independence is to be upheld (they must be from the same population). Of course, this is a border-line impossible thing to control (especially in field studies!), since for all intents and purposes, things are connected to each other with varying degrees of intensity, so nothing is truly independent. However, statistically speaking, what you are trying to see is whether the behaviour of each data point is independent from others or not, and you can minimize this through your experimental design or by accounting for that dependence in your model. For instance, suppose we want to measure the average concentration of red blood cells in a population of 50 students. If, when you collect your data, you take measurements from only 25 students, but measure each of them twice, suddenly your data points are not independent, because every two observations will be closely related to each other, whereas if you had sampled each of the 50 students once, it is less likely that the data would be related. The consequence of violating this assumption is that you are at an increased risk of making a Type I error, which is finding a pattern that suggests a relationship between two variables that doesn’t really exist. I’ll illustrate what that means with another example. Say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper, which we will revisit in Section 4.3.

Assumption 3 has to do with variance, and it is at the root of statistical thinking. Suppose you want to see how variable “A” is affected by variable “B” (say blueberry yield and the application of a specific herbicide concentration). The most basic way to think about this is in a signal vs. noise setting. If there is a relationship between “A” and “B” (i.e. a signal exists between blueberry yield and that concentration of that herbicide), it is generally obscured by noise (variance caused by many other things such as rainfall, efficiency of application of the herbicide, blueberry plant health, etc.), and we use statistics to quantify the noise and the signal so we can see how big the signal actually is (i.e. how strong the relationship between “A” and “B” is). Now suppose what you want is to see how variable “A” is affected along differing levels by variable “B” (i.e. how blueberry yield is affected along a gradient of concentrations of this herbicide). Here, there may be differing levels of noise along the gradient (e.g. suppose lower concentrations have smaller variance because at those levels the herbicide doesn’t really do anything to the plants, but that variance increases as concentrations increase because different plants have different tolerances to the herbicide), so you need to ensure you either account for this change in variance in your experimental design or in your model, otherwise you are at a higher risk of making a Type II error, which is finding no evidence of a relationship between “A” and “B” when there actually is one (i.e. too much noise for you to find the signal). If your violation of constant variance comes from unequal sample sizes along the factor levels you have chosen, the consequences may be a bit different. Suppose you found low variance in the lower concentrations of herbicide and high variance in high concentrations – there are two scenarios that can arise here. 1) If you have fewer samples for the low herbicide concentrations (low variance associated with fewer samples) and many more samples of the high herbicide concentrations (high variance associated with higher number of samples), you are at risk of Type II error; 2) If you have fewer samples for the high herbicide concentrations (higher variance associated with fewer samples) and more samples of the low herbicide concentrations (low variance associated with higher number of samples), you are now at risk of making a Type I error.

There are ways to check how our data meet (or violate) these assumptions when fitting our GAMs, and knowing whether or not you violate them is important because we can then address them methodologically, or at the very least, we can discuss how our conclusions may be affected by these violations (if they exist).

2.1.3 How to do it in R

2.1.3.1 Load packages

First, we will load the packages required to complete the entire procedure.

library(mgcv) # For fitting GAMs
library(gratia) # For cleaner-looking assumption plots

2.1.3.2 Prepare your data

Your data for GAMs should be laid out in the “each row is an observation and each column is a variable” format. For instance, suppose you wanted to get information about the inhabitants of a galaxy far far away such as their name, birth year (before the battle of Yavin), weight, height, etc. An intuitive way to lay out those data would be to make every row represent a specific inhabitant and every column a variable about that inhabitant, so that in the end you would have something this:

# This data set is part of the dplyr package
starwars # Displays data about all Star Wars characters
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Your own data need to be laid out in the same way, where each row is an observation (e.g. individual organisms, or collection sites) and each column is a variable about that observation (e.g. elevation, temperature, humidity, etc.).

We will be using the following Staphylinidae elevation and heat tolerance data from Costa Rica to learn how to run GAMs in R.

#______________________________________________________________________________
# Importing the data frame
staph_ctmax_elevation_data <- read.csv(
  here("data",
       "raw",
       "OTU_named_190220.csv")) # Imports the data from working directory

#______________________________________________________________________________
# Displaying the top 10 rows of the data frame
staph_ctmax_elevation_data %>%
  as_tibble() # Just makes this output look nicer
# A tibble: 57 × 3
   bin    elevation ctmax
   <chr>      <int> <dbl>
 1 OTU-01      1300  37.5
 2 OTU-01      1300  36.5
 3 OTU-01      1300  37  
 4 OTU-01      1300  36.5
 5 OTU-01      1300  36.5
 6 OTU-01      1300  34  
 7 OTU-01      1200  36  
 8 OTU-02      1200  36  
 9 OTU-03      1200  36  
10 OTU-04      1300  39.5
# ℹ 47 more rows

2.1.3.3 Running a GAM

We will be using the gam() function from the mgcv package to fit our model and the gam.check() function from the mgcv package to check whether our error terms meet or violate the model assumptions. If you want, you can use the appraise() function from the gratia package to have ggplot2 versions of the graphs you get from gam.check(). For this example, we can expect a Gaussian distribution of our error terms because CTmax is not an integer and can be both positive and negative (theoretically, you could have a species with a CTmax of -10°C, who knows!).

#______________________________________________________________________________
# Model fitting
staph_ctmax_gam <- gam(
  formula = ctmax ~ s(elevation, # s() specifies variable as smooth (wiggly)
                      k = 5), # Specifies number of basis functions
  data = staph_ctmax_elevation_data, # Data frame with our variables of interest
  family = "gaussian", # Specifies the expected error distribution
  method = "REML" # Algorithm that penalizes over-fitting of the data
)

#______________________________________________________________________________
# Checking assumptions
gam.check(staph_ctmax_gam)


Method: REML   Optimizer: outer newton
full convergence after 5 iterations.
Gradient range [-3.56664e-10,1.45528e-11]
(score 116.2551 & scale 3.236838).
Hessian positive definite, eigenvalue range [0.7384314,27.53047].
Model rank =  5 / 5 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

               k'  edf k-index p-value
s(elevation) 4.00 2.81    0.95    0.32

In this output, we see 4 plots and some stuff printed onto the console. The plots will help us check our assumptions, and the console log will tell us:

  1. If there was convergence in our model fitting (if not, we can’t trust the results)
  2. The behaviour of the ctmax elevation relationship through the edf (effective degrees of freedom) (e.g. edf = 1 means a linear relationship, edf = 2 quadratic, edf = 3 cubic, and so on)
  3. A test statistic with a p-value that indicates whether the residuals are randomly distributed or not (constant variance assumption), with a small p-value (< 0.05) indicating non-random distribution of the residuals (violation).

You can often fix a violation of constant variance by increasing the number of basis functions in the k argument in the gam() function (I don’t know for sure, but it looks like this test is for whether the residuals behave randomly within each basis function rather than among all functions. I say this because of what you will see below).

The plots came out looking a bit squished here because they are printed as individual plots in RStudio, but here they got combined into the space of one and I didn’t try to fix it (sorry!). At this point, however, is where I would use the appraise() function from gratia to have better-looking plots.

# Prints model diagnostic plots
appraise(staph_ctmax_gam)

Figure 1: Model diagnostic plots of the Staphylinidae CTmax and elevation (m.a.s.l.) GAM.

We see in Figure 1 that our residuals mostly fall very close to the line in the “QQ plot of residuals”, which indicates that our error terms mostly behave in a Gaussian way. We can see in the “Histogram of residuals” plot that our residuals seem to be left skewed, which accounts for why not all residuals fell within the line in the “QQ plot”. Some statisticians may recommend we transform our data to address this skew, but I will leave it up to you whether to do it or not. I tend to think that if the vast majority of residuals behave according to the distribution we chose (and the ones that don’t are not absurd-looking outliers), we can go ahead with our model and then in the discussion we can talk about the outliers that did not fall perfectly within the expected distribution.

We see in the “residuals vs linear predictors” plot that there is a sideways cone shape to our residuals (looks like >). This indicates that there is no constant variance along the values of y, and that where CTmaxs are higher (looks like above 40°C), there tends to be less variation than where CTmaxs are lower (below 40°C).

If we see any pattern with the mean of the residuals vs fits plot (the middle line doesn’t cut the dots in half horizontally), the assumption of independence may be violated. This does not seem to be the case with our data, but we will explore this a bit more in Section 4.2.

So, we know that our model is going to yield results that make sense (we chose the correct error distribution), but that we may be at risk of seeing some type of error because we don’t have constant variance. This being said, it does NOT mean that our data are bad and that therefore modeling them is a fruitless venture. For once, just by trying to get a model going, we have learned that CTmax does not vary equally along its range, and that that where we find things with CTmaxs above 40°C, we tend to find only things with high CTmax, but where we find things with CTmaxs below 40°C, we tend to find all kinds of CTmaxs, which biologically speaking is an extremely interesting find! All we have to be cautious of is overstating whatever results we find about the relationship and what shape it shows. Because our test statistic in the gam.check() function did not indicate that we violated constant variance, we should not worry too much about it (could be because it tests variance within each basis function instead of across the entire gradient, I don’t know), though we should keep it in mind!.

Now, we will extract the results from our fitted model.

summary(staph_ctmax_gam) # function from base R that extracts model results

Family: gaussian 
Link function: identity 

Formula:
ctmax ~ s(elevation, k = 5)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.9211     0.2383   163.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(elevation) 2.806  3.342 37.49  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.689   Deviance explained = 70.5%
-REML = 116.26  Scale est. = 3.2368    n = 57

This output is broken up into three main parts:

  1. From the “Family” section to the “Formula” section. Here, we see some descriptors about the model we just fitted.
  2. From the “Parametric coefficients” section to the “Approximate significance of smooth terms” section. Here we see information about each of the parameters of our model. For the parametric coefficients, we only have an intercept (every model has an intercept – it’s the value of y when x is 0). We see its estimated value (CTmax of 38°C when elevation is 0 m.a.s.l), the standard error associated with that estimate, the test statistic that tested whether the estimate is different from 0, and the associated p-value. If we had other categorical variables in our model, this is where we would see their results. For the smooth terms (our continuous variables of interest, which in this case is only elevation), we see its effective degrees of freedom (which indicate the relationship between elevation and CTmax is behaving almost cubically), the Ref.df and F values, which are test statistic values used in an ANOVA that evaluates whether we see evidence of a significant effect of elevation or not, and the corresponding p-value.
  3. From the R-sq. line to the end of the output. Here, we see values that describe how well our model explained the variance in the data, as well as some other values that we don’t have to worry about. If the relationship we are modeling behaved linearly (edf = 1), we should use the “R-sq. (adjusted)” value to see how much variation is explained by our model, whereas if it behaving in any other, more complex way, we should look at the “deviance explained” value (this is what Wood says in his book I linked above – I’m not totally confident that I understand the difference, so we should do what he recommended). We see that our model explained 70.% of the variance in the data, which is pretty good!

Now let’s visualize the model using the base R graphical output (it gives us some information that we can’t really get with the gratia package, as far as I know).

plot(staph_ctmax_gam,
     residuals = TRUE, # Shows the residuals 
     pch = 1, # Selects size of the dots 
     shift = coef(staph_ctmax_gam)[1], # Shifts the scale of y to include the intercept 
     seWithMean = TRUE, # Adds the error associated with the mean to the se
     shade = TRUE, # Makes 95% confidence interval into shaded area
     xlab = "Elevation (m)") # Specifies label of x axis

Figure 2: Base R plot of the Staphylinidae CTmax and elevation (m.a.s.l.) GAM.

We see in Figure 2 evidence that CTmax decreases with elevation, but that the variance between the CTmax values increases with elevation. This is extremely interesting biologically speaking, because we see evidence that CTmax is important for Staphylinidae at lower elevations, but that it may not be the case at higher elevations given that we tend to find a larger range of values there!

Now let’s make the above plot with the ggplot2 package, which allows for great customization while being intuitive in how it works. We have already loaded ggplot2 when we loaded the tidyverse package in Section 1, so we are good to go. Note that the ggplot2 method of plotting GAMs (as far as I know) doesn’t use the mgcv package to generate its model, so you may find some minor discrepancies in the base R plot of our mgcv model (Figure 2) and the ggplot2 plot (Figure 3).

ggplot(staph_ctmax_elevation_data, # Data frame that has our data
       aes(x = elevation, # Our x variable
           y = ctmax)) + # Our y variable
  geom_point() + # Choosing to plot our data in the form of points
  geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
              formula = y ~ s(x, # Specifies formula of our model
                              k = 5), # Number of basis functions
              se = TRUE, # Whether we want the gray 95% CI or not (we do!)
              method.args = list(family = "gaussian")) + # Error distribution
  labs(x = "Elevation (m)", # Customizing x and y axis labels
       y = "CT~max~ (°C)") + # Text between ~~ will be subscript (markdown syntax)
  theme_classic() + # Removes gray grid in background and adds lines to axes
  theme(axis.title = element_text(size = 12), # Font size for all axis titles
        axis.title.y = element_markdown(), # Allows markdown in y axis label
        axis.text = element_text(size = 10)) # Font size for all axis text

Figure 3: ggplot2 plot of the Staphylinidae CTmax (°C) and elevation (m.a.s.l.) GAM.

So in conclusion, we found evidence of a significant effect of elevation on Staphylinidae CTmax, but because of the lack of constant variance, we should be cautious when claiming certainties about the specifics of the model (e.g. the shape of the relationship). This lack of constant variance could be due to a violation in the independence assumption (here, I’m speaking in biological terms), which we will address in Section 4.

3 Beta diversity analyses

Beta diversity is a measure of the differences between communities (numbers of species and what species are there) in different places (e.g. traps, habitats, field sites, etc.). We will cover a visualization technique that allows us to see those differences in community composition and statistical tests that go along with it.

3.1 NMDS

3.1.1 What it is

This is a visualization technique that allows us to see the differences in community composition (what kinds of species you find) between your treatments (levels of your factor of interest). It is not a statistical test, so it should always be accompanied by a test such as PERMANOVA or ANOSIM so that we can say whether the differences we see in the NMDS are statistically significant or not.

3.1.2 How to do it in R

3.1.2.1 Load packages

First, we will load the packages required to complete the entire procedure.

library(vegan) # For community analyses

Let’s also pre-assign a colourblind palette so we can use it for the graphs we will make.

#______________________________________________________________________________
# Colour palette
cbpalette <- c(
  "#999999",
  "#E69F00",
  "#56B4E9",
  "#009E73",
  "#F0E442",
  "#0072B2",
  "#D55E00",
  "#CC79A7"
)

3.1.2.2 Prepare your data

It is possible that you will have one big excel file with all your data in it, which is great. This will have to be transformed into at least two separate datasheets, but fear not: this can all be done in R and you don’t have to worry about having several Excel files to keep track of. However, it is also possible you have the data frames already set up for your analyses somewhere, in which case you can skip this step.

Here we will import an example datasheet might look like yours from Kembel and Cahill Jr (2011).

#______________________________________________________________________________
# Importing the data frame
community_df <- read_excel(
  here("data",
       "raw",
       "community_df.xlsx")) # Imports the data from working directory

#______________________________________________________________________________
# Displaying the data frame
community_df # Prints the top 10 rows of the data frame if table is massive
# A tibble: 331 × 16
   plot_number species abundance   sla leaf_area leaf_thickness leaf_tissue_dens
         <dbl> <chr>       <dbl> <dbl>     <dbl>          <dbl>            <dbl>
 1           9 Viola_…        10  839.     1.05           0.2            0.00596
 2           7 Planta…        20  548.     0.256          0.223          0.00817
 3          11 Potent…        40  471.     4.28           0.343          0.00619
 4           3 Andros…        20  450.     0.255          0.36           0.00618
 5          14 Campan…        10  402.     0.998          0.247          0.0101 
 6           7 Poa_sa…        50  385.     0.347          0.203          0.0128 
 7          17 Orthoc…        60  371.     0.653          0.204          0.0132 
 8          10 Astrag…        10  359.     2.03           0.237          0.0118 
 9           9 Draba_…        10  348.     0.600          0.307          0.00938
10           9 Descur…        10  347.     0.886          0.29           0.00993
# ℹ 321 more rows
# ℹ 9 more variables: srl <dbl>, root_tissue_dens <dbl>, root_diam <dbl>,
#   habitat <chr>, site <chr>, slope <dbl>, aspect <dbl>, slope_position <dbl>,
#   rel_moisture <dbl>

As a quick side note, notice how each row represents the abundance of one species caught in a specific plot. Yours might be numbers of a species by trap, or individual specimen data, which can easily be turned into a data frame like ours if the meta data is available.

Now that we have the data frame loaded in R, we can start preparing the building blocks required to run the NMDS. First, we will create two data frames: 1) the community matrix, which will contain the number of species collected in each of our collection plots (or traps, or GPS coordinate grids, really depends on your project), and 2) the metadata, containing information about each of the collection plots.

The species matrix can be generated as follows, and should look like this:

matrix <- community_df %>%
  select(plot_number, # Selects the relevant columns from our data frame
         species,
         abundance) %>%
  pivot_wider(names_from = species, # Transposes species names to columns
              values_from = abundance,
              values_fill = 0) %>%
  arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
  select(plot_number, # Puts plot_number as the first column
         sort(colnames(.))) # Orders remaining columns alphabetically

#______________________________________________________________________________
# Display the matrix
matrix
# A tibble: 17 × 77
   plot_number Achillea_millefolium Allium_textile Amelanchier_alnifolia
         <dbl>                <dbl>          <dbl>                 <dbl>
 1           1                    0              0                     0
 2           2                   10             10                     0
 3           3                   30              0                     0
 4           5                    0              0                     0
 5           6                    0              0                     0
 6           7                    0              0                     0
 7           8                    0              0                     0
 8           9                   50              0                    10
 9          10                   40              0                     0
10          11                   40             30                     0
11          12                    0             20                     0
12          13                   40             10                    10
13          14                    0             10                    10
14          15                   40             10                     0
15          16                   30              0                     0
16          17                   10              0                     0
17          18                   40              0                     0
# ℹ 73 more variables: Androsace_occidentalis <dbl>, Anemone_patens <dbl>,
#   Antennaria_neglecta <dbl>, Antennaria_parvifolia <dbl>,
#   Arnica_fulgens <dbl>, Arnica_sororia <dbl>, Artemisia_cana <dbl>,
#   Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
#   Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
#   Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
#   Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>, …

The community metadata can be generated as follows, assuming each plot row has a unique value for each of the variables. It’s possible you have several measurements of some abiotic variable, so in that case you’d need to average it so that you only have one value per plot number. Please see the printed table at the end, because yours should look like it.

metadata <- community_df %>%  
  group_by(plot_number) %>% # Applies whatever comes next to each level of plot_number
  distinct(habitat, # Select one of each distinct value of the bellow variables
           site,
           slope,
           aspect,
           slope_position,
           rel_moisture) %>% 
  arrange(plot_number) %>% # Orders by plot_number in ascending fashion
  ungroup() # Makes whole table independent from group_by variable, important later

#______________________________________________________________________________
# Display the meta data sheet
metadata # Prints the entire metadata table
# A tibble: 17 × 7
   plot_number habitat    site     slope aspect slope_position rel_moisture
         <dbl> <chr>      <chr>    <dbl>  <dbl>          <dbl>        <dbl>
 1           1 Mixedgrass Onefour      0    270            3              1
 2           2 Mixedgrass Onefour     20    130            1.5            2
 3           3 Mixedgrass Onefour      5     90            1              2
 4           5 Mixedgrass Onefour      5    130            2              1
 5           6 Mixedgrass Onefour      1     90            3              1
 6           7 Mixedgrass Onefour      8      0            2              2
 7           8 Fescue     Kinsella     5    100            3              2
 8           9 Fescue     Kinsella     5    160            1.5            3
 9          10 Fescue     Kinsella    18    155            2.5            2
10          11 Fescue     Kinsella    12    180            2              1
11          12 Fescue     Kinsella     5    164            3              3
12          13 Fescue     Kinsella    15    220            2.5            3
13          14 Fescue     Kinsella     3    160            3              3
14          15 Fescue     Kinsella     5    160            3              3
15          16 Fescue     Kinsella     6    150            1.5            3
16          17 Fescue     Kinsella     4    140            2.5            2
17          18 <NA>       <NA>        NA     NA           NA             NA

Now we are ready to run the NMDS.

3.1.2.3 Running the NMDS

To run the NMDS, we will use the metaMDS() function from the vegan package.

#______________________________________________________________________________
# Setting seed to ensure reproducible results since nmds is a permutational 
# procedure. This makes it so that the random numbers picked to generate these 
# results will always be the same. Number you pick can be anything.
set.seed(2023)

#______________________________________________________________________________
# NMDS
community_nmds <- metaMDS(matrix %>%
                            select(! 1)) # Omits column 1 (plot_number)
Square root transformation
Wisconsin double standardization
Run 0 stress 0.06758156 
Run 1 stress 0.07097971 
Run 2 stress 0.06482428 
... New best solution
... Procrustes: rmse 0.07340193  max resid 0.2350523 
Run 3 stress 0.0675814 
Run 4 stress 0.06467223 
... New best solution
... Procrustes: rmse 0.01147297  max resid 0.0343962 
Run 5 stress 0.067413 
Run 6 stress 0.0674132 
Run 7 stress 0.06482413 
... Procrustes: rmse 0.01147154  max resid 0.03440339 
Run 8 stress 0.0653533 
Run 9 stress 0.06787984 
Run 10 stress 0.06756966 
Run 11 stress 0.06787951 
Run 12 stress 0.06535327 
Run 13 stress 0.06741359 
Run 14 stress 0.06758146 
Run 15 stress 0.06467203 
... New best solution
... Procrustes: rmse 0.0003099958  max resid 0.0009114333 
... Similar to previous best
Run 16 stress 0.06788061 
Run 17 stress 0.06467214 
... Procrustes: rmse 0.0002691596  max resid 0.0007917589 
... Similar to previous best
Run 18 stress 0.07051754 
Run 19 stress 0.06482403 
... Procrustes: rmse 0.01154534  max resid 0.0353847 
Run 20 stress 0.06600201 
*** Best solution repeated 2 times

This function iteratively attempts to condense the multidimensional data frame (many columns with many rows to be compared) into a two dimensional set of points (essentially x and y coordinates) while trying to stay true to the original differences between each point of the data frame. This is what all these “runs” are in the output above. You can imagine that the more you try to force this multidimensional set of differences into two dimensions (a regular graph) you would create a lot of stress. In fact, this is a measure you can get! Stress is a measure of how well the NMDS plot is representing the differences in our original data set. We can plot what is called a stress plot of our NMDS to see how much stress our NMDS had when trying to condense our differences.

stressplot(community_nmds) # Generates the stress plot

Figure 4: Stress plot for NMDS done on fescue and mixed-seed grass community data

Visually, what you want to see is this “staircase” style graph (Figure 4). We can also directly get the stress value by extracting it from our metaMDS() results with community_nmds$stress, which in our case is 0.065. According to Kruskal, you want the stress value to be less than 0.2 (low stress). Often when the stress is too high, our metaMDS() function will fail to converge, so at the end of the output above where you see all of the iterations it will say something like “failed to converge”. Conversely, when the stress is too low (too close to 0), it will also fail to converge. When that is the case, you can play around with reducing or increasing the size of your plots (pooling or splitting them apart tends to work). Since our example converged nicely, we can then plot our results.

3.1.2.4 Ploting the NMDS with ggplot2

The first thing we need to do is create a data frame that combines the results from our metaMDS() function with our metadata data frame.

data_scores <- as.data.frame(community_nmds$points) %>% # Extracts NMDS plotting points
  mutate(plot_number = metadata$plot_number, # Attaches columns from metadata table
         habitat = metadata$habitat)

Now let’s view the table

data_scores #  Prints the table
         MDS1       MDS2 plot_number    habitat
1  -1.0309795  0.2629909           1 Mixedgrass
2  -0.7136177 -0.3885305           2 Mixedgrass
3  -0.6773968 -0.5183459           3 Mixedgrass
4  -1.6868373  0.3290805           5 Mixedgrass
5  -1.5790633 -0.2702064           6 Mixedgrass
6  -1.2286524 -0.0570536           7 Mixedgrass
7  -1.0491372  0.1059923           8     Fescue
8   0.9419983  0.4102748           9     Fescue
9   0.7731372  0.1122020          10     Fescue
10  0.2347052 -0.1745242          11     Fescue
11  0.1459202  0.3860476          12     Fescue
12  0.8159381  0.3035745          13     Fescue
13  0.6712101  0.1708772          14     Fescue
14  1.0346944 -0.3985301          15     Fescue
15  1.2593688 -0.3332723          16     Fescue
16  0.6240574  0.3493586          17     Fescue
17  1.4646544 -0.2899354          18       <NA>

This next step is optional, but I like it, so I’m including it here. In it, we generate a data frame that contains the coordinates of the centroids for each group, so when we make the plot, we can make the centroids stand out from the other data.

centroid <- data_scores %>%
  group_by(habitat) %>%
  summarize(MDS1 = mean(MDS1),
            MDS2 = mean(MDS2)) %>%
  drop_na(habitat)

At this point, we are ready to have a ggplot2 plot of our NMDS with our main variable of interest. In this case, we want to see, as you might have guessed from the last table we created, that we are interested in how the community of species differs between two habitats: mixed grass and fescue. So let’s see what that looks like!

ggplot(data = data_scores %>%
         drop_na(habitat), # Removes rows with missing metadata
       aes(x = MDS1,
           y = MDS2,
           colour = habitat)) +
  stat_ellipse(level = 0.95,
               show.legend = FALSE) + # Specifies the confidence interval of the ellipsoids
  # coord_fixed() + # Optional. Makes x and y units change at same rate
  geom_point(size = 2) +
  geom_point(data = centroid,
             aes(fill = habitat),
             size = 5,
             shape = 21,
             colour = "black",
             show.legend = FALSE) +
  labs(colour = "Habitat") +
  scale_colour_manual(values = cbpalette) +
  scale_fill_manual(values = cbpalette) +
  theme_classic() + # Gets rid of grey background grid and adds lines to axes
  theme(text = element_text(size = 12), 
        axis.text.x = element_text(size = 10),
        legend.position = "right")

Figure 5: NMDS of species present in mixed grass and fescue communities.

There we have it,