```
#________________________________
# Assign the number 3 to x
<- 3
x
#________________________________
# Display whatever is stored in x
x
```

`[1] 3`

**NOTE**: This document is a work in progress. If you see any errors, please reach out to either Alex or myself so we can address them.

This is a collection of some of the most used statistical procedures and visualization techniques in the Smith Lab. Generally speaking, this document is structured from the perspective of someone who wants to analyse community ecology data (and is completely biased towards the procedures I performed during my PhD), but that does not mean these procedures are only useful in that context, so even if you are not doing a community ecology project, you might find this tome useful.

Here, you will find brief descriptions of what each procedure is, links to useful resources, and step-by-step instructions on how to perform them in R. Depending on what year you are reading this, some of the R functions we use here may have changed in how they work, but these instructions should give you at least a starting point. You can find the R and package versions used during the writing of this document in Section 5.

I have broken down every procedure into its steps, and displayed the code and outputs for you to follow along. I have annotated every code chunk with brief explanations of what most things in the code do in case you are not familiar with a specific command. If I wrote a comment explaining what a command does, it is unlikely I did it again if the code reappears down the line.

If you are not comfortable reading R code, here is a quick guide on how to do so.

- Variables get their values assigned by this operator:
`<-`

. This means that whatever is to the right of that arrow gets stored in the name we write on the left, and every time you call that name, all of the values assigned to it show up. For example, if you write`x <- 3`

, every time you tell R to run`x`

, the number`3`

will appear.

```
#________________________________
# Assign the number 3 to x
<- 3
x
#________________________________
# Display whatever is stored in x
x
```

`[1] 3`

- Functions are operations you perform on the data, and you can identify them by the parentheses that always follow them. For instance, the function that calculates means in R is
`mean()`

. The things that go inside the parentheses are the parameters of the function (such as your data file, any tweaking you want to do to the how the function is doing its calculations, etc.). Their values are assigned with an`=`

sign and are separated by commas, so every function will look something like`function(x = something, y = something_else)`

. - Piping is a concept in which you chain one function after another so that when you run a line of code, it runs all of the chained procedures at once. This, perhaps most importantly, makes the code more readable to us (the computer doesn’t really care if functions are chained). The piping operator I use throughout this document is
`%>%`

, which comes from the`magrittr`

package. Base R has its own operator (`|>`

), but it’s slower than`%>%`

(as of the writing of this document), though you’d only notice the differences in speed if you are running procedures that take a long time to finish (so use whichever one you like!). Piping operators can be read as “then” (e.g. do this, then do that, then do that, etc.). For example,`data_frame %>% select(variable_x) %>% distinct()`

can be read as “take my data frame called ‘data_frame’, then select variable ‘x’ from it, then show me the distinct values of ‘x’”. - Finally, I tend to break up my code into a new line at most instances of a comma and at every instance of a piping operator, because this stops the code from running all the way to the side of the screen, which improves readability (in my opinion). This, however, is a stylistic choice, and you don’t have to do it if you don’t want to. I find it also helps me when I’m editing the code.

Now we are ready to start. First, let’s load some packages that we will be using throughout this document.

```
# This is an alternative way of calling your R packages to your session instead
# of writing "library()" for every package that you have. Either one is fine,
# but I added this here so you know it exists.
lapply(
c(
"tidyverse", # For data manipulation and graphics (also loads magrittr pipe)
"ggtext", # For easier axis title writing when we need sub or superscripts
"here", # For easier file pathing in the project directory
"readxl" # For importing Excel files
),# Function to be applied over values in vector above
library, character.only = TRUE) # Need this for it to work for some reason
```

Alpha diversity is a measure of the local diversity of species in any given place, often talked about in terms of number of species but sometimes taking into account their relative abundances as well. We will cover a couple of ways that you are likely to analyse alpha diversity data.

These models basically do magic, but since this magic can be complex I will only give you a brief and simplified explanation of how they work. If, however, you are interested in learning more about them, I highly recommend Noam Ross’s GAMs in R course (which at the time of this writing is completely free!), Gavin Simpson’s YouTube workshop on GAMs, and if you really want to know more about the math behind them, Simon Wood’s book.

In short, GAMs are models that can learn from the data what the best shape to represent them is. I will try to impart on you why I called this magic with an example.

Most models assume before hand the behaviour of the data in question, and the math behind them (calculation of parameters) and the subsequent test statistics and p-values that evaluate the model as a description of the data, are all based on that shape. Take linear models for instance – by definition, they assume a linear relationship between the variables in question. If the variables you are working with do not have a linear relationship, you are bound to miss perhaps important behaviours of the system you are trying to represent with your model. For example, suppose you have a positive hump-shaped relationship between two variables (let’s say species diversity and elevation), and then while doing your statistical analyses you fit a linear model. By doing so, you will be able to capture the positive nature of the relationship (species diversity increases with elevation), but you will not capture the humped nature of that relation (in the diversity and elevation hypothetical here, there would be a dramatic increase of diversity at a certain point which would then stop happening, and you would completely miss that since your linear model would tell you that diversity is changing in a linear, one-step-at-a-time fashion). On the other hand, if your data behave linearly, a linear model would capture the nature of the relationship between the two variables very well, since it would assume the correct behaviour of the data. Remember, however, how I said at the beginning of this that GAMs can learn from the data what the best shape to represent them is? This means that if the data behave linearly, our GAM will come out looking, and behaving like a linear model, and if our data behave “humped-shapedly”, our GAM will take that form instead!

Briefly, this is done through things called basis functions, which when put together make up a GAM. Imagine that you take your entire elevation gradient and you split it, for instance, into 10 equidistant sections. Now suppose you fit, for instance, a cubic regression model to the diversity and elevation data for each of those 10 sections, and that you could then combine the results of each of those sections to ultimately describe the diversity-elevation gradient in one grand model. Well, that is more or less how GAMs work! The basis functions are the functions used throughout the 10 sections of the gradient, and the GAM is the overall model that combines the results from each of the sections!

The assumptions of GAMs are the same as those for parametric linear models, which you might already have been exposed to in undergraduate statistics courses. There are three main assumptions:

**Error distribution chosen is correct****Independence of observations****Constant variance of observations**

Assumption number 1 might have been presented to you in the past as the assumption of “normality”, and you might have also heard things like “make sure your data are normal!”, or “are your data normally distributed?”, or “you should transform your data to make them normal.”, or any number of sentences like that. This stems from the fact that statistics courses tend to emphasize parametric tests, which tend to assume a Gaussian distribution (known as the normal distribution) of the error terms (if the data are normally distributed, this gets reflected in the error terms, which is what matters for modeling. For more information on this, see Montgomery’s book, or any other basic statistics book). In GAMs, we can choose the error distribution we think is most appropriate to our data based on our knowledge of its nature, so we are not bound by the normal distribution and generally do not need to do any transformations to our data. For example, count data is made up of integers (whole numbers, not decimals) and cannot have negative values (it is impossible for you to count -10 oranges – what would that even mean?). The Gaussian distribution goes from -\(\infty\) to +\(\infty\) and includes decimal numbers, so any model using the normal distribution to predict count data would be predicting nonsensical things (such as counting -10.5 oranges!). Therefore, it is crucial that you choose the correct error distribution for your model, because otherwise your results just will not make any sense (you will get outputs and p-values and everything else, but they will be very poor representations of the system you are trying to describe!).

Assumption 2 refers to the relationship between each of your data points, and whether they are influenced by (or related to) each other. In technical terms, the two points must not be drawn from different populations if the assumption of independence is to be upheld (they must be from the same population). Of course, this is a border-line impossible thing to control (especially in field studies!), since for all intents and purposes, things are connected to each other with varying degrees of intensity, so nothing is truly independent. However, statistically speaking, what you are trying to see is whether the behaviour of each data point is independent from others or not, and you can minimize this through your experimental design or by accounting for that dependence in your model. For instance, suppose we want to measure the average concentration of red blood cells in a population of 50 students. If, when you collect your data, you take measurements from only 25 students, but measure each of them twice, suddenly your data points are not independent, because every two observations will be closely related to each other, whereas if you had sampled each of the 50 students once, it is less likely that the data would be related. The consequence of violating this assumption is that you are at an increased risk of making a **Type I error**, which is finding a pattern that suggests a relationship between two variables that doesn’t really exist. I’ll illustrate what that means with another example. Say you wanted to model the relationship between brain size and body mass in a community of mammals, and you go ahead and draw little dots on a graph (your data points) and then you draw a line of best fit, which then shows you a positive trend. That line is telling you that for an increase in body mass, you should expect to see an increase in brain size. Now suppose you later realized that the upper half of the mammals on the graph (the heaviest and brainiest ones) are all part of the same genus, and that the bottom half mammals all belong to a different genus. Well, suddenly you can’t tell if the pattern you are seeing from your line of best fit is due to the relationship between brain size and body mass or if it’s just because members of each of those genera have big brains paired with big bodies and small brains paired with small bodies, and that pairing has nothing to do with the relationship between brain and body size (which in this case means that if you were to add data from different genera to your graph you would stop seeing a relationship between brain and body size). Thus, you may be seeing a pattern that isn’t really reflecting the true relationship between the variables in question because you are violating the assumption of independence. This example is from Felsenstein’s paper, which we will revisit in Section 4.3.

Assumption 3 has to do with variance, and it is at the root of statistical thinking. Suppose you want to see how variable “A” is affected by variable “B” (say blueberry yield and the application of a specific herbicide concentration). The most basic way to think about this is in a signal vs. noise setting. If there is a relationship between “A” and “B” (i.e. a signal exists between blueberry yield and that concentration of that herbicide), it is generally obscured by noise (variance caused by many other things such as rainfall, efficiency of application of the herbicide, blueberry plant health, etc.), and we use statistics to quantify the noise and the signal so we can see how big the signal actually is (i.e. how strong the relationship between “A” and “B” is). Now suppose what you want is to see how variable “A” is affected along differing levels by variable “B” (i.e. how blueberry yield is affected along a gradient of concentrations of this herbicide). Here, there may be differing levels of noise along the gradient (e.g. suppose lower concentrations have smaller variance because at those levels the herbicide doesn’t really do anything to the plants, but that variance increases as concentrations increase because different plants have different tolerances to the herbicide), so you need to ensure you either account for this change in variance in your experimental design or in your model, otherwise you are at a higher risk of making a **Type II error**, which is finding no evidence of a relationship between “A” and “B” when there actually is one (i.e. too much noise for you to find the signal). If your violation of constant variance comes from unequal sample sizes along the factor levels you have chosen, the consequences may be a bit different. Suppose you found low variance in the lower concentrations of herbicide and high variance in high concentrations – there are two scenarios that can arise here. 1) If you have fewer samples for the low herbicide concentrations (low variance associated with fewer samples) and many more samples of the high herbicide concentrations (high variance associated with higher number of samples), you are at risk of **Type II error**; 2) If you have fewer samples for the high herbicide concentrations (higher variance associated with fewer samples) and more samples of the low herbicide concentrations (low variance associated with higher number of samples), you are now at risk of making a **Type I error**.

There are ways to check how our data meet (or violate) these assumptions when fitting our GAMs, and knowing whether or not you violate them is important because we can then address them methodologically, or at the very least, we can discuss how our conclusions may be affected by these violations (if they exist).

First, we will load the packages required to complete the entire procedure.

```
library(mgcv) # For fitting GAMs
library(gratia) # For cleaner-looking assumption plots
```

Your data for GAMs should be laid out in the “each row is an observation and each column is a variable” format. For instance, suppose you wanted to get information about the inhabitants of a galaxy far far away such as their name, birth year (before the battle of Yavin), weight, height, etc. An intuitive way to lay out those data would be to make every row represent a specific inhabitant and every column a variable about that inhabitant, so that in the end you would have something this:

```
# This data set is part of the dplyr package
# Displays data about all Star Wars characters starwars
```

```
# A tibble: 87 × 14
name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
7 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
8 R5-D4 97 32 <NA> white,… red NA none mascu… Tatooi…
9 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
10 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
# … with 77 more rows, 4 more variables: species <chr>, films <list>,
# vehicles <list>, starships <list>, and abbreviated variable names
# ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
```

Your own data need to be laid out in the same way, where each row is an observation (e.g. individual organisms, or collection sites) and each column is a variable about that observation (e.g. elevation, temperature, humidity, etc.).

We will be using the following Staphylinidae elevation and heat tolerance data from Costa Rica to learn how to run GAMs in R.

```
#______________________________________________________________________________
# Importing the data frame
<- read.csv(
staph_ctmax_elevation_data here("data",
"raw",
"OTU_named_190220.csv")) # Imports the data from working directory
#______________________________________________________________________________
# Displaying the top 10 rows of the data frame
%>%
staph_ctmax_elevation_data as_tibble() # Just makes this output look nicer
```

```
# A tibble: 57 × 3
bin elevation ctmax
<chr> <int> <dbl>
1 OTU-01 1300 37.5
2 OTU-01 1300 36.5
3 OTU-01 1300 37
4 OTU-01 1300 36.5
5 OTU-01 1300 36.5
6 OTU-01 1300 34
7 OTU-01 1200 36
8 OTU-02 1200 36
9 OTU-03 1200 36
10 OTU-04 1300 39.5
# … with 47 more rows
```

We will be using the `gam()`

function from the `mgcv`

package to fit our model and the `gam.check()`

function from the `mgcv`

package to check whether our error terms meet or violate the model assumptions. If you want, you can use the `appraise()`

function from the `gratia`

package to have `ggplot2`

versions of the graphs you get from `gam.check()`

. For this example, we can expect a Gaussian distribution of our error terms because CT_{max} is not an integer and can be both positive and negative (theoretically, you could have a species with a CT_{max} of -10°C, who knows!).

```
#______________________________________________________________________________
# Model fitting
<- gam(
staph_ctmax_gam formula = ctmax ~ s(elevation, # s() specifies variable as smooth (wiggly)
k = 5), # Specifies number of basis functions
data = staph_ctmax_elevation_data, # Data frame with our variables of interest
family = "gaussian", # Specifies the expected error distribution
method = "REML" # Algorithm that penalizes over-fitting of the data
)
#______________________________________________________________________________
# Checking assumptions
gam.check(staph_ctmax_gam)
```

```
Method: REML Optimizer: outer newton
full convergence after 5 iterations.
Gradient range [-3.56625e-10,1.450484e-11]
(score 116.2551 & scale 3.236838).
Hessian positive definite, eigenvalue range [0.7384314,27.53047].
Model rank = 5 / 5
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(elevation) 4.00 2.81 0.95 0.28
```

In this output, we see 4 plots and some stuff printed onto the console. The plots will help us check our assumptions, and the console log will tell us:

- If there was convergence in our model fitting (if not, we can’t trust the results)
- The behaviour of the ctmax elevation relationship through the edf (effective degrees of freedom) (e.g. edf = 1 means a linear relationship, edf = 2 quadratic, edf = 3 cubic, and so on)
- A test statistic with a p-value that indicates whether the residuals are randomly distributed or not (constant variance assumption), with a small p-value (< 0.05) indicating non-random distribution of the residuals (violation).

You can often fix a violation of constant variance by increasing the number of basis functions in the `k`

argument in the `gam()`

function (I don’t know for sure, but it looks like this test is for whether the residuals behave randomly within each basis function rather than among all functions. I say this because of what you will see below).

The plots came out looking a bit squished here because they are printed as individual plots in RStudio, but here they got combined into the space of one and I didn’t try to fix it (sorry!). At this point, however, is where I would use the `appraise()`

function from `gratia`

to have better-looking plots.

```
# Prints model diagnostic plots
appraise(staph_ctmax_gam)
```

We see in Figure 1 that our residuals mostly fall very close to the line in the “QQ plot of residuals”, which indicates that our error terms mostly behave in a Gaussian way. We can see in the “Histogram of residuals” plot that our residuals seem to be left skewed, which accounts for why not all residuals fell within the line in the “QQ plot”. Some statisticians may recommend we transform our data to address this skew, but I will leave it up to you whether to do it or not. I tend to think that if the vast majority of residuals behave according to the distribution we chose (and the ones that don’t are not absurd-looking outliers), we can go ahead with our model and then in the discussion we can talk about the outliers that did not fall perfectly within the expected distribution.

We see in the “residuals vs linear predictors” plot that there is a sideways cone shape to our residuals (looks like >). This indicates that there is no constant variance along the values of y, and that where CT_{max}s are higher (looks like above 40°C), there tends to be less variation than where CT_{max}s are lower (below 40°C).

If we see any pattern with the mean of the residuals vs fits plot (the middle line doesn’t cut the dots in half horizontally), the assumption of independence may be violated. This does not seem to be the case with our data, but we will explore this a bit more in Section 4.2.

So, we know that our model is going to yield results that make sense (we chose the correct error distribution), but that we may be at risk of seeing some type of error because we don’t have constant variance. This being said, it does NOT mean that our data are bad and that therefore modeling them is a fruitless venture. For once, just by trying to get a model going, we have learned that CT_{max} does not vary equally along its range, and that that where we find things with CT_{max}s above 40°C, we tend to find only things with high CT_{max}, but where we find things with CT_{max}s below 40°C, we tend to find all kinds of CT_{max}s, which biologically speaking is an extremely interesting find! All we have to be cautious of is overstating whatever results we find about the relationship and what shape it shows. Because our test statistic in the `gam.check()`

function did not indicate that we violated constant variance, we should not worry too much about it (could be because it tests variance within each basis function instead of across the entire gradient, I don’t know), though we should keep it in mind!.

Now, we will extract the results from our fitted model.

`summary(staph_ctmax_gam) # function from base R that extracts model results`

```
Family: gaussian
Link function: identity
Formula:
ctmax ~ s(elevation, k = 5)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.9211 0.2383 163.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(elevation) 2.806 3.342 37.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.689 Deviance explained = 70.5%
-REML = 116.26 Scale est. = 3.2368 n = 57
```

This output is broken up into three main parts:

- From the “Family” section to the “Formula” section. Here, we see some descriptors about the model we just fitted.
- From the “Parametric coefficients” section to the “Approximate significance of smooth terms” section. Here we see information about each of the parameters of our model. For the parametric coefficients, we only have an intercept (every model has an intercept – it’s the value of y when x is 0). We see its estimated value (CT
_{max}of 38°C when elevation is 0 m.a.s.l), the standard error associated with that estimate, the test statistic that tested whether the estimate is different from 0, and the associated p-value. If we had other categorical variables in our model, this is where we would see their results. For the smooth terms (our continuous variables of interest, which in this case is only elevation), we see its effective degrees of freedom (which indicate the relationship between elevation and CT_{max}is behaving almost cubically), the Ref.df and F values, which are test statistic values used in an ANOVA that evaluates whether we see evidence of a significant effect of elevation or not, and the corresponding p-value. - From the R-sq. line to the end of the output. Here, we see values that describe how well our model explained the variance in the data, as well as some other values that we don’t have to worry about. If the relationship we are modeling behaved linearly (edf = 1), we should use the “R-sq. (adjusted)” value to see how much variation is explained by our model, whereas if it behaving in any other, more complex way, we should look at the “deviance explained” value (this is what Wood says in his book I linked above – I’m not totally confident that I understand the difference, so we should do what he recommended). We see that our model explained 70.% of the variance in the data, which is pretty good!

Now let’s visualize the model using the base R graphical output (it gives us some information that we can’t really get with the `gratia`

package, as far as I know).

```
plot(staph_ctmax_gam,
residuals = TRUE, # Shows the residuals
pch = 1, # Selects size of the dots
shift = coef(staph_ctmax_gam)[1], # Shifts the scale of y to include the intercept
seWithMean = TRUE, # Adds the error associated with the mean to the se
shade = TRUE, # Makes 95% confidence interval into shaded area
xlab = "Elevation (m)") # Specifies label of x axis
```

We see in Figure 2 evidence that CT_{max} decreases with elevation, but that the variance between the CT_{max} values increases with elevation. This is extremely interesting biologically speaking, because we see evidence that CT_{max} is important for Staphylinidae at lower elevations, but that it may not be the case at higher elevations given that we tend to find a larger range of values there!

Now let’s make the above plot with the `ggplot2`

package, which allows for great customization while being intuitive in how it works. We have already loaded `ggplot2`

when we loaded the `tidyverse`

package in Section 1, so we are good to go. Note that the `ggplot2`

method of plotting GAMs (as far as I know) doesn’t use the `mgcv`

package to generate its model, so you may find some minor discrepancies in the base R plot of our `mgcv`

model (Figure 2) and the `ggplot2`

plot (Figure 3).

```
ggplot(staph_ctmax_elevation_data, # Data frame that has our data
aes(x = elevation, # Our x variable
y = ctmax)) + # Our y variable
geom_point() + # Choosing to plot our data in the form of points
geom_smooth(method = "gam", # geom_smooth allows for GAM modeling in plot
formula = y ~ s(x, # Specifies formula of our model
k = 5), # Number of basis functions
se = TRUE, # Whether we want the gray 95% CI or not (we do!)
method.args = list(family = "gaussian")) + # Error distribution
labs(x = "Elevation (m)", # Customizing x and y axis labels
y = "CT~max~ (°C)") + # Text between ~~ will be subscript (markdown syntax)
theme_classic() + # Removes gray grid in background and adds lines to axes
theme(axis.title = element_text(size = 12), # Font size for all axis titles
axis.title.y = element_markdown(), # Allows markdown in y axis label
axis.text = element_text(size = 10)) # Font size for all axis text
```

So in conclusion, we found evidence of a significant effect of elevation on Staphylinidae CT_{max}, but because of the lack of constant variance, we should be cautious when claiming certainties about the specifics of the model (e.g. the shape of the relationship). This lack of constant variance could be due to a violation in the independence assumption (here, I’m speaking in biological terms), which we will address in Section 4.

Beta diversity is a measure of the differences between communities (numbers of species and what species are there) in different places (e.g. traps, habitats, field sites, etc.). We will cover a visualization technique that allows us to see those differences in community composition and statistical tests that go along with it.

This is a visualization technique that allows us to see the differences in community composition (what kinds of species you find) between your treatments (levels of your factor of interest). It is not a statistical test, so it should always be accompanied by a test such as PERMANOVA or ANOSIM so that we can say whether the differences we see in the NMDS are statistically significant or not.

First, we will load the packages required to complete the entire procedure.

`library(vegan) # For community analyses`

Let’s also pre-assign a colourblind palette so we can use it for the graphs we will make.

```
#______________________________________________________________________________
# Colour palette
<- c(
cbpalette "#999999",
"#E69F00",
"#56B4E9",
"#009E73",
"#F0E442",
"#0072B2",
"#D55E00",
"#CC79A7"
)
```

It is possible that you will have one big excel file with all your data in it, which is great. This will have to be transformed into at least two separate datasheets, but fear not: this can all be done in R and you don’t have to worry about having several Excel files to keep track of. However, it is also possible you have the data frames already set up for your analyses somewhere, in which case you can skip this step.

Here we will import an example datasheet that might look like yours.

```
#______________________________________________________________________________
# Importing the data frame
<- read_excel(
community_df here("data",
"raw",
"community_df.xlsx")) # Imports the data from working directory
#______________________________________________________________________________
# Displaying the data frame
# Prints the top 10 rows of the data frame if table is massive community_df
```

```
# A tibble: 331 × 16
plot_nu…¹ species abund…² sla leaf_…³ leaf_…⁴ leaf_…⁵ srl root_…⁶ root_…⁷
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 9 Viola_… 10 839. 1.05 0.2 0.00596 133. 0.306 0.175
2 7 Planta… 20 548. 0.256 0.223 0.00817 54.9 0.767 0.173
3 11 Potent… 40 471. 4.28 0.343 0.00619 192. 0.257 0.165
4 3 Andros… 20 450. 0.255 0.36 0.00618 201. 0.442 0.122
5 14 Campan… 10 402. 0.998 0.247 0.0101 197. 0.212 0.176
6 7 Poa_sa… 50 385. 0.347 0.203 0.0128 240. 0.183 0.170
7 17 Orthoc… 60 371. 0.653 0.204 0.0132 84.4 0.333 0.206
8 10 Astrag… 10 359. 2.03 0.237 0.0118 21.0 0.546 0.335
9 9 Draba_… 10 348. 0.600 0.307 0.00938 58.0 0.45 0.227
10 9 Descur… 10 347. 0.886 0.29 0.00993 169. 0.475 0.126
# … with 321 more rows, 6 more variables: habitat <chr>, site <chr>,
# slope <dbl>, aspect <dbl>, slope_position <dbl>, rel_moisture <dbl>, and
# abbreviated variable names ¹plot_number, ²abundance, ³leaf_area,
# ⁴leaf_thickness, ⁵leaf_tissue_dens, ⁶root_tissue_dens, ⁷root_diam
```

As a quick side note, notice how each row represents the abundance of one species caught in a specific plot. Yours might be numbers of a species by trap, or individual specimen data, which can easily be turned into a data frame like ours if the meta data is available.

Now that we have the data frame loaded in R, we can start preparing the building blocks required to run the NMDS. First, we will create two data frames: 1) the community matrix, which will contain the number of species collected in each of our collection plots (or traps, or GPS coordinate grids, really depends on your project), and 2) the metadata, containing information about each of the collection plots.

The species matrix can be generated as follows, and should look like this:

```
<- community_df %>%
matrix select(plot_number, # Selects the relevant columns from our data frame
species,%>%
abundance) pivot_wider(names_from = species, # Transposes species names to columns
values_from = abundance,
values_fill = 0) %>%
arrange(plot_number) %>% # Orders the plot numbers in ascending fashion
select(plot_number, # Puts plot_number as the first column
sort(colnames(.))) # Orders remaining columns alphabetically
#______________________________________________________________________________
# Display the matrix
matrix
```

```
# A tibble: 17 × 77
plot_number Achille…¹ Alliu…² Amela…³ Andro…⁴ Anemo…⁵ Anten…⁶ Anten…⁷ Arnic…⁸
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0 0 0 10 0
2 2 10 10 0 0 0 0 0 0
3 3 30 0 0 20 0 0 20 30
4 5 0 0 0 0 0 0 0 0
5 6 0 0 0 0 0 0 0 0
6 7 0 0 0 0 0 0 0 0
7 8 0 0 0 0 0 0 0 10
8 9 50 0 10 70 0 0 0 0
9 10 40 0 0 60 10 0 0 0
10 11 40 30 0 20 20 0 70 0
11 12 0 20 0 10 20 0 0 0
12 13 40 10 10 10 10 20 0 0
13 14 0 10 10 40 20 0 0 0
14 15 40 10 0 0 0 0 0 0
15 16 30 0 0 0 0 0 0 0
16 17 10 0 0 60 20 30 0 0
17 18 40 0 0 0 0 0 0 0
# … with 68 more variables: Arnica_sororia <dbl>, Artemisia_cana <dbl>,
# Artemisia_frigida <dbl>, Artemisia_ludoviciana <dbl>,
# Aster_ciliolatus <dbl>, Aster_ericoides <dbl>, Aster_falcatus <dbl>,
# Aster_laevis <dbl>, Astragalus_aboriginum <dbl>,
# Astragalus_bisulcatus <dbl>, Astragalus_dasyglottis <dbl>,
# Bouteloua_gracilis <dbl>, Bromus_inermis <dbl>,
# Campanula_rotundifolia <dbl>, Carex_filifolia <dbl>, …
```

The community metadata can be generated as follows, assuming each plot row has a unique value for each of the variables. It’s possible you have several measurements of some abiotic variable, so in that case you’d need to average it so that you only have one value per plot number. Please see the printed table at the end, because yours should look like it.

```
<- community_df %>%
metadata group_by(plot_number) %>% # Applies whatever comes next to each level of plot_number
distinct(habitat, # Select one of each distinct value of the bellow variables
site,
slope,
aspect,
slope_position,%>%
rel_moisture) arrange(plot_number) %>% # Orders by plot_number in ascending fashion
ungroup() # Makes whole table independent from group_by variable, important later
#______________________________________________________________________________
# Display the meta data sheet
# Prints the entire metadata table metadata
```

```
# A tibble: 17 × 7
plot_number habitat site slope aspect slope_position rel_moisture
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 Mixedgrass Onefour 0 270 3 1
2 2 Mixedgrass Onefour 20 130 1.5 2
3 3 Mixedgrass Onefour 5 90 1 2
4 5 Mixedgrass Onefour 5 130 2 1
5 6 Mixedgrass Onefour 1 90 3 1
6 7 Mixedgrass Onefour 8 0 2 2
7 8 Fescue Kinsella 5 100 3 2
8 9 Fescue Kinsella 5 160 1.5 3
9 10 Fescue Kinsella 18 155 2.5 2
10 11 Fescue Kinsella 12 180 2 1
11 12 Fescue Kinsella 5 164 3 3
12 13 Fescue Kinsella 15 220 2.5 3
13 14 Fescue Kinsella 3 160 3 3
14 15 Fescue Kinsella 5 160 3 3
15 16 Fescue Kinsella 6 150 1.5 3
16 17 Fescue Kinsella 4 140 2.5 2
17 18 <NA> <NA> NA NA NA NA
```

Now we are ready to run the NMDS.

To run the NMDS, we will use the `metaMDS()`

function from the `vegan`

package.

```
#______________________________________________________________________________
# Setting seed to ensure reproducible results since nmds is a permutational
# procedure. This makes it so that the random numbers picked to generate these
# results will always be the same. Number you pick can be anything.
set.seed(2023)
#______________________________________________________________________________
# NMDS
<- metaMDS(matrix %>%
community_nmds select(! 1)) # Omits column 1 (plot_number)
```

```
Square root transformation
Wisconsin double standardization
Run 0 stress 0.06758118
Run 1 stress 0.07097971
Run 2 stress 0.06482428
... New best solution
... Procrustes: rmse 0.07324737 max resid 0.2347399
Run 3 stress 0.0675814
Run 4 stress 0.06467223
... New best solution
... Procrustes: rmse 0.01147297 max resid 0.0343962
Run 5 stress 0.067413
Run 6 stress 0.0674132
Run 7 stress 0.06482413
... Procrustes: rmse 0.01147154 max resid 0.03440339
Run 8 stress 0.0653533
Run 9 stress 0.06787984
Run 10 stress 0.06756966
Run 11 stress 0.06787951
Run 12 stress 0.06535327
Run 13 stress 0.06741359
Run 14 stress 0.06758146
Run 15 stress 0.06467203
... New best solution
... Procrustes: rmse 0.0003099958 max resid 0.0009114333
... Similar to previous best
Run 16 stress 0.06788061
Run 17 stress 0.06467214
... Procrustes: rmse 0.0002691596 max resid 0.0007917589
... Similar to previous best
Run 18 stress 0.07051754
Run 19 stress 0.06482403
... Procrustes: rmse 0.01154534 max resid 0.0353847
Run 20 stress 0.06600201
*** Best solution repeated 2 times
```

This function iteratively attempts to condense the multidimensional data frame (many columns with many rows to be compared) into a two dimensional set of points (essentially x and y coordinates) while trying to stay true to the original differences between each point of the data frame. This is what all these “runs” are in the output above. You can imagine that the more you try to force this multidimensional set of differences into two dimensions (a regular graph) you would create a lot of stress. In fact, this is a measure you can get! Stress is a measure of how well the NMDS plot is representing the differences in our original data set. We can plot what is called a stress plot of our NMDS to see how much stress our NMDS had when trying to condense our differences.

`stressplot(community_nmds) # Generates the stress plot`

Visually, what you want to see is this “staircase” style graph (Figure 4). We can also directly get the stress value by extracting it from our `metaMDS()`

results with `community_nmds$stress`

, which in our case is 0.065. According to Kruskal, you want the stress value to be less than 0.2 (low stress). Often when the stress is too high, our `metaMDS()`

function will fail to converge, so at the end of the output above where you see all of the iterations it will say something like “failed to converge”. Conversely, when the stress is too low (too close to 0), it will also fail to converge. When that is the case, you can play around with reducing or increasing the size of your plots (pooling or splitting them apart tends to work). Since our example converged nicely, we can then plot our results.

`ggplot2`

The first thing we need to do is create a data frame that combines the results from our `metaMDS()`

function with our metadata data frame.

```
<- as.data.frame(community_nmds$points) %>% # Extracts NMDS plotting points
data_scores mutate(plot_number = metadata$plot_number, # Attaches columns from metadata table
habitat = metadata$habitat)
```

Now let’s view the table

`# Prints the table data_scores `

```
MDS1 MDS2 plot_number habitat
1 -1.0309795 0.2629909 1 Mixedgrass
2 -0.7136177 -0.3885305 2 Mixedgrass
3 -0.6773968 -0.5183459 3 Mixedgrass
4 -1.6868373 0.3290805 5 Mixedgrass
5 -1.5790633 -0.2702064 6 Mixedgrass
6 -1.2286524 -0.0570536 7 Mixedgrass
7 -1.0491372 0.1059923 8 Fescue
8 0.9419983 0.4102748 9 Fescue
9 0.7731372 0.1122020 10 Fescue
10 0.2347052 -0.1745242 11 Fescue
11 0.1459202 0.3860476 12 Fescue
12 0.8159381 0.3035745 13 Fescue
13 0.6712101 0.1708772 14 Fescue
14 1.0346944 -0.3985301 15 Fescue
15 1.2593688 -0.3332723 16 Fescue
16 0.6240574 0.3493586 17 Fescue
17 1.4646544 -0.2899354 18 <NA>
```

This next step is optional, but I like it, so I’m including it here. In it, we generate a data frame that contains the coordinates of the centroids for each group, so when we make the plot, we can make the centroids stand out from the other data.

```
<- data_scores %>%
centroid group_by(habitat) %>%
summarize(MDS1 = mean(MDS1),
MDS2 = mean(MDS2)) %>%
drop_na(habitat)
```

At this point, we are ready to have a `ggplot2`

plot of our NMDS with our main variable of interest. In this case, we want to see, as you might have guessed from the last table we created, that we are interested in how the community of species differs between two habitats: mixed grass and fescue. So let’s see what that looks like!

```
ggplot(data = data_scores %>%
drop_na(habitat), # Removes rows with missing metadata
aes(x = MDS1,
y = MDS2,
colour = habitat)) +
stat_ellipse(level = 0.95,
show.legend = FALSE) + # Specifies the confidence interval of the ellipsoids
# coord_fixed() + # Optional. Makes x and y units change at same rate
geom_point(size = 2) +
geom_point(data = centroid,
aes(fill = habitat),
size = 5,
shape = 21,
colour = "black",
show.legend = FALSE) +
labs(colour = "Habitat") +
scale_colour_manual(values = cbpalette) +
scale_fill_manual(values = cbpalette) +
theme_classic() + # Gets rid of grey background grid and adds lines to axes
theme(text = element_text(size = 12),
axis.text.x = element_text(size = 10),
legend.position = "right")
```