Ordination in R

 

This page covers the R functions to perform multivariate ordination, including some methods beyond the default biplots for displaying your results graphically.  Some of these methods will use functions in the vegan package, which you should load and install (see here if you haven’t loaded packages before).  The following four major ordination methods are covered:

1.    Principal Components Analysis (PCA)

2.    Correspondence Analysis (CA) and Detrended Correspondence Analysis (DCA)

3.    Principal Coordinates Analysis (PCoA)

4.    Non-Metric Multidimensional Scaling (NMDS).

 

Principal Components Analysis (PCA)

 

This function is part of the basic R installation, so you won’t need the vegan package yet.  Before you run the function, you will need to have a data frame containing only numerical data (there can be row names).  The default arrangement is to have the samples (or sites) in rows and the measured variables (or counts) in columns.  You can transpose a data frame (or matrix) – swap the rows to columns and vice versa – using the transpose function t(…):

 

transposed.frame<-t(dataframe) #transposes data frame so rows become columns and vice versa

 

The default setting for the PCA prcomp(…) function performs ordination using the covariance matrix, for example:

 

PCA.res<-prcomp(data.frame) #where data.frame is the name of the data frame variable

 

If you want to use the correlation matrix (for example, when the variables are measured in different units or when differences in their variances are not important for interpretation), specify:

 

PCA.res<-prcomp(data.frame, scale.=T)

 

You should store the results as a variable, an object of class “prcomp” (don’t worry about what “class” means) that contains specific items.  Most of the time you will want to plot the results manually using the values provided, to customize the sample colors and other parameters, but to get a quick plot you can use the biplot(…) command:

 

biplot(PCA.res)

 

 

The first two (most important) components are plotted, samples are shown in black and the loadings of each variable are shown by the red arrows.  There are two sets of axis scales; the ones on the right and top correspond to the axis scores for samples, and the bottom and left axes correspond to the loadings of the variables.  Note that it uses the sample labels for plot points, which is usually unwieldy.  If you have a lot of variables, you will also get a giant mess of loadings like in the plot above.  Because of that, it is often better to produce a custom plot.  That will allow you to customize other characteristics – grouping samples into categories by color of the points, for example.

 

The PCA results object contains the following variables:

 

names(PCA.res)

[1] "sdev"     "rotation" "center"   "scale"    "x"  

 

“Center” and “scale” simply keep track of your choices when running the function, so “sdev,” “rotation,” and “x” are the important data.  The “sdev” variable contains the standard deviation for each principal component axis.  Because it is more traditional to talk about variance explained (rather than standard deviation), you can easily convert sdev to variance (variance is standard deviation squared) or proportion of variance.

 

You can also assess the proportion of variance explained by asking for the summary of the PCA results:

 

summary(PCA.res)

Importance of components:

                         PC1   PC2   PC3    PC4    PC5

Standard deviation     0.175 0.142 0.115 0.0924 0.0845

Proportion of Variance 0.312 0.205 0.135 0.0871 0.0728

Cumulative Proportion  0.312 0.517 0.652 0.7395 0.8123

 

The output is truncated to only include the first five components.

 

You can also use “sdev” to produce a scree plot to help decide how many components to interpret:

 

 

The variable somewhat cryptically named “x” actually contains the axis scores for each sample on each principal component.  The variable named “rotation” contains the values for the loadings, which can be used to add arrows for loadings on the biplot.

 

Even though PCA is not an appropriate ordination method for this ecological data set, here is an example of the type of biplot you can make by custom plotting.  For more information on adding arrows, text labels, and legends to plots, and information on how to vary colors, see the graphing page.  I also only plotted the ten largest loadings (the ten most important variables for these two components) for clarity.

 

 

Correspondence and Detrended Correspondence Analysis (DCA)

 

This function is in the vegan package and is called “decorana” and it requires a matrix-like object (data frame) with samples in rows and species in columns (DCA is typically performed on taxonomic count data, but could be applied to other types of data).  The basic function call is simple, but you should keep in mind the default settings:

 

DCA.res<-decorana(dataframe) #performs DCA

 

The default settings are to 1) perform the detrending/rescaling procedure (ira=0), 2) not downweight rare taxa (iweigh=0), 3) perform four rescaling cycles (iresc=4), and 4) use 26 segments for detrending/rescaling (mk=26).  You generally won’t want to change the rescaling cycles or segment number, but the first two parameters can be adjusted.  The function only returns the first four axes.

 

DCA.res<-decorana(dataframe,ira=1) #performs CA (actually reciprocal averaging)

 

DCA.res<-decorana(dataframe,iweigh=1) #performs DCA and downweights rare taxa

 

 

The result variable is an object containing eigenvalues, sample and species scores, etc.  There are some things to beware, however!

 

The eigenvalues are reported in the summary (summary(DCA.res)) and can be obtained from DCA.res$evals.  In CA, these eigenvalues represent the inertia of each axis and are conceptually analogous to the “variance explained” in PCA.  Total inertia is not given, but it is simply based on the chi-squared statistic for the whole data frame:

 

total.inertia<-chisq.test(dataframe/sum(dataframe))$statistic

 

You can easily calculate the proportion of inertia for each axis.  In DCA, however, the returned eigenvalues do not sum to total inertia and don’t actually have a lot of meaning because of the detrending/rescaling process.

 

 

Likewise, the variables DCA.res$rproj and DCA.res$cproj seem to contain sample and species axis scores (which you could use for plotting), but this is only true for CA (ira=1).  In DCA, the detrending/rescaling procedure distorts the scores, so you need to use a different method.

 

scores(DCA.res,display="sites") #extracts sample/site axis scores

 

scores(DCA.res,display="species") #extracts variable/species axis scores

 

scores(DCA.res,display="sites",choice=1:2) #extracts sample/site scores for axes 1 and 2 only

 

scores(DCA.res,display="species",choice=2:3) #extracts sample/site scores for axes 2 and 3 only

 

 

To plot the results, you can just use the default plot(…) command on the results (I added the titles), which plots the sites as open circles and the variables/species as red crosses, but without labels, as shown below.

 

 

 

For more customization, you can extract the scores (as explained above) and plot manually as done in the PCA section.  You can also use the display=c("sites","species") option combined with points(…) or text(…) to add points or labels.  For example:

 

plot(CA.res,type="n",main="Correspondence Analysis") #creates empty plot(type=”n”)

 

text(CA.res,display="species",col="red",cex=0.6) #adds red labels for species

 

points(CA.res,display="sites",pch=16) #adds filled circles for sites

 

 

Note that the taxon names plot on top of one another, resulting in a jumbled mess.  Producing a publication-quality plot here would take some more detailed work.

 

 

Principal Coordinates Analysis (PCoA)

 

PCoA is a distance-based ordination method that can be performed via the capscale(…) function in vegan.  You can also use cmdscale(…) in the base installation, but you will need to produce a distance matrix from the original data (see the cluster analysis page for more information on distances).  PCoA is also called classical multidimensional scaling or metric multidimensional scaling.  The major benefit of PCoA is the ability to choose a different distance measure – when Euclidean distance is used, PCoA is the same as PCA.

 

The capscale(…) function is designed for another purpose, so the syntax is a bit different than the other ordination methods, but it can be used to perform PCoA:

 

PCoA.res<-capscale(dataframe~1,distance="bray") #must specify dataframe~1 (where dataframe is the sample/variable data matrix) to perform PCoA

#must specify a distance from distances provided in vegdist(…)

 

The vegdist(…) function has more distances, including some more applicable to (paleo)ecological data:

 

Distances available in vegdist(…) are: "manhattan", "euclidean", "canberra", "bray", "kulczynski", "jaccard", "gower", "altGower", "morisita", "horn", "mountford", "raup" , "binomial" or "chao" and the default is “bray” or Bray-Curtis.

 

You can see the eigenvalues (including proportion contribution to distance explained) and the species and sample scores using:

 

summary(PCoA.res)

 

Scores can be extracted in the same fashion as DCA:

 

scores(PCoA.res,display=”sites”) #or display=”species”

 

The default plot(PCoA.res) doesn’t include labels so you can either add points(…) and text(…) as outlined above, or extract the scores and plot manually:

 

 

Notice that the PCoA plot looks similar to the PCA plot (although rotated), because both enforce a linear relationship that is not appropriate for ecological data like this.  I’m not sure why the species scores are so tightly clustered around the origin.

 

 

Non-Metric Multidimensional Scaling (NMDS)

 

The default syntax to perform NMDS in the vegan package is simple – the function is metaMDS(…), but there are some defaults that may or may not be what you want.  Make sure to store the results as a variable.

 

1.    The default distance is Bray-Curtis, but any distance available in vegdist(…) can be used.

 

For example, metaMDS(dataframe,distance=”kulczynski”)

 

2.    If some counts are large, the default is to perform a square-root transformation.  A Wisconsin-style double transformation (normalizing taxa to percent abundance and then normalizing abundances to the maximum for each species) is also the default.  To turn off these defaults, use:

 

metaMDS(dataframe,autotransform=F)

 

3.    Axes are meaningless in NMDS (because it is not an eigenanalysis ordination), so the final result is rotated using PCA to make axis 1 (on the plot) correspond to the greatest variance among the NMDS sample points.  This doesn’t change interpretation, cannot be modified, and is a good idea, but you should be aware of it.

 

4.    Choice of dimensionality is important in NMDS (and must be user-specified) because NMDS at lower dimensions is not a projection of the higher-dimension solution.  The default is to ordinate the points in two dimensions (k=2), but you may want to look at three dimensions or more.

 

metaMDS(dataframe,autotransform=F,k=3)

 

A plot of stress (a measure of goodness-of-fit) vs. dimensionality can be used to assess the proper choice of dimensions, in much the same way as you could use a scree plot in PCA.

 

Here is a function that produces a stress vs. dimensionality plot:

 

NMDS.scree<-function(x) { #where x is the name of the data frame variable

plot(rep(1,10),replicate(10,metaMDS(x,autotransform=F,k=1)$stress/100),xlim=c(1,nrow(x)),ylim=c(0,0.5),xlab="# of Dimensions",ylab="Stress",main=”NMDS stress plot”)

 

for (i in 1:(nrow(x)-2)) {

points(rep(i+1,10),replicate(10,metaMDS(x,autotransform=F,k=i+1)$stress/100))

}

}

 

Ideally, there will be an inflection point (an “elbow”) that you can use to choose the number of dimensions (usually k+1 if the inflection is at point k).  In practice, as in this case, it is often less clear – should you choose three dimensions or five?  The stress values themselves (the metaMDS results are stress*100) can be used as an indicator.  Stress values >0.2 are generally poor and potentially uninterpretable, whereas values <0.1 are good and <0.05 are excellent, leaving little danger of misinterpretation.  Stress values between 0.1 and 0.2 are useable but some of the distances will be misleading.

 

 

The metaMDS ordination results contain sample/site axis scores (NMDS.res$points) and variable/taxon scores (NMDS.res$species) that you can use to make a custom plot (as with previous ordinations, the default plot has species marked by red crosses and samples by open circles, but both are unlabeled).