Multivariate statistics

Principal components analysis

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate data sets with some underlying linear structure Debated Two or more rows of measured data with three or more variables

Principal components analysis (PCA) is a procedure for finding hypothetical variables (components) which account for as much of the variance in your multidimensional data as possible (Davis 1986, Harper 1999). These new variables are linear combinations of the original variables. PCA has several applications, two of them are:

• Simple reduction of the data set to only two variables (the two most important components), for plotting and clustering purposes.
• More interestingly, you might try to hypothesize that the most important components are correlated with some other underlying variables. For morphometric data, this might be simply age, while for associations it might be a physical or chemical gradient (e.g. latitude or position across the shelf).

The PCA routine finds the eigenvalues and eigenvectors of the variance-covariance matrix or the correlation matrix. Choose var-covar if all your variables are measured in the same units (e.g. centimetres). Choose correlation (normalized var-covar) if your variables are measured in different units; this implies normalizing all variables using division by their standard deviations. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (components) are given for all components. The percentages of variance accounted for by these components are also given. If most of the variance is accounted for by the first one or two components, you have scored a success, but if the variance is spread more or less evenly among the components, the PCA has in a sense not been very successful.

The Jolliffe cut-off value gives an informal indication of how many principal components should be considered significant (Jolliffe, 1986). Components with eigenvalues smaller than the Jolliffe cut-off may be considered insignificant, but too much weight should not be put on this criterion.

Row-wise bootstrapping is carried out if a non-zero number of bootstrap replicates (e.g. 1000) is given in the 'Boot N' box. The bootstrapped components are re-ordered and reversed according to Peres-Neto et al. (2003) to ensure correspondence with the original axes. 95% bootstrapped confidence intervals are given for the eigenvalues.

The 'Scree plot' (simple plot of eigenvalues) can also be used to indicate the number of significant components. After this curve starts to flatten out, the corresponding components may be regarded as insignificant. 95% confidence intervals are shown if bootstrapping has been carried out. The eigenvalues expected under a random model (Broken Stick) are optionally plotted - eigenvalues under this curve represent non-significant components (Jackson 1993).

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the two most important components. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours. You can also plot the Minimal Spanning Tree, which is the shortest possible set of connected lines connecting all points. This may be used as a visual aid in grouping close points. The MST is based on an Euclidean distance measure of the original data points, so it is most meaningful when all your variables use the same unit. The 'Biplot' option will show a projection of the original axes (variables) onto the scattergram. This is another visualisation of the PCA loadings (coefficients) - see below.

If the "Eigenval scale" is ticked, the data points will be scaled by 1/sqrt(dk), and the biplot eigenvectors by sqrt(dk) - this is the correlation biplot of Legendre & Legendre (1998). If not ticked, the data points are not scaled, while the biplot eigenvectors are normalized to equal length (but not to unity, for graphical reasons) - this is the distance biplot.

The 'View loadings' option shows to what degree your different original variables (given in the original order along the x axis) enter into the different components (as chosen in the radio button panel). These component loadings are important when you try to interpret the 'meaning' of the components. The 'Coefficients' option gives the PC coefficients, while 'Correlation' gives the correlation between a variable and the PC scores. Do not use the latter if you are doing PCA on the correlation matrix. If bootstrapping has been carried out, 95% confidence intervals are shown (only for the Coefficients option).

The 'SVD' option will enforce use of the superior Singular Value Decomposition algorithm instead of "classical" eigenanalysis. The two algorithms will normally give almost identical results, but axes may be flipped.

For the 'Shape PCA' and 'Shape deform' options, see the section on Geometrical Analysis.

Bruton & Owen (1988) describe a typical morphometrical application of PCA.

Missing data is supported by column average substitution.

Principal coordinates

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate data sets with some underlying linear structure Unknown Two or more rows of measured or counted data with three or more variables, or a symmetric similarity or distance matrix

Principal coordinates analysis (PCO) is another ordination method, somewhat similar to PCA. It is also known as Metric Multidimensional Scaling (different from Non-metric Multidimensional Scaling!). The algorithm is taken from Davis (1986).

The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances or similarities between all data points. The Gower measure will normally be used instead of Euclidean distance, which gives results similar to PCA. An additional eleven distance measures are available - these are explained under Cluster Analysis. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (coordinates) are given for the first four most important coordinates (or fewer if there are fewer than four data points). The percentages of variance accounted for by these components are also given.

The similarity/distance values are raised to the power of c (the "Transformation exponent") before eigenanalysis. The standard value is c=2. Higher values (4 or 6) may decrease the "horseshoe" effect (Podani & Miklos 2002).

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the PCO. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours. The "Eigenvalue scaling" option scales each axis using the square root of the eigenvalue (recommended). The minimal spanning tree option is based on the selected similarity or distance index in the original space.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho or user-defined indices).

Non-metric multidimensional scaling

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate data sets None Two or more rows of measured, counted or presence/absence data with two or more variables, or a symmetric similarity or distance matrix.

Non-metric multidimensional scaling is based on a distance matrix computed with any of 13 supported distance measures, as explained under Cluster Analysis below. The algorithm then attempts to place the data points in a two- or three-dimensional coordinate system such that the ranked differences are preserved. For example, if the original distance between points 4 and 7 is the ninth largest of all distances between any two points, points 4 and 7 will ideally be placed such that their euclidean distance in the 2D plane or 3D space is still the ninth largest. Non-metric multidimensional scaling intentionally does not take absolute distances into account.

The program may converge on a different solution in each run, depending upon the random initial conditions. Each run is actually a sequence of 11 trials, from which the one with smallest stress is chosen. One of these trials uses PCO as the initial condition, but this rarely gives the best solution. The solution is automatically rotated to the major axes (2D and 3D).

The algorithm implemented in PAST, which seems to work very well, is based on a new approach developed by Taguchi & Oono (in press).

The minimal spanning tree option is based on the selected similarity or distance index in the original space.

Shepard plot: This plot of obtained versus observed (target) ranks indicates the quality of the result. Ideally, all points should be placed on a straight ascending line (x=y).

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Correspondence analysis

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients Unknown Two or more rows of counted data in three or more compartments

Correspondence analysis (CA) is yet another ordination method, somewhat similar to PCA but for counted data. For comparing associations (columns) containing counts of taxa, or counted taxa (rows) across associations, CA is the more appropriate algorithm. Also, CA is more suitable if you expect that species have unimodal responses to the underlying parameters, that is they favour a certain range of the parameter, becoming rare for lower and higher values (this is in contrast to PCA, which assumes a linear response).

The CA routine finds the eigenvalues and eigenvectors of a matrix containing the Chi-squared distances between all data points. The eigenvalue, giving a measure of the similarity accounted for by the corresponding eigenvector, is given for each eigenvector. The percentages of similarity accounted for by these components are also given.

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the CA. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours.

In addition, the variables (columns, associations) can be plotted in the same coordinate system (Q mode), optionally including the column labels. If your data are 'well behaved', taxa typical for an association should plot in the vicinity of that association.

PAST presently uses a symmetric scaling ("Benzecri scaling").

If you have more than two columns in your data set, you can choose to view a scatter plot on the second and third axes.

Relay plot: This is a composite diagram with one plot per column. The plots are ordered according to CA column scores. Each data point is plotted with CA first-axis row scores on the vertical axis, and the original data point value (abundance) in the given column on the horizontal axis. This may be most useful when samples are in rows and taxa in columns. The relay plot will then show the taxa ordered according to their positions along the gradients, and for each taxon the corresponding plot should ideally show a unimodal peak, partly overlapping with the peak of the next taxon along the gradient (see Hennebert & Lees 1991 for an example from sedimentology).

Missing data is supported by column average substitution.

Detrended correspondence analysis

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients Unknown Two or more rows of counted data in three or more compartments

The Detrended Correspondence (DCA) module uses the same algorithm as Decorana (Hill & Gauch 1980), with modifications according to Oxanen & Minchin (1997). It is specialized for use on 'ecological' data sets with abundance data; samples in rows, taxa in columns (vice versa prior to v. 1.79). When the 'Detrending' option is switched off, a basic Reciprocal Averaging will be carried out. The result should be similar to Correspondence Analysis (see above) plotted on the first and second axes.

Eigenvalues for the first three ordination axes are given as in CA, indicating their relative importance in explaining the spread in the data.

Detrending is a sort of normalization procedure in two steps. The first step involves an attempt to 'straighten out' points lying in an arch, which is a common occurrence. The second step involves 'spreading out' the points to avoid clustering of the points at the edges of the plot. Detrending may seem an arbitrary procedure, but can be a useful aid in interpretation.

Missing data is supported by column average substitution.

Canonical correspondence analysis

 Typical application Assumptions Data needed Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients Unknown Two or more rows of sites, with taxa (species) in columns. The first columns contain environmental variables.

Canonical Correspondence Analysis (Legendre & Legendre 1998) is correspondence analysis of a site/species matrix where each site has given values for one or more environmental variables (temperature, depth, grain size etc.). The ordination axes are linear combinations of the environmental variables. CCA is thus an example of direct gradient analysis, where the gradient in environmental variables is known a priori and the species abundances (or presence/absences) are considered to be a response to this gradient.

The implementation in PAST follows the eigenanalysis algorithm given in Legendre & Legendre (1998). The ordinations are given as site scores - fitted site scores are presently not available. Environmental variables are plotted as correlations with site scores. Both scalings (type 1 and 2) of Legendre & Legendre (1998) are available. Scaling 2 emphasizes relationships between species.

Two-block Partial Least Squares (PLS)

 Typical application Assumptions Data needed Studying the structure of covariation between two sets of variates on the same rows None Two or more rows of multivariate continuous data. The columns should be first all variates of first block, then all variates of second block.

Two-block Partial Least squares can be seen as an ordination method that can be compared with PCA, but with the objective of maximizing covariance between two sets of variates on the same rows (specimens, sites). For example, morphometric and environmental data collected on the same specimens can be ordinated in order to study covariation between the two.

The program will ask for the number of columns belonging to the first block. The remaining columns will be assigned to the second block. There are options for plotting PLS scores both within and across blocks, and PLS loadings.

The algorithm follows Rohlf & Corti (2000). Permutation tests and biplots are not yet implemented.

Cluster analysis

 Typical application Assumptions Data needed Finding hierarchical groupings in multivariate data sets None Two or more rows of counted, measured or presence/absence data in one or more variables or categories, or a symmetric similarity or distance matrix.

The hierarchical clustering routine produces a 'dendrogram' showing how data points (rows) can be clustered. For 'R' mode clustering, putting weight on groupings of taxa, taxa should go in rows. It is also possible to find groupings of variables or associations (Q mode), by entering taxa in columns. Switching between the two is done by transposing the matrix (in the Edit menu).

Three different algorithms are available:

• Unweighted pair-group average (UPGMA). Clusters are joined based on the average distance between all members in the two groups.
• Single linkage (nearest neighbour). Clusters are joined based on the smallest distance between the two groups.
• Ward's method. Clusters are joined such that increase in within-group variance is minimized,

One method is not necessarily better than the other, though single linkage is not recommended by some. It can be useful to compare the dendrograms given by the different algorithms in order to informally assess the robustness of the groupings. If a grouping is changed when trying another algorithm, that grouping should perhaps not be trusted.

For Ward's method, a Euclidean distance measure is inherent to the algorithm. For UPGMA and single linkage, the distance matrix can be computed using 13 different indices:

• The Euclidean distance (between rows) is a robust and widely applicable measure. Distance is converted to similarity by changing the sign.

• Correlation (of the variables along rows) using Pearson's r. A little meaningless if you have only two variables.

• Correlation using Spearman's rho (basically the r value of the ranks). Will often give the same result as correlation using r.

• Dice (Sorensen) coefficient for absence-presence data (coded as 0 or positive numbers). Puts more weight on joint occurences than on mismatches.

When comparing two columns (associations), a match is counted for all taxa with presences in both columns. Using 'M' for the number of matches and 'N' for the the total number of taxa with presences in just one column, we have

Dice similarity = 2M / (2M+N)

• Jaccard similarity for absence-presence data: M / (M+N)

• The Simpson index is defined as M / Nmin, where Nmin is the smaller of the numbers of presences in the two associations. This index treats two associations as identical if one is a subset of the other, making it useful for fragmentary data.

• Kulczynski similarity for presence-absence data: [M/(M+N1)+M/(M+N2)]/2

• Ochiai similarity for presence-absence data (binary form of the cosine): sqrt([M/(M+N1)][M/(M+N2)])

• Bray-Curtis measure for abundance data.

• Cosine similarity for abundance data - the inner product of abundances each normalised to unit norm.

• Chord distance for abundance data (converted to similarity by changing the sign). Recommended!

• Morisita's index for abundance data. Recommended!

• Raup-Crick index for absence-presence data. Recommended! This index (Raup & Crick 1979) uses a randomization ("Monte Carlo") procedure, comparing the observed number of species ocurring in both associations with the distribution of co-occurrences from 200 random replicates.

See Harper (1999) or Davis (1986) for details.

• Horn's overlap index for abundance data (Horn 1966).

• Hamming distance for categorical data as coded with integers (or sequence data coded as CAGT). The Hamming distance is the number of differences (mismatches), so that the distance between (3,5,1,2) and (3,7,0,2) equals 2. In PAST, this is normalised to the range (0,1), which is known to geneticists as "p-distance".

• Jukes-Cantor distance for genetic sequence data (CAGT). Similar to Hamming distance, but takes into account probability of reversals.

• Kimura distance for genetic sequence data (CAGT). Similar to Jukes-Cantor distance, but takes into account different probabilities of nucleotide transitions vs. transversals.

• Tajima-Nei distance for genetic sequence data (CAGT). Similar to Jukes-Cantor distance, but does not assume equal nucleotide frequencies.

• Manhattan distance: The sum of differences in each variable (converted to similarity by changing the sign).

• User-defined similarity: Expects a symmetric similarity matrix rather than original data. No error checking!

• User-defined distance: Expects a symmetric distance matrix rather than original data. No error checking!

• Mixed: This option requires that data types have been assigned to columns (see Entering and manipulating data). A pop-up window will ask for the similarity/distance measure to use for each datatype. These will be combined using an average weighted by the number of variates of each type. The default choices correspond to those suggested by Gower, but other combinations may well work better. The "Gower" option is a range-normalised Manhattan distance.

Missing data: The cluster analysis algorithm can handle missing data, coded as -1 or question mark (?). This is done using pairwise deletion, meaning that when distance is calculated between two points, any variables that are missing are ignored in the calculation. For Raup-Crick, missing values are treated as absence. Missing data are not supported for Ward's method, nor for the Rho similarity measure.

Two-way clustering: The two-way option allows simultaneous clustering in R-mode and Q-mode.

Stratigraphically constrained clustering: This option will allow only adjacent rows or groups of rows to be joined during the agglomerative clustering procedure. May produce strange-looking (but correct) dendrograms.

Bootstrapping: If a number of bootstrap replicates is given (e.g. 100), the columns are subjected to resampling. The percentage of replicates where each node is still supported is given on the dendrogram.

All-zeros rows: Some similarity measures (Dice, Jaccard, Simpson etc.) are undefined when comparing two all-zero rows. To avoid errors, especially when bootstrapping sparse data sets, the similarity is set to zero in such cases.

Neighbour joining cluster analysis

 Typical application Assumptions Data needed Finding hierarchical groupings in multivariate data sets None Two or more rows of counted, measured or presence/absence data in one or more variables or categories, or a symmetric similarity or distance matrix.

Neigbour joining clustering (Saitou & Nei 1987) is an alternative method for hierarchical cluster analysis. The method was originally developed for phylogenetic analysis, but may be superior to UPGMA also for ecological data. In contrast with UPGMA, two branches from the same internal node do not need to have equal branch lengths. A phylogram (unrooted dendrogram with proportional branch lengths) is given.

Distance indices and bootstrapping are as for other cluster analysis (above).

Negative branch lengths are forced to zero, and transferred to the adjacent branch according to Kuhner & Felsenstein (1994).

The tree is by default rooted on the last branch added during tree construction (this is not midpoint rooting). Optionally, the tree can be rooted on the first row in the data matrix (outgroup).

K-means clustering

 Typical application Assumptions Data needed Non-hierarchical clustering of multivariate data into a specified number of groups None Two or more rows of counted or measured data in one or more variables

K-means clustering (e.g. Bow 1984) is a non-hierarchical clustering method. The number of clusters to use is specified by the user, usually according to some hypothesis such as there being two sexes, four geographical regions or three species in the data set

The cluster assignments are initially random. In an iterative procedure, items are then moved to the cluster which has the closest cluster mean, and the cluster means are updated accordingly. This continues until items are no longer "jumping" to other clusters. The result of the clustering is to some extent dependent upon the initial, random ordering, and cluster assignments may therefore differ from run to run. This is not a bug, but normal behaviour in k-means clustering.

The cluster assignments may be copied and pasted back into the main spreadsheet, and corresponding colors (symbols) assigned to the items using the 'Numbers to colors' option in the Edit menu.

Missing data is supported by column average substitution.

Seriation

 Typical application Assumptions Data needed Stratigraphical or environmental ordering of taxa and localities None Presence/absence (0/1) matrix with taxa in rows

Seriation of an absence-presence matrix using the algorithm described by Brower & Kyle (1988). This method is typically applied to an association matrix with taxa (species) in the rows and populations in the columns. For constrained seriation (see below), columns should be ordered according to some criterion, normally stratigraphic level or position along a presumed faunal gradient.

The seriation routines attempt to reorganize the data matrix such that the presences are concentrated along the diagonal. There are two algorithms: Constrained and unconstrained optimization. In constrained optimization, only the rows (taxa) are free to move. Given an ordering of the columns, this procedure finds the 'optimal' biozonation, that is, the ordering of taxa which gives the prettiest range plot. Also, in the constrained mode, the program runs a 'Monte Carlo' simulation, generating and seriating 30 random matrices with the same number of occurences within each taxon, and compares these to the original matrix to see if it is more informative than a random one (this procedure is time-consuming for large data sets).

In the unconstrained mode, both rows and columns are free to move.

Discriminant analysis and Hotelling's T2

 Typical application Assumptions Data needed Testing for separation and equal means of two multivariate data sets Multivariate normality. Hotelling's test assumes equal covariances. Two multivariate data sets of measured data, marked with different colors

Given two sets of multivariate data, an axis is constructed which maximizes the difference between the sets. The two sets are then plotted along this axis using a histogram.

This module expects the rows in the two data sets to be grouped into two sets by coloring the rows, e.g. with black (dots) and red (crosses).

Equality of the means of the two groups is tested by a multivariate analogue to the t test, called Hotelling's T-squared, and a p value for this test is given. Normal distribution of the variables is required, and also that the number of cases is at least two more than the number of variables.

Number of constraints: For correct calculation of the Hotelling's p value, the number of dependent variables (constraints) must be specified. It should normally be left at 0, but for Procrustes fitted landmark data use 4 (for 2D) or 6 (for 3D).

Discriminant analysis can be used for visually confirming or rejecting the hypothesis that two species are morphologically distinct. Using a cutoff point at zero (the midpoint between the means of the discriminant scores of the two groups), a classification into two groups is shown in the "View numbers" option. The percentage of correctly classified items is also given.

Discriminant function: New specimens can be classified according to the discriminant function. Take the inner product between the measurements on the new specimen and the given discriminant function factors, and then subtract the given offset value.

Leave one out (cross-evaluation): An option is available for leaving out one row (specimen) at a time, re-computing the discriminant analysis with the remaining specimens, and classifying the left-out row accordingly (as given by the Score value).

Beware: The combination of discriminant analysis and Hotelling's T2test is sometimes misused. One should not be surprised to find a statistically significant difference between two samples which have been chosen with the objective of maximizing distance in the first place! The division into two groups should ideally be based on independent evidence.

See Davis (1986) for details.

Missing data is supported by column average substitution.

Paired Hotelling's T2

 Typical application Assumptions Data needed Testing for equal means of a paired multivariate data set Multivariate normality. A multivariate data set of paired measured data, marked with different colors

The paired Hotelling's test expects two groups of multivariate data, marked with different colours. Rows within each group must be consecutive. The first row of the first group is paired with the first row of the second group, the second row is paired with the second, etc.

Missing data is supported by column average substitution.

Permutation test for two multivariate groups

 Typical application Assumptions Data needed Testing for equal means of two multivariate data sets The two groups have similar distributions (variances) Two multivariate data sets of measured data, marked with different colors

This module expects the rows in the two data sets to be grouped into two sets by coloring the rows, e.g. with black (dots) and red (crosses).

Equality of the means of the two groups is tested using permutation with 2000 replicates (can be changed by the user), and the Mahalanobis squared distance measure. The permutation test is an alternative to Hotelling's test when the assumptions of multivariate normal distributions and equal covariance matrices do not hold.

Missing data is supported by column average substitution.

Multivariate normality test

 Typical application Assumptions Data needed Testing for multivariate normality Departures from multivariate normality detectable as departure from multivariate skewness or kurtosis One multivariate sample of measured data, with variables in columns

Multivariate normality is assumed by a number of multivariate tests. PAST computes Mardia's multivariate skewness and kurtosis, with tests based on chi-squared (skewness) and normal (kurtosis) distributions. A powerful omnibus (overall) test due to Doornik & Hansen (1994) is also given. If at least one of these tests show departure from normality (small p value), the distribution is significantly non-normal. Sample size should be reasonably large (>50), although a small-sample correction is also attempted for the skewness test.

Box's M test

 Typical application Assumptions Data needed Testing for equivalence of the covariance matrices for two data samples Multivariate normality Two multivariate samples of measured data, or two (square) variance-covariance matrices, marked with different colors.

This test is rather specialized, testing for the equivalence of the covariance matrices for two multivariate samples. You can use either two original multivariate samples from which the covariance matrices are automatically computed, or two specified variance-covariance matrices. In the latter case, you must also specify the sizes (number of individuals) of the two samples.

The Box's M statistic is given, together with a significance value based on a chi-square approximation. Note that this test is supposedly very sensitive. This means that a high p value will be a good, although informal, indicator of equality, while a highly significant result (low p value) may in practical terms be a somewhat too sensitive indicator of inequality.

One-way MANOVA and Canonical Variates Analysis

 Typical application Assumptions Data needed Testing for equality of the means of several multivariate samples, and ordination based on maximal separation (multigroup discriminant analysis) Multivariate normal distribution, similar variances-covariances Two or more samples of multivariate measured data, marked with different colors. The number of cases must exceed the number of variables.

One-way MANOVA (Multivariate ANalysis Of VAriance) is the multivariate version of the univariate ANOVA, testing whether several samples have the same mean. If you have only two samples, you would perhaps rather use the two-sample Hotelling's T2 test.

Two statistics are provided: Wilk's lambda with it's associated Rao's F and the Pillai trace with it's approximated F. Wilk's lambda is probably more commonly used, but the Pillai trace may be more robust.

Number of constraints: For correct calculation of the p values, the number of dependent variables (constraints) must be specified. It should normally be left at 0, but for Procrustes fitted landmark data use 4 (for 2D) or 6 (for 3D).

Pairwise comparisons (post-hoc): If the MANOVA shows significant overall difference between groups, the analysis can proceed by pairwise comparisons. In PAST, the post-hoc analysis is quite simple, by pairwise Hotelling's tests. In the post-hoc table, groups are named according to the row label of the first item in the group. Hotelling's p values are given above the diagonal, while Bonferroni corrected values (multiplied by the number of pairwise comparisons) are given below the diagonal. This Bonferroni corrected test has very little power.

Canonical Variates Analysis

An option under MANOVA, CVA produces a scatter plot of specimens along the two first canonical axes, producing maximal and second to maximal separation between all groups (multigroup discriminant analysis). The axes are linear combinations of the original variables as in PCA, and eigenvalues indicate amount of variation explained by these axes.

Missing data is supported by column average substitution.

One-way ANOSIM

 Typical application Assumptions Data needed Testing for difference between two or more multivariate groups, based on any distance measure Ranked dissimilarities within groups have equal median and range. Two or more groups of multivariate data, marked with different colors, or a symmetric similarity or distance matrix with similar groups.

ANOSIM (ANalysis Of Similarities) is a non-parametric test of significant difference between two or more groups, based on any distance measure (Clarke 1993). The distances are converted to ranks. ANOSIM is normally used for ecological taxa-in-samples data, where groups of samples are to be compared.

In a rough analogy with ANOVA, the test is based on comparing distances between groups with distances within groups. Let rb be the mean rank of all distances between groups, and rw the mean rank of all distances within groups. The test statistic R is then defined as

R = (rb-rw)/(N(N-1)/4).

Large positive R (up to 1) signifies dissimilarity between groups. The significance is computed by permutation of group membership, with 10,000 replicates (can be changed by the user).

Pairwise ANOSIMs between all pairs of groups are provided as a post-hoc test. The Bonferroni corrected p values are very conservative.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Two-way ANOSIM

 Typical application Assumptions Data needed Testing for difference between multivariate groups, based on any distance measure. The groups are organized into two factors of at least two levels each. Ranked dissimilarities within groups have equal median and range. First two columns: Levels of the two factors, coded with integers. Consecutive columns: Multivariate data, or a symmetric similarity or distance matrix.

The two-way ANOSIM in PAST uses the crossed design (Clarke 1993). For more information see one-way ANOSIM, but note that groups (levels) are not coded with colors but with integer numbers in the first two columns.

One-way NPMANOVA

 Typical application Assumptions Data needed Testing for difference between two or more multivariate groups, based on any distance measure The groups have similar distributions (similar variances) Two or more groups of multivariate data, marked with different colors, or a symmetric similarity or distance matrix with similar groups.

NPMANOVA (Non-Parametric MANOVA) is a non-parametric test of significant difference between two or more groups, based on any distance measure (Anderson 2001). NPMANOVA is normally used for ecological taxa-in-samples data, where groups of samples are to be compared, but may also be used as a general non-parametric MANOVA

NPMANOVA calculates an F value in analogy with ANOVA. In fact, for univariate data sets and the Euclidean distance measure, NPMANOVA is equivalent to ANOVA and gives the same F value.

The significance is computed by permutation of group membership, with 10,000 replicates (can be changed by the user).

Pairwise NPMANOVAs between all pairs of groups are provided as a post-hoc test. The Bonferroni corrected p values are very conservative.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Mantel test

 Typical application Assumptions Data needed Testing for correlation between two distance matrices, typically geographical or stratigraphic distance and e.g. distance between species compositions of samples (is there a spatial structure in a multivariate data set?) None Two groups of multivariate data, marked with different colors, or two symmetric distance or similarity matrices.

The Mantel test is a permutation test for correlation between two distance or similarity matrices. In PAST, these matrices can also be computed automatically from two sets of original data. The first matrix must be above the second matrix in the spreadsheet, and the rows be marked with two different colors. The two matrices must have the same number of rows. If they are distance or similarity matrices, they must also have the same number of columns.

SIMPER

 Typical application Assumptions Data needed Identifying taxa primarily responsible for differences between two or more groups of ecological samples (abundances) Independent samples Two or more groups of multivariate abundance samples (taxa in columns), marked with different colors.

SIMPER (Similarity Percentage) is a simple method for assessing which taxa are primarily responsible for an observed difference between groups of samples (Clarke 1993). The overall significance of the difference is often assessed by ANOSIM. The Bray-Curtis similarity measure is implicit to SIMPER.

If more than two groups are selected, you can either compare two groups (pairwise) by choosing from the lists of groups, or you can pool all samples to perform one overall multi-group SIMPER.

CABFAC factor analysis

 Typical application Assumptions Data needed Factor analysis of abundance data, optionally with associated environmental data None Counted data with samples in rows, taxa in columns. The first column can optionally contain associated environmental data (e.g. temperature)

This module implements the classical Imbrie & Kipp (1971) method of factor analysis and environmental regression (CABFAC and REGRESS).

The program asks whether the first column contains environmental data. If not, a simple factor analysis with Varimax rotation will be computed on row-normalized data. If environmental data are included, the factors will be regressed onto the environmental variable using the second-order (parabolic) method of Imbrie & Kipp, with cross terms. PAST then reports the RMA regression of original environmental values against values reconstructed from the transfer function. You can also save the transfer function as a text file that can later be used for reconstruction of palaeoenvironment (see below). This file contains:

• Number of taxa
• Number of factors
• Factor scores for each taxon
• Number of regression coefficients
• Regression coefficients (second- and first-order terms, and intercept)

Calibration from CABFAC

 Typical application Assumptions Data needed Reconstructing environmental parameters (e.g. temperature) from a CABFAC transfer function and fossil species abundances. Faunal responses to the environment constant through time and space Samples in rows, taxa in columns. The program will also ask for a previously constructed transfer function file

This module will reconstruct a (single) environmental parameter from taxa-in-samples abundance data. The program will also ask for a CABFAC transfer function file, as previously made by CABFAC factor analysis. The set of taxa (columns) must be identical in the spreadsheet and the transfer function file. (There seems to be a numerical instability that adds some noise to the solution - this is being looked into).

Calibration from optima

 Typical application Assumptions Data needed Reconstructing environmental parameters (e.g. temperature) from species abundance and species optimum data. None Samples in rows, taxa in columns. First three rows: Optima, tolerances, peak abundances. Consequent rows: Abundance counts.

The first three rows can be generated from known (Recent) abundance and environmental data by the "Species packing" option in the Model menu. The third row (peak abundance) is not used, and the second row (tolerance) is used only when the "Equal tolerances" box is not ticked.

The algorithm is weighted averaging, optionally with tolerance weighting, according to ter Braak & van Dam (1989).

Modern Analog Technique

 Typical application Assumptions Data needed Reconstructing paleoenvironmental parameters (e.g. temperature) from fossil counts and modern data. Fossil associations are ecologically comparable to modern ones Samples in rows, taxa in columns. First column contains environmental data - downcore samples have question marks in this column. All modern samples in first rows, then all downcore samples.

The Modern Analog Technique works by finding modern samples with faunal associations close to modern ones.

Parameters to set:

• Distance measure: Several distance measures commonly used in MAT are available. "Squared chord" has become the standard choice in the literature.
• Weighting: When several modern analogs are linked to one downcore sample, their environmental values can be weighted equally, inversely proportional to faunal distance, or inversely proportional to ranked faunal distance.
• Distance threshold: Only modern analogs closer than this threshold are used. A default value is given, which is the tenth percentile of distances between all sample pairs in the modern data. The "Dissimilarity distribution" histogram may be useful when selecting this threshold.
• N analogs: This is the maximum number of modern analogs used for each downcore sample.
• Jump method (on/off): For each downcore sample, modern samples are sorted by ascending distance. When the distance increases by more than the selected percentage, the subsequent modern analogs are discarded.
Note that one or more of these options can be disabled by entering a large value. For example, a very large distance threshold will never apply, so the number of analogs is decided only by the "N analogs" value and optionally the jump method.

Cross validation

The scatter plot and R2 value show the results of a leave-one-out (jackknifing) cross-validation within the modern data. The y=x line is shown in red. This only partly reflects the "quality" of the method, as it gives little information about the accuracy of downcore estimation.