Multivariate
statistics |

Principal components
analysis

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate data sets with some underlying linear structure | Debated | Two or more rows of measured data with three or more variables |

Principal components analysis (PCA) is a procedure for finding hypothetical variables (components) which account for as much of the variance in your multidimensional data as possible (Davis 1986, Harper 1999). These new variables are linear combinations of the original variables. PCA has several applications, two of them are:

The PCA routine finds the eigenvalues and eigenvectors of the variance-covariance matrix or the correlation matrix. Choose var-covar if all your variables are measured in the same units (e.g. centimetres). Choose correlation (normalized var-covar) if your variables are measured in different units; this implies normalizing all variables using division by their standard deviations. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (components) are given for all components. The percentages of variance accounted for by these components are also given. If most of the variance is accounted for by the first one or two components, you have scored a success, but if the variance is spread more or less evenly among the components, the PCA has in a sense not been very successful.

The Jolliffe cut-off value gives an informal indication of how many principal components should be considered significant (Jolliffe, 1986). Components with eigenvalues smaller than the Jolliffe cut-off may be considered insignificant, but too much weight should not be put on this criterion.

Row-wise bootstrapping is carried out if a non-zero number of bootstrap replicates (e.g. 1000) is given in the 'Boot N' box. The bootstrapped components are re-ordered and reversed according to Peres-Neto et al. (2003) to ensure correspondence with the original axes. 95% bootstrapped confidence intervals are given for the eigenvalues.

The 'Scree plot' (simple plot of eigenvalues) can also be used to indicate the number of significant components. After this curve starts to flatten out, the corresponding components may be regarded as insignificant. 95% confidence intervals are shown if bootstrapping has been carried out. The eigenvalues expected under a random model (Broken Stick) are optionally plotted - eigenvalues under this curve represent non-significant components (Jackson 1993).

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the two most important components. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours. You can also plot the Minimal Spanning Tree, which is the shortest possible set of connected lines connecting all points. This may be used as a visual aid in grouping close points. The MST is based on an Euclidean distance measure of the original data points, so it is most meaningful when all your variables use the same unit. The 'Biplot' option will show a projection of the original axes (variables) onto the scattergram. This is another visualisation of the PCA loadings (coefficients) - see below.

If the "Eigenval scale" is ticked, the data points will be scaled by
1/sqrt(d_{k}), and the biplot eigenvectors by sqrt(d_{k}) -
this is the correlation biplot of Legendre & Legendre (1998). If not
ticked, the data points are not scaled, while the biplot eigenvectors
are normalized to equal length (but not to unity, for graphical reasons) - this
is the distance biplot.

The 'View loadings' option shows to what degree your different original variables (given in the original order along the x axis) enter into the different components (as chosen in the radio button panel). These component loadings are important when you try to interpret the 'meaning' of the components. The 'Coefficients' option gives the PC coefficients, while 'Correlation' gives the correlation between a variable and the PC scores. Do not use the latter if you are doing PCA on the correlation matrix. If bootstrapping has been carried out, 95% confidence intervals are shown (only for the Coefficients option).

The 'SVD' option will enforce use of the superior Singular Value Decomposition algorithm instead of "classical" eigenanalysis. The two algorithms will normally give almost identical results, but axes may be flipped.

For the 'Shape PCA' and 'Shape deform' options, see the section on Geometrical Analysis.

Bruton & Owen (1988) describe a typical morphometrical application of PCA.

Missing data is supported by column average substitution.

Principal coordinates

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate data sets with some underlying linear structure | Unknown | Two or more rows of measured or counted data with three or more variables, or a symmetric similarity or distance matrix |

Principal coordinates analysis (PCO) is another ordination method, somewhat similar to PCA. It is also known as Metric Multidimensional Scaling (different from Non-metric Multidimensional Scaling!). The algorithm is taken from Davis (1986).

The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances or similarities between all data points. The Gower measure will normally be used instead of Euclidean distance, which gives results similar to PCA. An additional eleven distance measures are available - these are explained under Cluster Analysis. The eigenvalues, giving a measure of the variance accounted for by the corresponding eigenvectors (coordinates) are given for the first four most important coordinates (or fewer if there are fewer than four data points). The percentages of variance accounted for by these components are also given.

The similarity/distance values are raised to the power of *c*
(the "Transformation exponent") before eigenanalysis. The standard
value is *c*=2. Higher values (4 or 6) may decrease the
"horseshoe" effect (Podani & Miklos 2002).

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the PCO. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours. The "Eigenvalue scaling" option scales each axis using the square root of the eigenvalue (recommended). The minimal spanning tree option is based on the selected similarity or distance index in the original space.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho or user-defined indices).

Non-metric multidimensional scaling

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate data sets | None | Two or more rows of measured, counted or presence/absence data with two or more variables, or a symmetric similarity or distance matrix. |

Non-metric multidimensional scaling is based on a distance matrix
computed with any of 13 supported distance measures, as explained under
Cluster Analysis below. The algorithm then attempts to place the data
points in a two- or three-dimensional coordinate system such that the *ranked
differences* are preserved. For example, if the original distance between
points 4 and 7 is the ninth largest of all distances between any
two points, points 4 and 7 will ideally be placed such that their
euclidean distance in the 2D plane or 3D space is still the ninth largest.
Non-metric multidimensional scaling intentionally does not take absolute distances
into account.

The program may converge on a different solution in each run, depending upon the random initial conditions. Each run is actually a sequence of 11 trials, from which the one with smallest stress is chosen. One of these trials uses PCO as the initial condition, but this rarely gives the best solution. The solution is automatically rotated to the major axes (2D and 3D).

The algorithm implemented in PAST, which seems to work very well, is based on a new approach developed by Taguchi & Oono (in press).

The minimal spanning tree option is based on the selected similarity or distance index in the original space.

*Shepard plot*: This plot of obtained versus observed (target) ranks
indicates the quality of the result. Ideally, all points should be placed
on a straight ascending line (x=y).

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Correspondence analysis

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients | Unknown | Two or more rows of counted data in three or more compartments |

Correspondence analysis (CA) is yet another ordination method, somewhat
similar to PCA but for *counted data*. For comparing associations
(columns) containing counts of taxa, or counted taxa (rows) across
associations, CA is the more appropriate algorithm. Also, CA is
more suitable if you expect that species have unimodal responses
to the underlying parameters, that is they favour a certain range of
the parameter, becoming rare for lower and higher values (this is in
contrast to PCA, which assumes a linear response).

The CA routine finds the eigenvalues and eigenvectors of a matrix containing the Chi-squared distances between all data points. The eigenvalue, giving a measure of the similarity accounted for by the corresponding eigenvector, is given for each eigenvector. The percentages of similarity accounted for by these components are also given.

The 'View scatter' option allows you to see all your data points (rows) plotted in the coordinate system given by the CA. If you have colored (grouped) rows, the different groups will be shown using different symbols and colours.

In addition, the variables (columns, associations) can be plotted in the same coordinate system (Q mode), optionally including the column labels. If your data are 'well behaved', taxa typical for an association should plot in the vicinity of that association.

PAST presently uses a symmetric scaling ("Benzecri scaling").

If you have more than two columns in your data set, you can choose to view a scatter plot on the second and third axes.

*Relay plot*: This is a composite diagram with one plot per
column. The plots are ordered according to CA column scores. Each data point
is plotted with CA first-axis row scores on the vertical axis, and the original
data point value (abundance) in the given column on the horizontal axis.
This may be most useful when samples are in rows and taxa in columns.
The relay plot will then show the taxa ordered according to
their positions along the gradients, and for each taxon the corresponding
plot should ideally show a unimodal peak, partly overlapping with the
peak of the next taxon along the gradient (see Hennebert
& Lees 1991 for an example from sedimentology).

Missing data is supported by column average substitution.

Detrended correspondence analysis

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients | Unknown | Two or more rows of counted data in three or more compartments |

The Detrended Correspondence (DCA) module uses the same algorithm as Decorana (Hill & Gauch 1980), with modifications according to Oxanen & Minchin (1997). It is specialized for use on 'ecological' data sets with abundance data; samples in rows, taxa in columns (vice versa prior to v. 1.79). When the 'Detrending' option is switched off, a basic Reciprocal Averaging will be carried out. The result should be similar to Correspondence Analysis (see above) plotted on the first and second axes.

Eigenvalues for the first three ordination axes are given as in CA, indicating their relative importance in explaining the spread in the data.

Detrending is a sort of normalization procedure in two steps. The first step involves an attempt to 'straighten out' points lying in an arch, which is a common occurrence. The second step involves 'spreading out' the points to avoid clustering of the points at the edges of the plot. Detrending may seem an arbitrary procedure, but can be a useful aid in interpretation.

Missing data is supported by column average substitution.

Canonical correspondence analysis

Typical application | Assumptions | Data needed |

Reduction and interpretation of large multivariate ecological data sets with environmental or other gradients | Unknown | Two or more rows of sites, with taxa (species) in columns. The first columns contain environmental variables. |

Canonical Correspondence Analysis (Legendre & Legendre
1998)
is correspondence analysis of a site/species matrix where each site has
given values for one or more environmental variables (temperature, depth,
grain size etc.). The ordination axes are linear combinations of the
environmental variables. CCA is thus an example of direct gradient analysis,
where the gradient in environmental variables is known *a priori*
and the species abundances (or presence/absences) are considered to be a
response to this gradient.

The implementation in PAST follows the eigenanalysis algorithm given in Legendre & Legendre (1998). The ordinations are given as site scores - fitted site scores are presently not available. Environmental variables are plotted as correlations with site scores. Both scalings (type 1 and 2) of Legendre & Legendre (1998) are available. Scaling 2 emphasizes relationships between species.

Two-block Partial Least Squares (PLS)

Typical application | Assumptions | Data needed |

Studying the structure of covariation between two sets of variates on the same rows | None | Two or more rows of multivariate continuous data. The columns should be first all variates of first block, then all variates of second block. |

Two-block Partial Least squares can be seen as an ordination method that can be compared with PCA, but with the objective of maximizing covariance between two sets of variates on the same rows (specimens, sites). For example, morphometric and environmental data collected on the same specimens can be ordinated in order to study covariation between the two.

The program will ask for the number of columns belonging to the first block. The remaining columns will be assigned to the second block. There are options for plotting PLS scores both within and across blocks, and PLS loadings.

The algorithm follows Rohlf & Corti (2000). Permutation tests and biplots are not yet implemented.

Cluster analysis

Typical application | Assumptions | Data needed |

Finding hierarchical groupings in multivariate data sets | None | Two or more rows of counted, measured or presence/absence data in one or more variables or categories, or a symmetric similarity or distance matrix. |

The hierarchical clustering routine produces a 'dendrogram' showing how data points (rows) can be clustered. For 'R' mode clustering, putting weight on groupings of taxa, taxa should go in rows. It is also possible to find groupings of variables or associations (Q mode), by entering taxa in columns. Switching between the two is done by transposing the matrix (in the Edit menu).

Three different algorithms are available:

One method is not necessarily better than the other, though single linkage is not recommended by some. It can be useful to compare the dendrograms given by the different algorithms in order to informally assess the robustness of the groupings. If a grouping is changed when trying another algorithm, that grouping should perhaps not be trusted.

For Ward's method, a Euclidean distance measure is inherent to the algorithm. For UPGMA and single linkage, the distance matrix can be computed using 13 different indices:

When comparing two columns (associations), a match is counted for all taxa with presences in both columns. Using 'M' for the number of matches and 'N' for the the total number of taxa with presences in just one column, we have

Dice similarity = 2M / (2M+N)

See Harper (1999) or Davis (1986) for details.

*Missing data:* The cluster analysis algorithm can handle missing
data, coded as -1 or question mark (?). This is done using pairwise
deletion, meaning that when distance is calculated between two points, any
variables that are missing are ignored in the calculation. For Raup-Crick,
missing values are treated as absence. Missing data
are not supported for Ward's method, nor for the Rho similarity measure.

*Two-way clustering:* The two-way option allows simultaneous
clustering in R-mode and Q-mode.

*Stratigraphically constrained clustering:* This option
will allow only adjacent rows or groups of rows to be joined
during the agglomerative clustering procedure. May produce
strange-looking (but correct) dendrograms.

*Bootstrapping:* If a number of bootstrap replicates is given (e.g.
100), the columns are subjected to resampling. The percentage of
replicates where each node is still supported is given on the dendrogram.

*All-zeros rows:* Some similarity measures (Dice, Jaccard, Simpson
etc.) are undefined when comparing two all-zero rows. To avoid errors,
especially when bootstrapping sparse data sets, the similarity is set to
zero in such cases.

Neighbour joining cluster analysis

Typical application | Assumptions | Data needed |

Finding hierarchical groupings in multivariate data sets | None | Two or more rows of counted, measured or presence/absence data in one or more variables or categories, or a symmetric similarity or distance matrix. |

Neigbour joining clustering (Saitou & Nei 1987) is an alternative method for hierarchical cluster analysis. The method was originally developed for phylogenetic analysis, but may be superior to UPGMA also for ecological data. In contrast with UPGMA, two branches from the same internal node do not need to have equal branch lengths. A phylogram (unrooted dendrogram with proportional branch lengths) is given.

Distance indices and bootstrapping are as for other cluster analysis (above).

Negative branch lengths are forced to zero, and transferred to the adjacent branch according to Kuhner & Felsenstein (1994).

The tree is by default rooted on the last branch added during tree construction (this is not midpoint rooting). Optionally, the tree can be rooted on the first row in the data matrix (outgroup).

K-means clustering

Typical application | Assumptions | Data needed |

Non-hierarchical clustering of multivariate data into a specified number of groups | None | Two or more rows of counted or measured data in one or more variables |

K-means clustering (e.g. Bow 1984) is a non-hierarchical clustering method. The number of clusters to use is specified by the user, usually according to some hypothesis such as there being two sexes, four geographical regions or three species in the data set

The cluster assignments are initially random. In an iterative procedure, items are then moved to the cluster which has the closest cluster mean, and the cluster means are updated accordingly. This continues until items are no longer "jumping" to other clusters. The result of the clustering is to some extent dependent upon the initial, random ordering, and cluster assignments may therefore differ from run to run. This is not a bug, but normal behaviour in k-means clustering.

The cluster assignments may be copied and pasted back into the main spreadsheet, and corresponding colors (symbols) assigned to the items using the 'Numbers to colors' option in the Edit menu.

Missing data is supported by column average substitution.

Seriation

Typical application | Assumptions | Data needed |

Stratigraphical or environmental ordering of taxa and localities | None | Presence/absence (0/1) matrix with taxa in rows |

Seriation of an absence-presence matrix using the algorithm described by Brower & Kyle (1988). This method is typically applied to an association matrix with taxa (species) in the rows and populations in the columns. For constrained seriation (see below), columns should be ordered according to some criterion, normally stratigraphic level or position along a presumed faunal gradient.

The seriation routines attempt to reorganize the data matrix such that the presences are concentrated along the diagonal. There are two algorithms: Constrained and unconstrained optimization. In constrained optimization, only the rows (taxa) are free to move. Given an ordering of the columns, this procedure finds the 'optimal' biozonation, that is, the ordering of taxa which gives the prettiest range plot. Also, in the constrained mode, the program runs a 'Monte Carlo' simulation, generating and seriating 30 random matrices with the same number of occurences within each taxon, and compares these to the original matrix to see if it is more informative than a random one (this procedure is time-consuming for large data sets).

In the unconstrained mode, both rows and columns are free to move.

Discriminant analysis and
Hotelling's T^{2}

Typical application | Assumptions | Data needed |

Testing for separation and equal means of two multivariate data sets | Multivariate normality. Hotelling's test assumes equal covariances. | Two multivariate data sets of measured data, marked with different colors |

Given two sets of multivariate data, an axis is constructed which maximizes the difference between the sets. The two sets are then plotted along this axis using a histogram.

This module expects the rows in the two data sets to be grouped into two sets by coloring the rows, e.g. with black (dots) and red (crosses).

Equality of the means of the two groups is tested by a multivariate analogue to the
*t* test, called *Hotelling's T-squared*, and a *p* value
for this test is given. Normal distribution of the variables is required,
and also that the number of cases is at least two more than the number of
variables.

*Number of constraints:* For correct calculation of the
Hotelling's *p* value, the number of dependent variables (constraints)
must be specified. It should normally be left at 0, but for Procrustes
fitted landmark data use 4 (for 2D) or 6 (for 3D).

Discriminant analysis can be used for visually confirming or rejecting the hypothesis that two species are morphologically distinct. Using a cutoff point at zero (the midpoint between the means of the discriminant scores of the two groups), a classification into two groups is shown in the "View numbers" option. The percentage of correctly classified items is also given.

*Discriminant function:* New specimens can be classified
according to the discriminant function. Take the inner product between
the measurements on the new specimen and the given discriminant function
factors, and then subtract the given offset value.

*Leave one out (cross-evaluation):* An option is available
for leaving out one row (specimen) at a time, re-computing the
discriminant analysis with the remaining specimens, and classifying the
left-out row accordingly (as given by the Score value).

Beware: The combination of
discriminant analysis and Hotelling's T^{2}test is sometimes
misused. One should not be surprised to find a statistically significant
difference between two samples which have been chosen with the objective
of maximizing distance in the first place! The division into two groups
should ideally be based on independent evidence.

See Davis (1986) for details.

Missing data is supported by column average substitution.

Paired Hotelling's
T^{2}

Typical application | Assumptions | Data needed |

Testing for equal means of a paired multivariate data set | Multivariate normality. | A multivariate data set of paired measured data, marked with different colors |

The paired Hotelling's test expects two groups of multivariate data, marked with different colours. Rows within each group must be consecutive. The first row of the first group is paired with the first row of the second group, the second row is paired with the second, etc.

Missing data is supported by column average substitution.

Permutation test for two multivariate groups

Typical application | Assumptions | Data needed |

Testing for equal means of two multivariate data sets | The two groups have similar distributions (variances) | Two multivariate data sets of measured data, marked with different colors |

This module expects the rows in the two data sets to be grouped into two sets by coloring the rows, e.g. with black (dots) and red (crosses).

Equality of the means of the two groups is tested using permutation with 2000 replicates (can be changed by the user), and the Mahalanobis squared distance measure. The permutation test is an alternative to Hotelling's test when the assumptions of multivariate normal distributions and equal covariance matrices do not hold.

Missing data is supported by column average substitution.

Multivariate normality test

Typical application | Assumptions | Data needed |

Testing for multivariate normality | Departures from multivariate normality detectable as departure from multivariate skewness or kurtosis | One multivariate sample of measured data, with variables in columns |

Multivariate normality is assumed by a number of multivariate tests. PAST computes Mardia's multivariate skewness and kurtosis, with tests based on chi-squared (skewness) and normal (kurtosis) distributions. A powerful omnibus (overall) test due to Doornik & Hansen (1994) is also given. If at least one of these tests show departure from normality (small p value), the distribution is significantly non-normal. Sample size should be reasonably large (>50), although a small-sample correction is also attempted for the skewness test.

Box's M test

Typical application | Assumptions | Data needed |

Testing for equivalence of the covariance matrices for two data samples | Multivariate normality | Two multivariate samples of measured data, or two (square) variance-covariance matrices, marked with different colors. |

This test is rather specialized, testing for the equivalence of the covariance matrices for two multivariate samples. You can use either two original multivariate samples from which the covariance matrices are automatically computed, or two specified variance-covariance matrices. In the latter case, you must also specify the sizes (number of individuals) of the two samples.

The Box's M statistic is given, together with a significance value based
on a chi-square approximation. Note that this test is supposedly
very sensitive. This means that a high *p* value will be a good,
although informal, indicator of equality, while a highly significant
result (low *p* value) may in practical terms be a somewhat too
sensitive indicator of inequality.

One-way MANOVA and Canonical Variates Analysis

Typical application | Assumptions | Data needed |

Testing for equality of the means of several multivariate samples, and ordination based on maximal separation (multigroup discriminant analysis) | Multivariate normal distribution, similar variances-covariances | Two or more samples of multivariate measured data, marked with different colors. The number of cases must exceed the number of variables. |

One-way MANOVA (Multivariate ANalysis Of VAriance) is the multivariate version
of the univariate ANOVA, testing whether several samples have the same mean.
If you have only two samples, you would perhaps rather use the two-sample
Hotelling's T^{2} test.

Two statistics are provided: Wilk's lambda with it's associated Rao's F and the Pillai trace with it's approximated F. Wilk's lambda is probably more commonly used, but the Pillai trace may be more robust.

*Number of constraints:* For correct calculation of the
*p* values, the number of dependent variables (constraints)
must be specified. It should normally be left at 0, but for Procrustes
fitted landmark data use 4 (for 2D) or 6 (for 3D).

*Pairwise comparisons (post-hoc):* If the MANOVA shows significant
overall difference between groups, the analysis can proceed by pairwise
comparisons. In PAST, the post-hoc analysis is quite
simple, by pairwise Hotelling's tests. In the post-hoc table, groups are
named according to the row label of the first item in the group. Hotelling's
p values are given above the diagonal, while Bonferroni corrected values
(multiplied by the number of pairwise comparisons) are given below the
diagonal. This Bonferroni corrected test has very little power.

*Canonical Variates Analysis*

An option under MANOVA, CVA produces a scatter plot of specimens along the two first canonical axes, producing maximal and second to maximal separation between all groups (multigroup discriminant analysis). The axes are linear combinations of the original variables as in PCA, and eigenvalues indicate amount of variation explained by these axes.

Missing data is supported by column average substitution.

One-way ANOSIM

Typical application | Assumptions | Data needed |

Testing for difference between two or more multivariate groups, based on any distance measure | Ranked dissimilarities within groups have equal median and range. | Two or more groups of multivariate data, marked with different colors, or a symmetric similarity or distance matrix with similar groups. |

ANOSIM (ANalysis Of Similarities) is a non-parametric test of significant difference between two or more groups, based on any distance measure (Clarke 1993). The distances are converted to ranks. ANOSIM is normally used for ecological taxa-in-samples data, where groups of samples are to be compared.

In a rough analogy with ANOVA, the test is based on comparing distances
between groups with distances within groups. Let *rb* be the mean rank of
all distances between groups, and *rw* the mean rank of all distances
within groups. The test statistic *R* is then defined as

*R* = (*rb-rw*)/(N(N-1)/4).

Large positive *R* (up to 1) signifies dissimilarity between groups.
The significance is computed by permutation of group membership, with
10,000 replicates (can be changed by the user).

Pairwise ANOSIMs between all pairs of groups are provided as a post-hoc test. The Bonferroni corrected p values are very conservative.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Two-way ANOSIM

Typical application | Assumptions | Data needed |

Testing for difference between multivariate groups, based on any distance measure. The groups are organized into two factors of at least two levels each. | Ranked dissimilarities within groups have equal median and range. | First two columns: Levels of the two factors, coded with integers. Consecutive columns: Multivariate data, or a symmetric similarity or distance matrix. |

The two-way ANOSIM in PAST uses the crossed design (Clarke 1993). For more information see one-way ANOSIM, but note that groups (levels) are not coded with colors but with integer numbers in the first two columns.

One-way NPMANOVA

Typical application | Assumptions | Data needed |

Testing for difference between two or more multivariate groups, based on any distance measure | The groups have similar distributions (similar variances) | Two or more groups of multivariate data, marked with different colors, or a symmetric similarity or distance matrix with similar groups. |

NPMANOVA (Non-Parametric MANOVA) is a non-parametric test of significant difference between two or more groups, based on any distance measure (Anderson 2001). NPMANOVA is normally used for ecological taxa-in-samples data, where groups of samples are to be compared, but may also be used as a general non-parametric MANOVA

NPMANOVA calculates an *F* value in analogy with ANOVA. In fact, for
univariate data sets and the Euclidean distance measure, NPMANOVA is
equivalent to ANOVA and gives the same *F* value.

The significance is computed by permutation of group membership, with 10,000 replicates (can be changed by the user).

Pairwise NPMANOVAs between all pairs of groups are provided as a post-hoc test. The Bonferroni corrected p values are very conservative.

Missing data is supported by pairwise deletion (not for the Raup-Crick, Rho and user-defined indices).

Mantel test

Typical application | Assumptions | Data needed |

Testing for correlation between two distance matrices, typically geographical or stratigraphic distance and e.g. distance between species compositions of samples (is there a spatial structure in a multivariate data set?) | None | Two groups of multivariate data, marked with different colors, or two symmetric distance or similarity matrices. |

The Mantel test is a permutation test for correlation between two distance or similarity matrices. In PAST, these matrices can also be computed automatically from two sets of original data. The first matrix must be above the second matrix in the spreadsheet, and the rows be marked with two different colors. The two matrices must have the same number of rows. If they are distance or similarity matrices, they must also have the same number of columns.

SIMPER

Typical application | Assumptions | Data needed |

Identifying taxa primarily responsible for differences between two or more groups of ecological samples (abundances) | Independent samples | Two or more groups of multivariate abundance samples (taxa in columns), marked with different colors. |

SIMPER (Similarity Percentage) is a simple method for assessing which taxa are primarily responsible for an observed difference between groups of samples (Clarke 1993). The overall significance of the difference is often assessed by ANOSIM. The Bray-Curtis similarity measure is implicit to SIMPER.

If more than two groups are selected, you can either compare two groups (pairwise) by choosing from the lists of groups, or you can pool all samples to perform one overall multi-group SIMPER.

CABFAC factor analysis

Typical application | Assumptions | Data needed |

Factor analysis of abundance data, optionally with associated environmental data | None | Counted data with samples in rows, taxa in columns. The first column can optionally contain associated environmental data (e.g. temperature) |

This module implements the classical Imbrie & Kipp (1971) method of factor analysis and environmental regression (CABFAC and REGRESS).

The program asks whether the first column contains environmental data. If not, a simple factor analysis with Varimax rotation will be computed on row-normalized data. If environmental data are included, the factors will be regressed onto the environmental variable using the second-order (parabolic) method of Imbrie & Kipp, with cross terms. PAST then reports the RMA regression of original environmental values against values reconstructed from the transfer function. You can also save the transfer function as a text file that can later be used for reconstruction of palaeoenvironment (see below). This file contains:

- Number of taxa
- Number of factors
- Factor scores for each taxon
- Number of regression coefficients
- Regression coefficients (second- and first-order terms, and intercept)

Calibration from CABFAC

Typical application | Assumptions | Data needed |

Reconstructing environmental parameters (e.g. temperature) from a CABFAC transfer function and fossil species abundances. | Faunal responses to the environment constant through time and space | Samples in rows, taxa in columns. The program will also ask for a previously constructed transfer function file |

This module will reconstruct a (single) environmental parameter from taxa-in-samples abundance data. The program will also ask for a CABFAC transfer function file, as previously made by CABFAC factor analysis. The set of taxa (columns) must be identical in the spreadsheet and the transfer function file. (There seems to be a numerical instability that adds some noise to the solution - this is being looked into).

Calibration from optima

Typical application | Assumptions | Data needed |

Reconstructing environmental parameters (e.g. temperature) from species abundance and species optimum data. | None | Samples in rows, taxa in columns. First three rows: Optima, tolerances, peak abundances. Consequent rows: Abundance counts. |

The first three rows can be generated from known (Recent) abundance and environmental data by the "Species packing" option in the Model menu. The third row (peak abundance) is not used, and the second row (tolerance) is used only when the "Equal tolerances" box is not ticked.

The algorithm is weighted averaging, optionally with tolerance weighting, according to ter Braak & van Dam (1989).

Modern Analog Technique

Typical application | Assumptions | Data needed |

Reconstructing paleoenvironmental parameters (e.g. temperature) from fossil counts and modern data. | Fossil associations are ecologically comparable to modern ones | Samples in rows, taxa in columns. First column contains environmental data - downcore samples have question marks in this column. All modern samples in first rows, then all downcore samples. |

The Modern Analog Technique works by finding modern samples with faunal associations close to modern ones.

**Parameters to set:**

- Distance measure: Several distance measures commonly used in MAT are available. "Squared chord" has become the standard choice in the literature.
- Weighting: When several modern analogs are linked to one downcore sample, their environmental values can be weighted equally, inversely proportional to faunal distance, or inversely proportional to ranked faunal distance.
- Distance threshold: Only modern analogs closer than this threshold are used. A default value is given, which is the tenth percentile of distances between all sample pairs in the modern data. The "Dissimilarity distribution" histogram may be useful when selecting this threshold.
- N analogs: This is the maximum number of modern analogs used for each downcore sample.
- Jump method (on/off): For each downcore sample, modern samples are sorted by ascending distance. When the distance increases by more than the selected percentage, the subsequent modern analogs are discarded.

**Cross validation**

The scatter plot and *R*^{2} value show the results of
a leave-one-out (jackknifing) cross-validation within the
modern data. The *y=x* line is shown in red. This only partly
reflects the "quality" of the method, as it gives little information
about the accuracy of downcore estimation.

Next: Fitting data to functions | PAST home page |