The Page test is not a trend test

The Page test is a non-parametric test for monotonically ordered differences in ranks. It can be used to assess the statistical evidence for an increase in the ordinal ranks between k ‘treatments’ (conditions or generations), based on N independent replications for each treatment. The ordering of the treatments 1…k along which we expect the monotonic effect has to be specified a priori.

Definition

Let m_i be the mean ordinal rank of the measure of interest obtained for treatment i, then the null hypothesis of the test (identical to many other tests, e.g. Friedman’s) is m₁ = m₂ = … = m_k i.e. there is no difference between the expected ranks for the k conditions. For some reason the original formulation of the alternative hypothesis being tested given in Page (1963) is m₁ > m₂ > … > m_k.

Later papers and textbook entries correctly point out that the alternative hypothesis is actually m₁ ≤ m₂ ≤ … ≤ m_k where at least one of the inequalities has to be a true inequality (Siegel and Castellan 1988; Hollander and Wolfe 1999, 284; Van De Wiel and Di Bucchianico 2001, 143). What this means is that strong evidence for just a single step-wise change in the mean rank, e.g. m₁ < m₂ = … = m_k can be sufficient for the test to reject the null hypothesis.

What the Page test is not

As the alternative hypothesis shows, the Page test is not a ‘trend test’ in any meaningful way, since it does not test for successive or cumulative changes in ranks. (Note how the original paper speaks of k treatments rather than generations, i.e. the Page test was not designed for what are essentially dependent measures.)

It also cannot show whether differences between the conditions/generations are significant, since it is a non-parametric test that only considers ranks, not absolute changes in the underlying measure. These points will be demonstrated using some semi-randomly generated data sets.

Mock dataset demonstration

To test the sensitivity of the test to a single step-wise difference across conditions we can take a typical sample set of N = 4 replications with k = 10 levels each and fix the very first position to always be ranked first, with all successive ranks being randomly shuffled. This is equivalent to the first generation doing badly at a task, with all successive generations outperforming the first one, but no cumulative improvement between them.

# make results reproducible
set.seed(1000)

pseudorandomranks <- function(...)
  unlist(lapply(list(...), function(p) if (length(p) > 1) sample(p) else p))

lowestthenrandom <- function()
  pseudorandomranks(1, 2:10)

# example ordering
t(replicate(4, lowestthenrandom()))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    5    4    7    9    6    8    2    3    10
## [2,]    1    7    6   10    2    8    9    3    4     5
## [3,]    1    8    3    6    7    4    5    9    2    10
## [4,]    1   10    3    8    5    2    9    4    7     6

Given this semi-random data generation function, we can now create a large number of data sets and compute their expected distribution of significance levels according to the Page test, and see how it varies based on the number of replications N.

library(cultevo)

sampleLs <- function(datafun, nrepetitions) {
  ps <- list("0.001" = 0, "0.01" = 0, "0.05" = 0, "NS" = 0)
  for (i in seq(nrepetitions)) {
    p <- page.test(datafun(), verbose=FALSE)$p.value
    if (p <= 0.001) {
      p <- "0.001"
    } else if (p <= 0.01) {
      p <- "0.01"
    } else if (p <= 0.05) {
      p <- "0.05"
    } else {
      p <- "NS"
    }
    ps[[p]] <- ps[[p]] + 1
  }
  unlist(ps) / nrepetitions
}

# choose some N (number of replications)
sampleps <- function(testfun, N, datafun, nrepetitions=1000)
  testfun(function() t(replicate(N, datafun())), nrepetitions)

Generating 1000 datasets like the one above – which really only exhibit a single point change in the distribution of mean ranks – we get a significant result about half of the time:

sampleps(sampleLs, N=4, lowestthenrandom)

## 0.001  0.01  0.05    NS 
## 0.028 0.156 0.287 0.529

Increasing the number of replications to N = 10, still only assuming that the first generation performs differently from all the other ones:

sampleps(sampleLs, N=10, lowestthenrandom)

## 0.001  0.01  0.05    NS 
## 0.279 0.372 0.228 0.121

The influence is even stronger when the single change point occurs closer to the middle of the number of conditions. Based on 1000 randomly generated datasets where the first two ranks are always shuffled in the first two positions, followed by ranks 3-10 also shuffled randomly across just 4 replications, we obtain the following distribution of p values:

sampleps(sampleLs, N=4, function() pseudorandomranks(1:2, 3:10))

## 0.001  0.01  0.05    NS 
## 0.442 0.401 0.148 0.009

The test is so sensitive to any (even just single-point) evidence for a change in the a priori suspected direction that it is largely unaffected by evidence for a consistent trend in the opposite direction, as can be seen in this data set where more than half of the pairwise differences between ranks indicate a downward trend:

# abrupt upwards jump after the first three observations, but the
# remaining seven observations exhibit a consistent downwards trend
upwardsjumpdownardstrend <- function() pseudorandomranks(1:3, 10, 9, 8, 7, 6, 5, 4)
sampleps(sampleLs, N=4, upwardsjumpdownardstrend)

## 0.001  0.01  0.05    NS 
## 0.000 0.000 0.999 0.001

# start around the middle, then sudden downward followed by extreme upwards jump
updownup <- function() pseudorandomranks(3:4, 5:6, 1:2, 7:8)
sampleps(sampleLs, N=4, updownup)

## 0.001  0.01  0.05    NS 
##     0     0     1     0

Alternatives to the Page test

It’s not 1963 anymore, so everybody has a computer, and probably some more concrete expectations about the development of their (presumably continuous) measure of interest. Will its value rise indefinitely across conditions/generations, or is there a ceiling where it will level out? Do you have an idea of the value at which it will level out? Will it rise linearly between conditions until it hits its maximum? Logarithmically? Exponentially? All of these are specific hypotheses corresponding to specific models that can be fit and then compared based on your data (Winter and Wieling 2016).

If you are simply looking for other non-parametric tests for sequential (or otherwise temporally dependent) data, the seasonal Kendall test (Hirsch, Slack, and Smith 1982; Gilbert 1987; Gibbons, Bhaumik, and Aryal 2009) takes seasonal effects on environmental measurements into account by computing the Mann Kendall test on each of k seasons/months separately, and then combining the individual test results. Since the order of the individual seasons is not actually taken into account (it only is in a later version of the test, Hirsch and Slack (1984)), the test is essentially a within-subject version that combines the results of k independent Mann-Kendall tests into one to increase the statistical power (Gibbons, Bhaumik, and Aryal 2009, 211). The test was in fact already used to test for trends in different geographic sample locations rather than seasons (Helsel and Frans 2006). The seasonal’s test alternative hypothesis is “a monotone trend in one or more seasons” (Hirsch and Slack 1984, 728).

Citation

This tutorial can be cited as:

techreport{Stadler2017,
author = {Stadler, Kevin},
title = {{The Page test is not a trend test}},
url = {https://kevinstadler.github.io/cultevo/articles/page.test.html},
year = {2017}
}

References

Gibbons, Robert D, Dulal Bhaumik, and Subhash Aryal. 2009. Statistical Methods for Groundwater Monitoring. 2nd ed.

Gilbert, Richard. 1987. Statistical Methods for Environmental Pollution Monitoring. John Wiley & Sons, Inc.

Helsel, Dennis R, and Lonna M Frans. 2006. “Regional Kendall test for trend.” Environmental Science and Technology 40 (13): 4066–73. https://doi.org/10.1021/es051650b.

Hirsch, Robert M, and James R Slack. 1984. “A Nonparametric Trend Test for Seasonal Data With Serial Dependence.” Water Resources Research 20 (6): 727–32. https://doi.org/10.1029/WR020i006p00727.

Hirsch, Robert M, James R Slack, and Richard A Smith. 1982. “Techniques of trend analysis for monthly water quality data.” Water Resources Research 18 (1): 107–21. https://doi.org/10.1029/WR018i001p00107.

Hollander, Miles, and Douglas A Wolfe. 1999. Nonparametric Statistical Methods.

Page, Ellis Batten. 1963. “Ordered hypotheses for multiple treatments: a significance test for linear ranks.” Journal of the American Statistical Association 58: 216–30. https://doi.org/10.1080/01621459.1963.10500843.

Siegel, Sidney, and N. John Castellan. 1988. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill.

Van De Wiel, Mark A., and A. Di Bucchianico. 2001. “Fast computation of the exact null distribution of Spearman’s rho and Page’s L statistic for samples with and without ties.” Journal of Statistical Planning and Inference 92: 133–45. https://doi.org/10.1016/S0378-3758(00)00166-X.

Winter, Bodo, and Martijn Wieling. 2016. “How to analyze linguistic change using mixed models, Growth Curve Analysis and Generalized Additive Modeling.” Journal of Language Evolution 1 (1): 7–18. https://doi.org/10.1093/jole/lzv003.