4.6 Randomization tests

Randomization tests were introduced by R.A. Fisher in 1935. The goal of a randomization test is to help discovering some regularity (e.g. a non random property or pattern) in a complicated data set. A classic example is to take a pack of poker play-cards and check whether they were well shuffled by our poker opponent. According to the hypothesis testing terminology, randomization tests make the null hypothesis of randomness and test this hypothesis against data. In order to test the randomness hypothesis, several random transformation of data are generated.

Suppose we are interested in some property which is related to the order of data. Let the original data set $D_ N={ x_1,\dots,x_ N} $ and $t(D_ N)$ some statistic which is a function of the order in the data $D_ N$. We want to test if the value of $t(D_N)$ is due only to randomness.

  • An empirical distribution is generated by scrambling (or shuffling) $R$ times the $N$ elements at random. For example the $j$th, $j=1,\dots ,R$ scrambled data set could be $D^{(j)}_ N={ x_{23},x_4,x_{343},\dots } $

  • For each of the $j$th scrambled sets we compute a statistic $t^{(i)}$. The resulting distribution is called the resampling distribution.

  • Suppose that the value of $t(D_ N)$ is only exceeded by $k$ of the $R$ values of the resampling distribution.

  • The probability of observing $t(D_ N)$ under the null hypothesis (i.e. randomness) is only $p_ t=k/R$. The null hypothesis can be accepted/rejected on the basis of $p_ t$.

The quantity $p_ t$ plays the role of nonparametric p-value (Section 3.11.3) and it can be used, like its parametric counterpart, both to assess the evidence of the null hypothesis and to perform a decision test (e.g. refuse to play if we think cards were not sufficiently shuffled).

A bioinformatics example

Suppose we have a DNA sequence and we think that the number of repeated sequences (e.g. AGTAGTAGT) in the sample is greater than expected by chance. Let $t=17$ be the number of repetitions. How to test this hypothesis? Let us formulate the null hypothesis that the base order is random. We can construct an empirical distribution under the null hypothesis by taking the original sample and randomly scrambling the bases $R=1000$ times. This creates a sample with the same base frequencies as the original sample but where the order of bases is assigned at random. Suppose that only $5$ of the $1000$ randomized samples has a number of repetition higher or equal than $17$. The p-value (i.e. the probability of seeing $t=17$ under the null hypothesis) which is returned by the randomization test amounts to $0.005$. You can run the randomization test by using the following R script.

R code
seq<-"ATGGTCAAATTAACTTCAATCGCCGCTGGTGTCGCTGCCATCGCTGCTACTGCTTCT
GCAACCACCACTCTAGCTCAATCTGACGAAAGAGTCAACTTGGTGGAATTGGGTGTCTACGTCT
CTGATATCAGAGCTCACTTAGCCCAATACTACATGTTCCAAGCCGCCCACCCAACTGAAACCT
ACCCAGTCGAAGTTGCTGAAGCCGTTTTCAACTACGGTGACTTCACCACCATGTTGACCGGTATT
GCTCCAGACCAAGTGACCAGAATGATCACCGGTGTTCCAAGTGGTACTCCAGCAGATTAAAGCCAG
CCATCTCCAGTGCTCTAAGTCCAAGGACGGTATCTACACTATCGCAAACTAAG"


count.motif<-function(seq,mot){
  N<-nchar(seq)
  M<-nchar(mot)
  cnt<-0
  for (i in 1:N){
    cnt<-cnt+length(grep(mot,substr(seq,i,i+M-1)))
  }
  cnt
}

R<-2000
alpha<-0.05
l<-unlist(strsplit(seq,split=""))
mot<-"AAG" ## test also the motif AA
count<-count.motif(seq,mot)
print(paste ("Motif ", mot, " encountered ", count," times"))
count.perm<-array(0,c(R,1))
for (r in 1:R){
  permut.l<-sample(l)
  count.perm[r]<-count.motif(paste(permut.l,collapse=""),mot)
}

p<- sum(count.perm>=count)/R
hist(count.perm,main="Histogram of counting in randomization")
if (p < alpha){
  print(paste ("Motif ", mot, " significantly represented"))
}else{
    print(paste("Motif ", mot, " NOT significantly represented"))
}

$\bullet $