Applied Statistics for Bioinformatics using R
Wim P. Krijnen
November 10, 2009
Statistical hypothesis testing consists of hypotheses, distributional assump-
tions, and decisions (conclusions). The hypotheses pertain to the outcome
of a biological experiment and are always formulated in terms of population
values of parameters. Statistically, the outcomes of experiments are seen as
realizations of random variables. The latter are assumed to have a certain
suitable distribution which is seen as a statistical model for outcomes of an
experiment. Then a statistic is formulated (e.g. a t-value) which is treated
both as a function of the random variables and as a function of the data
values. By comparing the distribution of the statistic with the value of the
statistic, the p-value is computed and compared to the level of significance.
A large p-value indicates that the model fits the data well and that the as-
sumptions as well as the null-hypothesis are correct with large probability.
However, a low p-value indicates, under the validity of the distributional as-
sumptions, that the outcome of the experiment is so unlikely that this causes
a sufficient amount of doubt to the researcher to reject the null hypothesis.
> dat <- matrix(c(5,5,5,5),2,byrow=TRUE)
> chisq.test(dat)
Pearson’s Chi-squared test with Yates’ continuity correction
data: dat
X-squared = 0.2, df = 1, p-value = 0.6547
Since the p-value is larger than the significance level, the null hypothesis of
independence is not rejected.
Suppose that for another cutoff value we obtain 8 true positives (tp), 2
false positives (fp), 8 true negatives (tn), and 2 false negatives (fn). Then
testing independence yields the following.
> dat <- matrix(c(8,2,2,8),2,byrow=TRUE)
> chisq.test(dat)
Pearson’s Chi-squared test with Yates’ continuity correction
data: dat
X-squared = 5, df = 1, p-value = 0.02535
Since the p-value is smaller than the significance level, the null hypothesis of
independence is rejected.
Example 2. In the year 1866 Mendel observed in large number of exper-
iments frequencies of characteristics of different kinds of seed and their off-
spring. In particular, this yielded the frequencies 5474, 1850 the seed shape
of ornamental sweet peas. A crossing of B and b yields off spring BB, Bb and
bb with probability 0.25, 0.50, 0.25. Since Mendel could not distinguish Bb
from BB, his observations theoretically occur with probability 0.75 (BB and
Bb) and 0.25 (bb). To test the null hypothesis H0 : (π1 , π2 ) = (0.75, 0.25)
against H1 : (π1 , π2 ) = (0.75, 0.25), we use the chi-squared test6 , as follows.
> pi <- c(0.75,0.25)
> x <-c(5474, 1850)
> chisq.test(x, p=pi)
Chi-squared test for given probabilities
data: x
X-squared = 0.2629, df = 1, p-value = 0.6081
From the p-value 0.6081, we do not reject the null hypothesis.
The null-hypothesis of the Fisher test is that the odds ratio equals 1 and
the alternative hypothesis that it differs from 1. Suppose that the frequencies
of significant oncogenes for Chromosome 1 equals f11 = 300 out of a total of
f12 = 500 and for the genome f21 = 3000 out of f22 = 6000. The hypothesis
that the odd ratio equals one can now be tested as follows.
> dat <- matrix(c(300,500,3000,6000),2,byrow=TRUE)
> fisher.test(dat)
Fisher’s Exact Test for Count Data
data: dat
p-value = 0.01912
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.029519 1.396922
sample estimates:
odds ratio
1.199960
Since the p-value is smaller than the significance level, the null hypothesis
of odds ratio equal to one is rejected. There are more significant oncogenes
in Chromosome 1 compared to that in the genome.
No comments:
Post a Comment