n0b3l1a: Intro to Biostatistics

Friday, March 16, 2012

Intro to Biostatistics

http://cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf

Applied Statistics for Bioinformatics using R
Wim P. Krijnen
November 10, 2009

Statistical hypothesis testing consists of hypotheses, distributional assump-
tions, and decisions (conclusions). The hypotheses pertain to the outcome
of a biological experiment and are always formulated in terms of population
values of parameters. Statistically, the outcomes of experiments are seen as
realizations of random variables. The latter are assumed to have a certain
suitable distribution which is seen as a statistical model for outcomes of an
experiment. Then a statistic is formulated (e.g. a t-value) which is treated
both as a function of the random variables and as a function of the data
values. By comparing the distribution of the statistic with the value of the
statistic, the p-value is computed and compared to the level of significance.
A large p-value indicates that the model fits the data well and that the as-
sumptions as well as the null-hypothesis are correct with large probability.
However, a low p-value indicates, under the validity of the distributional as-
sumptions, that the outcome of the experiment is so unlikely that this causes
a sufficient amount of doubt to the researcher to reject the null hypothesis.

> dat <- matrix(c(5,5,5,5),2,byrow=TRUE)
> chisq.test(dat)

Pearson’s Chi-squared test with Yates’ continuity correction

data: dat

X-squared = 0.2, df = 1, p-value = 0.6547

Since the p-value is larger than the significance level, the null hypothesis of
independence is not rejected.

Suppose that for another cutoff value we obtain 8 true positives (tp), 2

false positives (fp), 8 true negatives (tn), and 2 false negatives (fn). Then

testing independence yields the following.

> dat <- matrix(c(8,2,2,8),2,byrow=TRUE)

> chisq.test(dat)

Pearson’s Chi-squared test with Yates’ continuity correction

data: dat

X-squared = 5, df = 1, p-value = 0.02535

Since the p-value is smaller than the significance level, the null hypothesis of

independence is rejected.

Example 2. In the year 1866 Mendel observed in large number of exper-

iments frequencies of characteristics of different kinds of seed and their off-

spring. In particular, this yielded the frequencies 5474, 1850 the seed shape

of ornamental sweet peas. A crossing of B and b yields off spring BB, Bb and

bb with probability 0.25, 0.50, 0.25. Since Mendel could not distinguish Bb

from BB, his observations theoretically occur with probability 0.75 (BB and

Bb) and 0.25 (bb). To test the null hypothesis H0 : (π1 , π2 ) = (0.75, 0.25)

against H1 : (π1 , π2 ) = (0.75, 0.25), we use the chi-squared test6 , as follows.

> pi <- c(0.75,0.25)

> x <-c(5474, 1850)

> chisq.test(x, p=pi)

Chi-squared test for given probabilities

data: x

X-squared = 0.2629, df = 1, p-value = 0.6081

From the p-value 0.6081, we do not reject the null hypothesis.

The null-hypothesis of the Fisher test is that the odds ratio equals 1 and
the alternative hypothesis that it differs from 1. Suppose that the frequencies

of significant oncogenes for Chromosome 1 equals f11 = 300 out of a total of

f12 = 500 and for the genome f21 = 3000 out of f22 = 6000. The hypothesis

that the odd ratio equals one can now be tested as follows.

> dat <- matrix(c(300,500,3000,6000),2,byrow=TRUE)

> fisher.test(dat)

Fisher’s Exact Test for Count Data

data: dat

p-value = 0.01912

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

1.029519 1.396922

sample estimates:

odds ratio

1.199960

Since the p-value is smaller than the significance level, the null hypothesis

of odds ratio equal to one is rejected. There are more significant oncogenes

in Chromosome 1 compared to that in the genome.

Friday, March 16, 2012

Intro to Biostatistics

No comments: