Microbial Informatics

Lecture 10

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

Announcements

Homework 3 is out
For the next two Fridays, the first hour of class will be a lecture
Read Introduction to Statistics with R (Chapter 5: T- and Wilcoxon tests)

Review

Human are poor sources of random numbers!
Can perform simulations using random number generators
Binomial distribution is a discrete distribution
Several xbinom functions in R

Learning objectives

Describe the different types of hypotheses
How to test associations between tabular data
How to generate tabular data

Testing a hypothesis

What is a hypothesis?
- Testable
- Falsifiable
- Leads to predictions
- Cannot prove a hypothesis is true!
What's are null and alternative hypotheses?
- Null hypothesis is the hypothesis you are trying to reject
- Cannot prove a null hypothesis true, can only reject

Remember: everything is probabilistic

Errors
- Type I: Falsely rejecting a null hypothesis
- Type II: Falsely supporting a null hypothesis
P-value
- The probability that you would see something as extreme or more so if the null hypothesis is true (Type I)
- Generally willing to accept a Type I error of 0.05
Power
- The ability to reject a null hypothesis if false (Type II)
- Generally desire a power of 0.80

Two types of tests

Single tailed
- Probability that the difference is greater than expected
- Example: Are there more males than females?
Two-tailed:
- Probability that the difference is greater or less than expected
- Example: Is there a difference in the number of males and females?

Testing

How would we know whether we had a even sampling of M & F in the previous example?
- How to test?
- Binomial distribution

pbinom(2, 10, 0.5)
2 * pbinom(2, 10, 0.5)
binom.test(2, 10, 0.5)
binom.test(2, 10, 0.5, alternative = "less")

Let's get some data to play with...

metadata <- read.table(file = "wild.metadata.txt", header = T)
rownames(metadata) <- metadata$Group
metadata <- metadata[, -1]

An aside: the table commands

Will convert your observations (individuals by variables) into tables (varaibles by variables)

table(metadata$Sex)
table(metadata$Sex, metadata$SP)
table(metadata$Sex, metadata$SP, metadata$Repro)

Generating marginal sums

sex.sp <- table(metadata$Sex, metadata$SP)
margin.table(sex.sp)
margin.table(sex.sp, 1)
margin.table(sex.sp, 2)

Generating relative abundance tables

prop.table(sex.sp)
prop.table(sex.sp, 1)
prop.table(sex.sp, 2)

Is there a difference in the sex-species ratio?

##    
##     PL PMG
##   F 24  36
##   M 34  17

What would you expect if there was no difference in the ratio?

Is there a species bias?

species.sums <- margin.table(sex.sp, 2)
binom.test(species.sums)

## 
##  Exact binomial test
## 
## data:  species.sums
## number of successes = 58, number of trials = 111, p-value = 0.7044
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4256 0.6182
## sample estimates:
## probability of success 
##                 0.5225

Is there a sex bias?

sex.sums <- margin.table(sex.sp, 1)
binom.test(sex.sums)

## 
##  Exact binomial test
## 
## data:  sex.sums
## number of successes = 60, number of trials = 111, p-value = 0.4478
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4433 0.6355
## sample estimates:
## probability of success 
##                 0.5405

Is the sex/species distribution the same?

What would be expected?

frac.male <- sex.sums["M"]/sum(sex.sums)
frac.female <- 1 - frac.male
frac.sex <- c(F = frac.female, M = frac.male)

frac.pl <- species.sums["PL"]/sum(species.sums)
frac.pmg <- 1 - frac.pl
frac.species <- c(PL = frac.pl, PMG = frac.pmg)

expected <- frac.sex %*% t(frac.species)
expected <- expected * sum(sex.sp)
expected

##      PL.PL PMG.PL
## [1,] 31.35  28.65
## [2,] 26.65  24.35

What is the difference between what we expected and what we observed?

\[\chi^2 =\sum \frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

Test against a ChiSquared distribution with: \[df=(n_{row}-1)(n_{col}-1)\]

ChiSquared distribution

The distribution of a sum of the squares of k indepdentently sampled normal random variables, where k is the degrees of freedom
Procedure to create an empirical distribution
1. Draw k random variables from a normal distribution with mean 0.0 and standard deviation of 1.0
2. Square each of them
3. Sum them
4. Repeat many times and keep track of how many times you see each value
Interested in the proportion of the distribution larger than our test statistic

What is the difference between what we expected and what we observed?

chi.sq <- sum((expected - sex.sp)^2/expected)
df <- (nrow(sex.sp) - 1) * (ncol(sex.sp) - 1)
plot(seq(0, 20, 0.05), dchisq(seq(0, 20, 0.05), df = df), type = "l", xlab = "ChiSquared Statistic", 
    ylab = "Probability with 1 degree of freedom")
arrows(x0 = chi.sq, x1 = chi.sq, y0 = 0.4, y1 = 0.05, lwd = 2, col = "red")

plot of chunk unnamed-chunk-10

What is the difference between what we expected and what we observed?

chisq.test(sex.sp)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  sex.sp
## X-squared = 6.825, df = 1, p-value = 0.00899

Fisher exact test

Imagine a 2x2 table and the cells are a, b, c, d going from left to right and top-down
The probability of seeing a specific table format...

\[P=\frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a!b!c!d!}\]

Need to then find probability of more extreme tables via permutation
Can be computationally intensive

Fisher exact test in R

fisher.test(sex.sp)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  sex.sp
## p-value = 0.007411
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1418 0.7765
## sample estimates:
## odds ratio 
##     0.3368

Test of proportions

Null: the fraction of males is the same in both species

males <- sex.sp["M", ]
prop.test(males, species.sums)

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  males out of species.sums
## X-squared = 6.825, df = 1, p-value = 0.00899
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.06891 0.46199
## sample estimates:
## prop 1 prop 2 
## 0.5862 0.3208

Conclusions

Generally need at least 5 observations per cell of your table (rule of thumb)
Test possible with more that two levels per factor, but may be necessary use simulations to build distributions
Fisher exact test is best approach, but can be computationally demanding

Microbial Informatics

Lecture 10

Announcements

Review

Learning objectives

Testing a hypothesis

Remember: everything is probabilistic

Two types of tests

Testing

Let's get some data to play with...

An aside: the table commands

Generating marginal sums

Generating relative abundance tables

Is there a difference in the sex-species ratio?

Is there a species bias?

Is there a sex bias?

Is the sex/species distribution the same?

What is the difference between what we expected and what we observed?

ChiSquared distribution

What is the difference between what we expected and what we observed?

What is the difference between what we expected and what we observed?

Fisher exact test

Fisher exact test in R

Test of proportions

Conclusions

Questions?