# Microbial Informatics

## Lecture 09

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• Homework 3 is out
• For the next two Fridays, the first hour of class will be a lecture
• Read Introduction to Statistics with R (Chapter 8: Tabular data)

## Review

• Human are poor sources of random numbers!

## Learning objectives

• Random variables

## Randomness and probabilities

• In science, much of what we do assumes that samples are observed randomly
• We live in a probabilistic world where everything has a probability, even if it is very small

## Randomness

• Pick a number between 1 and 10 and write it down
• What would you expect a random sampling of the numbers from 1 to 10 to look like?
• What would it depend on?
• Mean?
• Min?
• Max?
• What type of plot would we use?

## Let's get R to do it for us

• Use the `sample` command to randomly draw a number from a range of integers

``````r<-sample(1:10, 10, replace=T)
r
``````
• What should these look like?

``````hist(r)
stripchart(r, method="jitter")
plot(r)
summary(r)
``````

## How do we get a more reliable shape to our distribution?

• What happens if we run the following repeatedly?

``````r<-sample(1:10, 1000, replace=T)
hist(r, xlim=c(0,10), breaks=seq(0,10, 1))
``````
• We can "fix" the distribution:

``````set.seed(1)
r<-sample(1:10, 1000, replace=T)
hist(r, xlim=c(0,10), breaks=seq(0,10, 1))
``````
• Being able to set the random seed is an importat feature for reproducible reserach

## There are discrete and continuous variables

• Discrete: Hits in baseball, number of infected mice, number of people

``````r<-sample(1:10, 1000, replace=T)
plot(r, ylim=c(0,10))
``````
• Continuous: Weight, temperature, concentrations

``````r<-runif(1000, min=1, max=10)
plot(r, ylim=c(0,10))
``````
• Notice a difference?

## Binomial distribution

• Flipping a fair coin...

``````sample(c("H", "T"), 100, replace=T)
rbinom(10, size=1, prob=0.5)
``````
• Flipping a cooked coin...

``````heads <- rbinom(10, size=1, prob=0.8)
``````
• Hall of fame hitter...

``````hits <- rbinom(5, size=1, prob=0.3)
sum(hits)
``````

## Other distribution functions

• `rbinom`: random samples
• Draw random values from a binomial distribution
• Example: Have the computer flip a coin for you
• `dbinom`: distribution function
• Probability of drawing a certain number of something from a binomial distribution
• Example: Probability of getting 1 head out of 10 coin flips

## Other distribution functions

• `pbinom`: cumulative distribution function
• Probability of drawing a certain number of something or fewere froma binomial distribution
• Example: Probability of geting 1 or fewer heads out of 10 coin flips
• `qbinom`: inverse cumulative distribution function
• Given a probability, return the number whose cumulative distribution matches the probability
• Example: Number of heads (and fewer) you should expect to get 25% of the time when you make 10 flips

## Example

• You have a mouse breeding colony and you are disapointed because the mom you were counting on to give you an even mix of males and females has given you 2 males and 8 females.
• Do you think something is wrong with her?
• You could rebread her a bunch and see, but that will get expensive and take a long time.
• Is there a faster way?
``````breedings <- 1000
npups <- 10
p.males <- 0.5
obs.males <- 2
``````

## Number of males you get if you rebread her 1000 times and she has 10 pups per litter

``````r <- rbinom(breedings, npups, p.males)
r.hist <- hist(r, plot = FALSE, breaks = seq(-0.5, 10.5, 1))
par(mar = c(5, 5, 0.5, 0.5))
plot(r.hist\$density ~ r.hist\$mids, type = "h", lwd = 2, xlab = "Number of male mice",
ylab = "Density", xlim = c(0, 10))
arrows(x0 = 2, x1 = 2, y0 = 0.15, y1 = 0.08, lwd = 2, col = "red")
``````

## Fraction of those 1000 times where you get two males (empirical)

``````n.two <- sum(r == obs.males)
p.two.empirical <- n.two / breedings
p.two.empirical
``````
``````## [1] 0.046
``````

## Built in R function: `dbinom`

``````p.two.R <- dbinom(2, 10, 0.5)
p.two.R
``````
``````## [1] 0.04395
``````
``````p.two.empirical - p.two.R
``````
``````## [1] 0.002055
``````
• How would you reduce the difference?

## Number of males you get if you rebread her 1000 times and she has 10 pups per litter

``````plot(r.hist\$density ~ r.hist\$mids, type = "h", lwd = 2, xlab = "Number of male mice",
ylab = "Density", xlim = c(0, 10))
points(x = 0:10, dbinom(0:10, 10, 0.5), col = "red", lwd = 3, type = "l", lty = 1)
``````

## Fraction of those 1000 times where you get two or fewer males (empirical and `pbinom`)

``````n.two.or.fewer <- sum(r <= obs.males)
p.two.or.fewer.empirical <- n.two.or.fewer / breedings
p.two.or.fewer.empirical
``````
``````## [1] 0.054
``````
``````p.two.or.fewer.R <- pbinom(2, 10, 0.5)
p.two.or.fewer.R
``````
``````## [1] 0.05469
``````
``````p.two.or.fewer.empirical-p.two.or.fewer.R
``````
``````## [1] -0.0006875
``````

## How many females should we expect to have 90% of the time?

``````inv.cdf <- qbinom(0.9, 10, 0.5)
``````
• In a litter of 10 mice, we should expect to have as many as 7 females in 90% of our litters

## Things to think about: Baseball...

• What is the probability of a 0.300 hiter going 0 for 4?
• What is the probability of a 0.300 hiter going 4 for 4?
• What's the probability of replicating Joe DiMaggio's 56 game hitting streak?
• What's the probability of another person matching the streak next season?

## Things to think about: DNA sequencing

• If sequencing errors are random and 1% of base calls are errors, then...
• fraction of 250 bp sequences would have more than one error?
• you had 1 million sequences, how many would have the exact same error?

## Other important distributions

• `dunif`, `punif`, `qunif`, `runif`
• `dnorm`, `pnorm`, `qnorm`, `rnorm`
• `dbinom`, `pbinom`, `qbinom`, `rbinom`
• many others... see pg 332, Appendix C in ISwR