Microbial Informatics

Lecture 06

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology


  • Start thinking about your project for the first half of the semester
    • Emphasis on data analysis
    • Due 10/24/2104 (friday)
    • Feel free to come to office hours to discuss project ideas
    • I have some ideas for microbial ecology analysis projects
  • When you upload your assignments, upload the README.Rmd and README.md files generated by RStudio/knitr


  • Everything in R is some form of a vector
  • Vectors
  • List
  • Matrix
  • Table
  • Data table
  • Factors

Learning objectives

  • Appreciate the diversity of different data structures
  • Understand how to calculate descriptive variables
  • Begin to develop data visualization skills in R


  • Defining categorical variables
  • In a genome we might think of the forward/reverse orientation, reading frame, dna/protein sequence designation, or annotation category as categorical variables.
  • Create factors
metadata <- read.table(file="wild.metadata.txt", header=T)
rownames(metadata) <- metadata$Group
metadata <- metadata[,-1]

metadata$ET <-factor(metadata$ET)
  • What other variables here would be a factor?


  • Similar to data fames, but not necessarily read across rows and not all variables have the same length
  • Could hold a genome's data within a list:
    • name: Character with organism name
    • genome.size: Number with number of bases
    • start.pos: Vector of start positions for each gene
    • end.pos: Vector of end positions for each gene
    • gene.name: Name of each gene
    • hydrolases: Names of genes that are hydrolases
  • Allow one to create complex data structures
  • We'll use these only in passing

An example of where we'll use lists...

  • Let's get the mean weight for each sex of mouse
aggregate(metadata$Weight, by=metadata$Sex, mean)
aggregate(metadata$Weight, by=list(metadata$Sex), mean)
sex.weight <- aggregate(metadata$Weight, by=list(metadata$Sex), mean)
  • Let's get the mean weight for each sex and species of mouse
aggregate(metadata$Weight, by=list(metadata$Sex, metadata$SP), mean)

Descriptive data analysis

  • With large datasets it's easy to get lost and not know what the data look like
  • Need to devlop strategies to get a handle on the data
  • Three approaches:
    • Visualize data tables
    • Summary statistics
    • Visualize data

What types of variables do we have in the table?

  • How do we get the column/variable names?
  • How do we know the data type for each column?
  • Which columns in metadata should be
    • Continuous?
    • Categorical?

Visualizing raw data

  • Homework asked you to print out the first 5 lines of a data table
  • There ares everal ways to do this...
datatable[c(1,2,3,4,5),]    # explicit method   
datatable[1:5,]             # bracket method
head(datatable, n=5)                # first N lines
tail(datatable, n=5)                # last N lines

Summary statistics

What types of data structures do these output?

summary(metadata)           # how do we get a summary of the overall metadata table?
summary(metadata$Weight)    # how do we get a summary of the weights?
median(metadata$Weight)     # median - what is a median?
w.summary <- summary(metadata$Weight)

sd(metadata$Weight)         # standard deviation
var(metadata$Weight)        # variance
min(metadata$Weight)        # minimum weight
max(metadata$Weight)        # maximum weight
quantile(metadata$Weight)   # weight distribution by quantiles                          
quantile(metadata$Weight, probs=seq(0,1,0.1))   # weight distribution by deciles

Let's cross reference columns

aggregate(metadata$Weight, by=list(metadata$Sex), mean)
aggregate(Weight~Sex, data=metadata, mean)  #Notice the difference in output

aggregate(metadata$Weight, by=list(metadata$Sex, metadata$SP), mean)
aggregate(Weight~Sex+SP, data=metadata, mean)

aggregate(Weight~Sex+SP, data=metadata, summary)

w.sex_sp <- aggregate(Weight~Sex+SP, data=metadata, mean)   # what type of variable is w.sex_sp?

Data visualization

  • This is a huge area of explorataion
  • R is tremendously powerful for generating plots and data visualizaiton tools
  • Can generally tell someone used MS Excel by how bad the plots look
  • Can certainly generate crap in R, but upside is greater
  • Numerous packages available, but we will focus on base package until the end of the semester:
    • Lattice
    • ggplot2
    • rgl

For Thursday

  • Start working on new assignment that will be posted this weekend
  • Assignment due Friday
  • Read Introduction to Statistics with R (Chapter 3 and 4)