Microbial Informatics

Lecture 05

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology


  • No class on Thursday (9/18) or Friday (9/19).
  • Start thinking about your project for the first half of the semester
    • Emphasis on data analysis
    • Due 10/24/2104 (friday)
    • Feel free to come to office hours to discuss project ideas


  • Vectors are one-dimensional sets of values of the same type
  • Everything in R is some form of a vector
  • Can access values of a vector with square brackets ("[]")

Types of containers

  • Vectors
  • List
  • Matrix
  • Table
  • Data table
  • Factors
  • We will go through these more in detail throughout the course and especially in second half of the course

Learning objectives

  • Be able to differentiate between matrices and data frames
  • Make categorical variables using factors
  • Develop complex data structures using lists


  • Multidimensional data structure of the same data type
  • We'll see a lot of overlap with tables and data.frames

Create and access a matrix...

m <- matrix(seq(1:96), nrow=8, ncol=12) #create a 8 x 12 matrix
rownames(m)<-c("A", "B", "C", "D", "E", "F", "G", "H")


Numerous operations that can be performed on a matrix

t(m)             # transpose the matrix
1/m              # take each value of m and find it's reciprocal
m * m            # calculate the square of each value in m
m %*% t(m)       # performs matrix multiplication
crossprod(m,m)   # performs the cross product
rowSums(m)       # calculate the sum for each row
colSums(m)       # calculate the sum for each column
lower.tri(m)     # find the indices that are below the diagonal
m[lower.tri(m)]  # give the lower triangle of m
diag(m)          # the values on the diagonal of m
det(m[1:8,1:8])  # the determinent of m

Apply functions to matrices

apply(m, 1, sum)    # get the sum for each row - same as rowSums(m)
apply(m, 2, sum)    # get the sum for each column - same as colSums(m)

Data frames

  • multidimensional data structure that allows for multiple data types across columns
  • think of gene statistics in a genome annotation
gene start end strand length annotation
rbcA num num logic num character
  • important point is that the data is linked across the rows

Let's get some data

  • Want to work with metadata from a study looking at the gut microbiota of wild populations of Peromyscus leucopis and P. maniculatis
  • Download data
  • Take a look at the data
  • Save folder to your Desktop
  • Set your working directory to the Desktop

Working with data frames

  • Be sure to set correct working directory in RStudio

    metadata <- read.table(file="wild.metadata.txt", header=T)
    head(metadata)      # look at the first lines of table
    rownames(metadata)  # notice a problem here?
    summary(metadata)   # output a summary of each column in table
  • Check out the Data section of the Environment tab of RStudio

  • What problems can you see with this output?

Accessing values from data frames

metadata$Age            # output column named "Age"
metadata[,"Age"]        # output column named "Age"
metadata[,7]            # output 4th column ("end")
metadata[,-7]           # output everything but the 4th column ("end")

metadata["23", ]        # output row with Group 6_16m33
metadata[23, ]          # output 23rd row (aka Group 6_16m33)
metadata[-23,]          # output everything but the 23rd row

Let's use these functions to clean up the data

  • We'd like to use the "Group"" column as the rowname
    • Group names must be unique
    • Case sensitive - "group" will not work
rownames(metadata) <- metadata$Group
metadata <- metadata[,-1]

More complicated stuff

  • What do these commands do?

  • What's the difference between these commands?

    metadata <- metadata[-23,]
  • Can make new columns

    metadata[,"sequences"] <- rep(NA, nrow(metadata))

Incorporating logic

  • Define criteria to set rows you want to keep
  • Let's get all of the P. leucopis samples
  • Let's get all of the P. leucopis samples from males
metadata[metadata$SP=="PL" & metadata$Sex=="M",]


  • Defining categorical variables
  • In a genome we might think of the forward/reverse orientation, reading frame, dna/protein sequence designation, or annotation category as categorical variables.
  • Create factors
metadata$ET <-factor(metadata$ET)
  • What other variables here would be a factor?


  • Similar to data fames, but not necessarily read across rows and not all variables have the same length
  • Could hold a genome's data within a list:
    • name: Character with organism name
    • genome.size: Number with number of bases
    • start.pos: Vector of start positions for each gene
    • end.pos: Vector of end positions for each gene
    • gene.name: Name of each gene
    • hydrolases: Names of genes that are hydrolases
  • Allow one to create complex data structures
  • We'll use these only in passing

An example of where we'll use lists...

  • Let's get the mean weight for each sex of mouse
aggregate(metadata$Weight, by=metadata$Sex, mean)
aggregate(metadata$Weight, by=list(metadata$Sex), mean)
sex.weight <- aggregate(metadata$Weight, by=list(metadata$Sex), mean)
  • Let's get the mean weight for each sex and species of mouse
aggregate(metadata$Weight, by=list(metadata$Sex, metadata$SP), mean)

For next Tuesday

  • Start working on new assignment that will be posted this weekend
  • Read Introduction to Statistics with R (Chapters 1 and 2)