# Microbial Informatics

## Lecture 05

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• No class on Thursday (9/18) or Friday (9/19).
• Start thinking about your project for the first half of the semester
• Emphasis on data analysis
• Due 10/24/2104 (friday)
• Feel free to come to office hours to discuss project ideas

## Review

• Vectors are one-dimensional sets of values of the same type
• Everything in R is some form of a vector
• Can access values of a vector with square brackets ("[]")

## Types of containers

• Vectors
• List
• Matrix
• Table
• Data table
• Factors
• We will go through these more in detail throughout the course and especially in second half of the course

## Learning objectives

• Be able to differentiate between matrices and data frames
• Make categorical variables using factors
• Develop complex data structures using lists

## Matrices

• Multidimensional data structure of the same data type
• We'll see a lot of overlap with tables and data.frames

## Create and access a matrix...

``````m <- matrix(seq(1:96), nrow=8, ncol=12) #create a 8 x 12 matrix
colnames(m)<-1:12
rownames(m)<-c("A", "B", "C", "D", "E", "F", "G", "H")

dim(m)
nrow(m)
ncol(m)
m[1:5,1:5]
m[1:5,]
m[,1:5]
``````

## Numerous operations that can be performed on a matrix

``````t(m)             # transpose the matrix
1/m              # take each value of m and find it's reciprocal
m * m            # calculate the square of each value in m
m %*% t(m)       # performs matrix multiplication
crossprod(m,m)   # performs the cross product
rowSums(m)       # calculate the sum for each row
colSums(m)       # calculate the sum for each column
lower.tri(m)     # find the indices that are below the diagonal
m[lower.tri(m)]  # give the lower triangle of m
diag(m)          # the values on the diagonal of m
det(m[1:8,1:8])  # the determinent of m
``````

## Apply functions to matrices

``````apply(m, 1, sum)    # get the sum for each row - same as rowSums(m)
apply(m, 2, sum)    # get the sum for each column - same as colSums(m)
``````

## Data frames

• multidimensional data structure that allows for multiple data types across columns
• think of gene statistics in a genome annotation
gene start end strand length annotation
rbcA num num logic num character
rbcB
rbcC
rbcD
etc.
• important point is that the data is linked across the rows

## Let's get some data

• Want to work with metadata from a study looking at the gut microbiota of wild populations of Peromyscus leucopis and P. maniculatis
• Take a look at the data
• Save folder to your Desktop
• Set your working directory to the Desktop

## Working with data frames

• Be sure to set correct working directory in RStudio

``````metadata <- read.table(file="wild.metadata.txt", header=T)
rownames(metadata)  # notice a problem here?
summary(metadata)   # output a summary of each column in table
``````
• Check out the Data section of the Environment tab of RStudio

• What problems can you see with this output?

## Accessing values from data frames

``````metadata\$Age            # output column named "Age"
metadata[,"Age"]        # output column named "Age"
metadata[,7]            # output 4th column ("end")
metadata[,-7]           # output everything but the 4th column ("end")

metadata["23", ]        # output row with Group 6_16m33
metadata[23, ]          # output 23rd row (aka Group 6_16m33)
metadata[-23,]          # output everything but the 23rd row
``````

## Let's use these functions to clean up the data

• We'd like to use the "Group"" column as the rowname
• Group names must be unique
• Case sensitive - "group" will not work
``````rownames(metadata) <- metadata\$Group
``````

## More complicated stuff

• What do these commands do?

``````metadata\$Weight[1:5]
``````
• What's the difference between these commands?

``````metadata[-23,]
``````
• Can make new columns

``````metadata[,"sequences"] <- rep(NA, nrow(metadata))
``````

## Incorporating logic

• Define criteria to set rows you want to keep
• Let's get all of the P. leucopis samples
``````metadata[metadata\$SP=="PL",]
``````
• Let's get all of the P. leucopis samples from males
``````metadata[metadata\$SP=="PL" & metadata\$Sex=="M",]
``````

## Factors

• Defining categorical variables
• In a genome we might think of the forward/reverse orientation, reading frame, dna/protein sequence designation, or annotation category as categorical variables.
• Create factors
``````factor(metadata\$ET)
``````
• What other variables here would be a factor?

## Lists

• Similar to data fames, but not necessarily read across rows and not all variables have the same length
• Could hold a genome's data within a list:
• name: Character with organism name
• genome.size: Number with number of bases
• start.pos: Vector of start positions for each gene
• end.pos: Vector of end positions for each gene
• gene.name: Name of each gene
• hydrolases: Names of genes that are hydrolases
• Allow one to create complex data structures
• We'll use these only in passing

## An example of where we'll use lists...

• Let's get the mean weight for each sex of mouse
``````aggregate(metadata\$Weight, by=metadata\$Sex, mean)
``````aggregate(metadata\$Weight, by=list(metadata\$Sex, metadata\$SP), mean)