Microbial Informatics

Lecture 05

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

Announcements

No class on Thursday (9/18) or Friday (9/19).
Start thinking about your project for the first half of the semester
- Emphasis on data analysis
- Due 10/24/2104 (friday)
- Feel free to come to office hours to discuss project ideas

Review

Vectors are one-dimensional sets of values of the same type
Everything in R is some form of a vector
Can access values of a vector with square brackets ("[]")

Types of containers

Vectors
List
Matrix
Table
Data table
Factors
We will go through these more in detail throughout the course and especially in second half of the course

Learning objectives

Be able to differentiate between matrices and data frames
Make categorical variables using factors
Develop complex data structures using lists

Matrices

Multidimensional data structure of the same data type
We'll see a lot of overlap with tables and data.frames

Create and access a matrix...

m <- matrix(seq(1:96), nrow=8, ncol=12) #create a 8 x 12 matrix
colnames(m)<-1:12
rownames(m)<-c("A", "B", "C", "D", "E", "F", "G", "H")

dim(m)
nrow(m)
ncol(m)
m[1:5,1:5]
m[1:5,]
m[,1:5]

Numerous operations that can be performed on a matrix

t(m)             # transpose the matrix
1/m              # take each value of m and find it's reciprocal
m * m            # calculate the square of each value in m
m %*% t(m)       # performs matrix multiplication
crossprod(m,m)   # performs the cross product
rowSums(m)       # calculate the sum for each row
colSums(m)       # calculate the sum for each column
lower.tri(m)     # find the indices that are below the diagonal
m[lower.tri(m)]  # give the lower triangle of m
diag(m)          # the values on the diagonal of m
det(m[1:8,1:8])  # the determinent of m

Apply functions to matrices

apply(m, 1, sum)    # get the sum for each row - same as rowSums(m)
apply(m, 2, sum)    # get the sum for each column - same as colSums(m)

Data frames

multidimensional data structure that allows for multiple data types across columns
think of gene statistics in a genome annotation

gene	start	end	strand	length	annotation
rbcA	num	num	logic	num	character
rbcB
rbcC
rbcD
etc.

important point is that the data is linked across the rows

Let's get some data

Want to work with metadata from a study looking at the gut microbiota of wild populations of Peromyscus leucopis and P. maniculatis
Download data
Take a look at the data
Save folder to your Desktop
Set your working directory to the Desktop

Working with data frames

Be sure to set correct working directory in RStudio

metadata <- read.table(file="wild.metadata.txt", header=T)
head(metadata)      # look at the first lines of table
dim(metadata)
nrow(metadata)
ncol(metadata)
colnames(metadata)
rownames(metadata)  # notice a problem here?
summary(metadata)   # output a summary of each column in table

Check out the Data section of the Environment tab of RStudio
What problems can you see with this output?

Accessing values from data frames

metadata$Age            # output column named "Age"
metadata[,"Age"]        # output column named "Age"
metadata[,7]            # output 4th column ("end")
metadata[,-7]           # output everything but the 4th column ("end")

metadata["23", ]        # output row with Group 6_16m33
metadata[23, ]          # output 23rd row (aka Group 6_16m33)
metadata[-23,]          # output everything but the 23rd row

Let's use these functions to clean up the data

We'd like to use the "Group"" column as the rowname
- Group names must be unique
- Case sensitive - "group" will not work

rownames(metadata) <- metadata$Group
metadata <- metadata[,-1]
head(metadata)

More complicated stuff

What do these commands do?

metadata$Weight[1:5]
metadata[1:5,"Weight"]
metadata["5_31m2",]

What's the difference between these commands?
```
metadata[-23,]
metadata <- metadata[-23,]
```

Can make new columns

metadata[,"sequences"] <- rep(NA, nrow(metadata))

Incorporating logic

Define criteria to set rows you want to keep
Let's get all of the P. leucopis samples

metadata[metadata$SP=="PL",]

Let's get all of the P. leucopis samples from males

metadata[metadata$SP=="PL" & metadata$Sex=="M",]

Factors

Defining categorical variables
In a genome we might think of the forward/reverse orientation, reading frame, dna/protein sequence designation, or annotation category as categorical variables.
Create factors

factor(metadata$ET)
metadata$ET <-factor(metadata$ET)
summary(metadata)
levels(metadata$polymer)

What other variables here would be a factor?

Lists

Similar to data fames, but not necessarily read across rows and not all variables have the same length
Could hold a genome's data within a list:
- name: Character with organism name
- genome.size: Number with number of bases
- start.pos: Vector of start positions for each gene
- end.pos: Vector of end positions for each gene
- gene.name: Name of each gene
- hydrolases: Names of genes that are hydrolases
Allow one to create complex data structures
We'll use these only in passing

An example of where we'll use lists...

Let's get the mean weight for each sex of mouse

aggregate(metadata$Weight, by=metadata$Sex, mean)
aggregate(metadata$Weight, by=list(metadata$Sex), mean)
sex.weight <- aggregate(metadata$Weight, by=list(metadata$Sex), mean)
sex.weight$x

Let's get the mean weight for each sex and species of mouse

aggregate(metadata$Weight, by=list(metadata$Sex, metadata$SP), mean)

For next Tuesday

Start working on new assignment that will be posted this weekend
Read Introduction to Statistics with R (Chapters 1 and 2)

Microbial Informatics

Lecture 05

Announcements

Review

Types of containers

Learning objectives

Matrices

Create and access a matrix...

Numerous operations that can be performed on a matrix

Apply functions to matrices

Data frames

Let's get some data

Working with data frames

Accessing values from data frames

Let's use these functions to clean up the data

More complicated stuff

Incorporating logic

Factors

Lists

An example of where we'll use lists...

For next Tuesday

Questions?