# Microbial Informatics

## Lecture 06

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• Start thinking about your project for the first half of the semester
• Emphasis on data analysis
• Due 10/24/2104 (friday)
• Feel free to come to office hours to discuss project ideas
• I have some ideas for microbial ecology analysis projects

## Review

• Everything in R is some form of a vector
• Vectors
• List
• Matrix
• Table
• Data table
• Factors

## Learning objectives

• Appreciate the diversity of different data structures
• Understand how to calculate descriptive variables
• Begin to develop data visualization skills in R

## Factors

• Defining categorical variables
• In a genome we might think of the forward/reverse orientation, reading frame, dna/protein sequence designation, or annotation category as categorical variables.
• Create factors
``````metadata <- read.table(file="wild.metadata.txt", header=T)

``````
• What other variables here would be a factor?

## Lists

• Similar to data fames, but not necessarily read across rows and not all variables have the same length
• Could hold a genome's data within a list:
• name: Character with organism name
• genome.size: Number with number of bases
• start.pos: Vector of start positions for each gene
• end.pos: Vector of end positions for each gene
• gene.name: Name of each gene
• hydrolases: Names of genes that are hydrolases
• Allow one to create complex data structures
• We'll use these only in passing

## An example of where we'll use lists...

• Let's get the mean weight for each sex of mouse
``````aggregate(metadata\$Weight, by=metadata\$Sex, mean)
sex.weight\$x
``````
• Let's get the mean weight for each sex and species of mouse
``````aggregate(metadata\$Weight, by=list(metadata\$Sex, metadata\$SP), mean)
``````

## Descriptive data analysis

• With large datasets it's easy to get lost and not know what the data look like
• Need to devlop strategies to get a handle on the data
• Three approaches:
• Visualize data tables
• Summary statistics
• Visualize data

## What types of variables do we have in the table?

• How do we get the column/variable names?
• How do we know the data type for each column?
• Which columns in `metadata` should be
• Continuous?
• Categorical?

## Visualizing raw data

• Homework asked you to print out the first 5 lines of a data table
• There ares everal ways to do this...
``````datatable[c(1,2,3,4,5),]    # explicit method
datatable[1:5,]             # bracket method
head(datatable, n=5)                # first N lines
tail(datatable, n=5)                # last N lines
``````

## Summary statistics

What types of data structures do these output?

``````summary(metadata)           # how do we get a summary of the overall metadata table?
summary(metadata\$Weight)    # how do we get a summary of the weights?
median(metadata\$Weight)     # median - what is a median?
w.summary["Median"]

quantile(metadata\$Weight)   # weight distribution by quantiles
quantile(metadata\$Weight, probs=seq(0,1,0.1))   # weight distribution by deciles
``````

## Let's cross reference columns

``````aggregate(metadata\$Weight, by=list(metadata\$Sex), mean)
aggregate(Weight~Sex, data=metadata, mean)  #Notice the difference in output

w.sex_sp <- aggregate(Weight~Sex+SP, data=metadata, mean)   # what type of variable is w.sex_sp?
``````

## Data visualization

• This is a huge area of explorataion
• R is tremendously powerful for generating plots and data visualizaiton tools
• Can generally tell someone used MS Excel by how bad the plots look
• Can certainly generate crap in R, but upside is greater
• Numerous packages available, but we will focus on base package until the end of the semester:
• Lattice
• ggplot2
• rgl

## For Thursday

• Start working on new assignment that will be posted this weekend
• Assignment due Friday
• Read Introduction to Statistics with R (Chapter 3 and 4)