# Microbial Informatics

## Lecture 20

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• A new homework has been posted and is due on November 22nd
• work with a partner
• no more than one explicit loop
• Will have lab period on Friday

## Review

• Loops are slow in R beacuase variables are copied, destroyed, and recreated each time a vector is modified
• the apply and replicate suite of commands allow you to minimize your use of explicit loops

## Here's a problem...

• We have a table of relative abundances...
relabund <- matrix(rep(runif(20000)), ncol=20, nrow=100)
relabund[5,] <- c(runif(10,0,0.4), runif(10, 0.3,0.7))
colnames(relabund) <- c(paste0("Lean", 1:10), paste0("Obese", 1:10))
rownames(relabund) <- paste0("Species", 1:100)
treatments <- c(rep("lean", 10), rep("obese", 10))


##              Lean1      Lean2      Lean3      Lean4     Lean5     Lean6
## Species1 0.7320719 0.61367991 0.44710369 0.93737694 0.1545328 0.7581348
## Species2 0.1587366 0.29431957 0.18656555 0.40873243 0.1751768 0.9519493
## Species3 0.7459371 0.14643946 0.37502507 0.83650445 0.2839269 0.3636480
## Species4 0.2116673 0.03895666 0.66441798 0.09118599 0.2525140 0.8759450
## Species5 0.2533841 0.09087869 0.04734376 0.01431301 0.2551368 0.2670180
## Species6 0.7963445 0.57078249 0.90997548 0.38780057 0.8780423 0.2803217
##               Lean7      Lean8      Lean9     Lean10     Obese1    Obese2
## Species1 0.48390810 0.09841650 0.37743814 0.84738340 0.68320363 0.3305247
## Species2 0.10043277 0.05339326 0.64165944 0.39090620 0.61785717 0.2844650
## Species3 0.02560588 0.63525609 0.02907226 0.08486640 0.03923239 0.8935477
## Species4 0.69394501 0.74943690 0.42591778 0.53587751 0.87731287 0.1407579
## Species5 0.12074160 0.12838108 0.08421323 0.08451048 0.67639246 0.3258745
## Species6 0.72716501 0.82467974 0.69858696 0.77484355 0.47065886 0.5433217
##             Obese3     Obese4     Obese5    Obese6    Obese7    Obese8
## Species1 0.7718421 0.02277831 0.89896404 0.2127805 0.1474924 0.8126880
## Species2 0.6259197 0.43031531 0.88595164 0.4238480 0.3491638 0.7818655
## Species3 0.7366177 0.16466986 0.38247831 0.3801840 0.9368757 0.5069368
## Species4 0.9165949 0.99351859 0.25953107 0.4532728 0.9596510 0.4002479
## Species5 0.6271704 0.34291623 0.57269919 0.5209073 0.6038838 0.6461251
## Species6 0.8557418 0.04982476 0.02552185 0.8431182 0.5089379 0.3121548
##             Obese9    Obese10
## Species1 0.5373861 0.02844547
## Species2 0.3988145 0.45445103
## Species3 0.4414108 0.49137888
## Species4 0.1843113 0.76326338
## Species5 0.3180390 0.52756064
## Species6 0.8503138 0.22655002


## Here's a problem...

• Perform a wilcoxon test on each Species differentiating between lean and obese individuals

• Write R code, without the use of for loops that produces the following output:

• Species5 was significantly different between the two groups
• In this statement, the "Species5" should be produced by r code
• Be sure to correct for multiple comparisons!

## Here's a problem...

• Perform a wilcoxon test on each Species differentiating between lean and obese individuals
test <- function(x, design){
p <- wilcox.test(x~design)\$p.value
return(p)
}

p.values <- apply(relabund, 1, test, design=treatments)


## Here's a problem...

• Write R code, without the use of for loops that produces the following output:
• Species5 was significantly different between the two groups
• rownames(relabund)[padj.values<0.05] was significantly different between the two groups
• In this statement, the "Species5" should be produced by r code
• In this statement, the "rownames(relabund)[padj.values<0.05]" should be produced by r code
• Be sure to correct for multiple comparisons!

## Learning objectives

• Understand how to get data in and out of R

## Integrating with other software

• Problem: We typically have large datasets and need a way to get that information into the program. Similarly, we generally have large amounts of output that we can't just send to the screen.

• Solution: We have seen a number of functions already for reading in data and in the very first homework you had a function to write out a table. R has a number of solutions to do both

## readline: read a single line from the screen

    > readline()
ab de fg
[1] "ab de fg"

Pat Schloss

Pat Schloss
> name
[1] "Pat Schloss"


## scan: reads in a vector from a file or keyboard

• Imagine a file that contains:
    123
2 4 5
53

scan("file1.txt")

## [1] 123   2   4   5  53

• Note that it reads it in as a numeric vector

## scan: reads in a vector from a file or keyboard

• Now imagine a file that contains:
    123
abcd
2 4 5
54

scan("file2.txt")

## Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got 'abcd'


## Try again...

scan("file2.txt", what="")

## [1] "123"  "abcd" "2"    "4"    "5"    "54"

• the default is to assume everything is a number

## Delimeters

• scan reads things into a vector assuming that any white space separates the vector values
scan("file2.txt", what="", sep="\n")

## [1] "123"   "abcd"  "2 4 5" "54"


## Getting data from the keyboard

    > scan("")  #will read from the keyboard

>scan("")
1:  12 5 2 3
5:  23 24
6:
[1] 12  5  2  3 23 24

• Note that to stop reading data in you give it a blank line

## read.table

• Reading in a table, every row has to have the same number of columns
• Imagine a file that contains:
name    age sex
pat     36  M
john    22  M
susan   55  F

read.table("table.txt", header=TRUE)

##    name age sex
## 1   pat  36   M
## 2  john  22   M
## 3 susan  55   F


## read.table

• We'd like to get row names for our table
read.table("table.txt", header=TRUE, row.names=1)

##       age sex
## pat    36   M
## john   22   M
## susan  55   F

• note that these commands would not have worked with the scan function since it is a mixture of text and numbers
• numerous other options for read.table including skipping rows, different delimeters, etc.

## You can read from the internet

data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data",

##   V1 V2 V3 V4    V5     V6    V7 V8    V9   V10  V11 V12 V13
## 1 11  0 71  0 0.260      9 4.600 14     1     1 name   1   0
## 2 19  0 72  0 0.380      6 4.100 14 1.700 0.588 name   1   0
## 3 16  0 55  0 0.260      4 3.420 14     1     1 name   1   0
## 4 57  0 60  0 0.253 12.062 4.603 16 1.450 0.788 name   1   0
## 5 19  1 57  0 0.160     22 5.750 18 2.250 0.571 name   1   0
## 6 26  0 68  0 0.260      5 4.310 12     1 0.857 name   1   0

• Note that read.csv is a special version of read.table for reading data talbe sthat are separated by commas

## Reading in lines of files

c <- file("file2.txt", open="r")

## [1] "123"
## [1] "abcd"
## [1] "2 4 5"
## [1] "54"
## character(0)


## Be sure to close the file when you're done!

close(c)


## print

• Writes to the screen
x <- 1:3
print(x^2)

## [1] 1 4 9

• This is useful for outputting data from within functions if you are trying to debug your code

## cat

• Write to the screen, must provide your own newline character
cat(x, "abc", x^2)

## 1 2 3 abc 1 4 9

cat(x, "abc", x^2, "\n")

## 1 2 3 abc 1 4 9

• Not all implementations of R include a newline character with the cat function, so including the \n is useful to just to be safe

## write.table: structured writing to a file

head(data)

##   V1 V2 V3 V4    V5     V6    V7 V8    V9   V10  V11 V12 V13
## 1 11  0 71  0 0.260      9 4.600 14     1     1 name   1   0
## 2 19  0 72  0 0.380      6 4.100 14 1.700 0.588 name   1   0
## 3 16  0 55  0 0.260      4 3.420 14     1     1 name   1   0
## 4 57  0 60  0 0.253 12.062 4.603 16 1.450 0.788 name   1   0
## 5 19  1 57  0 0.160     22 5.750 18 2.250 0.571 name   1   0
## 6 26  0 68  0 0.260      5 4.310 12     1 0.857 name   1   0

colnames(data) <- LETTERS[1:ncol(data)]
rownames(data) <- 1:nrow(data)
write.table(file="ecg.data", data)


## File output:

head ecg.data

## "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
## "1" "11" "0" "71" 0 "0.260" "9" "4.600" "14" "1" "1" "name" "1" "0"
## "2" "19" "0" "72" 0 "0.380" "6" "4.100" "14" "1.700" "0.588" "name" "1" "0"
## "3" "16" "0" "55" 0 "0.260" "4" "3.420" "14" "1" "1" "name" "1" "0"
## "4" "57" "0" "60" 0 "0.253" "12.062" "4.603" "16" "1.450" "0.788" "name" "1" "0"
## "5" "19" "1" "57" 0 "0.160" "22" "5.750" "18" "2.250" "0.571" "name" "1" "0"
## "6" "26" "0" "68" 0 "0.260" "5" "4.310" "12" "1" "0.857" "name" "1" "0"
## "7" "13" "0" "62" 0 "0.230" "31" "5.430" "22.5" "1.875" "0.857" "name" "1" "0"
## "8" "50" "0" "60" 0 "0.330" "8" "5.250" "14" "1" "1" "name" "1" "0"
## "9" "19" "0" "46" 0 "0.340" "0" "5.090" "16" "1.140" "1.003" "name" "1" "0"

• Again, many options for altering how the output looks

## Unstructured writing to a file

• cat:
    cat("abc\n", file="u")
cat("def\n", file="u", append=T)

• write:
    write(x, file="filename", sep= " ")
write(x, file="filename", append=T, sep= "\t")
write(x, "")

## 1 2 3

• Note that the append=T prevents the file from being written over

## Basic file operations

• file.info("data"): tells file size, creation time, etc
• dir(): gives the directory contents for the specified dir
• list.files(): same as dir()
• file.exists(): tells whether a file exists, duh.
• getwd(): tells you where you are in the file tree
• setwd(): allows you to change the current directory