# Microbial Informatics

## Lecture 21

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• A new homework has been posted and is due on November 22nd
• work with a partner
• no more than one explicit loop
• Will have lab period on Friday
• Read Chapters 11 in TAoRP for background material on what is discussed today and Tuesday

## Review

• There are high level functions for getting structured data in and out of R - write.table, read.table
• There are low level functions for input and output that don't require defined data structure - write, scan, readLines

## Learning objectives

• Understand how to work with and manipulate character variables

## Remember...

Strings are atomic variables made up of characters, numbers, punctuation, etc. You can form a string by puttting information between " and ".

name <- "pat"
name
## [1] "pat"

name[1]

## What would happen if we did...

name[1]
## [1] "pat"
• Note that name is a vector and so name[1] will return the first element of that vector, not the first character of the vector.

## But how do we get the first character?

• Get substrings with the substr command
substr(x, start, stop)
• x is the string of interest
• start is the position within the string where you want the substring to start
• stop is the position within the string where you want the substring to end

## For example...

substr(name, 3, 3)
## [1] "t"
substr(name, 2, 3)
## [1] "at"
substr(name, 1, 1)
## [1] "p"
substr(name, 2, 4)
## [1] "at"

## What if we have a vector of names?

names <- c("pat", "sarah", "john", "emily", "mary", "susan")
substr(names, 1,2)
## [1] "pa" "sa" "jo" "em" "ma" "su"

## What do we need to know to return the last two characters of each person's name?

• To get the length of each person's name use the nchar command:
name.length <- nchar(names)
substr(names, name.length-2, name.length)
## [1] "pat" "rah" "ohn" "ily" "ary" "san"
• Oops! What happened? How do we fix it?

## What do we need to know to return the last two characters of each person's name?

• To get the length of each person's name use the nchar command:
name.length <- nchar(names)
substr(names, name.length-1, name.length)
## [1] "at" "ah" "hn" "ly" "ry" "an"
• This is what is called a "fence post" error

## Say we have the following names...

names <- c("Pat Schloss", "Mary O'Riordan", "Vince Young", "Kathy Spindler", "Harry Mobley", "Oveta Fuller", "Adam Lauring")
• I'd like to generate a vector of first and last names so I can re write the names in alphabetical "Last, First" format.

## We need to split the names using the strsplit function

strsplit(x, split)
• x is the string
• split is the delimeter to split on
• The output is a list

## Let's try it out

split.names <- strsplit(names, " ")
split.names
## [[1]]
## [1] "Pat"     "Schloss"
##
## [[2]]
## [1] "Mary"      "O'Riordan"
##
## [[3]]
## [1] "Vince" "Young"
##
## [[4]]
## [1] "Kathy"    "Spindler"
##
## [[5]]
## [1] "Harry"  "Mobley"
##
## [[6]]
## [1] "Oveta"  "Fuller"
##
## [[7]]
## [1] "Adam"    "Lauring"

## Where else could we use strsplit?

• Dates
strsplit("11/8/2012", split="/")
## [[1]]
## [1] "11"   "8"    "2012"
• DNA sequences
strsplit("ATGCATCTGA", split="")
## [[1]]
##  [1] "A" "T" "G" "C" "A" "T" "C" "T" "G" "A"

## Say we want to reformat our date to be separated by -'s

• We can use the paste function to stitch the vector together
• paste(x, y, sep=" ", collapse=NULL)
• x and y are two vectors - need only supply one
• sep is the character to use to paste the two vectors to each other
• collapse is the character to use to merge the elements of the final vector

## Try it out with dates

date <- unlist(strsplit("11/8/2012", split="/"))
date
## [1] "11"   "8"    "2012"
paste(date, collapse="-")
## [1] "11-8-2012"
paste("Today is", date, sep=":", collapse="-")
## [1] "Today is:11-Today is:8-Today is:2012"
paste("Today is", paste(date, collapse="-"), sep=": ")
## [1] "Today is: 11-8-2012"

## Perhaps you want to paste strings together without using a separator

paste("Today is", paste(date, collapse="-"), sep=": ")
## [1] "Today is: 11-8-2012"
paste("Today is: ", paste(date, collapse="-"), sep="")
## [1] "Today is: 11-8-2012"
paste0("Today is: ", paste(date, collapse="-"))
## [1] "Today is: 11-8-2012"

## Let's return to the list of names

• Given a list of names in First Last format, can you convert them to Last, First format and then alphabetize them?
• Your input should be the names vector
• Your output should look like:
##      Oveta Fuller      Adam Lauring      Harry Mobley    Mary O'Riordan
##   "Fuller, Oveta"   "Lauring, Adam"   "Mobley, Harry" "O'Riordan, Mary"
##       Pat Schloss    Kathy Spindler       Vince Young
##    "Schloss, Pat" "Spindler, Kathy"    "Young, Vince"

## Let's return to the list of names

last.first <- function(name){
split.names <- unlist(strsplit(name, " "))
l.f <- paste(split.names[2], split.names[1], sep=", ")
return(l.f)
}
convert.names <- sapply(names, last.first)
sort(convert.names)
##      Oveta Fuller      Adam Lauring      Harry Mobley    Mary O'Riordan
##   "Fuller, Oveta"   "Lauring, Adam"   "Mobley, Harry" "O'Riordan, Mary"
##       Pat Schloss    Kathy Spindler       Vince Young
##    "Schloss, Pat" "Spindler, Kathy"    "Young, Vince"

## Formatting text output with sprintf

i <- 8
sprintf("the square of %d is %d", i, i^2)
## [1] "the square of 8 is 64"
sprintf("the square root of %d is %6.2f", i, sqrt(i))
## [1] "the square root of 8 is   2.83"
sprintf("%d times 1e6 is %.3e", i, i * 1e6)
## [1] "8 times 1e6 is 8.000e+06"

## Things to notice

• %s reserves the place for an string
• %d reserves the place for an integer
• %f reserves the place for an decimal number
• %e reserves the place for an number in scientific notation
• For %f and %e the format is %m.n. n indicates the number of values to the right of the decimal place to include and m indicates the total number of spaces to allot the string
• The output is a string
• Of course, you can do all of this in the text block of a knitr document

## Another useful way to format output to text

format(x, trim = FALSE, digits = NULL, nsmall = 0L,
justify = c("left", "right", "centre", "none"),
width = NULL, na.encode = TRUE, scientific = NA,
big.mark = "",   big.interval = 3L,
small.mark = "", small.interval = 5L,
decimal.mark = ".", zero.print = NULL,
drop0trailing = FALSE, ...)`
• x is a number
• trim is whether to right justify numbers to a common width
• digits is the maximum number of significant digits
• nsmall is the minimum number of digits to the right of the decimal