Microbial Informatics

Lecture 21

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

Announcements

A new homework has been posted and is due on November 22nd
- work with a partner
- no more than one explicit loop
Will have lab period on Friday
Read Chapters 11 in TAoRP for background material on what is discussed today and Tuesday

Review

There are high level functions for getting structured data in and out of R - write.table, read.table
There are low level functions for input and output that don't require defined data structure - write, scan, readLines

Learning objectives

Understand how to work with and manipulate character variables

Remember...

Strings are atomic variables made up of characters, numbers, punctuation, etc. You can form a string by puttting information between " and ".

name <- "pat"
name

## [1] "pat"

What would happen if we did...

name[1]

What would happen if we did...

name[1]

## [1] "pat"

Note that name is a vector and so name[1] will return the first element of that vector, not the first character of the vector.

But how do we get the first character?

Get substrings with the substr command

substr(x, start, stop)

x is the string of interest
start is the position within the string where you want the substring to start
stop is the position within the string where you want the substring to end

For example...

substr(name, 3, 3)

## [1] "t"

substr(name, 2, 3)

## [1] "at"

substr(name, 1, 1)

## [1] "p"

substr(name, 2, 4)

## [1] "at"

What if we have a vector of names?

names <- c("pat", "sarah", "john", "emily", "mary", "susan")
substr(names, 1,2)

## [1] "pa" "sa" "jo" "em" "ma" "su"

What do we need to know to return the last two characters of each person's name?

To get the length of each person's name use the nchar command:

name.length <- nchar(names)
substr(names, name.length-2, name.length)

## [1] "pat" "rah" "ohn" "ily" "ary" "san"

Oops! What happened? How do we fix it?

What do we need to know to return the last two characters of each person's name?

To get the length of each person's name use the nchar command:

name.length <- nchar(names)
substr(names, name.length-1, name.length)

## [1] "at" "ah" "hn" "ly" "ry" "an"

This is what is called a "fence post" error

Say we have the following names...

names <- c("Pat Schloss", "Mary O'Riordan", "Vince Young", "Kathy Spindler", "Harry Mobley", "Oveta Fuller", "Adam Lauring")

I'd like to generate a vector of first and last names so I can re write the names in alphabetical "Last, First" format.

We need to split the names using the `strsplit` function

strsplit(x, split)

x is the string
split is the delimeter to split on
The output is a list

Let's try it out

split.names <- strsplit(names, " ")
split.names

## [[1]]
## [1] "Pat"     "Schloss"
## 
## [[2]]
## [1] "Mary"      "O'Riordan"
## 
## [[3]]
## [1] "Vince" "Young"
## 
## [[4]]
## [1] "Kathy"    "Spindler"
## 
## [[5]]
## [1] "Harry"  "Mobley"
## 
## [[6]]
## [1] "Oveta"  "Fuller"
## 
## [[7]]
## [1] "Adam"    "Lauring"

Where else could we use `strsplit`?

Dates

strsplit("11/8/2012", split="/")

## [[1]]
## [1] "11"   "8"    "2012"

DNA sequences

strsplit("ATGCATCTGA", split="")

## [[1]]
##  [1] "A" "T" "G" "C" "A" "T" "C" "T" "G" "A"

Say we want to reformat our date to be separated by `-`'s

We can use the paste function to stitch the vector together
paste(x, y, sep=" ", collapse=NULL)
x and y are two vectors - need only supply one
sep is the character to use to paste the two vectors to each other
collapse is the character to use to merge the elements of the final vector

Try it out with dates

date <- unlist(strsplit("11/8/2012", split="/"))
date

## [1] "11"   "8"    "2012"

paste(date, collapse="-")

## [1] "11-8-2012"

paste("Today is", date, sep=":", collapse="-")

## [1] "Today is:11-Today is:8-Today is:2012"

paste("Today is", paste(date, collapse="-"), sep=": ")

## [1] "Today is: 11-8-2012"

Perhaps you want to paste strings together without using a separator

paste("Today is", paste(date, collapse="-"), sep=": ")

## [1] "Today is: 11-8-2012"

paste("Today is: ", paste(date, collapse="-"), sep="")

## [1] "Today is: 11-8-2012"

paste0("Today is: ", paste(date, collapse="-"))

## [1] "Today is: 11-8-2012"

Let's return to the list of names

Given a list of names in First Last format, can you convert them to Last, First format and then alphabetize them?
Your input should be the names vector
Your output should look like:

##      Oveta Fuller      Adam Lauring      Harry Mobley    Mary O'Riordan 
##   "Fuller, Oveta"   "Lauring, Adam"   "Mobley, Harry" "O'Riordan, Mary" 
##       Pat Schloss    Kathy Spindler       Vince Young 
##    "Schloss, Pat" "Spindler, Kathy"    "Young, Vince"

Let's return to the list of names

last.first <- function(name){
    split.names <- unlist(strsplit(name, " "))
    l.f <- paste(split.names[2], split.names[1], sep=", ")
    return(l.f)
}
convert.names <- sapply(names, last.first)
sort(convert.names)

##      Oveta Fuller      Adam Lauring      Harry Mobley    Mary O'Riordan 
##   "Fuller, Oveta"   "Lauring, Adam"   "Mobley, Harry" "O'Riordan, Mary" 
##       Pat Schloss    Kathy Spindler       Vince Young 
##    "Schloss, Pat" "Spindler, Kathy"    "Young, Vince"

Formatting text output with `sprintf`

i <- 8
sprintf("the square of %d is %d", i, i^2)

## [1] "the square of 8 is 64"

sprintf("the square root of %d is %6.2f", i, sqrt(i))

## [1] "the square root of 8 is   2.83"

sprintf("%d times 1e6 is %.3e", i, i * 1e6)

## [1] "8 times 1e6 is 8.000e+06"

Things to notice

%s reserves the place for an string
%d reserves the place for an integer
%f reserves the place for an decimal number
%e reserves the place for an number in scientific notation
For %f and %e the format is %m.n. n indicates the number of values to the right of the decimal place to include and m indicates the total number of spaces to allot the string
The output is a string
Of course, you can do all of this in the text block of a knitr document

Another useful way to format output to text

format(x, trim = FALSE, digits = NULL, nsmall = 0L,
       justify = c("left", "right", "centre", "none"),
       width = NULL, na.encode = TRUE, scientific = NA,
       big.mark = "",   big.interval = 3L,
       small.mark = "", small.interval = 5L,
       decimal.mark = ".", zero.print = NULL,
       drop0trailing = FALSE, ...)`

x is a number
trim is whether to right justify numbers to a common width
digits is the maximum number of significant digits
nsmall is the minimum number of digits to the right of the decimal

Microbial Informatics

Lecture 21

Announcements

Review

Learning objectives

Remember...

What would happen if we did...

What would happen if we did...

But how do we get the first character?

For example...

What if we have a vector of names?

What do we need to know to return the last two characters of each person's name?

What do we need to know to return the last two characters of each person's name?

What do we need to know to return the last two characters of each person's name?

Say we have the following names...

We need to split the names using the strsplit function

Let's try it out

Where else could we use strsplit?

Say we want to reformat our date to be separated by -'s

Try it out with dates

Perhaps you want to paste strings together without using a separator

Let's return to the list of names

Let's return to the list of names

Formatting text output with sprintf

Things to notice

Another useful way to format output to text

Questions?

We need to split the names using the `strsplit` function

Where else could we use `strsplit`?

Say we want to reformat our date to be separated by `-`'s

Formatting text output with `sprintf`