Microbial Informatics

Lecture 23

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

Announcements

  • Homework is due on November 22nd
    • work with a partner
    • no more than one explicit loop
  • Will have lab period on Friday
  • Read Chapters 11 in TAoRP for background material on what is discussed today

Review

  • String manipulation
  • Understand how to work with and manipulate character variables
  • Exercise from Tuesday...

Let's revisit the metadata file

metadata <- read.table(file="wild.metadata.txt", header=T)
head(metadata)
##     Group Date ET Station  SP Sex Age Repro Weight Ear
## 1  5_25m3 5_25  3    BB18  PL   M   J   ABD    7.5  13
## 2  5_25m4 5_25  4     K19  PL   M   A   SCR   16.0  15
## 3  5_26m1 5_26  1     A12  PL   F   A    NE   19.5  14
## 4  5_26m9 5_26  9      M9  PL   F   A    NE   25.0  13
## 5 5_31m11 5_31 11      F2 PMG   F   J    NT   16.0  18
## 6  5_31m2 5_31  2     CC4  PL   M  SA   ABD   15.0  14
  • The Date column is the date that the mice were captured in M_DD format. Can you convert this column into "Month Day, Year" format? Assume the year was 2011.

How to do it...

metadata <- read.table(file="wild.metadata.txt", header=T)

fixDate <- function(m_d, year=2011){
    m.d <- unlist(strsplit(x=m_d, split="_"))
    m.d <- as.numeric(m.d)

    month <- month.name[m.d[1]]
    day <- m.d[1]
    format.date <- paste0(month, " ", day, ", ", year)
    return(format.date)
}

date <- as.character(metadata$Date)
nice.dates <- sapply(date, fixDate)
names(nice.dates) <- NULL
##   [1] "May 5, 2011"  "May 5, 2011"  "May 5, 2011"  "May 5, 2011" 
##   [5] "May 5, 2011"  "May 5, 2011"  "May 5, 2011"  "May 5, 2011" 
##   [9] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [13] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [17] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [21] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [25] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [29] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [33] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [37] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [41] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [45] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [49] "June 6, 2011" "June 6, 2011" "June 6, 2011" "June 6, 2011"
##  [53] "June 6, 2011" "June 6, 2011" "July 7, 2011" "July 7, 2011"
##  [57] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [61] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [65] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [69] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [73] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [77] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [81] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [85] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [89] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [93] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
##  [97] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
## [101] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
## [105] "July 7, 2011" "July 7, 2011" "July 7, 2011" "July 7, 2011"
## [109] "July 7, 2011" "July 7, 2011" "July 7, 2011"

Learning objectives

  • Making "generic" regular expressions
  • "Generic" find and replace

Motivational questions

  • How would you...

    • find a motif in an amino acid sequence?
    • find a gene?
    • parse a file name to see what type of file it is?
    • list a bunch of files where you don't know its name, but they have a similar format?
  • Regular expressions!

Repeated elements

  • + - Matches preceeding character 1 or more times
grep("a+", c("baa", "woof"))    
## [1] 1
  • ? - Matches preceeding character 0 or 1 time
grep("colou?r", c("color", "colour"))
## [1] 1 2
  • * - Matches preceeding character 0 or more times
grep("ab*c", c("ac", "abc", "abbc", "abbbc"))
## [1] 1 2 3 4

You can define the repeat length

  • {} - Matches user defined number of times
grep("ab{2}c", c("ac", "abc", "abbc", "abbbc"))
## [1] 3
  • {,} - Matches user defined number of times (range)
grep("ab{1,2}c", c("ac", "abc", "abbc", "abbbc"))
## [1] 2 3
grep("ab{,2}c", c("ac", "abc", "abbc", "abbbc"))
## [1] 1 2 3 4

Metacharacters

  • A character with a special meaning that should not be interpreted literally
  • Memorize these...
    • . - Any character
    • \\d - Any number
    • \\w - Any alphanumeric character
    • \\s - Any whitespace characters (<space>, \\t, \\n)
    • \\D - Anything but a number
    • \\W - Any whitespace character
    • \\S - Any non-whitespace character

. - Any character

grep("A.G", c("ACG", "ATG", "ATTG"))
## [1] 1 2
grep("A.+G", c("ACG", "ATG", "ATTG"))
## [1] 1 2 3

\\d - Any number

grep("\\d", c("ATG", "123"))
## [1] 2

\\w - Any alphanumeric character

grep("\\w", c("ATG", "123"))
## [1] 1 2

\\s - Whitespace characters

grep("\\s", c("A G", "ATG"))
## [1] 1

Opposites

  • \\D - Any non-numeric characters
grep("\\D", c("ATG", "123"))
## [1] 1
  • \\W - Any non-alphanumeric characters
grep("\\W", c("ATG", "123"))
## integer(0)
  • \\S - Any non-space characters
grep("\\S", c("A G", "ATG"))    #why does this come up as 1,2?
## [1] 1 2

How to search for a quantifier?

  • \\ - When used to precede a quantifier or metacharacter, it expresses that character
grep("\\+", c("2+2", "2-2", "2.2"))
## [1] 1
grep("\\.", c("2+2", "2-2", "2.2"))
## [1] 3
grep("\\(\\d{3}\\)\\d{3}-\\d{4}", "(734)867-5301")
## [1] 1

Define your own metacharacters!

  • [] - Match any of the characters in the brackets
grep("[ATGCU]", c("ATG", "123"))
## [1] 1
grep("[AG2]", c("ATG", "123"))
## [1] 1 2

Define your own metacharacters

  • [-] - Match any of the characters including & between them...
grep("[a-z]",  c("ATG", "123"))
## integer(0)
grep("[a-zA-Z]",  c("ATG", "123"))
## [1] 1
grep("[a-zA-Z0-9]",  c("ATG", "123"))
## [1] 1 2

Be exclusive...

  • [^] - Don't match any of the characters in the brackets...
grep("[^AGTC]", c("ATG", "123"))
## [1] 2
grep("[^NU]", c("ATG", "AUG", "ANN"))
## [1] 1 2 3

Replacements with sub/gsub

  • Within the pattern you can use parentheses to identify sub-patterns that you manipulate in the replacement
gsub("ATG(CAG)", "AAA\\1", "ATGCAG")
## [1] "AAACAG"
gsub("(ATG)(CAG)", "\\1AAA\\2", "ATGCAG")
## [1] "ATGAAACAG"
gsub("(A.G)(C.G)", "\\1AAA\\2", c("ATGCAG","AAGCTG"))
## [1] "ATGAAACAG" "AAGAAACTG"

Let's go back to that example from Tuesday...

metadata <- read.table(file="wild.metadata.txt", header=T)

fixDate <- function(m_d, year=2011){
    m.d <- unlist(strsplit(x=m_d, split="_"))
    m.d <- as.numeric(m.d)

    month <- month.name[m.d[1]]
    day <- m.d[1]
    format.date <- paste0(month, " ", day, ", ", year)
    return(format.date)
}

date <- as.character(metadata$Date)
nice.dates <- sapply(date, fixDate)
names(nice.dates) <- NULL
  • What could we do differently now?

New and improved date converter

month <- as.numeric(gsub("^(\\d+)_\\d+", "\\1", metadata$Date))
day <- gsub("^\\d+_(\\d+)", "\\1", metadata$Date)
year <- "2011"
paste0(month.name[month], " ", day, ", ", year)
##   [1] "May 25, 2011"  "May 25, 2011"  "May 26, 2011"  "May 26, 2011" 
##   [5] "May 31, 2011"  "May 31, 2011"  "May 31, 2011"  "May 31, 2011" 
##   [9] "June 14, 2011" "June 14, 2011" "June 15, 2011" "June 15, 2011"
##  [13] "June 15, 2011" "June 15, 2011" "June 15, 2011" "June 15, 2011"
##  [17] "June 15, 2011" "June 15, 2011" "June 15, 2011" "June 15, 2011"
##  [21] "June 16, 2011" "June 16, 2011" "June 16, 2011" "June 16, 2011"
##  [25] "June 16, 2011" "June 16, 2011" "June 17, 2011" "June 17, 2011"
##  [29] "June 17, 2011" "June 1, 2011"  "June 1, 2011"  "June 1, 2011" 
##  [33] "June 29, 2011" "June 29, 2011" "June 29, 2011" "June 29, 2011"
##  [37] "June 29, 2011" "June 29, 2011" "June 29, 2011" "June 2, 2011" 
##  [41] "June 2, 2011"  "June 2, 2011"  "June 2, 2011"  "June 30, 2011"
##  [45] "June 30, 2011" "June 30, 2011" "June 30, 2011" "June 30, 2011"
##  [49] "June 30, 2011" "June 5, 2011"  "June 5, 2011"  "June 5, 2011" 
##  [53] "June 5, 2011"  "June 5, 2011"  "July 13, 2011" "July 13, 2011"
##  [57] "July 13, 2011" "July 13, 2011" "July 13, 2011" "July 13, 2011"
##  [61] "July 13, 2011" "July 13, 2011" "July 13, 2011" "July 13, 2011"
##  [65] "July 13, 2011" "July 13, 2011" "July 14, 2011" "July 14, 2011"
##  [69] "July 14, 2011" "July 14, 2011" "July 14, 2011" "July 14, 2011"
##  [73] "July 14, 2011" "July 14, 2011" "July 14, 2011" "July 14, 2011"
##  [77] "July 14, 2011" "July 14, 2011" "July 14, 2011" "July 14, 2011"
##  [81] "July 14, 2011" "July 14, 2011" "July 14, 2011" "July 14, 2011"
##  [85] "July 14, 2011" "July 14, 2011" "July 14, 2011" "July 14, 2011"
##  [89] "July 14, 2011" "July 2, 2011"  "July 2, 2011"  "July 2, 2011" 
##  [93] "July 2, 2011"  "July 2, 2011"  "July 2, 2011"  "July 2, 2011" 
##  [97] "July 2, 2011"  "July 2, 2011"  "July 2, 2011"  "July 2, 2011" 
## [101] "July 2, 2011"  "July 3, 2011"  "July 3, 2011"  "July 3, 2011" 
## [105] "July 3, 2011"  "July 3, 2011"  "July 3, 2011"  "July 3, 2011" 
## [109] "July 3, 2011"  "July 3, 2011"  "July 3, 2011"

How would you write a pattern to...

  • find a motif in an amino acid sequence?
  • find a gene?
  • parse a file name to see what type of file it is?
  • list a bunch of files where you don't know its name, but they have a similar format?

Questions?