# Microbial Informatics

## Lecture 28

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• Final project (due 12/16/2014)
• Should be a program that others can use to do something useful (I have ideas if you need one, but really...)
• Would be smart to include a test file
• Create a public repository with documentation in README file and license
• Will have class on Friday, but not next Tuesday

## Review

• We've talked a lot about the R programming language and how we can do it to do useful things and help with our analyses
• The tools you have now will enable you to do many many things
• TDD is a software development process that results in a rapid development cycle

## Learning objectives

• Continue development of TDD skills
• Variable scoping

## TDD is a software development process where you...

1. Create a failing test that describes one a realistic problem you might face
2. Make sure test fails / see which tests fail
3. Write enough code to make sure test passes
4. Run all tests to make sure you haven't broken other test
5. Simplify the code
6. Repeat

## Introducing: testthat

• Problems with what we've been doing
• This process can get tedious
• It's not automated
• testthat is an R package for doing testing
• Put test code into a separate file (test-????.R)
• Code as normal in your R script file (????.R)

## test-pschloss.R

words <- readPaper("../../assignment04/mothur.txt")
expect_that(words, is_a("list"))
expect_that(length(words[[1]]), equals(2056))
expect_that(sum(grepl("\\W", words[[1]])), equals(0))
expect_that(wordCount("mothur", words), equals(25))
expect_that(wordCount("the", words), equals(133))
expect_that(wordCount(c("mothur", "the"), words), equals(c(25, 133)))


## pschloss.R

readPaper <- function(file){
words <- scan(file, what="")
words <- gsub("\\W", "", words)
words <- tolower(words)
return(list(words))
}

wordCount <- function(word, wordList){
word <- tolower(word)
word.count <- numeric()
for(w in word){
word.count[w] <- sum(wordList[[1]]==w)
}
names(word.count) <- NULL
return(word.count)
}


## How to run...

library(testthat)
source("pschloss.R")
test_dir("./")

## ......


## expect_that command options...

• is_true: truth
• is_false: falsehood
• is_a: inheritance
• equals: equality with numerical tolerance
• is_identical_to: exact identity
• is_equivalent_to: equality ignoring attributes
• matches: string matching
• prints_text: output matching
• throws_error: error matching
• gives_warning: warning matching
• shows_message: message matching
• takes_less_than: performance

## Other elements of testthat

• Test file must have a test- prefix
• Can get fancy by defining your own expectation functions
• Can establish specific contexts with environmental settings, etc.
• Can automate testing so that it runs the tests whenever you update the source code

## Exercise

• Here is a toy DNA sequence:  CTACATGATCCTACCGCTCAACTACCAATCGTAACC 
• Create a function that will return a vector containing the start and end positions of the start and stop codons
• Do this in a TDD approach

## Variable scoping

• To this point we've largely ignored the issue of where our variables live and where they're "allowed to go"
• This has to do with a concept of variable scoping and the various environments that are used within R

## Consider this example...

dna <- "ATGCCTGACCTTTGCATACAA"

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
return(comp.rev.sequence)
}

• Where can dna be used?
• Where can getRevComp be used?
• Where can rev.sequence be used?

## What happens if...

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
print(dna, "")  <----
return(comp.rev.sequence)
}
getRevComp(dna)


## What happens if...

rev.sequence

## Error in eval(expr, envir, enclos): object 'rev.sequence' not found


## What happens if...

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
dna <- comp.rev.sequence
return(comp.rev.sequence)
}

dna
getRevComp(dna)
dna

## [1] "ATGCCTGACCTTTGCATACAA"
## [1] "TTGTATGCAAAGGTCAGGCAT"
## [1] "ATGCCTGACCTTTGCATACAA"


## What's happening locally?

ls()

## [1] "dna"        "encoding"   "getRevComp" "inputFile"  "readPaper"
## [6] "wordCount"


## Quick summary

• At the time getRevComp is created, there are the objects rev.sequence and comp.rev.sequence created within getRevComp, plus those objects from the environment getRevComp is sitting in, namely dna
• But it is important to note that the reverse is not true. The outermost environment is not affected by what goes on inside getRevComp (e.g. dna was n ot changed). This means that functions have no side effects
• So you can have name conflicts between the objects within and outside your functions, but this is generally not a good idea. Sometimes people will use l_ as a prefix on all variables within a function.
• Upshot is that objects exist within a heirarchy

## How do we write up the heirarchy?

• As we've seen we can only read variables from up the heirarchy. We can't write variables up the heirarchy
• Unless we use the superassignment (<<-) operator
getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
dna <<- comp.rev.sequence
return(comp.rev.sequence)
}
dna
getRevComp(dna)
dna

## [1] "ATGCCTGACCTTTGCATACAA"
## [1] "TTGTATGCAAAGGTCAGGCAT"
## [1] "TTGTATGCAAAGGTCAGGCAT"


## Should you use the superassignment operator?

• This creates global variables, which are controversial
• Problems caused by potential side effects and difficulty debugging code
• Benefits are that they can make the code easier to read/write
• Be careful