# Microbial Informatics

## Lecture 29

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

## Announcements

• Final project (due 12/16/2014)
• Should be a program that others can use to do something useful (I have ideas if you need one, but really...)
• Would be smart to include a test file
• Create a public repository with documentation in README file and license
• Will have class on Friday, but not next Tuesday

## Review

• We've talked a lot about the R programming language and how we can do it to do useful things and help with our analyses
• The tools you have now will enable you to do many many things
• TDD is a software development process that results in a rapid development cycle

## Learning objectives

• Variable scoping
• Software licensing

## Variable scoping

• To this point we've largely ignored the issue of where our variables live and where they're "allowed to go"
• This has to do with a concept of variable scoping and the various environments that are used within R

## Consider this example...

dna <- "ATGCCTGACCTTTGCATACAA"

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
return(comp.rev.sequence)
}

• Where can dna be used?
• Where can getRevComp be used?
• Where can rev.sequence be used?

## What happens if...

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
print(dna, "")  <----
return(comp.rev.sequence)
}
getRevComp(dna)


## What happens if...

rev.sequence

## Error in eval(expr, envir, enclos): object 'rev.sequence' not found


## What happens if...

getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
dna <- comp.rev.sequence
return(comp.rev.sequence)
}

dna
getRevComp(dna)
dna

## [1] "ATGCCTGACCTTTGCATACAA"
## [1] "TTGTATGCAAAGGTCAGGCAT"
## [1] "ATGCCTGACCTTTGCATACAA"


## What's happening locally?

ls()

## [1] "dna"        "encoding"   "getRevComp" "inputFile"


## Quick summary

• At the time getRevComp is created, there are the objects rev.sequence and comp.rev.sequence created within getRevComp, plus those objects from the environment getRevComp is sitting in, namely dna
• But it is important to note that the reverse is not true. The outermost environment is not affected by what goes on inside getRevComp (e.g. dna was n ot changed). This means that functions have no side effects
• So you can have name conflicts between the objects within and outside your functions, but this is generally not a good idea. Sometimes people will use l_ as a prefix on all variables within a function.
• Upshot is that objects exist within a heirarchy

## How do we write up the heirarchy?

• As we've seen we can only read variables from up the heirarchy. We can't write variables up the heirarchy
• Unless we use the superassignment (<<-) operator
getRevComp <- function(sequence){
rev.sequence <- paste(rev(unlist(strsplit(sequence, ""))), collapse="")
comp.rev.sequence <- chartr("ATGC", "TACG", rev.sequence)
dna <<- comp.rev.sequence
return(comp.rev.sequence)
}
dna
getRevComp(dna)
dna

## [1] "ATGCCTGACCTTTGCATACAA"
## [1] "TTGTATGCAAAGGTCAGGCAT"
## [1] "TTGTATGCAAAGGTCAGGCAT"


## Should you use the superassignment operator?

• This creates global variables, which are controversial
• Problems caused by potential side effects and difficulty debugging code
• Benefits are that they can make the code easier to read/write
• Be careful

## Licensing

• A legally-binding agreement which governs the use and redistribution of software
• Can range from being proprietary (e.g. M$Windows/OS X) to open source (e.g. Linux/R) • You need a license ## Cost of code • Can be$ or free (as in speech) or free (as in beer)
• Beer
• No cost
• No expectations of how used
• Source code not necessarily open (e.g. Java)
• Speech
• Free to use as you want
• Free to see how it works
• Free to distribute how you'd like
• Free to improve

## Why do you need a license on your code?

• "Unlicensed code is closed code, so any open license is better than none"
• If you want others to see and use your code (which is why you're doing it), then you need a license
• Once you select a license, include it with your code - GitHub will provide you with a license
• You generally want to use a Free and Open Source Software (FOSS) license
• You do not necessarily lose copywright protecton

## Use a GNU Public License (GPL)-compatible license

• Guarantees the freedom of users to use, copy, and modify code
• Copyrights are maintained
• May charge for software or (re)distribute for free
• May only be combined with other code that uses GPL

## Use a permissive, BSD/MIT-style license

• BSD/MIT licenses are compatible with a GPL license
• Copyrights are maintained
• May be combined with code using any other license
• Easier for others (including commercial companies) to incorporate your work
• Minimal difference between BSD and MIT licenses

## Conclusion

• Reproducible research is critical to doing good science
• Making data analysis scripts and other code open is critical to reproduciblity
• R is a great tool for doing your analysis

## Going forward

• Learn another language (Python)
• Datbases (SQL)
• Evangelize to your labmates and PI
• Use collaborative features within GitHub
• Develop and distribute your code
• Work on another groups prooject