- Project 1 is due on the 24th
- The first hour of tomorrow's class will be a lecture
- No class on Tuesday (stud section)
- Read
(Chapters 2 and 3)*The Art of R Programming*

Patrick D. Schloss, PhD (microbialinformatics.github.io)

Department of Microbiology & Immunology

- Project 1 is due on the 24th
- The first hour of tomorrow's class will be a lecture
- No class on Tuesday (stud section)
- Read
(Chapters 2 and 3)*The Art of R Programming*

- Imagine you need to calculate dilutions for a 96-well plate so that you have 10 ng / ul DNA solutions.
- Is it best to do each calculation on a calculator?
- What might go wrong?
- Most of you would likely use an Excel spreadsheet with a lot of copying and pasting of equations
- Now what if one of your labmates would like to do the same calculations?

- What are the tedious, error prone activities you do in the lab?

- Generating a graph - automation
- Processing data from 96 well plates
- ELISAs
- growth curves

- Primer design
- Assessing DNA sequence quality
- Non-standard statistical analysis
- Simulations
- Simple tasks but with a ton of data (determine the GC content of a genome)

- Calculations can be complicated/difficult and easy to screw up
- Program encapsulate functionality that can be easily repeated and transferred
- With proper documentation it is easy for others to see what format the data need to be in and understand what is going on. Think of the t.test function...

- The remainder of the semester will address more advanced concepts that are necessary for programming within R and that can be easily transferred to other programming languages.
- Each of these concepts has parallels in other languages.

- R is a high level programming language
- To sum a vector of numbers you can just use the "sum" command
- In other languages you'd have to use a for loop
- This is good and bad. Good because it's easier to program, bad because it can make the same code slower in R than the other languages

- The two most significant tools you can make use of, regardless of the programming language, are pseudocode and comments.

- informal description of how a program or component of a program will function. It is designed for human to read, not the computer
- It is a good practice to first write the pseudocode and then the R syntax
- Think of it as you would an outline for a manuscript you are writing

- In R, a comment is indicated with a # sign
- Never too many comments, difficult to remember what you meant in your code a month later or for someone else to read your mind
- Helpful to indicate what is going on at each step, indicate inputs, outputs, defaults
- Can put #'s at the beginning of each line in your pseudocode and then fill in the actual code

Imagine you have a bacterial genome and you want to find all of the open reading frames (ORFs). What might the pseudocode look like?

- Get a list of all of the start and stop codons in all frames
- Determine which frame each codon belongs to
- For every start codon determine the length of the putative gene
- Apply some minimum length criteria
- Select the putative ORFs that are long and non-overlapping
- Output: ORF coordinates, DNA sequence, translated protein sequence

- You can then take each of these steps and make the pseudocode more specific (indenting is helpful)...
- Get a list of all of the start and stop codons in all six frames.

```
for position in the genome
codon = that base plus the next two bases
if codon == "ATG" then save that position as a start codon
if codon == "TAG", "TAA", or "TGA" then save that position as a stop codon
return list of start and stop codon positions
```

- For example, you could find all the start/stop codons in the +1 frame, +2, etc. instead of finding the frame after you find the start stop codons
- Choice is a matter of...
- style
- complexity
- speed
- memory requirement

- Alternate start codons?
- Minimum ORF length
- Allow overlap in ORFs
- Sequence complexity
- etc...

- The code you've written so far has been to put everythign in one file in a stream of consciousness format
- Problems with this approach...
- Repeated code
- Bad organization
- Difficult to debug

- Functions allow us to encapsulate specific tasks.
- Ideally, each function only does one thing (e.g. sum, mean, etc)
- Advantage being that if a new algorithm came up to calcualte the mean then you would just change that section of code instead of that code throughout the manuscript

```
myFunction <- function(arguements){
# insert special sauce
}
myFunction()
```

- The function returns the last command entered
- Usually want to use
`return`

function to be safe

```
myFunction <- function(arguements){
# insert special sauce
return(output)
}
myFunction()
```

```
x <- 1:10
sum.squared <- function(x){
squared <- x^2
sum.sq <- sum(squared)
return(sum.sq)
}
sum.squared(x)
```

```
## [1] 385
```

```
sum.squared <- function(x){
sum.sq <- sum(x^2)
return(sum.sq)
}
sum.squared(x)
```

```
## [1] 385
```

```
sum.squared <- function(x){
return(sum(x^2))
}
sum.squared(x)
```

```
## [1] 385
```

```
sum.squared <- function(x){
sum(x^2)
}
sum.squared(x)
```

```
## [1] 385
```

```
sum.squared <- function(x) sum(x^2)
sum.squared(x)
```

```
## [1] 385
```

```
sum(x^2)
```

```
## [1] 385
```

- Consider where you want to be on this continuum from a three line function to no function at all.
- Readability
- Maintainability
- Ability to expand features

- Either the result of the return command or the last variable that is created
- Think about... this example only returns a single value. How would you get back more complex data like the summary command does?

```
sum.squared <- function(x, mu=0){
sum((x-mu)^2)
}
sum.squared(x)
```

```
## [1] 385
```

```
sum.squared(x, 3)
```

```
## [1] 145
```

- Variables only exist within the braces

```
sum.squared <- function(x){
sum.sq <- sum(x^2)
}
sum.squared(c(1,2,3))
```

- Where is the output???

```
sum.sq
```

```
## Error: object 'sum.sq' not found
```

- The distribution of a sum of the squares of k indepdentently sampled normal random variables, where k is the degrees of freedom
- Procedure to create an empirical distribution
- Draw k random variables from a normal distribution with mean 0.0 and standard deviation of 1.0
- Square each of them
- Sum them
- Repeat many times and keep track of how many times you see each value

- Interested in the proportion of the distribution larger than our test statistic