Microbial Informatics

Lecture 02

Patrick D. Schloss, PhD (microbialinformatics.github.io)
Department of Microbiology & Immunology

Learning objectives

  • Gain familiarity with RStudio
  • Describe how to generate a document in markdown and the R flavor of markdown
  • Understand the basics of version control software using git

What is Markdown (*.md)?

  • "Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML)." - John Gruber
  • The advantage is that you can read it as a text document and it will make sense and you can use conversion software to generate other file formats including html, pdf, docx
  • Can be rendered using:
    • A Perl script from Gruber
    • RStudio

What's special about R markdown (*.Rmd)?

  • "R Markdown is an authoring format that enables easy creation of dynamic documents, presentations, and reports from R. It combines the core syntax of markdown (an easy-to-write plain text format) with embedded R code chunks that are run so their output can be included in the final document. R Markdown documents are fully reproducible (they can be automatically regenerated whenever underlying R code or data changes)." - RStudio website
  • Keys...
    • Ability to format text
    • Embed R code and output as chunks or inline

Chunks

plot of chunk unnamed-chunk-1

Inline

  • Let me pick a random number between 1 and 10.
  • Hmmm, I pick 2
  • 2 squared is 4
  • All of the numbers (after 1 and 10) were generated within R
  • The dynamic component is that you use R packages to allow a user to set the min and max values to bound your pick

knitr

  • The chunks and inline approaches are implemented using an R package called knitr
  • We will talk about knitr later when we start digging into R's syntax
  • Documentation and a book (written in knitr) can be found on Yi Hui's website

Syntax

Version control software (VCS)

  • Example scenarios:
    • You are working by yourself on the data analysis for your thesis. Your advisor wants to see the bleeding edge code for what you are doing.
    • You are working with collaborators on the data analysis for a paper. Since each has a different piece of the story to take care of, everyone has their own workflows, versions of the raw data, etc.
    • You discover a bug in your favorite R package and have found a fix for it
  • Each of these problems can be solved using VCS

Version control defined

"A tool for managing changes to a set of files. Each set of changes creates a new revision of the files; the version control system allows users to recover old revisions reliably, and helps manage conflicting changes made by different users." -- Software Carpentry Workshop

Important points

  • It's virtually impossible to lose old versions of text
  • Can be used like the "undo/redo" features in Word, but with branches
  • Keeps a record of who made changes and when
  • Difficult to ignore other people's contributions since software notifies of conflicts
  • git is one type of VCS, but it's the most popular

Theory (From ProGit)

Theory (From ProGit)

Theory (From ProGit)

Basic workflow...

  • Muck around with the files in your working directory. Possibly add or delete files.
  • You stage the files, adding snapshots of them to your staging area. (git add)
  • You take the files as they are in the staging area and store that to your Git directory. (git commit)

For Friday

Questions?