RNA-seq data analysis bootcamp
This workshop is directed toward life scientists with little to no experience with statistical computing or bioinformatics. This interactive workshop will introduce both the Linux/UNIX operating system and the R statistical computing environment, with a focus on a biological application - analyzing RNA-seq data for differentially expressed genes. The morning session will introduce basic operation in a UNIX environment, and will cover the first steps in an RNA-seq analysis including QC, alignment, and quantitation. The afternoon will introduce the R statistical computing environment, and will cover differential gene expression analysis using Bioconductor. By the end of the workshop, participants will:
- Be familiar with the UNIX shell, including nagivating the filesystem, creating/examining/removing files, getting help, and batch operations.
- Know how to align and quantitate gene expression with RNA-seq data
- Become familiar with the R statistical computing environment, including data types, variables, array manipulation, functions, data frames, data import/export, visualization, and using packages.
- Know what packages to use and what steps to take to analyze RNA-seq data for differentially expressed genes.
Participants will also be exposed to operating in a virtual environment and/or provisioning their own cloud computing resources. This course is sponsored by the Claude Moore Health Sciences Library, and borrows some materials from the Software Carpentry and Data Carpentry projects.
Pre-requisites: See the setup requirements below. Set aside an hour to create the necessary accounts and install the software prior to the workshop. We will not have time to do this during the workshop.
Registration: Registration opens Friday, February 20, 2015 at 9:00am. See below for registration instructions. The workshop is free1, but requires a $10 registration fee that is refunded after attending the course and submitting a course evaluation.
Instructor / Technical contact: Stephen Turner (s...@virginia.edu)
Logistics / registration contact: Bart Ragon (b...@virginia.edu)
Agenda
The boot camp is a two-part series.
Part I: Monday, March 23 2015, 8:30am - 12:30pm
Part II: Thursday, March 26 2015, 1:00pm - 5:00pm
Location: Carter classroom, first floor Health Sciences Library
Instruction will start promptly at 8:30am on the first day. If you have any trouble with setup, please contact Stephen Turner prior to the course. Dr. Turner will also be available at 8:00am that morning, 30 minutes prior to the course for hands-on troubleshooting, but please try to solve any setup problems prior to this time if possible.
Part I:
- 0800-0830: (Optional) Help with setup
- 0830-0900: Using AWS EC2
- 0900-1045: Introduction to Linux
- 1100-1230: QC, alignment and expression quantitation
Part II:
- 1300-1445: Introduction to R
- 1500-1700: QC, differential expression, and visualization with R/Bioconductor
Course Material
- Part I:
- Part II:
Setup
Please bring a laptop with the software below installed (everything is free). You'll also need to create an Amazon Web Services account. I can't understate how important it is to do this prior to the course - we will not have time during the workshop to troubleshoot installation issues. Please email me (sd...@virginia.edu) if you have any trouble.
Setup checklist:
- Register & activate an AWS account
- Get a free AWS voucher from Dr. Turner
- Download PuTTY (Windows users only)
- Install Cyberduck
- Download and extract course repository zip file
- Install R
- Install RStudio
- Install DESeq2 R package
Software setup, part I: AWS and a terminal
Most bioinformatics is done on a computer running a Linux/UNIX operating system. In the first part of this workshop we will be doing data analysis on a remote linux server rather than on our own laptops. To do that we need: (1) a remote computer set up with all the software we'll need, (2) a way to connect to that computer, and (3) a way to transfer files to and from that computer.
Since most of us don't have our own Linux server running somewhere, we'll rent a server from Amazon for the duration of this course.
Create AWS account
First, create an Amazon Web Services account: http://aws.amazon.com/. Make sure to register for a Basic (Free) account. You will be required to enter a credit card and billing information -- don't worry, I have free use vouchers for you so you will not be charged. You will need to verify a phone number before you can start using AWS. Note that your Amazon.com account is not connected to your Amazon Web Services account. They are two separate entities with different login and billing information.
Once you have your AWS account set up and can successfully log in to console.aws.amazon.com, email Stephen Turner (sd...@virginia.edu) to obtain a voucher to be able to use AWS for free during and after our course. Use the subject line "RNA-SEQ COURSE AWS VOUCHER" in your email to me. Once you have your voucher, return to the AWS console, click your name at the top right, click "Billing & Cost Management", then on the left, click "Credits". Redeem the promo code I sent you -- this credit will buy enough compute time to complete this workshop and for several future RNA-seq analyses.
If you're interested in trying out EC2 prior to the workshop, watch this short video to learn how to launch your first instance (and make sure to stop the instance after you're done). Whether you do this or not, be sure to stop or terminate any running EC2 instances when you are done with them. After the course, you may deactivate your AWS account if you wish under the "My Account" settings in the AWS console.
Download a terminal emulator (Windows only)
Skip this step if you're using a Mac -- you already have a terminal that you can access by typing "Terminal" into Spotlight, or navigating to Applications -> Utilities -> Terminal.
If you're using Windows you'll need a terminal emulator. Download the latest version of PuTTY here. Ensure that you can run this .exe file (no installation necessary).
Download a file transfer program
Download and install Cyberduck (free and open-source): https://cyberduck.io. We will use this transfer files back and forth between our local laptops and our remote linux server running on Amazon.
Note: If using a Mac, download from the website above (free), not from the Mac App Store (paid).
Software setup, part II: R and RStudio
Note: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.
- Download data. Download the
gapminder.csv
andmalebmi.csv
files from bioconnector.org/data. Save them somewhere easy to find. Optionally, open them up in Excel and look around. - Install R. You'll need R version 3.1.2 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
- Install RStudio. Download and install the latest stable version of RStudio Desktop. Alternatively, download the RStudio Desktop v0.99 preview release (the 0.99 preview version has many nice new features that are especially useful for this particular workshop).
- Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you've used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking
Update all/some/none? [a/s/n]:
. If you see this, typea
at the propt and hitEnter
to update any old packages. If you're using a Windows machine you might get some errors about not having permission to modify the existing libraries -- don't worry about this message. You can avoid this error altogether by running RStudio as an administrator.
# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("knitr")
install.packages("rmarkdown")
You can check that you've installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(rmarkdown)
These commands may produce some notes or other output, but as long as they work without an error message, you're good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName'
, then the required packages did not install correctly. Please do not hesitate to email me prior to the course if you are still having difficulty.
Additionally, you'll need to install a few Bioconductor packages. These packages are installed differently than "regular" R packages from CRAN. Copy and paste these lines of code into your R console.
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("DESeq2")
You can check that you've installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:
library(DESeq2)
If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName'
, then the required packages did not install correctly. Please do not hesitate to email me prior to the course if you are still having difficulty.
Registration
Register here or use the form below. The workshop is free1, but requires a $10 registration fee that is refunded after attending the course and submitting a course evaluation.
Registration opens Friday, February 20, 2015 at 9:00am.
-
The workshop is free, but requires a $10 registration fee that is refunded after attending the course and submitting a course evaluation. We do this to protect you, the person who's truly interested in taking this workshop, from those who would otherwise sign up to hold their spot. You will receive a refund for your registration after attending all parts of the workshop and after you submit an evaluation. ↩