Introduction

Today we’re going to look at a data set from my work. This data comes from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys (a closely related species to rhesus macaques). MSCs are known to have immune modulatory effects, and the main goal of the project is to test whether these stem cells, when activated with interferon gamma (IFNg), can help suppress the body’s immune response against an organ transplant.

Known effects of MSC which may benefit organ transplants

The purpose of this specific data set is simply to determine which genes’ expression is affected when MSCs are treated with IFNg. As such, we have both untreated (Control) samples and IFNg-treated samples from 3 different passages of the cell cultures. So, we’ll be using limma to test each gene for differential expression between Control and IFNg samples.

Preliminary setup

First, let’s install all the packages we’ll need.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
already_installed_packages <- rownames(installed.packages())
needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
need_to_install <- setdiff(needed_packages, already_installed_packages)
if (length(need_to_install) > 0) {
    ## http://bioconductor.org/install/
    BiocManager::install(need_to_install)
}

Then we’ll load those packages.

library(SummarizedExperiment)
library(edgeR)
library(limma)
library(ggplot2)
library(dplyr)

Initial data loading and exploration

Let’s load the data file. This uses the readRDS function, which reads a single R object from a file and returns it. We assign the result to a variable.

sexp <- readRDS(gzcon(url("https://darwinawardwinner.github.io/resume/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS")))
print(sexp)
class: RangedSummarizedExperiment 
dim: 20416 54 
metadata(0):
assays(1): counts
rownames(20416): ENSP00000346839 ENSP00000260356 ... ENSP00000420502 ENSP00000420636
rowData names(8): symbol ensembl_gene_id ... entrezgene source_species
colnames(54): LabA.R104.6C4_65M.P4.Control LabA.R104.6C4_65M.P4.IFNg ...
  LabB.R95.CN8351.P6.Control LabB.R95.CN8351.P6.IFNg
colData names(6): Sample.ID Animal.ID ... Lab Treatment

This object is called a SummarizedExperiment, and its purpose is to hold the experimental data, sample metadata, gene metadata, and any other relevant information for the experiment all in one place. This makes it an excellent starting point for any analysis. In our case, the experimental data consists of a matrix of counts for each gene in each sample. This count matrix was generated by aligning all the RNA sequence reads in each sample to the cyno genome and then counting the number of uniquely-mapped reads that overlap each gene. I have done the aligning and counting process for you and provided the count matrix, since the alignment and counting would take many hours to commplete.

Look at the output of the print statement above. The RNA-Seq read counts are contained in the assays slot. The colData slot contains information on the samples, while the rowRanges slot contains information on the genes. The rows have “ranges” instead of just “data” because in addition to things like the gene symbol and description, the rowRanges slot also contains the genomic coordinates of each gene, that is, the chromosome, start/end positions of each exon, and the strand. This is the same information that was used to count the overlapping reads for each gene, but we won’t be needing it today.

You can read more about SummarizedExperiment objects here.

Structure of a SummarizedExperiment object

Extracting the data and metadata

Ok, let’s pull out the information that we want from this SummarizedExperiment.

First, the sample information:

sample_table <- as.data.frame(colData(sexp))

Next, the gene information:

## Gene metadata
gene_info <- as.data.frame(mcols(sexp))

And finally, the count matrix:

count_matrix <- assay(sexp)

Examine the sample_table, gene_info, and count_matrix objects. Try dim, class, nrow, ncol, and similar. Try viewing the first few rows or columns of each one.

  • What kind of object is each one?
  • What kind of data do they contain?
  • Which dimensions do they have in common? How do these dimensions match up to the dimension of the SummarizedExperiment?

Learning about the experimental design from the sample table

Before we get to the analysis, we need to have a closer look at the sample table, to gain a better understanding of the experimental design.

names(sample_table)
[1] "Sample.ID" "Animal.ID" "Passage"   "Run"       "Lab"       "Treatment"
head(sample_table)
  • How many Labs/Runs/Animals/Passages/Treatments are there in this experiment?
  • How many samples are in each one?

(You may find the unique and table functions useful for answering these questions.)

  • How many Animals are there in each Lab?
  • Are there Animals with samples from both Labs?
table(sample_table$Animal.ID, sample_table$Lab)
         
          A B
  6C4_65M 6 0
  6C63    6 0
  6C7     6 0
  6C84    6 0
  7C37    6 0
  8C8     6 0
  CN7314  0 6
  CN7875  0 6
  CN8351  0 6
  • How many sequencing Runs are there in each Lab?
  • How many Animals are there in each sequencing Run?
  • Are there Animals with samples from multiple sequencing Runs?
  • Is it better to put all of each animal’s samples in the same run, or is it better to distribute each animal’s samples across multiple runs?

Performing a basic limma analysis

Filtering non-expressed genes

Our count matrix contains information on every known gene in the cyno genome, but we only want to look at the genes that are expressed in MSCs. Many genes will not be expressed at all and will have zero or very few counts in all samples, and the statistical method breaks down when it is fed all zeros. So we need to decide on a threshold of gene detection and filter out all the genes whose average expression does not reach this threshold. Let’s see how a threshold of logCPM = 1 looks on a histogram and a QQ plot against normal distribution quantiles.

mean_log_cpm <- aveLogCPM(count_matrix)
filter_threshold <- 1
ggplot() + aes(x=mean_log_cpm) +
    geom_histogram(binwidth=0.2) +
    geom_vline(xintercept=filter_threshold) +
    ggtitle("Histogram of mean expression values")

qqnorm(mean_log_cpm); abline(h=filter_threshold)

  • What do the shapes of these two plots indicate?
  • How can we justify our choice of threshold using these plots?
  • Would this same detection threshold be appropriate for other experiments?
  • What biological or experimental factors might influence the choice of threshold?

Having chosen our threshold, let’s pick the subset of genes whose average expression passes that threshold.

keep_genes <- mean_log_cpm >= 1
filtered_count_matrix <- count_matrix[keep_genes,]
filtered_gene_info <- gene_info[keep_genes,]
nrow(filtered_count_matrix)
[1] 11907
  • What fraction of the genes in the genome are considered expressed according to our threshold?
  • Is this number sensitive to our choice of threshold? How much of a difference does it make if we use a threshold of 0.5 or 1.5 instead?

Normalization

Ok, now that we have some understanding of the experimental design, let’s find some differentially expressed genes! We’ll start by computing the total counts in each sample and then normalizing these total counts using the Trimmed Mean of M-values (TMM) method, provided by the calcNormFactors function.

total_counts <- colSums(count_matrix)
nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
print(nf)
LabA.R104.6C4_65M.P4.Control    LabA.R104.6C4_65M.P4.IFNg LabA.R104.6C4_65M.P5.Control 
                   0.8800745                    0.9898919                    0.8203199 
   LabA.R104.6C4_65M.P5.IFNg LabA.R104.6C4_65M.P6.Control    LabA.R104.6C4_65M.P6.IFNg 
                   0.9952668                    0.8838364                    1.0259290 
   LabA.R107.6C63.P4.Control       LabA.R107.6C63.P4.IFNg    LabA.R107.6C63.P5.Control 
                   1.1900295                    1.3016013                    1.1399379 
      LabA.R107.6C63.P5.IFNg    LabA.R107.6C63.P6.Control       LabA.R107.6C63.P6.IFNg 
                   1.1169711                    1.0104253                    1.2073377 
    LabA.R107.6C7.P4.Control        LabA.R107.6C7.P4.IFNg     LabA.R107.6C7.P5.Control 
                   1.0517275                    1.0998364                    1.0280885 
       LabA.R107.6C7.P5.IFNg     LabA.R107.6C7.P6.Control        LabA.R107.6C7.P6.IFNg 
                   1.0526623                    0.8782156                    1.0749254 
   LabA.R104.6C84.P4.Control       LabA.R104.6C84.P4.IFNg    LabA.R104.6C84.P5.Control 
                   0.6720892                    0.9333467                    0.6241975 
      LabA.R104.6C84.P5.IFNg    LabA.R104.6C84.P6.Control       LabA.R104.6C84.P6.IFNg 
                   0.8950369                    0.6706472                    0.9438313 
   LabA.R107.7C37.P4.Control       LabA.R107.7C37.P4.IFNg    LabA.R107.7C37.P5.Control 
                   1.0769607                    1.1242476                    1.0746693 
      LabA.R107.7C37.P5.IFNg    LabA.R107.7C37.P6.Control       LabA.R107.7C37.P6.IFNg 
                   1.0230317                    1.1620610                    1.0660020 
    LabA.R104.8C8.P4.Control        LabA.R104.8C8.P4.IFNg     LabA.R104.8C8.P5.Control 
                   0.8419412                    1.0470961                    0.7138150 
       LabA.R104.8C8.P5.IFNg     LabA.R104.8C8.P6.Control        LabA.R104.8C8.P6.IFNg 
                   1.0934070                    0.8403635                    1.1033499 
  LabB.R95.CN7314.P4.Control      LabB.R95.CN7314.P4.IFNg   LabB.R95.CN7314.P5.Control 
                   0.9509877                    1.0742606                    0.9817360 
     LabB.R95.CN7314.P5.IFNg   LabB.R95.CN7314.P6.Control      LabB.R95.CN7314.P6.IFNg 
                   1.0921636                    1.2666923                    1.0758749 
  LabB.R95.CN7875.P4.Control      LabB.R95.CN7875.P4.IFNg   LabB.R95.CN7875.P5.Control 
                   0.9013256                    1.0833068                    1.0313982 
     LabB.R95.CN7875.P5.IFNg   LabB.R95.CN7875.P6.Control      LabB.R95.CN7875.P6.IFNg 
                   1.1667053                    1.1918567                    1.3519535 
  LabB.R95.CN8351.P4.Control      LabB.R95.CN8351.P4.IFNg   LabB.R95.CN8351.P5.Control 
                   0.9827153                    0.9560459                    1.0029934 
     LabB.R95.CN8351.P5.IFNg   LabB.R95.CN8351.P6.Control      LabB.R95.CN8351.P6.IFNg 
                   0.9652303                    0.9766554                    0.9307331 
normalized_total_counts <- total_counts * nf
  • What is the range of total_counts/normalization factors? (Try the summary function)
  • What is the average total count per sample? Does this seem high or low? How might we expect this to affect our results?

Fitting the linear models

Now that we have computed our normalization and filtered out non-expressed genes, it’s time to fit our linear models. The basic model fitting function provided by limma is lmFit, which is essentially a shortcut for running lm once for each gene, using the same model formula each time.

First, we will construct our design matrix by selecting an appropriate model formula. This step is performed automatically when you run lm, but for lmFit we must do it manually. For our first model, we’ll keep it simple and use Treatment as the only covariate.

design <- model.matrix(~ Treatment, data=sample_table)
head(design)
                             (Intercept) TreatmentIFNg
LabA.R104.6C4_65M.P4.Control           1             0
LabA.R104.6C4_65M.P4.IFNg              1             1
LabA.R104.6C4_65M.P5.Control           1             0
LabA.R104.6C4_65M.P5.IFNg              1             1
LabA.R104.6C4_65M.P6.Control           1             0
LabA.R104.6C4_65M.P6.IFNg              1             1

Note that the design has two columns, representing the two coefficients in our model. The first one, named (Intercept), represents the expression level of the Control samples, while the second coefficient represents the difference between Control and IFNg treatments. This is the coefficient that we will test for differential expression once we have fit our model. This model will be equivalent to a simple two-sample t-test.

Before we fit our model, we have to run voom to compensate for the heteroskedasticity of the counts. (While we’re at it, we also insert the gene metadata into the resulting object. This will allow limma to include the gene metadata its result tables automatically.)

v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)

v$genes <- filtered_gene_info

Running voom with plot=TRUE produces a diagnostic plot showing the empirical relationship between log count and variance. The fitted curve is a running average of the cloud of points, and this curve is what voom uses to assign a weight to each observed count.

  • According to the plot, which count would have the highest weight (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the log2 scale)
  • Which would have the lowest weight?
  • Is the relationship between log count and variance monotonic? Why might it not be monotonic, if higher counts are supposed to be more precise?
  • Why does voom need to know our experimental design?

The object returned by voom is an EList, which is a complex object similar in spirit to a SummarizedExperiment. You don’t need to know anything about its internals. Just know that it contains both the matrix of log2(CPM) values and the corresponding matrix of weights. It also contains the gene info, which we added manually. Together with the design matrix, this is everything we need to fit our linear models.

fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)

Recall that the eBayes function is responsible for squeezing each gene’s sample variance toward the overall mean variance of the whole data set, in the process trading a bit of bias for stability.

Getting our results

Anyway, now we can perform a “moderated t-test”. “Moderated” means that we are substituting the empirical Bayes squeezed variance for the sample variance in the formula for the t statistic (and also adjusting the degrees of freedom term accordingly). To get our results, we call topTable, telling it which coefficient we wish to test, along with which multiple testing correction to use on the p-values. The n=Inf tells it to give us the results for all the genes in the data set.

results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
head(results)
  • What is the estimated FDR for the most significant gene (the FDR is stored in the adj.P.Val column)?
  • How many genes are significantly differentially expressed at a threshold of 10% FDR?
  • Do you see any genes that make biological sense in the top few results?

Now let’s inspect the p-value histogram. For reference, we’ll add in a horizontal line indicating what a uniform distribution would look like.

ggplot(results) +
    aes(x=P.Value) +
    geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
    geom_hline(yintercept=1) +
    ggtitle("P-value distribution for Control vs Treatment")

  • What can we conclude from this histogram?

Lastly, we can generate an MA plot showing the log2 fold change vs the log2 CPM for each gene:

ggplot(arrange(results, desc(P.Value))) +
    aes(x=AveExpr, y=logFC,
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("MA Plot, IFNg vs Control")

Another common plot is the volcano plot, which plots significance vs log fold change:

ggplot(arrange(results, desc(P.Value))) +
    aes(x=logFC, y=-log10(P.Value),
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("Volcano Plot, IFNg vs Control")

  • Based on these plots, are the changes balanced between up and down, or is there a bias toward a certain direction of change? Does this make sense for the biology of interferon treatment?

Improving the analysis by exploring the data

We got pretty good results above, but how do we know that we analyzed the data correctly? Are there any important covariates that we should have included in our model? How can we figure out which covariates are important? One way is to do a PCA plot. Limma actually provides something called an MDS or PCoA plot, which is slightly different from a PCA plot but serves the same purpose.

mds <- data.frame(plotMDS(v)[c("x", "y")])

Limma’s plotMDS function creates an MDS plot using the sample labels, but this results in a very crowded plot. Luckily, it also returns the x and y coordinates of the plot, which we can use to make our own. Because we want to see how each covariate relates to the principal coordinates, let’s make several versions of our MDS plot, colored by each covariate.

mds <- cbind(mds, sample_table)
p <- ggplot(mds) +
    aes(x=x, y=y) +
    xlab("PC1") + ylab("PC2") +
    geom_point(size=3) +
    coord_fixed(ratio=1) +
    ggtitle("Sample MDS Plot")
for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
    print(p + aes_string(color=i) +
          ggtitle(paste("Sample MDS Plot Colored by", i)))
}

Investigating an outlier

You might have noticed that a few of the IFNg-treated samples cluster with the Control samples. These look like possible outlier samples, and we should investigate them, since they could interfere with our model fit. Let’s try coloring by Animal ID and using a different shape for the IFNg samples.

p + aes(color=Animal.ID, shape=Treatment)

  • Can you identify the misbehaving Animal?

Let’s remove the samples for this animal from the data set and repeat the whole limma analysis.

bad_animal <- "CN8351"
selected_samples <- !(sample_table$Animal.ID %in% bad_animal)

good_sample_table <- droplevels(sample_table[selected_samples,])
good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
good_normalized_total_counts <- normalized_total_counts[selected_samples]

design <- model.matrix(~ Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)

FALSE  TRUE 
 6788  5119 
  • How did removing the outlier animal affect the number of differentially expressed genes?

Hopefully this demonstrates the importance of properly exploring your data before deciding on a model. You can always fit any model to the data, but the model will give you some kind of answer, sometimes even a plausible-looking one, whether or not that model is a good fit for the data. By eliminating an outlier, we significantly increased our ability to detect differential expression. Similarly, the focus of the homework will be to explore the consequences of adding additional covariates into the model formula.

Homework: Basic Model Selection

Try fitting models with Animal.ID, Passage, or both in addition to Treatment. In other words, try all four of the following model formulae:

~ Treatment # (This is the model we just fit in class)
~ Animal.ID + Treatment
~ Passage + Treatment
~ Animal.ID + Passage + Treatment

Use the filtered dataset with the outlier animal samples removed. In addition to testing Treatment for differential expression in each model, also test the other covariates for differential expression by passing a different value for the coef argument to topTable. You will need to pass a vector of all the columns of the design matrix relevant to the covariate. Below, I have included an example of how to fit the model and test for differential expression for the formula ~Passage + Treatment. (Hint: use colnames(design) to help you figure out which coefficients to test.)

design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
table(passage.results$adj.P.Val <= 0.1)

Based on your results for all four models, decide which model best fits the data. Justify your choice with MDS plots and p-value distribution plots that show why you are including or excluding each of Animal.ID and Passage from your model. How do your results for Treatment change when you include your chosen covariates? Does your model give more differentially expressed genes than the one we fit in class? If so, how many more? How many genes are differentially expressed with respect to your chosen covariates? Do these results still hold when using a different signifcance threshold, such as 5% or 1% FDR instead of 10%?

For help fitting models with multiple covariates, see the Limma User’s Guide section 9.4, “Additive Models and Blocking”.

Homework shortcut: selectModel()

designList <- list(
  TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
  TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
  TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
  TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
sm <- selectModel(v, designlist = designList, criterion = "aic")
table(sm$pref)

      TrtOnly       TrtPass     TrtAnimal TrtPassAnimal 
          418            65          7888          3536 
---
title: RNA-seq study of cynomolgus monkey mesenchymal stem cells treated with interferon
  gamma
author: "Ryan C. Thompson"
date: "May 8, 2018"
output:
  html_notebook:
    fig_caption: yes
  pdf_document:
    fig_caption: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, retina=2, cache=TRUE, autodep=TRUE)
```

# Introduction

Today we're going to look at a data set from my work. This data comes
from menchymnal stem cells (MSCs) cultured from 9 cynomolgus monkeys
(a closely related species to rhesus macaques). MSCs are known to have
immune modulatory effects, and the main goal of the project is to test
whether these stem cells, when activated with interferon gamma (IFNg),
can help suppress the body's immune response against an organ
transplant.

![Known effects of MSC which may benefit organ transplants](images/msc1-cropped.png)

The purpose of this specific data set is simply to determine which
genes' expression is affected when MSCs are treated with IFNg. As
such, we have both untreated (Control) samples and IFNg-treated
samples from 3 different passages of the cell cultures. So, we'll be
using limma to test each gene for differential expression between
Control and IFNg samples.

# Preliminary setup

First, let's install all the packages we'll need.

```{r install_pkgs, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
already_installed_packages <- rownames(installed.packages())
needed_packages <- c("edgeR", "limma", "SummarizedExperiment", "ggplot2", "locfit", "dplyr", "statmod")
need_to_install <- setdiff(needed_packages, already_installed_packages)
if (length(need_to_install) > 0) {
    ## http://bioconductor.org/install/
    BiocManager::install(need_to_install)
}
```

Then we'll load those packages.

```{r load_pkgs, results="hide", message=FALSE}
library(SummarizedExperiment)
library(edgeR)
library(limma)
library(ggplot2)
library(dplyr)
```

# Initial data loading and exploration

Let's load the data file. This uses the `readRDS` function, which
reads a single R object from a file and returns it. We assign the
result to a variable.

```{r loading_data}
sexp <- readRDS(gzcon(url("https://darwinawardwinner.github.io/resume/examples/Salomon/Teaching/Cyno-RNASeq-SummarizedExperiment.RDS")))
print(sexp)
```

This object is called a SummarizedExperiment, and its purpose is to
hold the experimental data, sample metadata, gene metadata, and any
other relevant information for the experiment all in one place. This
makes it an excellent starting point for any analysis. In our case,
the experimental data consists of a matrix of counts for each gene in
each sample. This count matrix was generated by aligning all the RNA
sequence reads in each sample to the
[cyno genome](http://www.ncbi.nlm.nih.gov/genome/776) and then
counting the number of uniquely-mapped reads that overlap each gene. I
have done the aligning and counting process for you and provided the
count matrix, since the alignment and counting would take many hours
to commplete.

Look at the output of the print statement above. The RNA-Seq read
counts are contained in the `assays` slot. The `colData` slot contains
information on the samples, while the `rowRanges` slot contains
information on the genes. The rows have "ranges" instead of just
"data" because in addition to things like the gene symbol and
description, the `rowRanges` slot also contains the genomic
coordinates of each gene, that is, the chromosome, start/end positions
of each exon, and the strand. This is the same information that was
used to count the overlapping reads for each gene, but we won't be
needing it today.

You can read more about SummarizedExperiment objects
[here](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html).

![Structure of a SummarizedExperiment object](images/SEXP.svg)

* How many samples does this SummarizedExperiment object contain? How
  many genes does it have? Where did you find this information?
* What happens when you try to select a subset of rows or columns?

## Extracting the data and metadata

Ok, let's pull out the information that we want from this
SummarizedExperiment.

First, the sample information:

```{r extract_sampletable}
sample_table <- as.data.frame(colData(sexp))
```

Next, the gene information:

```{r extract_geneinfo}
## Gene metadata
gene_info <- as.data.frame(mcols(sexp))
```

And finally, the count matrix:

```{r extract_counts}
count_matrix <- assay(sexp)
```

Examine the `sample_table`, `gene_info`, and `count_matrix` objects.
Try `dim`, `class`, `nrow`, `ncol`, and similar. Try viewing the first
few rows or columns of each one.

* What kind of object is each one?
* What kind of data do they contain?
* Which dimensions do they have in common? How do these dimensions
  match up to the dimension of the SummarizedExperiment?

```{r example_sexp_components, include=FALSE}
dim(sexp)
dim(sample_table)
dim(gene_info)
dim(count_matrix)

head(sample_table)
head(gene_info)
## Find the IFNG gene
gene_info[which(gene_info$symbol == "IFNG"),]
count_matrix[1:10,1:6]
```

## Learning about the experimental design from the sample table

Before we get to the analysis, we need to have a closer look at the
sample table, to gain a better understanding of the experimental
design.

```{r head_sample_table}
names(sample_table)
head(sample_table)
```

* How many Labs/Runs/Animals/Passages/Treatments are there in this
  experiment?
* How many samples are in each one?

(You may find the `unique` and `table` functions useful for answering
these questions.)

```{r tabulate_samples, include=FALSE}
lapply(sample_table[-1], unique)
lapply(sample_table[-1], table)
```
* How many Animals are there in each Lab?
* Are there Animals with samples from both Labs?

```{r tabulate_animals_in_labs}
table(sample_table$Animal.ID, sample_table$Lab)
```

* How many sequencing Runs are there in each Lab?
* How many Animals are there in each sequencing Run?
* Are there Animals with samples from multiple sequencing Runs?
* Is it better to put all of each animal's samples in the same run, or
  is it better to distribute each animal's samples across multiple
  runs?

```{r tabulate_more_things, include=FALSE}
table(sample_table$Animal.ID, sample_table$Run)
table(sample_table$Run, sample_table$Lab)
```

# Performing a basic limma analysis #

## Filtering non-expressed genes ##

Our count matrix contains information on every known gene in the cyno
genome, but we only want to look at the genes that are expressed in
MSCs. Many genes will not be expressed at all and will have zero or
very few counts in all samples, and the statistical method breaks down
when it is fed all zeros. So we need to decide on a threshold of gene
detection and filter out all the genes whose average expression does
not reach this threshold. Let's see how a threshold of logCPM = 1
looks on a histogram and a QQ plot against normal distribution
quantiles.

```{r plot_cpm_threshold, fig.cap="logCPM histogram with threshold line"}
mean_log_cpm <- aveLogCPM(count_matrix)
filter_threshold <- 1
ggplot() + aes(x=mean_log_cpm) +
    geom_histogram(binwidth=0.2) +
    geom_vline(xintercept=filter_threshold) +
    ggtitle("Histogram of mean expression values")
```

```{r plot_cpm_qq, fig.cap="logCPM QQ plot with threshold line"}
qqnorm(mean_log_cpm); abline(h=filter_threshold)
```

* What do the shapes of these two plots indicate?
* How can we justify our choice of threshold using these plots?
* Would this same detection threshold be appropriate for other
  experiments?
* What biological or experimental factors might influence the choice
  of threshold?

Having chosen our threshold, let's pick the subset of genes whose
average expression passes that threshold.

```{r filter_genes}
keep_genes <- mean_log_cpm >= 1
filtered_count_matrix <- count_matrix[keep_genes,]
filtered_gene_info <- gene_info[keep_genes,]
nrow(filtered_count_matrix)
```

* What fraction of the genes in the genome are considered expressed
  according to our threshold?
* Is this number sensitive to our choice of threshold? How much of a
  difference does it make if we use a threshold of 0.5 or 1.5 instead?

```{r threshold_sensitivity, include=FALSE}
mean(keep_genes)
mean(mean_log_cpm >= 1.5)
mean(mean_log_cpm >= 0.5)
```

## Normalization ##

Ok, now that we have some understanding of the experimental design,
let's find some differentially expressed genes! We'll start by
computing the total counts in each sample and then normalizing these
total counts using the Trimmed Mean of M-values (TMM) method, provided
by the `calcNormFactors` function.

```{r calcNormFactors}
total_counts <- colSums(count_matrix)
nf <- calcNormFactors(count_matrix, lib.size=total_counts, method="TMM")
print(nf)
normalized_total_counts <- total_counts * nf
```

* What is the range of total_counts/normalization factors? (Try the
  `summary` function)
* What is the average total count per sample? Does this seem high or
  low? How might we expect this to affect our results?

```{r inspect_nf, include=FALSE}
summary(total_counts)
summary(nf)
```

## Fitting the linear models ##

Now that we have computed our normalization and filtered out
non-expressed genes, it's time to fit our linear models. The basic
model fitting function provided by limma is `lmFit`, which is
essentially a shortcut for running `lm` once for each gene, using the
same model formula each time.

First, we will construct our design matrix by selecting an appropriate
model formula. This step is performed automatically when you run `lm`,
but for `lmFit` we must do it manually. For our first model, we'll
keep it simple and use Treatment as the only covariate.

```{r design}
design <- model.matrix(~ Treatment, data=sample_table)
head(design)
```

Note that the design has two columns, representing the two
coefficients in our model. The first one, named `(Intercept)`,
represents the expression level of the Control samples, while the
second coefficient represents the difference between Control and IFNg
treatments. This is the coefficient that we will test for differential
expression once we have fit our model. This model will be equivalent
to a simple two-sample t-test.

Before we fit our model, we have to run `voom` to compensate for the
heteroskedasticity of the counts. (While we're at it, we also insert
the gene metadata into the resulting object. This will allow limma to
include the gene metadata its result tables automatically.)

```{r voom, fig.cap="Voom diagnostic plot"}
v <- voom(filtered_count_matrix, design, lib.size=normalized_total_counts, plot=TRUE)
v$genes <- filtered_gene_info
```

Running `voom` with `plot=TRUE` produces a diagnostic plot showing the
empirical relationship between log count and variance. The fitted
curve is a running average of the cloud of points, and this curve is
what voom uses to assign a weight to each observed count.

* According to the plot, which count would have the highest weight
  (i.e. the lowest expected variance): 4, 30, or 30000? (Remember the
  log2 scale)
* Which would have the lowest weight?
* Is the relationship between log count and variance monotonic? Why
  might it not be monotonic, if higher counts are supposed to be more
  precise?
* Why does `voom` need to know our experimental design?

The object returned by `voom` is an `EList`, which is a complex object
similar in spirit to a SummarizedExperiment. You don't need to know
anything about its internals. Just know that it contains both the
matrix of log2(CPM) values and the corresponding matrix of weights. It
also contains the gene info, which we added manually. Together with
the design matrix, this is everything we need to fit our linear
models.

```{r lmfit}
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
```

Recall that the `eBayes` function is responsible for squeezing each
gene's sample variance toward the overall mean variance of the whole
data set, in the process trading a bit of bias for stability.

## Getting our results ##

Anyway, now we can perform a "moderated t-test". "Moderated" means
that we are substituting the empirical Bayes squeezed variance for the
sample variance in the formula for the t statistic (and also adjusting
the degrees of freedom term accordingly). To get our results, we call
`topTable`, telling it which coefficient we wish to test, along with
which multiple testing correction to use on the p-values. The `n=Inf`
tells it to give us the results for all the genes in the data set.

```{r results}
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
head(results)
```

* What is the estimated FDR for the most significant gene (the FDR is
  stored in the `adj.P.Val` column)?
* How many genes are significantly differentially expressed at a
  threshold of 10% FDR?
* Do you see any genes that make biological sense in the top few
  results?

```{r num_sig, include=FALSE}
results$adj.P.Val[1]
table(results$adj.P.Val <= 0.1)
```

Now let's inspect the p-value histogram. For reference, we'll add in a
horizontal line indicating what a uniform distribution would look
like.

```{r pval_hist, fig.cap="P-value histogram for the first analysis"}
ggplot(results) +
    aes(x=P.Value) +
    geom_histogram(aes(y=..density..), binwidth=0.025, boundary=0) +
    geom_hline(yintercept=1) +
    ggtitle("P-value distribution for Control vs Treatment")
```

* What can we conclude from this histogram?

Lastly, we can generate an MA plot showing the log2 fold change vs the
log2 CPM for each gene:

```{r maplot}
ggplot(arrange(results, desc(P.Value))) +
    aes(x=AveExpr, y=logFC,
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("MA Plot, IFNg vs Control")
```

Another common plot is the volcano plot, which plots significance vs
log fold change:

```{r volcanoplot}
ggplot(arrange(results, desc(P.Value))) +
    aes(x=logFC, y=-log10(P.Value),
         color=ifelse(adj.P.Val <= 0.1, "FDR <= 10%", "FDR > 10%")) +
    geom_point(size=0.1) +
    scale_color_hue(name="Significance") +
    theme(legend.justification=c(1,1), legend.position=c(1,1)) +
    ggtitle("Volcano Plot, IFNg vs Control")
```

* Based on these plots, are the changes balanced between up and down,
  or is there a bias toward a certain direction of change? Does this
  make sense for the biology of interferon treatment?


# Improving the analysis by exploring the data

We got pretty good results above, but how do we know that we analyzed
the data correctly? Are there any important covariates that we should
have included in our model? How can we figure out which covariates are
important? One way is to do a PCA plot. Limma actually provides
something called an MDS or PCoA plot, which is slightly different from
a PCA plot but serves the same purpose.

```{r mds_init, fig.cap="Basic MDS plot from limma"}
mds <- data.frame(plotMDS(v)[c("x", "y")])
```

Limma's `plotMDS` function creates an MDS plot using the sample
labels, but this results in a very crowded plot. Luckily, it also
returns the x and y coordinates of the plot, which we can use to make
our own. Because we want to see how each covariate relates to the
principal coordinates, let's make several versions of our MDS plot,
colored by each covariate.

```{r mds, results="hide"}
mds <- cbind(mds, sample_table)
p <- ggplot(mds) +
    aes(x=x, y=y) +
    xlab("PC1") + ylab("PC2") +
    geom_point(size=3) +
    coord_fixed(ratio=1) +
    ggtitle("Sample MDS Plot")
for (i in c("Lab", "Run", "Animal.ID", "Passage", "Treatment")) {
    print(p + aes_string(color=i) +
          ggtitle(paste("Sample MDS Plot Colored by", i)))
}
```

* Based on the MDS plots, which variables seem to be important, and
  which do not?

## Investigating an outlier

You might have noticed that a few of the IFNg-treated samples cluster
with the Control samples. These look like possible outlier samples,
and we should investigate them, since they could interfere with our
model fit. Let's try coloring by Animal ID and using a different shape
for the IFNg samples.

```{r id_outlier, fig.cap="MDS plot for identifying the outlier animal"}
p + aes(color=Animal.ID, shape=Treatment)
```

* Can you identify the misbehaving Animal?

Let's remove the samples for this animal from the data set and repeat
the whole limma analysis.

```{r filter_outlier_animal}
bad_animal <- "CN8351"
selected_samples <- !(sample_table$Animal.ID %in% bad_animal)

good_sample_table <- droplevels(sample_table[selected_samples,])
good_filtered_count_matrix <- count_matrix[keep_genes,selected_samples]
good_normalized_total_counts <- normalized_total_counts[selected_samples]

design <- model.matrix(~ Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
```

* How did removing the outlier animal affect the number of
  differentially expressed genes?

Hopefully this demonstrates the importance of properly exploring your
data before deciding on a model. You can always fit any model to the
data, but the model will give you some kind of answer, sometimes even
a plausible-looking one, whether or not that model is a good fit for
the data. By eliminating an outlier, we significantly increased our
ability to detect differential expression. Similarly, the focus of the
homework will be to explore the consequences of adding additional
covariates into the model formula.

# Homework: Basic Model Selection

Try fitting models with Animal.ID, Passage, or both in addition to
Treatment. In other words, try all four of the following model
formulae:

```{r model_options, eval=FALSE}
~ Treatment # (This is the model we just fit in class)
~ Animal.ID + Treatment
~ Passage + Treatment
~ Animal.ID + Passage + Treatment
```

Use the filtered dataset with the outlier animal samples removed. In
addition to testing Treatment for differential expression in each
model, also test the other covariates for differential expression by
passing a different value for the `coef` argument to `topTable`. You
will need to pass a vector of all the columns of the design matrix
relevant to the covariate. Below, I have included an example of how to
fit the model and test for differential expression for the formula
`~Passage + Treatment`. (Hint: use `colnames(design)` to help you
figure out which coefficients to test.)

```{r homework_example, eval=FALSE}
design <- model.matrix(~ Passage + Treatment, data=good_sample_table)
v <- voom(good_filtered_count_matrix, design, lib.size=good_normalized_total_counts)
v$genes <- filtered_gene_info
fit <- lmFit(v, design)
fit <- eBayes(fit, robust=TRUE)
results <- topTable(fit, coef="TreatmentIFNg", adjust.method="BH", n=Inf)
table(results$adj.P.Val <= 0.1)
passage.results <- topTable(fit, coef=c("PassageP5", "PassageP6"), n=Inf)
table(passage.results$adj.P.Val <= 0.1)
```

Based on your results for all four models, decide which model best
fits the data. Justify your choice with MDS plots and p-value
distribution plots that show why you are including or excluding each
of Animal.ID and Passage from your model. How do your results for
Treatment change when you include your chosen covariates? Does your
model give more differentially expressed genes than the one we fit in
class? If so, how many more? How many genes are differentially
expressed with respect to your chosen covariates? Do these results
still hold when using a different signifcance threshold, such as 5% or
1% FDR instead of 10%?

For help fitting models with multiple covariates, see the
[Limma User's Guide](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf)
section 9.4, "Additive Models and Blocking".

# Homework shortcut: `selectModel()`

```{r homework_shortcut}
designList <- list(
  TrtOnly=model.matrix(~ Treatment, data=good_sample_table),
  TrtPass=model.matrix(~ Treatment + Passage, data=good_sample_table),
  TrtAnimal=model.matrix(~ Treatment + Animal.ID, data=good_sample_table),
  TrtPassAnimal=model.matrix(~ Treatment + Animal.ID + Passage, data=good_sample_table))
v <- voom(good_filtered_count_matrix, designList$TrtPassAnimal, lib.size=good_normalized_total_counts)
sm <- selectModel(v, designlist = designList, criterion = "aic")
table(sm$pref)
```
