A Review of MoR and TMM Normalization
Set Up
Let y_{i,g} denote the observed count in a sample i \in \mathcal{I} (with n_i = |\mathcal{I}|) for a gene g \in \mathcal{G} (with n_g = |\mathcal{G}|) realized from a particular negative binomial distribution:
\begin{aligned} Y_{i,g} &\sim \text{NegBinom}(\mu_{i,g}, \phi_{g}) \\ \mu_{i,g} &= \eta_{i,g} \cdot \rho_{i} \end{aligned}
Where \eta_{i,g} denotes the true magnitude of expression and \rho_{i} denotes the “size factor” that scales this expression from its ground-truth.
DESeq2’s Median of Ratios
We first focus on a particular gene g and compare the observed count for a sample i to the geometric average of counts across samples to get a ratio R_{i,g}:
R_{i,g} = \frac{ y_{i,g} }{( \prod_{i \in \mathcal{I}} y_{i,g} )^{ \frac{1}{n_i} } }
We then take the median of these ratios to get our size factor estimates!
\hat{s}_{i} = \underset{g \in \mathcal{G}}{\text{median }} R_{i,g}
edgeR’s Trimmed Mean of M Values
We first “normalize” the observed count profile for a sample i by the total number of counts N_i = \sum_{g \in \mathcal{G}} y_{i,g} in order to get proportions:
y'_{i,g} = \frac{y_{i,g}}{N_i}
We next select a reference sample i^{\dagger} \in \mathcal{I} and compute both the ratio of log-transformed proportions and their weights:
\begin{aligned} R_{i,g} &= \frac{\log_2 y'_{i,g}}{\log_2 y'_{i^{\dagger}, g}} \\ w_{i,g} &= \frac{ N_i - y_{i,g} }{ N_i y_{i,g} } + \frac{ N_{i^{\dagger}} - y_{i^{\dagger},g} }{ N_{i^{\dagger}} y_{i^{\dagger},g} } \end{aligned}
We then filter the genes to a subset \mathcal{G}'_i \subset \mathcal{G} by symmetrically “trimming” away the smallest and largest ratios for a sample i to XX% of the original number (defaults to 70%).
We finally compute the “normalization” factor (distinct from a size factor) by taking the weighted average of these ratios and raising it to the second power:
\log_2 \hat{f}_i = \frac{ \sum_{g \in \mathcal{G}'_i} w_{i,g} R_{i,g} }{ \sum_{g \in \mathcal{G}'_i} w_{i,g} }
The size factor is then found by taking the product with the total number of observed counts:
\hat{s}_i = N_i \hat{f}_i
Note: This is referred to as the “effective library size” by the edgeR authors.
Benchmarking Size Factor Estimation
We’re interested in how well DESeq2, edgeR
and disize estimate the true size factors underlying a
given dataset. The following simulation framework specifies:
- A fixed experimental design.
- The amount of genes measured in the experiment.
- The fraction of genes who are not affected by a given measured covariate.
- The allowed magnitude of the covariate effect (conditional on the gene being affected).
For all experimental designs we construct various scenarios by varying the above parameters and then simulate a dataset corresponding to a given scenario:
# Simulate a single dataset given its settings
simulate_dataset <- function(
design_formula,
metadata,
n_genes,
sparsity,
mgt,
avg) {
# Simulate inverse overdispersion factors
iodisps <- rlnorm(n_genes, log(10), 0.5)
# Split design formula
design <- split_formula(design_formula)
# Construct model matrices...
# Simulate size factors
size_factors <- runif(length(levels(metadata$batch_id)), 0.1, 1.0) |>
setNames(levels(metadata$batch_id))
batch_effect <- size_factors[metadata$batch_id]
# Simulate counts
counts <- sapply(1:n_genes, function(g_i) {
# Simulate fixed-effects
sparse <- runif(ncol(fe_design)) < sparsity
fe_coefs <- ifelse(sparse, 0.0, rnorm(ncol(fe_design), 0, mgt))
# Simulate random-effects
re_coefs <- lapply(lapply(remm$Ztlist, nrow), function(n_coefs) {
sparse <- rep(runif(1) < sparsity, n_coefs)
ifelse(sparse, 0.0, rnorm(n_coefs, sd = mgt))
}) |>
unlist()
# Compute realized magnitude
log_mu <- as.vector(
rnorm(1, log(avg)) +
fe_design %*% fe_coefs +
re_design %*% re_coefs +
log(batch_effect)
)
# Draw counts from negative binomial
rnbinom(nrow(metadata), mu = exp(log_mu), size = iodisps[g_i])
})
# Cast to integers
mode(counts) <- "integer"
list(
counts = counts,
metadata = metadata,
size_factors = log(
size_factors / sum(size_factors) * length(size_factors)
)
)
}Impact on Downstream Differential Expression Analysis
Normalization is a small component of the transcriptomic workflow. The impact on downstream analyses however is the main outcome of interest, as size factors are only used to account for batch-effects.
To evaluate the impact of each normalization method on a typical downstream analysis, we specify a given experimental design and simulate a dataset similar to the benchmarks above:
# Simulate a single dataset given its settings
simulate_dataset <- function(
design_formula,
metadata,
n_genes,
sparsity,
mgt,
avg) {
# Simulate inverse overdispersion factors
iodisps <- rlnorm(n_genes, log(10), 0.5)
# Simulate size factors
size_factors <- runif(length(levels(metadata$batch_id)), 0.1, 1.0) |>
setNames(levels(metadata$batch_id))
batch_effect <- size_factors[metadata$batch_id]
# Construct sparsity mask
mask <- rep(1.0, n_genes)
mask[1:floor(sparsity * n_genes)] <- 0.0
# Construct design matrix
design <- model.matrix(design_formula, metadata)
# Simulate counts
counts <- sapply(1:n_genes, function(g_i) {
# Simulate effects
effects <- rnorm(ncol(design), sd = mgt)
# Compute realized magnitude
log_mu <- as.vector(
rnorm(1, log(avg)) +
design %*% effects * mask[g_i] +
log(batch_effect)
)
# Draw counts from negative binomial
rnbinom(nrow(design), mu = exp(log_mu), size = iodisps[g_i])
})
# Cast to integers
mode(counts) <- "integer"
list(
counts = counts,
metadata = metadata,
size_factors = log(
size_factors / sum(size_factors) * length(size_factors)
),
null = mask == 0.0
)
}The only difference being we now keep track of which genes are truly null to benchmark the type 1 & type 2 error rates.




