disize revolves around its only export:
disize::disize! It has the following required
arguments:
design_formula: An R formula that specifies the experimental design. This is essentially describing the model that you would’ve liked to fit in the absence of batch-effects; it is the same R formula you would pass to tools likeDESeq2oredgeRfor differential expression analysis (including predictors likecondition,sex, etc measured in your study). All terms used in this formula should be present inmetadata.counts: A (observation x features) matrix containing the transcript counts. This can be dense or sparse; an internal coercion to a dense matrix will be done after subsetting relevant features and observations.metadata: A dataframe containing the observation-level information(with observations as rows).batch_name: The name of the column inmetadatacontaining the batch identifier.
If the row indices of
countsandmetadatado not correspond to the same observation but are named, you can specifyobs_namewhich will automatically re-ordercountsto match the indices inmetadata.
Note:
disizeencourages users to set then_threadsargument in order to take advantage of parallelization. Experiment with different values ofn_threadsto find the best performance for your dataset.
An example with DESeq2
To see disize in action, we will be analyzing the
following dataset from Li et al., 2025 with DESeq2.
This dataset consists of a purified macrophage subtype(“Mac2”, induced in an ‘activated’ state) that has been partitioned into four groups exposed to different conditions. The authors offer this information on how the samples were processed:
The sorted Mac2 cells were divided into four groups and stimulated at the 3-h time point with the same concentrations as previously described. Then, the RNA was extracted using the RNeasy Plus Micro Kit as per manufacturer instructions. Poly(A)mRNA was isolated using mRNA Capture Beads 2.0 (Yeasen Cat.12629ES, CHN) with two rounds of purification, followed by RNA fragmentation with magnesium ions at 94°C (Yeasen Cat.12340ES97, CHN). RNA sequencing library preparation was performed using the TruSeq RNA Library Prep Kit v2 (Illumina). Sequencing was carried out as paired-end 2×150 bp (PE150) on an Illumina Novaseq™ X Plus (LC-Bio Technologies).
The TruSeq RNA Library Prep Kit involves “tagging” transcripts with barcodes that identify distinct samples, allowing all prepared cDNA libraries to be pooled together before sequencing. Since batch-effects are usually attributed to separate sequencing runs, then we expect very small batch-effects to be present in this dataset if we define a “batch” as the unit subjected to RNA extraction (and all further processing).
Dependencies
suppressPackageStartupMessages({
library(disize)
library(curl)
library(R.utils)
library(data.table)
})
# Set number of threads to use
n_threads <- parallel::detectCores() / 4Downloading the data
# Download counts and construct metadata
counts_path <- curl::curl_download(
url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE273924&format=file&file=GSE273924%5Fraw%5Fcounts%2Etsv%2Egz",
destfile = paste0(tempdir(), "/counts.tsv.gz")
)
counts <- data.table::fread(counts_path)
metadata <- data.frame(
"sample_id" = c(colnames(counts)[-1]),
"condition" = factor(rep(c("control", "lps", "nelps", "ne"), each = 3))
)
# Coerce to formatted matrix
gene_names <- counts$gene_id
counts <- t(as.matrix(counts[, -1]))
colnames(counts) <- gene_namesRunning disize
The metadata contains the information for the
experimental design:
print(metadata)
#> sample_id condition
#> 1 Ctrl1 control
#> 2 Ctrl2 control
#> 3 Ctrl3 control
#> 4 LPS1 lps
#> 5 LPS2 lps
#> 6 LPS3 lps
#> 7 NELPS1 nelps
#> 8 NELPS2 nelps
#> 9 NELPS3 nelps
#> 10 NE1 ne
#> 11 NE2 ne
#> 12 NE3 neFor this dataset, the study was primarily interested in the effect of
condition on expression, thus the formula we would input
into disize is:
design_formula <- ~conditionWe can finally run disize to get the estimated size
factors:
size_factors_1 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "sample_id",
n_threads = n_threads
)
#> Formatting data...
#> Optimizing over initialization...
#> Estimating size factors... (Max ETA: ~96.6s)
#> Warning in disize::disize(design_formula, counts, metadata, batch_name =
#> "sample_id", : Model did not converge, retrying fit with different
#> initialization... (if you see this multiple times try increasing 'n_iters')
#> Finised in 12.6s!
print(size_factors_1)
#> Ctrl1 Ctrl2 Ctrl3 LPS1 LPS2 LPS3
#> 0.08671084 0.01092470 -0.07757363 0.08014565 0.00838777 0.09461482
#> NE1 NE2 NE3 NELPS1 NELPS2 NELPS3
#> -0.06518790 0.03728017 0.10672543 -0.16491650 -0.13681531 -0.02596185Evidently the batch-effect is indeed small across most samples! The samples “NELPS1” and “NELPS2” seem to have been processed slightly worse, but otherwise the estimated size factors are approximately the same (within ~0.1).
We can confirm these estimates by rerunning disize with
a larger n_feats:
size_factors_2 <- disize::disize(
design_formula,
counts,
metadata,
batch_name = "sample_id",
n_feats = 15000,
n_threads = n_threads
)
#> Formatting data...
#> Optimizing over initialization...
#> Estimating size factors... (Max ETA: ~127.3s)
#> Finised in 21.7s!
print(size_factors_2)
#> Ctrl1 Ctrl2 Ctrl3 LPS1 LPS2 LPS3
#> 0.094264803 0.013007556 -0.070802495 0.069177490 -0.005186325 0.081820682
#> NE1 NE2 NE3 NELPS1 NELPS2 NELPS3
#> -0.043597883 0.050039834 0.113596360 -0.173229669 -0.143793687 -0.031765509Indeed the estimates remain largely the same.
Running DESeq
Once the size factors are inserted into the DESeqDataSet
object, the analysis then proceeds normally.
# Constructing DESeqDataSet object
dds <- DESeq2::DESeqDataSetFromMatrix(
countData = t(counts),
colData = metadata,
design = design_formula
)
DESeq2::sizeFactors(dds) <- exp(size_factors_2) # Insert size factors
# Run analysis
dds <- DESeq2::DESeq(dds)
#> using pre-existing size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
# Print results
print(DESeq2::results(dds))
#> log2 fold change (MLE): condition nelps vs control
#> Wald test p-value: condition nelps vs control
#> DataFrame with 57132 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat pvalue
#> <numeric> <numeric> <numeric> <numeric> <numeric>
#> ENSMUSG00000028180 1.94153e+03 -0.7871856 0.0925977 -8.50113655 1.87743e-17
#> ENSMUSG00000028182 2.81686e+01 -0.8862789 0.5028259 -1.76259586 7.79687e-02
#> ENSMUSG00000028185 3.35067e-01 -0.0402408 4.4073563 -0.00913037 9.92715e-01
#> ENSMUSG00000028184 1.03779e+04 0.7660372 0.0651605 11.75615900 6.56537e-32
#> ENSMUSG00000028187 9.79170e+02 -0.1786581 0.1105460 -1.61614186 1.06064e-01
#> ... ... ... ... ... ...
#> ENSMUSG00000106108 10.697656 6.7530326 1.4302304 4.721640 2.33951e-06
#> ENSMUSG00000042675 1581.164232 -0.0365512 0.0763697 -0.478609 6.32217e-01
#> ENSMUSG00000036959 1429.611345 0.5472428 0.0835723 6.548139 5.82583e-11
#> ENSMUSG00000036958 0.658051 -3.8966751 4.2776587 -0.910936 3.62329e-01
#> ENSMUSG00000042678 5.578370 -5.5971668 1.7015848 -3.289385 1.00407e-03
#> padj
#> <numeric>
#> ENSMUSG00000028180 1.75104e-16
#> ENSMUSG00000028182 1.51915e-01
#> ENSMUSG00000028185 NA
#> ENSMUSG00000028184 1.03493e-30
#> ENSMUSG00000028187 1.97551e-01
#> ... ...
#> ENSMUSG00000106108 9.56172e-06
#> ENSMUSG00000042675 7.67225e-01
#> ENSMUSG00000036959 3.62241e-10
#> ENSMUSG00000036958 NA
#> ENSMUSG00000042678 2.93008e-03