Using some subset of the signatures listed in TBsignatures
and
specified scoring algorithms, this function runs gene signature profiling
on an input gene expression dataset. It allows for scores to be computed for
these signatures which can be compared using various visualization tools also
provided in the TBSignatureProfiler package.
Usage
runTBsigProfiler(
input,
useAssay = NULL,
signatures = NULL,
algorithm = c("GSVA", "ssGSEA", "ASSIGN", "PLAGE", "Zscore", "singscore"),
combineSigAndAlgorithm = FALSE,
assignDir = NULL,
outputFormat = NULL,
parallel.sz = 0,
ASSIGNiter = 1e+05,
ASSIGNburnin = 50000,
ssgsea_norm = TRUE,
update_genes = TRUE
)
Source
Profiling for the Z-Score, PLAGE, GSVA, ssGSEA algorithms are all
conducted with the Bioconductor GSVA
package. Profiling for the
singscore algorithm is conducted with the Bioconductor singscore
package.
Arguments
- input
an input data object of the class
SummarizedExperiment
,data.frame
, ormatrix
containing gene expression data. Required.- useAssay
a character string specifying the assay to use for signature profiling when
input
is aSummarizedExperiment
. Required only for input data of the classSummarizedExperiment
. If null, the assay used will be"counts"
. The default isNULL.
- signatures
a
list
of signatures to run with their associated genes. This list should be in the same format asTBsignatures
, included in the TBSignatureProfiler package. Ifsignatures = NULL
, the default set of signaturesTBsignatures
list is used. For details, run?TBsignatures
. If <2 genes in a signature are present in the sample, that signature will not be evaluated and will not be present in the resulting SE object. The default isNULL
.- algorithm
a vector of algorithms to run, or character string if only one is desired. The default is
c("GSVA", "ssGSEA", "ASSIGN", "PLAGE", "Zscore", "singscore")
.- combineSigAndAlgorithm
logical, if
TRUE
, output row names will be of the form_ . It must be set to TRUE
if theouputFormat
will be a SummarizedExperiment andlength(algorithm) > 1
. It will always beFALSE
if only one algorithm is selected. IfFALSE
, there will be a column named algorithm' that lists which algorithm is used, and a column named 'pathway' that lists the signature profiled. IfNULL
, and one algorithm was used, the algorithm will not be listed. The default isFALSE
.- assignDir
a character string naming a directory to save intermediate ASSIGN results if
algorithm
specifies"ASSIGN"
. The default isNULL
, in which case intermediate results will not be saved.- outputFormat
a character string specifying the output data format. Possible values are
"SummarizedExperiment"
,"matrix"
, or"data.frame"
. The default is to return the same type as theinput
object.- parallel.sz
an integer identifying the number of processors to use when running the calculations in parallel for the GSVA and ssGSEA algorithms. If
parallel.sz = 0
, all cores are used. The default is0
.- ASSIGNiter
an integer indicating the number of iterations to use in the MCMC for the ASSIGN algorithm. The default is
100,000
.- ASSIGNburnin
an integer indicating the number of burn-in iterations to use in the MCMC for the ASSIGN algorithm. These iterations are discarded when computing the posterior means of the model parameters. The default is
50,000
.- ssgsea_norm
logical, passed to
GSVA::gsva()
. When parameteralgorithm = "ssgsea"
,the profiler runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. Whenssgsea.norm = FALSE
, this last normalization step is skipped. The default isTRUE
.- update_genes
logical, denotes whether gene names from
signatures
and the rownames ofinput
should be checked for accuracy usingHGNChelper::checkGeneSymbols()
. The mapping assumes genes are from humans and will keep unmapped genes as the original input gene name. Default isTRUE
.
Value
A SummarizedExperiment
object, data.frame
, or
matrix
of signature profiling results. The returned object will be
of the format specified in outputFormat
.
If input
is a SummarizedExperiment
and
outputFormat = "SummarizedExperiment"
, then the output will retain
any input information stored in the input colData. In general, if
outputFormat = "SummarizedExperiment"
then columns in the colData
will include the scores for each desired signature with samples on the rows.
If input
is a data.frame
or matrix
, then the returned
object will have signatures on the rows and samples on the columns.
References
Barbie, D.A., Tamayo, P., Boehm, J.S., Kim, S.Y., Moody, S.E., Dunn, I.F., Schinzel, A.C., Sandy, P., Meylan, E., Scholl, C., et al. (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108-112. doi: 10.1038/nature08460.
Foroutan, M. et al. (2018). Single sample scoring of molecular phenotypes. BMC Bioinformatics, 19. doi: 10.1186/s12859-018-2435-4.
Lee, E. et al. (2008). Inferring pathway activity toward precise disease classification. PLoS Comp Biol, 4(11):e1000217. doi: 10.1371/journal.pcbi.1000217
Shen, Y. et al. (2015). ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways. Bioinformatics, 31, 1745-1753. doi: 10.1093/bioinformatics/btv031.
Subramanian, A. et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102, 15545-15550. doi: 10.1073/pnas.0506580102.
Tomfohr, J. et al. (2005). Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics, 6:225. doi: 10.1186/1471-2105-6-225
Examples
## Using a data.frame input/output
# Create some toy data to test Zak_RISK_16 signature, using 5 samples with low
# expression & five samples with high expression of the signatures genes.
df_testdata <- as.data.frame(rbind(matrix(c(rnorm(80), rnorm(80) + 5), 16, 10,
dimnames = list(TBsignatures$Zak_RISK_16,
paste0("sample", seq_len(10)))),
matrix(rnorm(1000), 100, 10,
dimnames = list(paste0("gene", seq_len(100)),
paste0("sample", seq_len(10))))))
res <- runTBsigProfiler(input = df_testdata,
signatures = TBsignatures["Zak_RISK_16"],
algorithm = c("GSVA", "ssGSEA"),
combineSigAndAlgorithm = FALSE,
parallel.sz = 1)
#> Parameter update_genes is TRUE. Gene names will be updated.
#> Running GSVA
#> Running ssGSEA
#> [1] "Calculating ranks..."
#> [1] "Calculating absolute values from ranks..."
#> [1] "Normalizing..."
subset(res, res$pathway == "Zak_RISK_16")
#> pathway algorithm sample1 sample2
#> 1 Zak_RISK_16 GSVA -0.534666666666668 -0.600632318501172
#> 2 Zak_RISK_16 ssGSEA 0.192383825351229 0.069521160142643
#> sample3 sample4 sample5 sample6
#> 1 -0.736618705035972 -0.631807228915664 -0.52753894080997 0.496305732484076
#> 2 -0.166684761621847 0.0714328935283431 0.114638189730784 0.833315238378153
#> sample7 sample8 sample9 sample10
#> 1 0.681304347826087 0.499178082191781 0.586710239651416 0.540224719101124
#> 2 0.833315238378153 0.832308838099057 0.833315238378153 0.833315238378153
## Using a SummarizedExperiment input/output
# The TB_indian SummarizedExperiment data is included in the package.
GSVA_res <- runTBsigProfiler(input = TB_indian,
useAssay = "logcounts",
signatures = TBsignatures["Zak_RISK_16"],
algorithm = c("GSVA"),
combineSigAndAlgorithm = FALSE,
parallel.sz = 1)
#> Parameter update_genes is TRUE. Gene names will be updated.
#> Running GSVA
#> Warning: 1 genes with constant values throughout the samples.
#> Warning: Genes with constant values are discarded.
GSVA_res$Zak_RISK_16
#> [1] 0.6255317 0.6709665 0.4502341 0.7240338 0.5423896 0.5076658
#> [7] 0.6268138 -0.3785315 0.5837650 0.3315927 0.5688302 0.5667150
#> [13] 0.6943893 -0.6998614 0.3244909 0.6782273 -0.6715210 -0.6758541
#> [19] -0.5738776 -0.7008174 -0.5409373 -0.5289045 -0.5812383 -0.6514118
#> [25] -0.6726977 -0.2904185 -0.5783098 -0.6327992 -0.6770838 -0.4487168
#> [31] -0.6444862 -0.1045107 0.6871138 0.1003890 -0.4337305 -0.5566492
#> [37] 0.3533240 -0.3596061 0.6574704 -0.3224089 0.2424746 0.2679882
#> [43] 0.1508951 0.3123478