This vignette demonstrates the use of SignIT to decipher mutation signatures in individual tumours, explore signature bleed interactions, and perform temporal dissection of mutation signatures. SignIT is available for download and installation from https://www.github.com/eyzhao/SignIT. See installation instructions in the README.
Cancer is a disease characterized by the formation of tumours composed of cells which divide aggressively due to the loss of specialized functions and growth inhibition. Cancers arise due to the stepwise accumulation of spontaneous genetic mutations specific to the tumour, also known as somatic mutations. These mutations arise from many causes, including chemical mutagens, intracellular mutational processes, and deficient DNA repair genes.
While some mutations may contribute directly to tumour invasiveness, the majority are thought to be “passengers,” with minimal impact on the evolutionary fitness of cancer cells. Nevertheless, it is possible to glean insights about a cancer’s molecular history from these passengers. They provide evidence of the mutational processes which drove the acquisition of mutations in cancer.
Recent research has compiled 30 characteristic signatures of mutation deciphered from cancer whole genome and exome sequences. These are available from http://cancer.sanger.ac.uk/cosmic/signatures.
A mutation catalog is a Data Frame or Tibble containing two columns named mutation_type and count. The mutation_type column is a vector of unique strings each denoting a mutation type. It is possible to define custom mutation types, so long as they match the mutation types defined in a reference mutation signature table.
An example mutation catalog is provided in the package data.
data(example_catalog)
pander(head(example_catalog))
mutation_type | count |
---|---|
A[C>A]A | 86 |
A[C>A]C | 97 |
A[C>A]G | 8 |
A[C>A]T | 54 |
C[C>A]A | 74 |
C[C>A]C | 69 |
SignIT also provides tools to determine SNV mutation catalogs from a table of variants. Here are the first ten rows of a set of example variants. You can load this data from SignIT to see the full set.
data(example_variants)
pander(head(example_variants, 10))
chr | pos | ref | alt | A | C | G | T | total_depth | alt_depth | cnv_start | cnv_end | tumour_copy | tumour_content | normal_copy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 137744 | G | C | 0 | 9 | 77 | 0 | 86 | 9 | 10022 | 582429 | 2 | 0.63 | 2 |
1 | 1479032 | T | A | 8 | 3 | 0 | 104 | 115 | 8 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 2038218 | C | T | 0 | 57 | 1 | 7 | 65 | 7 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 2615086 | C | T | 0 | 41 | 0 | 5 | 46 | 5 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 2615087 | T | C | 0 | 5 | 0 | 41 | 46 | 5 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 2629042 | C | A | 24 | 191 | 0 | 0 | 215 | 24 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 2945876 | G | A | 6 | 0 | 33 | 0 | 39 | 6 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 3097886 | A | G | 87 | 0 | 7 | 0 | 94 | 7 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 4033630 | C | T | 0 | 84 | 0 | 7 | 91 | 7 | 582430 | 8975823 | 1 | 0.63 | 2 |
1 | 4033631 | A | G | 81 | 0 | 10 | 0 | 91 | 10 | 582430 | 8975823 | 1 | 0.63 | 2 |
Use mutations_to_catalog
to convert to catalog form.
example_catalog_2 <- with(
example_variants,
mutations_to_catalog(chr, pos, ref, alt)
)
## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colMeans,
## colnames, colSums, do.call, duplicated, eval, evalq, Filter,
## Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
## rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## The following object is masked from 'package:purrr':
##
## reduce
##
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
##
## compact
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
pander(head(example_catalog_2))
mutation_type | count |
---|---|
A[C>A]A | 180 |
A[C>A]C | 170 |
A[C>A]G | 27 |
A[C>A]T | 125 |
A[C>G]A | 122 |
A[C>G]C | 61 |
SignIT also provides example data of a variant table.
Once a mutation catalog is available, obtaining the contributions of each mutation signature (also known as the exposures) is straightforward. The following line of code initializes the Bayesian model which analyzes mutation signatures and stores the output into a list structure.
patient_signature_exposures <- get_exposures(
mutation_catalog = example_catalog,
n_cores = 1
)
## Sampling
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 9.77 seconds (Warm-up)
## 4.31 seconds (Sampling)
## 14.08 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 10.67 seconds (Warm-up)
## 4.71 seconds (Sampling)
## 15.38 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 10.13 seconds (Warm-up)
## 3.44 seconds (Sampling)
## 13.57 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 9.54 seconds (Warm-up)
## 3.63 seconds (Sampling)
## 13.17 seconds (Total)
A few functions are available to explore the solutions obtained. The easiest way is to plot the posterior distributions for each mutation signature, as follows.
patient_exposure_posterior_plot <- plot_exposure_posteriors(patient_signature_exposures)
print(patient_exposure_posterior_plot)
Note that all exposure plotting functions automatically shorten signature names by removing the word “Signature”. For comparison, we also provide a convenience function which computes a naive non-negative least squares (NNLS) best fit solution.
patient_exposure_nnls_plot <- plot_nnls_solution(patient_signature_exposures)
print(patient_exposure_nnls_plot)
Note how NNLS provides point estimates which mirror SignIT’s solutions, but that uncertainties vary greatly between signatures. Additionally, nnls manifests mutation bleed as inappropriate elevation of signatures 16, 22, and 25.
Signal bleed can often occur between signatures with similar mutational spectra. The result is that these signatures exhibit anticorrelated posterior sampling, suggesting that credible solutions can fit either signature. To visualize this anticorrelation, we can use the following command:
plot_two_signature_hexplot(patient_signature_exposures, 'Signature 3', 'Signature 8')
We can also simultaneously quantify the signature bleed across all signatures and plot it as a heatmap. Signatures with more negative Spearman Rho (dark blue) exhibit more signature bleed.
plot_signature_pairwise_bleed(patient_signature_exposures)
Lastly, SignIT provides a convenient visualization which overlays mutation signature bleed information on top of signature posteriors. The min_bleed
parameter takes on values from 0 to 1. The higher this threshold is set, the cleaner the graph will look, but the more information is lost. We typically recommend values around 0.2 to 0.3.
plot_exposure_posteriors_bleed(patient_signature_exposures, min_bleed = 0.25)
For a quantitative summary of signature exposures, use the following command
summary_table <- get_exposure_summary_table(patient_signature_exposures)
pander(summary_table)
signature | mode_exposure | mean_exposure | median_exposure | 0% | 100% | 2.5% | 25% | 75% | 97.5% |
---|---|---|---|---|---|---|---|---|---|
Signature 1 | 1155 | 1141 | 1146 | 928 | 1315 | 1009 | 1099 | 1184 | 1260 |
Signature 2 | 192 | 199 | 198 | 106 | 307 | 143 | 180 | 220 | 255 |
Signature 3 | 848 | 840 | 846 | 531 | 1143 | 604 | 773 | 911 | 1045 |
Signature 4 | 30 | 62.2 | 55.7 | 0.143 | 270 | 2.1 | 26.5 | 89.3 | 173 |
Signature 5 | 20.9 | 67.5 | 47 | 0.0907 | 349 | 2.6 | 19.9 | 94.5 | 242 |
Signature 6 | 10.2 | 34.2 | 26.7 | 0.0535 | 200 | 1.25 | 11.3 | 48.9 | 105 |
Signature 7 | 16.6 | 28.5 | 24.2 | 0.195 | 120 | 1.43 | 11.6 | 41 | 80.4 |
Signature 8 | 1242 | 1229 | 1226 | 922 | 1566 | 1014 | 1151 | 1301 | 1452 |
Signature 9 | 367 | 360 | 362 | 167 | 560 | 222 | 319 | 406 | 491 |
Signature 10 | 4.99 | 15.6 | 12.4 | 0.0311 | 79.3 | 0.523 | 5.33 | 22.2 | 50.3 |
Signature 11 | 15.7 | 32.2 | 27.7 | 0.236 | 127 | 1.35 | 14 | 45.8 | 87.9 |
Signature 12 | 7.45 | 26.3 | 19.2 | 0.0167 | 164 | 0.5 | 8.08 | 37.2 | 84.4 |
Signature 13 | 255 | 247 | 246 | 145 | 341 | 199 | 229 | 264 | 298 |
Signature 14 | 4.92 | 15.5 | 11.2 | 0.0317 | 97.8 | 0.456 | 4.9 | 22.3 | 52.8 |
Signature 15 | 5.68 | 20 | 15.8 | 0.0102 | 100 | 0.448 | 6.53 | 28.7 | 62.3 |
Signature 16 | 48.6 | 83.2 | 71.1 | 0.23 | 294 | 3.15 | 37.8 | 121 | 218 |
Signature 17 | 25 | 34.3 | 31 | 0.375 | 125 | 2.45 | 17.7 | 48.4 | 79.5 |
Signature 18 | 4.26 | 14.5 | 10 | 0.0271 | 80 | 0.282 | 4.24 | 20.7 | 53.4 |
Signature 19 | 5.1 | 17.3 | 13.1 | 0.0833 | 93 | 0.655 | 5.71 | 23.5 | 58.9 |
Signature 20 | 5.18 | 18.1 | 12.1 | 0.0249 | 121 | 0.358 | 4.97 | 25.3 | 67 |
Signature 21 | 6.5 | 16.7 | 12 | 0.00671 | 95.3 | 0.408 | 5.96 | 24.2 | 53.9 |
Signature 22 | 58.8 | 58.1 | 57.2 | 0.228 | 153 | 6.12 | 36.3 | 77.6 | 117 |
Signature 23 | 3.07 | 10.5 | 7.42 | 0.0473 | 71.9 | 0.335 | 3.08 | 14.8 | 38.1 |
Signature 24 | 9.4 | 25.5 | 20.6 | 0.0391 | 126 | 1.16 | 9.66 | 35.5 | 77.5 |
Signature 25 | 33.6 | 91.5 | 80.2 | 0.464 | 347 | 5.35 | 40 | 126 | 246 |
Signature 26 | 8.66 | 26.3 | 19.7 | 0.024 | 134 | 0.594 | 8.78 | 38.5 | 83.8 |
Signature 27 | 16.9 | 21.1 | 18.9 | 0.124 | 80.9 | 0.978 | 10.5 | 29.4 | 52 |
Signature 28 | 22.9 | 34 | 30.1 | 0.0891 | 138 | 1.76 | 16.5 | 48.6 | 85.4 |
Signature 29 | 7.05 | 25.3 | 19.9 | 0.0673 | 147 | 0.683 | 7.85 | 35.6 | 86.3 |
Signature 30 | 9.85 | 31.6 | 24.3 | 0.0235 | 157 | 0.772 | 10.2 | 45.5 | 109 |
Mutation types are not fixed. As long as you provide a reference mutation signature matrix which matches the mutation types of your mutation catalog, you can decipher mutation signatures. We will briefly demonstrate an example using structural variant signatures.
The structural variant reference signatures from Nik-Zainal et al. (2016) can be accessed as follows
sv_reference <- get_reference_signatures(signature_set = 'nikzainal_sv')
pander(head(sv_reference))
mutation_type | Rearrangement Signature 1 | Rearrangement Signature 2 | Rearrangement Signature 3 | Rearrangement Signature 4 | Rearrangement Signature 5 | Rearrangement Signature 6 |
---|---|---|---|---|---|---|
clustered|DEL|>10Mb | 0 | 0 | 0 | 0.01 | 0 | 0.07 |
clustered|DEL|1-10kb | 0 | 0 | 0 | 0.01 | 0 | 0.01 |
clustered|DEL|10-100kb | 0 | 0 | 0 | 0.01 | 0 | 0.01 |
clustered|DEL|100kb-1Mb | 0 | 0 | 0 | 0.02 | 0 | 0.03 |
clustered|DEL|1Mb-10Mb | 0 | 0 | 0 | 0.03 | 0 | 0.07 |
clustered|DUP|>10Mb | 0 | 0 | 0 | 0.01 | 0 | 0.07 |
We then use a mutation catalog with matching mutation types
data(example_sv_catalog)
pander(example_sv_catalog)
mutation_type | count |
---|---|
clustered|DEL|1-10kb | 0 |
clustered|DEL|10-100kb | 0 |
clustered|DEL|100kb-1Mb | 0 |
clustered|DEL|1Mb-10Mb | 3 |
clustered|DEL|>10Mb | 1 |
clustered|DUP|1-10kb | 0 |
clustered|DUP|10-100kb | 1 |
clustered|DUP|100kb-1Mb | 0 |
clustered|DUP|1Mb-10Mb | 2 |
clustered|DUP|>10Mb | 1 |
clustered|INV|1-10kb | 0 |
clustered|INV|10-100kb | 2 |
clustered|INV|100kb-1Mb | 1 |
clustered|INV|1Mb-10Mb | 2 |
clustered|INV|>10Mb | 3 |
clustered|TRA| | 8 |
non-clustered|DEL|1-10kb | 14 |
non-clustered|DEL|10-100kb | 14 |
non-clustered|DEL|100kb-1Mb | 1 |
non-clustered|DEL|1Mb-10Mb | 6 |
non-clustered|DEL|>10Mb | 8 |
non-clustered|DUP|1-10kb | 1 |
non-clustered|DUP|10-100kb | 4 |
non-clustered|DUP|100kb-1Mb | 9 |
non-clustered|DUP|1Mb-10Mb | 4 |
non-clustered|DUP|>10Mb | 1 |
non-clustered|INV|1-10kb | 1 |
non-clustered|INV|10-100kb | 4 |
non-clustered|INV|100kb-1Mb | 1 |
non-clustered|INV|1Mb-10Mb | 13 |
non-clustered|INV|>10Mb | 16 |
non-clustered|TRA| | 25 |
Next, we perform SignIT analysis as before.
sv_exposures <- get_exposures(example_sv_catalog, reference_signatures = sv_reference, n_cores = 1)
## Sampling
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 0.44 seconds (Warm-up)
## 0.33 seconds (Sampling)
## 0.77 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 0.54 seconds (Warm-up)
## 0.58 seconds (Sampling)
## 1.12 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
##
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 0.41 seconds (Warm-up)
## 0.33 seconds (Sampling)
## 0.74 seconds (Total)
##
##
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
##
## Gradient evaluation took 0.01 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 100 seconds.
## Adjust your expectations accordingly!
##
##
## Iteration: 1 / 400 [ 0%] (Warmup)
## Iteration: 40 / 400 [ 10%] (Warmup)
## Iteration: 80 / 400 [ 20%] (Warmup)
## Iteration: 120 / 400 [ 30%] (Warmup)
## Iteration: 160 / 400 [ 40%] (Warmup)
## Iteration: 200 / 400 [ 50%] (Warmup)
## Iteration: 201 / 400 [ 50%] (Sampling)
## Iteration: 240 / 400 [ 60%] (Sampling)
## Iteration: 280 / 400 [ 70%] (Sampling)
## Iteration: 320 / 400 [ 80%] (Sampling)
## Iteration: 360 / 400 [ 90%] (Sampling)
## Iteration: 400 / 400 [100%] (Sampling)
##
## Elapsed Time: 0.42 seconds (Warm-up)
## 0.32 seconds (Sampling)
## 0.74 seconds (Total)
And we can use the same plotting functions,
plot_exposure_posteriors_bleed(sv_exposures, signature_trim = 'Rearrangement Signature')
And the same summarization function
pander(get_exposure_summary_table(sv_exposures))
signature | mode_exposure | mean_exposure | median_exposure | 0% | 100% | 2.5% | 25% | 75% | 97.5% |
---|---|---|---|---|---|---|---|---|---|
Rearrangement Signature 1 | 12.3 | 13.8 | 13.6 | 4.64 | 29.1 | 6.8 | 11 | 16.3 | 22.5 |
Rearrangement Signature 2 | 60.7 | 63.1 | 63 | 39.9 | 88.4 | 46.8 | 57.6 | 68.3 | 77.4 |
Rearrangement Signature 3 | 0.69 | 2.34 | 1.92 | 0.0026 | 14.2 | 0.0945 | 0.853 | 3.26 | 7.12 |
Rearrangement Signature 4 | 12.7 | 14.2 | 13.8 | 3.9 | 34.8 | 6.35 | 10.9 | 17.2 | 23.5 |
Rearrangement Signature 5 | 35.2 | 35.4 | 35.1 | 15.6 | 54.1 | 23.6 | 31 | 39.7 | 48.4 |
Rearrangement Signature 6 | 16.5 | 17.2 | 16.9 | 4.42 | 33.3 | 7.63 | 13.4 | 20.7 | 28.6 |
We next explore how to perform temporal dissection of mutation signatures. The input data here is a data.frame or tibble of mutations, with some additional metadata. The following glimpse
shows the necessary formatting of the input. Feel free to load data(example_variants)
to see the complete input data for yourself.
glimpse(example_variants)
## Observations: 9,506
## Variables: 15
## $ chr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pos <dbl> 137744, 1479032, 2038218, 2615086, 2615087, 262...
## $ ref <chr> "G", "T", "C", "C", "T", "C", "G", "A", "C", "A...
## $ alt <chr> "C", "A", "T", "T", "C", "A", "A", "G", "T", "G...
## $ A <int> 0, 8, 0, 0, 0, 24, 6, 87, 0, 81, 0, 4, 0, 25, 0...
## $ C <int> 9, 3, 57, 41, 5, 191, 0, 0, 84, 0, 38, 15, 48, ...
## $ G <int> 77, 0, 1, 0, 0, 0, 33, 7, 0, 10, 0, 0, 34, 0, 2...
## $ T <int> 0, 104, 7, 5, 41, 0, 0, 0, 7, 0, 30, 0, 0, 27, ...
## $ total_depth <int> 86, 115, 65, 46, 46, 215, 39, 94, 91, 91, 68, 1...
## $ alt_depth <int> 9, 8, 7, 5, 5, 24, 6, 7, 7, 10, 38, 4, 48, 25, ...
## $ cnv_start <int> 10022, 582430, 582430, 582430, 582430, 582430, ...
## $ cnv_end <int> 582429, 8975823, 8975823, 8975823, 8975823, 897...
## $ tumour_copy <int> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ tumour_content <dbl> 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63,...
## $ normal_copy <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
Signature timing is performed using get_population_signatures
. Note that this process can take quite some time depending on your machine. We are using method = ‘vb’, which employs the variational Bayes solution and is much more computationally economical.
Note that if the parameter n_populations
is not provided, an automatic process of model selection is performed to attempt to estimate the optimal number of signatures.
A preferred approach (if computational resources are available) is to perform complete population signature inference for five models: 1-5 populations. Then, compute the WAIC of each model with compute_population_signatures_waic(population_signatures)
and choose the model which minimizes it. WAIC is a preferred model selection metric for Bayesian inference.
population_signatures <- get_population_signatures(
example_variants,
n_populations = 2,
method = 'vb'
)
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
## Extracting parameters from population model with 2 populations
## Subsetting reference signatures on request
## Sampling
## Running model with provided reference signatures: 13 signatures, 9506 mutations, and 2 clusters
## Sampling
## ------------------------------------------------------------
## EXPERIMENTAL ALGORITHM:
## This procedure has not been thoroughly tested and may be unstable
## or buggy. The interface is subject to change.
## ------------------------------------------------------------
##
##
##
## Gradient evaluation took 2.35 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 23500 seconds.
## Adjust your expectations accordingly!
##
##
## Begin eta adaptation.
## Iteration: 1 / 250 [ 0%] (Adaptation)
## Iteration: 50 / 250 [ 20%] (Adaptation)
## Iteration: 100 / 250 [ 40%] (Adaptation)
## Iteration: 150 / 250 [ 60%] (Adaptation)
## Iteration: 200 / 250 [ 80%] (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
##
## Begin stochastic gradient ascent.
## iter ELBO delta_ELBO_mean delta_ELBO_med notes
## 100 -8e+04 1.000 1.000
## 200 -8e+04 0.500 1.000
## 300 -8e+04 0.334 0.001 MEDIAN ELBO CONVERGED
##
## Drawing a sample of size 1000 from the approximate posterior...
## COMPLETED.
Now, let’s visualize the results.
plot_population_signatures(population_signatures)
## Warning in function_list[[k]](value): NAs introduced by coercion
## Warning in function_list[[k]](value): NAs introduced by coercion
This analysis is revealing two tumour subpopulations. The red higher-prevalence subpopulation is inferred to be “earlier”. The Proportion refers to the fraction of mutations contributed from each subpopulation.
We see here that Signature 3 exhibits early bias while signatures 16, 17, and 30 have late bias.
For quantitative analysis workflows, use the following to provide a summary of these results:
pander(summarise_population_signatures(population_signatures))
## Warning in function_list[[k]](value): NAs introduced by coercion
## Warning in function_list[[k]](value): NAs introduced by coercion
population | signature | mean | median | mode | sd | population_proportion | prevalence_mean |
---|---|---|---|---|---|---|---|
Population 1 | Signature 1 | 0.0633 | 0.0639 | 0.0653 | 0.0051 | 0.801 | 1 |
Population 1 | Signature 2 | 0.0281 | 0.0282 | 0.0282 | 0.00318 | 0.801 | 1 |
Population 1 | Signature 3 | 0.364 | 0.368 | 0.378 | 0.0164 | 0.801 | 1 |
Population 1 | Signature 4 | 0.0343 | 0.0332 | 0.0304 | 0.0079 | 0.801 | 1 |
Population 1 | Signature 5 | 0.0424 | 0.0397 | 0.0335 | 0.0141 | 0.801 | 1 |
Population 1 | Signature 8 | 0.266 | 0.268 | 0.27 | 0.0153 | 0.801 | 1 |
Population 1 | Signature 9 | 0.0703 | 0.0705 | 0.0715 | 0.00893 | 0.801 | 1 |
Population 1 | Signature 11 | 0.03 | 0.0298 | 0.0298 | 0.00491 | 0.801 | 1 |
Population 1 | Signature 13 | 0.0324 | 0.0322 | 0.0321 | 0.0047 | 0.801 | 1 |
Population 1 | Signature 16 | 0.0178 | 0.0163 | 0.013 | 0.00686 | 0.801 | 1 |
Population 1 | Signature 17 | 0.02 | 0.0192 | 0.0186 | 0.00515 | 0.801 | 1 |
Population 1 | Signature 24 | 0.017 | 0.0165 | 0.0164 | 0.00494 | 0.801 | 1 |
Population 1 | Signature 30 | 0.0149 | 0.0145 | 0.0125 | 0.00453 | 0.801 | 1 |
Population 2 | Signature 1 | 0.0314 | 0.0292 | 0.0251 | 0.00262 | 0.199 | 0.309 |
Population 2 | Signature 2 | 0.0124 | 0.0114 | 0.00867 | 0.00122 | 0.199 | 0.309 |
Population 2 | Signature 3 | 0.0268 | 0.0252 | 0.0221 | 0.00229 | 0.199 | 0.309 |
Population 2 | Signature 4 | 0.00538 | 0.00287 | 0.0014 | 0.00169 | 0.199 | 0.309 |
Population 2 | Signature 5 | 0.0422 | 0.039 | 0.0331 | 0.00403 | 0.199 | 0.309 |
Population 2 | Signature 8 | 0.237 | 0.243 | 0.252 | 0.00432 | 0.199 | 0.309 |
Population 2 | Signature 9 | 0.0902 | 0.0901 | 0.0922 | 0.00478 | 0.199 | 0.309 |
Population 2 | Signature 11 | 0.00791 | 0.00588 | 0.00343 | 0.00155 | 0.199 | 0.309 |
Population 2 | Signature 13 | 0.00919 | 0.00835 | 0.00706 | 0.000931 | 0.199 | 0.309 |
Population 2 | Signature 16 | 0.297 | 0.304 | 0.316 | 0.00563 | 0.199 | 0.309 |
Population 2 | Signature 17 | 0.171 | 0.175 | 0.174 | 0.00342 | 0.199 | 0.309 |
Population 2 | Signature 24 | 0.0029 | 0.00183 | 0.001 | 0.000782 | 0.199 | 0.309 |
Population 2 | Signature 30 | 0.0654 | 0.0643 | 0.0643 | 0.00388 | 0.199 | 0.309 |
The below output specifies the packages used in this analysis, and their version numbers.
print(sessionInfo())
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.7 (Final)
##
## Matrix products: default
## BLAS: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRblas.so
## LAPACK: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.2
## [3] rtracklayer_1.36.6 Biostrings_2.46.0
## [5] XVector_0.18.0 GenomicRanges_1.30.0
## [7] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [9] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [11] bindrcpp_0.2 pander_0.6.1
## [13] ggraph_1.0.1 hexbin_1.27.1
## [15] cowplot_0.9.2 rstan_2.17.2
## [17] StanHeaders_2.17.2 dplyr_0.7.4
## [19] purrr_0.2.4 readr_1.1.1
## [21] tidyr_0.8.0 tibble_1.3.4
## [23] ggplot2_2.2.1 tidyverse_1.1.1
## [25] nnls_1.4 signit_1.0.1
## [27] Rcpp_0.12.13
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-131 bitops_1.0-6
## [3] matrixStats_0.52.2 lubridate_1.6.0
## [5] RColorBrewer_1.1-2 doParallel_1.0.11
## [7] httr_1.3.1 rprojroot_1.3-1
## [9] tools_3.4.1 backports_1.1.2
## [11] R6_2.2.2 lazyeval_0.2.1
## [13] colorspace_1.3-2 tidyselect_0.2.3
## [15] gridExtra_2.3 mnormt_1.5-5
## [17] compiler_3.4.1 rvest_0.3.2
## [19] Biobase_2.38.0 xml2_1.1.1
## [21] DelayedArray_0.4.1 labeling_0.3
## [23] scales_0.5.0 psych_1.7.3.21
## [25] stringr_1.2.0 digest_0.6.13
## [27] Rsamtools_1.30.0 foreign_0.8-67
## [29] rmarkdown_1.8.5 pkgconfig_2.0.1
## [31] htmltools_0.3.6 readxl_1.0.0
## [33] rlang_0.1.4 bindr_0.1
## [35] jsonlite_1.5 BiocParallel_1.12.0
## [37] inline_0.3.14 RCurl_1.95-4.8
## [39] magrittr_1.5 GenomeInfoDbData_0.99.1
## [41] Matrix_1.2-7.1 munsell_0.4.3
## [43] viridis_0.5.0 stringi_1.1.6
## [45] yaml_2.1.16 MASS_7.3-47
## [47] SummarizedExperiment_1.8.0 zlibbioc_1.24.0
## [49] plyr_1.8.4 grid_3.4.1
## [51] ggrepel_0.7.0 forcats_0.2.0
## [53] udunits2_0.13 lattice_0.20-34
## [55] haven_1.0.0 hms_0.4.1
## [57] knitr_1.17 igraph_1.1.2
## [59] reshape2_1.4.3 codetools_0.2-15
## [61] rmutil_1.1.0 XML_3.98-1.6
## [63] glue_1.2.0 evaluate_0.10.1
## [65] modelr_0.1.0 tweenr_0.1.5
## [67] foreach_1.4.4 cellranger_1.1.0
## [69] gtable_0.2.0 assertthat_0.2.0
## [71] ggforce_0.1.1 broom_0.4.2
## [73] viridisLite_0.3.0 iterators_1.0.9
## [75] GenomicAlignments_1.12.2 units_0.4-4