SignIT: Mutation Signature Analysis in Individual Tumours with MCMC

This vignette demonstrates the use of SignIT to decipher mutation signatures in individual tumours, explore signature bleed interactions, and perform temporal dissection of mutation signatures. SignIT is available for download and installation from https://www.github.com/eyzhao/SignIT. See installation instructions in the README.

Background

Cancer is a disease characterized by the formation of tumours composed of cells which divide aggressively due to the loss of specialized functions and growth inhibition. Cancers arise due to the stepwise accumulation of spontaneous genetic mutations specific to the tumour, also known as somatic mutations. These mutations arise from many causes, including chemical mutagens, intracellular mutational processes, and deficient DNA repair genes.

While some mutations may contribute directly to tumour invasiveness, the majority are thought to be “passengers,” with minimal impact on the evolutionary fitness of cancer cells. Nevertheless, it is possible to glean insights about a cancer’s molecular history from these passengers. They provide evidence of the mutational processes which drove the acquisition of mutations in cancer.

Recent research has compiled 30 characteristic signatures of mutation deciphered from cancer whole genome and exome sequences. These are available from http://cancer.sanger.ac.uk/cosmic/signatures.

SignIT on a Patient’s Mutation Catalog

The Mutation Catalog

A mutation catalog is a Data Frame or Tibble containing two columns named mutation_type and count. The mutation_type column is a vector of unique strings each denoting a mutation type. It is possible to define custom mutation types, so long as they match the mutation types defined in a reference mutation signature table.

An example mutation catalog is provided in the package data.

data(example_catalog)
pander(head(example_catalog))
mutation_type count
A[C>A]A 86
A[C>A]C 97
A[C>A]G 8
A[C>A]T 54
C[C>A]A 74
C[C>A]C 69

SignIT also provides tools to determine SNV mutation catalogs from a table of variants. Here are the first ten rows of a set of example variants. You can load this data from SignIT to see the full set.

data(example_variants)
pander(head(example_variants, 10))
chr pos ref alt A C G T total_depth alt_depth cnv_start cnv_end tumour_copy tumour_content normal_copy
1 137744 G C 0 9 77 0 86 9 10022 582429 2 0.63 2
1 1479032 T A 8 3 0 104 115 8 582430 8975823 1 0.63 2
1 2038218 C T 0 57 1 7 65 7 582430 8975823 1 0.63 2
1 2615086 C T 0 41 0 5 46 5 582430 8975823 1 0.63 2
1 2615087 T C 0 5 0 41 46 5 582430 8975823 1 0.63 2
1 2629042 C A 24 191 0 0 215 24 582430 8975823 1 0.63 2
1 2945876 G A 6 0 33 0 39 6 582430 8975823 1 0.63 2
1 3097886 A G 87 0 7 0 94 7 582430 8975823 1 0.63 2
1 4033630 C T 0 84 0 7 91 7 582430 8975823 1 0.63 2
1 4033631 A G 81 0 10 0 91 10 582430 8975823 1 0.63 2

Use mutations_to_catalog to convert to catalog form.

example_catalog_2 <- with(
  example_variants,
  mutations_to_catalog(chr, pos, ref, alt)
)
## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:tidyr':
## 
##     expand
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:purrr':
## 
##     reduce
## 
## Attaching package: 'XVector'
## The following object is masked from 'package:purrr':
## 
##     compact
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
pander(head(example_catalog_2))
mutation_type count
A[C>A]A 180
A[C>A]C 170
A[C>A]G 27
A[C>A]T 125
A[C>G]A 122
A[C>G]C 61

SignIT also provides example data of a variant table.

Running MCMC Analysis and Exploring Posterior Distributions

Once a mutation catalog is available, obtaining the contributions of each mutation signature (also known as the exposures) is straightforward. The following line of code initializes the Bayesian model which analyzes mutation signatures and stores the output into a list structure.

patient_signature_exposures <- get_exposures(
  mutation_catalog = example_catalog,
  n_cores = 1
)
## Sampling
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 9.77 seconds (Warm-up)
##                4.31 seconds (Sampling)
##                14.08 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 10.67 seconds (Warm-up)
##                4.71 seconds (Sampling)
##                15.38 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 10.13 seconds (Warm-up)
##                3.44 seconds (Sampling)
##                13.57 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 9.54 seconds (Warm-up)
##                3.63 seconds (Sampling)
##                13.17 seconds (Total)

A few functions are available to explore the solutions obtained. The easiest way is to plot the posterior distributions for each mutation signature, as follows.

patient_exposure_posterior_plot <- plot_exposure_posteriors(patient_signature_exposures)
print(patient_exposure_posterior_plot)

Note that all exposure plotting functions automatically shorten signature names by removing the word “Signature”. For comparison, we also provide a convenience function which computes a naive non-negative least squares (NNLS) best fit solution.

patient_exposure_nnls_plot <- plot_nnls_solution(patient_signature_exposures)
print(patient_exposure_nnls_plot)

Note how NNLS provides point estimates which mirror SignIT’s solutions, but that uncertainties vary greatly between signatures. Additionally, nnls manifests mutation bleed as inappropriate elevation of signatures 16, 22, and 25.

Visualizing Mutation Signature Bleed

Signal bleed can often occur between signatures with similar mutational spectra. The result is that these signatures exhibit anticorrelated posterior sampling, suggesting that credible solutions can fit either signature. To visualize this anticorrelation, we can use the following command:

plot_two_signature_hexplot(patient_signature_exposures, 'Signature 3', 'Signature 8')

We can also simultaneously quantify the signature bleed across all signatures and plot it as a heatmap. Signatures with more negative Spearman Rho (dark blue) exhibit more signature bleed.

plot_signature_pairwise_bleed(patient_signature_exposures)

Lastly, SignIT provides a convenient visualization which overlays mutation signature bleed information on top of signature posteriors. The min_bleed parameter takes on values from 0 to 1. The higher this threshold is set, the cleaner the graph will look, but the more information is lost. We typically recommend values around 0.2 to 0.3.

plot_exposure_posteriors_bleed(patient_signature_exposures, min_bleed = 0.25)

For a quantitative summary of signature exposures, use the following command

summary_table <- get_exposure_summary_table(patient_signature_exposures)
pander(summary_table)
signature mode_exposure mean_exposure median_exposure 0% 100% 2.5% 25% 75% 97.5%
Signature 1 1155 1141 1146 928 1315 1009 1099 1184 1260
Signature 2 192 199 198 106 307 143 180 220 255
Signature 3 848 840 846 531 1143 604 773 911 1045
Signature 4 30 62.2 55.7 0.143 270 2.1 26.5 89.3 173
Signature 5 20.9 67.5 47 0.0907 349 2.6 19.9 94.5 242
Signature 6 10.2 34.2 26.7 0.0535 200 1.25 11.3 48.9 105
Signature 7 16.6 28.5 24.2 0.195 120 1.43 11.6 41 80.4
Signature 8 1242 1229 1226 922 1566 1014 1151 1301 1452
Signature 9 367 360 362 167 560 222 319 406 491
Signature 10 4.99 15.6 12.4 0.0311 79.3 0.523 5.33 22.2 50.3
Signature 11 15.7 32.2 27.7 0.236 127 1.35 14 45.8 87.9
Signature 12 7.45 26.3 19.2 0.0167 164 0.5 8.08 37.2 84.4
Signature 13 255 247 246 145 341 199 229 264 298
Signature 14 4.92 15.5 11.2 0.0317 97.8 0.456 4.9 22.3 52.8
Signature 15 5.68 20 15.8 0.0102 100 0.448 6.53 28.7 62.3
Signature 16 48.6 83.2 71.1 0.23 294 3.15 37.8 121 218
Signature 17 25 34.3 31 0.375 125 2.45 17.7 48.4 79.5
Signature 18 4.26 14.5 10 0.0271 80 0.282 4.24 20.7 53.4
Signature 19 5.1 17.3 13.1 0.0833 93 0.655 5.71 23.5 58.9
Signature 20 5.18 18.1 12.1 0.0249 121 0.358 4.97 25.3 67
Signature 21 6.5 16.7 12 0.00671 95.3 0.408 5.96 24.2 53.9
Signature 22 58.8 58.1 57.2 0.228 153 6.12 36.3 77.6 117
Signature 23 3.07 10.5 7.42 0.0473 71.9 0.335 3.08 14.8 38.1
Signature 24 9.4 25.5 20.6 0.0391 126 1.16 9.66 35.5 77.5
Signature 25 33.6 91.5 80.2 0.464 347 5.35 40 126 246
Signature 26 8.66 26.3 19.7 0.024 134 0.594 8.78 38.5 83.8
Signature 27 16.9 21.1 18.9 0.124 80.9 0.978 10.5 29.4 52
Signature 28 22.9 34 30.1 0.0891 138 1.76 16.5 48.6 85.4
Signature 29 7.05 25.3 19.9 0.0673 147 0.683 7.85 35.6 86.3
Signature 30 9.85 31.6 24.3 0.0235 157 0.772 10.2 45.5 109

Structural Variant Signatures

Mutation types are not fixed. As long as you provide a reference mutation signature matrix which matches the mutation types of your mutation catalog, you can decipher mutation signatures. We will briefly demonstrate an example using structural variant signatures.

The structural variant reference signatures from Nik-Zainal et al. (2016) can be accessed as follows

sv_reference <- get_reference_signatures(signature_set = 'nikzainal_sv')
pander(head(sv_reference))
mutation_type Rearrangement Signature 1 Rearrangement Signature 2 Rearrangement Signature 3 Rearrangement Signature 4 Rearrangement Signature 5 Rearrangement Signature 6
clustered|DEL|>10Mb 0 0 0 0.01 0 0.07
clustered|DEL|1-10kb 0 0 0 0.01 0 0.01
clustered|DEL|10-100kb 0 0 0 0.01 0 0.01
clustered|DEL|100kb-1Mb 0 0 0 0.02 0 0.03
clustered|DEL|1Mb-10Mb 0 0 0 0.03 0 0.07
clustered|DUP|>10Mb 0 0 0 0.01 0 0.07

We then use a mutation catalog with matching mutation types

data(example_sv_catalog)
pander(example_sv_catalog)
mutation_type count
clustered|DEL|1-10kb 0
clustered|DEL|10-100kb 0
clustered|DEL|100kb-1Mb 0
clustered|DEL|1Mb-10Mb 3
clustered|DEL|>10Mb 1
clustered|DUP|1-10kb 0
clustered|DUP|10-100kb 1
clustered|DUP|100kb-1Mb 0
clustered|DUP|1Mb-10Mb 2
clustered|DUP|>10Mb 1
clustered|INV|1-10kb 0
clustered|INV|10-100kb 2
clustered|INV|100kb-1Mb 1
clustered|INV|1Mb-10Mb 2
clustered|INV|>10Mb 3
clustered|TRA| 8
non-clustered|DEL|1-10kb 14
non-clustered|DEL|10-100kb 14
non-clustered|DEL|100kb-1Mb 1
non-clustered|DEL|1Mb-10Mb 6
non-clustered|DEL|>10Mb 8
non-clustered|DUP|1-10kb 1
non-clustered|DUP|10-100kb 4
non-clustered|DUP|100kb-1Mb 9
non-clustered|DUP|1Mb-10Mb 4
non-clustered|DUP|>10Mb 1
non-clustered|INV|1-10kb 1
non-clustered|INV|10-100kb 4
non-clustered|INV|100kb-1Mb 1
non-clustered|INV|1Mb-10Mb 13
non-clustered|INV|>10Mb 16
non-clustered|TRA| 25

Next, we perform SignIT analysis as before.

sv_exposures <- get_exposures(example_sv_catalog, reference_signatures = sv_reference, n_cores = 1)
## Sampling
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.44 seconds (Warm-up)
##                0.33 seconds (Sampling)
##                0.77 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.54 seconds (Warm-up)
##                0.58 seconds (Sampling)
##                1.12 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.41 seconds (Warm-up)
##                0.33 seconds (Sampling)
##                0.74 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
## 
## Gradient evaluation took 0.01 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 100 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.42 seconds (Warm-up)
##                0.32 seconds (Sampling)
##                0.74 seconds (Total)

And we can use the same plotting functions,

plot_exposure_posteriors_bleed(sv_exposures, signature_trim = 'Rearrangement Signature')

And the same summarization function

pander(get_exposure_summary_table(sv_exposures))
signature mode_exposure mean_exposure median_exposure 0% 100% 2.5% 25% 75% 97.5%
Rearrangement Signature 1 12.3 13.8 13.6 4.64 29.1 6.8 11 16.3 22.5
Rearrangement Signature 2 60.7 63.1 63 39.9 88.4 46.8 57.6 68.3 77.4
Rearrangement Signature 3 0.69 2.34 1.92 0.0026 14.2 0.0945 0.853 3.26 7.12
Rearrangement Signature 4 12.7 14.2 13.8 3.9 34.8 6.35 10.9 17.2 23.5
Rearrangement Signature 5 35.2 35.4 35.1 15.6 54.1 23.6 31 39.7 48.4
Rearrangement Signature 6 16.5 17.2 16.9 4.42 33.3 7.63 13.4 20.7 28.6

Temporally Dissecting Mutation Signatures

We next explore how to perform temporal dissection of mutation signatures. The input data here is a data.frame or tibble of mutations, with some additional metadata. The following glimpse shows the necessary formatting of the input. Feel free to load data(example_variants) to see the complete input data for yourself.

glimpse(example_variants)
## Observations: 9,506
## Variables: 15
## $ chr            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pos            <dbl> 137744, 1479032, 2038218, 2615086, 2615087, 262...
## $ ref            <chr> "G", "T", "C", "C", "T", "C", "G", "A", "C", "A...
## $ alt            <chr> "C", "A", "T", "T", "C", "A", "A", "G", "T", "G...
## $ A              <int> 0, 8, 0, 0, 0, 24, 6, 87, 0, 81, 0, 4, 0, 25, 0...
## $ C              <int> 9, 3, 57, 41, 5, 191, 0, 0, 84, 0, 38, 15, 48, ...
## $ G              <int> 77, 0, 1, 0, 0, 0, 33, 7, 0, 10, 0, 0, 34, 0, 2...
## $ T              <int> 0, 104, 7, 5, 41, 0, 0, 0, 7, 0, 30, 0, 0, 27, ...
## $ total_depth    <int> 86, 115, 65, 46, 46, 215, 39, 94, 91, 91, 68, 1...
## $ alt_depth      <int> 9, 8, 7, 5, 5, 24, 6, 7, 7, 10, 38, 4, 48, 25, ...
## $ cnv_start      <int> 10022, 582430, 582430, 582430, 582430, 582430, ...
## $ cnv_end        <int> 582429, 8975823, 8975823, 8975823, 8975823, 897...
## $ tumour_copy    <int> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ tumour_content <dbl> 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63,...
## $ normal_copy    <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...

Signature timing is performed using get_population_signatures. Note that this process can take quite some time depending on your machine. We are using method = ‘vb’, which employs the variational Bayes solution and is much more computationally economical.

Note that if the parameter n_populations is not provided, an automatic process of model selection is performed to attempt to estimate the optimal number of signatures.

A preferred approach (if computational resources are available) is to perform complete population signature inference for five models: 1-5 populations. Then, compute the WAIC of each model with compute_population_signatures_waic(population_signatures) and choose the model which minimizes it. WAIC is a preferred model selection metric for Bayesian inference.

population_signatures <- get_population_signatures(
  example_variants,
  n_populations = 2,
  method = 'vb'
)
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19
## Chromosome levels match after prepending "chr". Proceeding...
## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels
## Extracting parameters from population model with 2 populations
## Subsetting reference signatures on request
## Sampling
## Running model with provided reference signatures: 13 signatures, 9506 mutations, and 2 clusters
## Sampling
## ------------------------------------------------------------
## EXPERIMENTAL ALGORITHM:
##   This procedure has not been thoroughly tested and may be unstable
##   or buggy. The interface is subject to change.
## ------------------------------------------------------------
## 
## 
## 
## Gradient evaluation took 2.35 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 23500 seconds.
## Adjust your expectations accordingly!
## 
## 
## Begin eta adaptation.
## Iteration:   1 / 250 [  0%]  (Adaptation)
## Iteration:  50 / 250 [ 20%]  (Adaptation)
## Iteration: 100 / 250 [ 40%]  (Adaptation)
## Iteration: 150 / 250 [ 60%]  (Adaptation)
## Iteration: 200 / 250 [ 80%]  (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
## 
## Begin stochastic gradient ascent.
##   iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
##    100     -8e+04             1.000            1.000
##    200     -8e+04             0.500            1.000
##    300     -8e+04             0.334            0.001   MEDIAN ELBO CONVERGED
## 
## Drawing a sample of size 1000 from the approximate posterior... 
## COMPLETED.

Now, let’s visualize the results.

plot_population_signatures(population_signatures)
## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

This analysis is revealing two tumour subpopulations. The red higher-prevalence subpopulation is inferred to be “earlier”. The Proportion refers to the fraction of mutations contributed from each subpopulation.

We see here that Signature 3 exhibits early bias while signatures 16, 17, and 30 have late bias.

For quantitative analysis workflows, use the following to provide a summary of these results:

pander(summarise_population_signatures(population_signatures))
## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion
population signature mean median mode sd population_proportion prevalence_mean
Population 1 Signature 1 0.0633 0.0639 0.0653 0.0051 0.801 1
Population 1 Signature 2 0.0281 0.0282 0.0282 0.00318 0.801 1
Population 1 Signature 3 0.364 0.368 0.378 0.0164 0.801 1
Population 1 Signature 4 0.0343 0.0332 0.0304 0.0079 0.801 1
Population 1 Signature 5 0.0424 0.0397 0.0335 0.0141 0.801 1
Population 1 Signature 8 0.266 0.268 0.27 0.0153 0.801 1
Population 1 Signature 9 0.0703 0.0705 0.0715 0.00893 0.801 1
Population 1 Signature 11 0.03 0.0298 0.0298 0.00491 0.801 1
Population 1 Signature 13 0.0324 0.0322 0.0321 0.0047 0.801 1
Population 1 Signature 16 0.0178 0.0163 0.013 0.00686 0.801 1
Population 1 Signature 17 0.02 0.0192 0.0186 0.00515 0.801 1
Population 1 Signature 24 0.017 0.0165 0.0164 0.00494 0.801 1
Population 1 Signature 30 0.0149 0.0145 0.0125 0.00453 0.801 1
Population 2 Signature 1 0.0314 0.0292 0.0251 0.00262 0.199 0.309
Population 2 Signature 2 0.0124 0.0114 0.00867 0.00122 0.199 0.309
Population 2 Signature 3 0.0268 0.0252 0.0221 0.00229 0.199 0.309
Population 2 Signature 4 0.00538 0.00287 0.0014 0.00169 0.199 0.309
Population 2 Signature 5 0.0422 0.039 0.0331 0.00403 0.199 0.309
Population 2 Signature 8 0.237 0.243 0.252 0.00432 0.199 0.309
Population 2 Signature 9 0.0902 0.0901 0.0922 0.00478 0.199 0.309
Population 2 Signature 11 0.00791 0.00588 0.00343 0.00155 0.199 0.309
Population 2 Signature 13 0.00919 0.00835 0.00706 0.000931 0.199 0.309
Population 2 Signature 16 0.297 0.304 0.316 0.00563 0.199 0.309
Population 2 Signature 17 0.171 0.175 0.174 0.00342 0.199 0.309
Population 2 Signature 24 0.0029 0.00183 0.001 0.000782 0.199 0.309
Population 2 Signature 30 0.0654 0.0643 0.0643 0.00388 0.199 0.309

Session Info

The below output specifies the packages used in this analysis, and their version numbers.

print(sessionInfo())
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.7 (Final)
## 
## Matrix products: default
## BLAS: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRblas.so
## LAPACK: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.2                  
##  [3] rtracklayer_1.36.6                Biostrings_2.46.0                
##  [5] XVector_0.18.0                    GenomicRanges_1.30.0             
##  [7] GenomeInfoDb_1.14.0               IRanges_2.12.0                   
##  [9] S4Vectors_0.16.0                  BiocGenerics_0.24.0              
## [11] bindrcpp_0.2                      pander_0.6.1                     
## [13] ggraph_1.0.1                      hexbin_1.27.1                    
## [15] cowplot_0.9.2                     rstan_2.17.2                     
## [17] StanHeaders_2.17.2                dplyr_0.7.4                      
## [19] purrr_0.2.4                       readr_1.1.1                      
## [21] tidyr_0.8.0                       tibble_1.3.4                     
## [23] ggplot2_2.2.1                     tidyverse_1.1.1                  
## [25] nnls_1.4                          signit_1.0.1                     
## [27] Rcpp_0.12.13                     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131               bitops_1.0-6              
##  [3] matrixStats_0.52.2         lubridate_1.6.0           
##  [5] RColorBrewer_1.1-2         doParallel_1.0.11         
##  [7] httr_1.3.1                 rprojroot_1.3-1           
##  [9] tools_3.4.1                backports_1.1.2           
## [11] R6_2.2.2                   lazyeval_0.2.1            
## [13] colorspace_1.3-2           tidyselect_0.2.3          
## [15] gridExtra_2.3              mnormt_1.5-5              
## [17] compiler_3.4.1             rvest_0.3.2               
## [19] Biobase_2.38.0             xml2_1.1.1                
## [21] DelayedArray_0.4.1         labeling_0.3              
## [23] scales_0.5.0               psych_1.7.3.21            
## [25] stringr_1.2.0              digest_0.6.13             
## [27] Rsamtools_1.30.0           foreign_0.8-67            
## [29] rmarkdown_1.8.5            pkgconfig_2.0.1           
## [31] htmltools_0.3.6            readxl_1.0.0              
## [33] rlang_0.1.4                bindr_0.1                 
## [35] jsonlite_1.5               BiocParallel_1.12.0       
## [37] inline_0.3.14              RCurl_1.95-4.8            
## [39] magrittr_1.5               GenomeInfoDbData_0.99.1   
## [41] Matrix_1.2-7.1             munsell_0.4.3             
## [43] viridis_0.5.0              stringi_1.1.6             
## [45] yaml_2.1.16                MASS_7.3-47               
## [47] SummarizedExperiment_1.8.0 zlibbioc_1.24.0           
## [49] plyr_1.8.4                 grid_3.4.1                
## [51] ggrepel_0.7.0              forcats_0.2.0             
## [53] udunits2_0.13              lattice_0.20-34           
## [55] haven_1.0.0                hms_0.4.1                 
## [57] knitr_1.17                 igraph_1.1.2              
## [59] reshape2_1.4.3             codetools_0.2-15          
## [61] rmutil_1.1.0               XML_3.98-1.6              
## [63] glue_1.2.0                 evaluate_0.10.1           
## [65] modelr_0.1.0               tweenr_0.1.5              
## [67] foreach_1.4.4              cellranger_1.1.0          
## [69] gtable_0.2.0               assertthat_0.2.0          
## [71] ggforce_0.1.1              broom_0.4.2               
## [73] viridisLite_0.3.0          iterators_1.0.9           
## [75] GenomicAlignments_1.12.2   units_0.4-4