SignIT: Mutation Signature Analysis in Individual Tumours with MCMC

This vignette demonstrates the use of SignIT to decipher mutation signatures in individual tumours, explore signature bleed interactions, and perform temporal dissection of mutation signatures. SignIT is available for download and installation from https://www.github.com/eyzhao/SignIT. See installation instructions in the README.

Background

Cancer is a disease characterized by the formation of tumours composed of cells which divide aggressively due to the loss of specialized functions and growth inhibition. Cancers arise due to the stepwise accumulation of spontaneous genetic mutations specific to the tumour, also known as somatic mutations. These mutations arise from many causes, including chemical mutagens, intracellular mutational processes, and deficient DNA repair genes.

While some mutations may contribute directly to tumour invasiveness, the majority are thought to be “passengers,” with minimal impact on the evolutionary fitness of cancer cells. Nevertheless, it is possible to glean insights about a cancer’s molecular history from these passengers. They provide evidence of the mutational processes which drove the acquisition of mutations in cancer.

Recent research has compiled 30 characteristic signatures of mutation deciphered from cancer whole genome and exome sequences. These are available from http://cancer.sanger.ac.uk/cosmic/signatures.

SignIT on a Patient’s Mutation Catalog

The Mutation Catalog

A mutation catalog is a Data Frame or Tibble containing two columns named mutation_type and count. The mutation_type column is a vector of unique strings each denoting a mutation type. It is possible to define custom mutation types, so long as they match the mutation types defined in a reference mutation signature table.

An example mutation catalog is provided in the package data.

data(example_catalog)
pander(head(example_catalog))

mutation_type	count
A[C>A]A	86
A[C>A]C	97
A[C>A]G	8
A[C>A]T	54
C[C>A]A	74
C[C>A]C	69

SignIT also provides tools to determine SNV mutation catalogs from a table of variants. Here are the first ten rows of a set of example variants. You can load this data from SignIT to see the full set.

data(example_variants)
pander(head(example_variants, 10))

chr	pos	ref	alt	A	C	G	T	total_depth	alt_depth	cnv_start	cnv_end	tumour_copy	tumour_content	normal_copy
1	137744	G	C	0	9	77	0	86	9	10022	582429	2	0.63	2
1	1479032	T	A	8	3	0	104	115	8	582430	8975823	1	0.63	2
1	2038218	C	T	0	57	1	7	65	7	582430	8975823	1	0.63	2
1	2615086	C	T	0	41	0	5	46	5	582430	8975823	1	0.63	2
1	2615087	T	C	0	5	0	41	46	5	582430	8975823	1	0.63	2
1	2629042	C	A	24	191	0	0	215	24	582430	8975823	1	0.63	2
1	2945876	G	A	6	0	33	0	39	6	582430	8975823	1	0.63	2
1	3097886	A	G	87	0	7	0	94	7	582430	8975823	1	0.63	2
1	4033630	C	T	0	84	0	7	91	7	582430	8975823	1	0.63	2
1	4033631	A	G	81	0	10	0	91	10	582430	8975823	1	0.63	2

Use mutations_to_catalog to convert to catalog form.

example_catalog_2 <- with(
  example_variants,
  mutations_to_catalog(chr, pos, ref, alt)
)

## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colMeans,
##     colnames, colSums, do.call, duplicated, eval, evalq, Filter,
##     Find, get, grep, grepl, intersect, is.unsorted, lapply,
##     lengths, Map, mapply, match, mget, order, paste, pmax,
##     pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce,
##     rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:dplyr':
## 
##     first, rename

## The following object is masked from 'package:tidyr':
## 
##     expand

## The following object is masked from 'package:base':
## 
##     expand.grid

## 
## Attaching package: 'IRanges'

## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice

## The following object is masked from 'package:purrr':
## 
##     reduce

## 
## Attaching package: 'XVector'

## The following object is masked from 'package:purrr':
## 
##     compact

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:base':
## 
##     strsplit

## Chromosome levels match after prepending "chr". Proceeding...

## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels

pander(head(example_catalog_2))

mutation_type	count
A[C>A]A	180
A[C>A]C	170
A[C>A]G	27
A[C>A]T	125
A[C>G]A	122
A[C>G]C	61

SignIT also provides example data of a variant table.

Running MCMC Analysis and Exploring Posterior Distributions

Once a mutation catalog is available, obtaining the contributions of each mutation signature (also known as the exposures) is straightforward. The following line of code initializes the Bayesian model which analyzes mutation signatures and stores the output into a list structure.

patient_signature_exposures <- get_exposures(
  mutation_catalog = example_catalog,
  n_cores = 1
)

## Sampling

## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 9.77 seconds (Warm-up)
##                4.31 seconds (Sampling)
##                14.08 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 10.67 seconds (Warm-up)
##                4.71 seconds (Sampling)
##                15.38 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 10.13 seconds (Warm-up)
##                3.44 seconds (Sampling)
##                13.57 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 9.54 seconds (Warm-up)
##                3.63 seconds (Sampling)
##                13.17 seconds (Total)

A few functions are available to explore the solutions obtained. The easiest way is to plot the posterior distributions for each mutation signature, as follows.

patient_exposure_posterior_plot <- plot_exposure_posteriors(patient_signature_exposures)
print(patient_exposure_posterior_plot)

Note that all exposure plotting functions automatically shorten signature names by removing the word “Signature”. For comparison, we also provide a convenience function which computes a naive non-negative least squares (NNLS) best fit solution.

patient_exposure_nnls_plot <- plot_nnls_solution(patient_signature_exposures)
print(patient_exposure_nnls_plot)

Note how NNLS provides point estimates which mirror SignIT’s solutions, but that uncertainties vary greatly between signatures. Additionally, nnls manifests mutation bleed as inappropriate elevation of signatures 16, 22, and 25.

Visualizing Mutation Signature Bleed

Signal bleed can often occur between signatures with similar mutational spectra. The result is that these signatures exhibit anticorrelated posterior sampling, suggesting that credible solutions can fit either signature. To visualize this anticorrelation, we can use the following command:

plot_two_signature_hexplot(patient_signature_exposures, 'Signature 3', 'Signature 8')

We can also simultaneously quantify the signature bleed across all signatures and plot it as a heatmap. Signatures with more negative Spearman Rho (dark blue) exhibit more signature bleed.

plot_signature_pairwise_bleed(patient_signature_exposures)

Lastly, SignIT provides a convenient visualization which overlays mutation signature bleed information on top of signature posteriors. The min_bleed parameter takes on values from 0 to 1. The higher this threshold is set, the cleaner the graph will look, but the more information is lost. We typically recommend values around 0.2 to 0.3.

plot_exposure_posteriors_bleed(patient_signature_exposures, min_bleed = 0.25)

For a quantitative summary of signature exposures, use the following command

summary_table <- get_exposure_summary_table(patient_signature_exposures)
pander(summary_table)

signature	mode_exposure	mean_exposure	median_exposure	0%	100%	2.5%	25%	75%	97.5%
Signature 1	1155	1141	1146	928	1315	1009	1099	1184	1260
Signature 2	192	199	198	106	307	143	180	220	255
Signature 3	848	840	846	531	1143	604	773	911	1045
Signature 4	30	62.2	55.7	0.143	270	2.1	26.5	89.3	173
Signature 5	20.9	67.5	47	0.0907	349	2.6	19.9	94.5	242
Signature 6	10.2	34.2	26.7	0.0535	200	1.25	11.3	48.9	105
Signature 7	16.6	28.5	24.2	0.195	120	1.43	11.6	41	80.4
Signature 8	1242	1229	1226	922	1566	1014	1151	1301	1452
Signature 9	367	360	362	167	560	222	319	406	491
Signature 10	4.99	15.6	12.4	0.0311	79.3	0.523	5.33	22.2	50.3
Signature 11	15.7	32.2	27.7	0.236	127	1.35	14	45.8	87.9
Signature 12	7.45	26.3	19.2	0.0167	164	0.5	8.08	37.2	84.4
Signature 13	255	247	246	145	341	199	229	264	298
Signature 14	4.92	15.5	11.2	0.0317	97.8	0.456	4.9	22.3	52.8
Signature 15	5.68	20	15.8	0.0102	100	0.448	6.53	28.7	62.3
Signature 16	48.6	83.2	71.1	0.23	294	3.15	37.8	121	218
Signature 17	25	34.3	31	0.375	125	2.45	17.7	48.4	79.5
Signature 18	4.26	14.5	10	0.0271	80	0.282	4.24	20.7	53.4
Signature 19	5.1	17.3	13.1	0.0833	93	0.655	5.71	23.5	58.9
Signature 20	5.18	18.1	12.1	0.0249	121	0.358	4.97	25.3	67
Signature 21	6.5	16.7	12	0.00671	95.3	0.408	5.96	24.2	53.9
Signature 22	58.8	58.1	57.2	0.228	153	6.12	36.3	77.6	117
Signature 23	3.07	10.5	7.42	0.0473	71.9	0.335	3.08	14.8	38.1
Signature 24	9.4	25.5	20.6	0.0391	126	1.16	9.66	35.5	77.5
Signature 25	33.6	91.5	80.2	0.464	347	5.35	40	126	246
Signature 26	8.66	26.3	19.7	0.024	134	0.594	8.78	38.5	83.8
Signature 27	16.9	21.1	18.9	0.124	80.9	0.978	10.5	29.4	52
Signature 28	22.9	34	30.1	0.0891	138	1.76	16.5	48.6	85.4
Signature 29	7.05	25.3	19.9	0.0673	147	0.683	7.85	35.6	86.3
Signature 30	9.85	31.6	24.3	0.0235	157	0.772	10.2	45.5	109

Structural Variant Signatures

Mutation types are not fixed. As long as you provide a reference mutation signature matrix which matches the mutation types of your mutation catalog, you can decipher mutation signatures. We will briefly demonstrate an example using structural variant signatures.

The structural variant reference signatures from Nik-Zainal et al. (2016) can be accessed as follows

sv_reference <- get_reference_signatures(signature_set = 'nikzainal_sv')
pander(head(sv_reference))

mutation_type	Rearrangement Signature 4	Rearrangement Signature 6
clustered\|DEL\|>10Mb	0.01	0.07
clustered\|DEL\|1-10kb	0.01	0.01
clustered\|DEL\|10-100kb	0.01	0.01
clustered\|DEL\|100kb-1Mb	0.02	0.03
clustered\|DEL\|1Mb-10Mb	0.03	0.07
clustered\|DUP\|>10Mb	0.01	0.07

We then use a mutation catalog with matching mutation types

data(example_sv_catalog)
pander(example_sv_catalog)

mutation_type	count
clustered\|DEL\|1-10kb	0
clustered\|DEL\|10-100kb	0
clustered\|DEL\|100kb-1Mb	0
clustered\|DEL\|1Mb-10Mb	3
clustered\|DEL\|>10Mb	1
clustered\|DUP\|1-10kb	0
clustered\|DUP\|10-100kb	1
clustered\|DUP\|100kb-1Mb	0
clustered\|DUP\|1Mb-10Mb	2
clustered\|DUP\|>10Mb	1
clustered\|INV\|1-10kb	0
clustered\|INV\|10-100kb	2
clustered\|INV\|100kb-1Mb	1
clustered\|INV\|1Mb-10Mb	2
clustered\|INV\|>10Mb	3
clustered\|TRA\|	8
non-clustered\|DEL\|1-10kb	14
non-clustered\|DEL\|10-100kb	14
non-clustered\|DEL\|100kb-1Mb	1
non-clustered\|DEL\|1Mb-10Mb	6
non-clustered\|DEL\|>10Mb	8
non-clustered\|DUP\|1-10kb	1
non-clustered\|DUP\|10-100kb	4
non-clustered\|DUP\|100kb-1Mb	9
non-clustered\|DUP\|1Mb-10Mb	4
non-clustered\|DUP\|>10Mb	1
non-clustered\|INV\|1-10kb	1
non-clustered\|INV\|10-100kb	4
non-clustered\|INV\|100kb-1Mb	1
non-clustered\|INV\|1Mb-10Mb	13
non-clustered\|INV\|>10Mb	16
non-clustered\|TRA\|	25

Next, we perform SignIT analysis as before.

sv_exposures <- get_exposures(example_sv_catalog, reference_signatures = sv_reference, n_cores = 1)

## Sampling

## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 1).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.44 seconds (Warm-up)
##                0.33 seconds (Sampling)
##                0.77 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 2).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.54 seconds (Warm-up)
##                0.58 seconds (Sampling)
##                1.12 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 3).
## 
## Gradient evaluation took 0 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.41 seconds (Warm-up)
##                0.33 seconds (Sampling)
##                0.74 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'signature_model' NOW (CHAIN 4).
## 
## Gradient evaluation took 0.01 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 100 seconds.
## Adjust your expectations accordingly!
## 
## 
## Iteration:   1 / 400 [  0%]  (Warmup)
## Iteration:  40 / 400 [ 10%]  (Warmup)
## Iteration:  80 / 400 [ 20%]  (Warmup)
## Iteration: 120 / 400 [ 30%]  (Warmup)
## Iteration: 160 / 400 [ 40%]  (Warmup)
## Iteration: 200 / 400 [ 50%]  (Warmup)
## Iteration: 201 / 400 [ 50%]  (Sampling)
## Iteration: 240 / 400 [ 60%]  (Sampling)
## Iteration: 280 / 400 [ 70%]  (Sampling)
## Iteration: 320 / 400 [ 80%]  (Sampling)
## Iteration: 360 / 400 [ 90%]  (Sampling)
## Iteration: 400 / 400 [100%]  (Sampling)
## 
##  Elapsed Time: 0.42 seconds (Warm-up)
##                0.32 seconds (Sampling)
##                0.74 seconds (Total)

And we can use the same plotting functions,

plot_exposure_posteriors_bleed(sv_exposures, signature_trim = 'Rearrangement Signature')

And the same summarization function

pander(get_exposure_summary_table(sv_exposures))

signature	mode_exposure	mean_exposure	median_exposure	0%	100%	2.5%	25%	75%	97.5%
Rearrangement Signature 1	12.3	13.8	13.6	4.64	29.1	6.8	11	16.3	22.5
Rearrangement Signature 2	60.7	63.1	63	39.9	88.4	46.8	57.6	68.3	77.4
Rearrangement Signature 3	0.69	2.34	1.92	0.0026	14.2	0.0945	0.853	3.26	7.12
Rearrangement Signature 4	12.7	14.2	13.8	3.9	34.8	6.35	10.9	17.2	23.5
Rearrangement Signature 5	35.2	35.4	35.1	15.6	54.1	23.6	31	39.7	48.4
Rearrangement Signature 6	16.5	17.2	16.9	4.42	33.3	7.63	13.4	20.7	28.6

Temporally Dissecting Mutation Signatures

We next explore how to perform temporal dissection of mutation signatures. The input data here is a data.frame or tibble of mutations, with some additional metadata. The following glimpse shows the necessary formatting of the input. Feel free to load data(example_variants) to see the complete input data for yourself.

glimpse(example_variants)

## Observations: 9,506
## Variables: 15
## $ chr            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pos            <dbl> 137744, 1479032, 2038218, 2615086, 2615087, 262...
## $ ref            <chr> "G", "T", "C", "C", "T", "C", "G", "A", "C", "A...
## $ alt            <chr> "C", "A", "T", "T", "C", "A", "A", "G", "T", "G...
## $ A              <int> 0, 8, 0, 0, 0, 24, 6, 87, 0, 81, 0, 4, 0, 25, 0...
## $ C              <int> 9, 3, 57, 41, 5, 191, 0, 0, 84, 0, 38, 15, 48, ...
## $ G              <int> 77, 0, 1, 0, 0, 0, 33, 7, 0, 10, 0, 0, 34, 0, 2...
## $ T              <int> 0, 104, 7, 5, 41, 0, 0, 0, 7, 0, 30, 0, 0, 27, ...
## $ total_depth    <int> 86, 115, 65, 46, 46, 215, 39, 94, 91, 91, 68, 1...
## $ alt_depth      <int> 9, 8, 7, 5, 5, 24, 6, 7, 7, 10, 38, 4, 48, 25, ...
## $ cnv_start      <int> 10022, 582430, 582430, 582430, 582430, 582430, ...
## $ cnv_end        <int> 582429, 8975823, 8975823, 8975823, 8975823, 897...
## $ tumour_copy    <int> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ tumour_content <dbl> 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63, 0.63,...
## $ normal_copy    <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...

Signature timing is performed using get_population_signatures. Note that this process can take quite some time depending on your machine. We are using method = ‘vb’, which employs the variational Bayes solution and is much more computationally economical.

Note that if the parameter n_populations is not provided, an automatic process of model selection is performed to attempt to estimate the optimal number of signatures.

A preferred approach (if computational resources are available) is to perform complete population signature inference for five models: 1-5 populations. Then, compute the WAIC of each model with compute_population_signatures_waic(population_signatures) and choose the model which minimizes it. WAIC is a preferred model selection metric for Bayesian inference.

population_signatures <- get_population_signatures(
  example_variants,
  n_populations = 2,
  method = 'vb'
)

## Chromosome levels match after prepending "chr". Proceeding...

## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels

## No BSgenome object provided. Using default: BSgenome.Hsapiens.UCSC.hg19
## Chromosome levels match after prepending "chr". Proceeding...

## Warning in get_trinucleotide(chr, pos, ref, genome = genome): "chr" was
## automatically prepended to all chromosome names to match genome seqlevels

## Extracting parameters from population model with 2 populations

## Subsetting reference signatures on request

## Sampling

## Running model with provided reference signatures: 13 signatures, 9506 mutations, and 2 clusters

## Sampling

## ------------------------------------------------------------
## EXPERIMENTAL ALGORITHM:
##   This procedure has not been thoroughly tested and may be unstable
##   or buggy. The interface is subject to change.
## ------------------------------------------------------------
## 
## 
## 
## Gradient evaluation took 2.35 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 23500 seconds.
## Adjust your expectations accordingly!
## 
## 
## Begin eta adaptation.
## Iteration:   1 / 250 [  0%]  (Adaptation)
## Iteration:  50 / 250 [ 20%]  (Adaptation)
## Iteration: 100 / 250 [ 40%]  (Adaptation)
## Iteration: 150 / 250 [ 60%]  (Adaptation)
## Iteration: 200 / 250 [ 80%]  (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
## 
## Begin stochastic gradient ascent.
##   iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
##    100     -8e+04             1.000            1.000
##    200     -8e+04             0.500            1.000
##    300     -8e+04             0.334            0.001   MEDIAN ELBO CONVERGED
## 
## Drawing a sample of size 1000 from the approximate posterior... 
## COMPLETED.

Now, let’s visualize the results.

plot_population_signatures(population_signatures)

## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

This analysis is revealing two tumour subpopulations. The red higher-prevalence subpopulation is inferred to be “earlier”. The Proportion refers to the fraction of mutations contributed from each subpopulation.

We see here that Signature 3 exhibits early bias while signatures 16, 17, and 30 have late bias.

For quantitative analysis workflows, use the following to provide a summary of these results:

pander(summarise_population_signatures(population_signatures))

## Warning in function_list[[k]](value): NAs introduced by coercion

## Warning in function_list[[k]](value): NAs introduced by coercion

population	signature	mean	median	mode	sd	population_proportion	prevalence_mean
Population 1	Signature 1	0.0633	0.0639	0.0653	0.0051	0.801	1
Population 1	Signature 2	0.0281	0.0282	0.0282	0.00318	0.801	1
Population 1	Signature 3	0.364	0.368	0.378	0.0164	0.801	1
Population 1	Signature 4	0.0343	0.0332	0.0304	0.0079	0.801	1
Population 1	Signature 5	0.0424	0.0397	0.0335	0.0141	0.801	1
Population 1	Signature 8	0.266	0.268	0.27	0.0153	0.801	1
Population 1	Signature 9	0.0703	0.0705	0.0715	0.00893	0.801	1
Population 1	Signature 11	0.03	0.0298	0.0298	0.00491	0.801	1
Population 1	Signature 13	0.0324	0.0322	0.0321	0.0047	0.801	1
Population 1	Signature 16	0.0178	0.0163	0.013	0.00686	0.801	1
Population 1	Signature 17	0.02	0.0192	0.0186	0.00515	0.801	1
Population 1	Signature 24	0.017	0.0165	0.0164	0.00494	0.801	1
Population 1	Signature 30	0.0149	0.0145	0.0125	0.00453	0.801	1
Population 2	Signature 1	0.0314	0.0292	0.0251	0.00262	0.199	0.309
Population 2	Signature 2	0.0124	0.0114	0.00867	0.00122	0.199	0.309
Population 2	Signature 3	0.0268	0.0252	0.0221	0.00229	0.199	0.309
Population 2	Signature 4	0.00538	0.00287	0.0014	0.00169	0.199	0.309
Population 2	Signature 5	0.0422	0.039	0.0331	0.00403	0.199	0.309
Population 2	Signature 8	0.237	0.243	0.252	0.00432	0.199	0.309
Population 2	Signature 9	0.0902	0.0901	0.0922	0.00478	0.199	0.309
Population 2	Signature 11	0.00791	0.00588	0.00343	0.00155	0.199	0.309
Population 2	Signature 13	0.00919	0.00835	0.00706	0.000931	0.199	0.309
Population 2	Signature 16	0.297	0.304	0.316	0.00563	0.199	0.309
Population 2	Signature 17	0.171	0.175	0.174	0.00342	0.199	0.309
Population 2	Signature 24	0.0029	0.00183	0.001	0.000782	0.199	0.309
Population 2	Signature 30	0.0654	0.0643	0.0643	0.00388	0.199	0.309

Session Info

The below output specifies the packages used in this analysis, and their version numbers.

print(sessionInfo())

## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.7 (Final)
## 
## Matrix products: default
## BLAS: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRblas.so
## LAPACK: /projects/ezhao_prj/dependencies/miniconda3/envs/dependencies/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.44.2                  
##  [3] rtracklayer_1.36.6                Biostrings_2.46.0                
##  [5] XVector_0.18.0                    GenomicRanges_1.30.0             
##  [7] GenomeInfoDb_1.14.0               IRanges_2.12.0                   
##  [9] S4Vectors_0.16.0                  BiocGenerics_0.24.0              
## [11] bindrcpp_0.2                      pander_0.6.1                     
## [13] ggraph_1.0.1                      hexbin_1.27.1                    
## [15] cowplot_0.9.2                     rstan_2.17.2                     
## [17] StanHeaders_2.17.2                dplyr_0.7.4                      
## [19] purrr_0.2.4                       readr_1.1.1                      
## [21] tidyr_0.8.0                       tibble_1.3.4                     
## [23] ggplot2_2.2.1                     tidyverse_1.1.1                  
## [25] nnls_1.4                          signit_1.0.1                     
## [27] Rcpp_0.12.13                     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131               bitops_1.0-6              
##  [3] matrixStats_0.52.2         lubridate_1.6.0           
##  [5] RColorBrewer_1.1-2         doParallel_1.0.11         
##  [7] httr_1.3.1                 rprojroot_1.3-1           
##  [9] tools_3.4.1                backports_1.1.2           
## [11] R6_2.2.2                   lazyeval_0.2.1            
## [13] colorspace_1.3-2           tidyselect_0.2.3          
## [15] gridExtra_2.3              mnormt_1.5-5              
## [17] compiler_3.4.1             rvest_0.3.2               
## [19] Biobase_2.38.0             xml2_1.1.1                
## [21] DelayedArray_0.4.1         labeling_0.3              
## [23] scales_0.5.0               psych_1.7.3.21            
## [25] stringr_1.2.0              digest_0.6.13             
## [27] Rsamtools_1.30.0           foreign_0.8-67            
## [29] rmarkdown_1.8.5            pkgconfig_2.0.1           
## [31] htmltools_0.3.6            readxl_1.0.0              
## [33] rlang_0.1.4                bindr_0.1                 
## [35] jsonlite_1.5               BiocParallel_1.12.0       
## [37] inline_0.3.14              RCurl_1.95-4.8            
## [39] magrittr_1.5               GenomeInfoDbData_0.99.1   
## [41] Matrix_1.2-7.1             munsell_0.4.3             
## [43] viridis_0.5.0              stringi_1.1.6             
## [45] yaml_2.1.16                MASS_7.3-47               
## [47] SummarizedExperiment_1.8.0 zlibbioc_1.24.0           
## [49] plyr_1.8.4                 grid_3.4.1                
## [51] ggrepel_0.7.0              forcats_0.2.0             
## [53] udunits2_0.13              lattice_0.20-34           
## [55] haven_1.0.0                hms_0.4.1                 
## [57] knitr_1.17                 igraph_1.1.2              
## [59] reshape2_1.4.3             codetools_0.2-15          
## [61] rmutil_1.1.0               XML_3.98-1.6              
## [63] glue_1.2.0                 evaluate_0.10.1           
## [65] modelr_0.1.0               tweenr_0.1.5              
## [67] foreach_1.4.4              cellranger_1.1.0          
## [69] gtable_0.2.0               assertthat_0.2.0          
## [71] ggforce_0.1.1              broom_0.4.2               
## [73] viridisLite_0.3.0          iterators_1.0.9           
## [75] GenomicAlignments_1.12.2   units_0.4-4

mutation_type	count
clustered\|DEL\|1-10kb	0
clustered\|DEL\|10-100kb	0
clustered\|DEL\|100kb-1Mb	0
clustered\|DEL\|1Mb-10Mb	3
clustered\|DEL\|>10Mb	1
clustered\|DUP\|1-10kb	0
clustered\|DUP\|10-100kb	1
clustered\|DUP\|100kb-1Mb	0
clustered\|DUP\|1Mb-10Mb	2
clustered\|DUP\|>10Mb	1
clustered\|INV\|1-10kb	0
clustered\|INV\|10-100kb	2
clustered\|INV\|100kb-1Mb	1
clustered\|INV\|1Mb-10Mb	2
clustered\|INV\|>10Mb	3
clustered\|TRA\|	8
non-clustered\|DEL\|1-10kb	14
non-clustered\|DEL\|10-100kb	14
non-clustered\|DEL\|100kb-1Mb	1
non-clustered\|DEL\|1Mb-10Mb	6
non-clustered\|DEL\|>10Mb	8
non-clustered\|DUP\|1-10kb	1
non-clustered\|DUP\|10-100kb	4
non-clustered\|DUP\|100kb-1Mb	9
non-clustered\|DUP\|1Mb-10Mb	4
non-clustered\|DUP\|>10Mb	1
non-clustered\|INV\|1-10kb	1
non-clustered\|INV\|10-100kb	4
non-clustered\|INV\|100kb-1Mb	1
non-clustered\|INV\|1Mb-10Mb	13
non-clustered\|INV\|>10Mb	16
non-clustered\|TRA\|	25

SignIT Vignette

Eric Zhao

Feb 9, 2017