%\VignetteIndexEntry{User manual} \documentclass[a4paper]{article} %\usepackage{hyperref} \usepackage[backend=bibtex]{biblatex} \bibliography{ClusteredMutationsVignettes.bib} \usepackage{graphicx} \usepackage{mathtools} \begin{document} \SweaveOpts{concordance=TRUE} \graphicspath{ {/home/david/Documents/ClusteredMutations/inst/doc/} } \title{\textbf{\emph{ClusteredMutations}: Looking for a (Mutation) Shower. \\}} \author{\textbf{David Lora}\\ \small{Clinical Research Unit (imas12)}\\ \small{Hospital Universitario 12 de Octubre}\\ \small{CIBER de Epidemiolog\'ia y Salud P\'ublica (CIBERESP)}\\ \small{Madrid, Spain}\\ \small{E-mail: \texttt{david@h12o.es}} } \maketitle \abstract{} This vignette shows the steps to identify the hyper-mutated zones, i.e., groups of closely spaced mutations, with a data set of somatic substitution mutations from a primary breast cancer whole genome with a germline mutation in BRCA1 using \emph{ClusteredMutations}. \\ \\ \textbf{keywords:} \emph{cancer genome, somatic mutation, mutation showers, clustered mutations, kataegis, anti-Robinson matrix} \begin{figure}[b] \begin{center} \includegraphics{figura1} \end{center} \end{figure} \newpage \section{Example and applications.} In the following example, a data set (PD4107a) of somatic substitution mutations from a primary breast cancer whole genome with a germline mutation in BRCA1 \cite{alexandrov_signatures_2013, nik-zainal_mutational_2012} is used to locate the hyper-mutated zones using \emph{ClusteredMutations}. \\ First, \emph{ClusteredMutations} package and PD4107a data set are loaded. <<>>= library(ClusteredMutations) data(PD4107a) @ \emph{showers()} is called with a change in the default setting to identify the complex mutations. Complex mutations are those regions with two or more mutations with each mutation separated by less than 10 bp from their nearest neighbor; therefore, min=2 and max=10 are used. Because complex mutations likely originate from trans-lesion synthesis (TLS) past a single DNA lesion\cite{harfe_dna_2000}, Roberts et al.\cite{roberts_clustered_2012, roberts_apobec_2013} proposed treating them as a single event. In this example, all somatic substitution mutations are used, including complex mutations. <<>>= data.showers<-showers(data=PD4107a, chr=Chr, position=Position, min=2, max=10) head(data.showers, n=10) @ The classic graph (Figure 1) to localize the regional clustering of mutations is the rainfall plot\cite{nik-zainal_mutational_2012}. \emph{imd()} permits the generation of a data set with the inter-mutational distance (IMD), the distance between each somatic substitution and the substitution immediately prior\cite{nik-zainal_mutational_2012}, and extra information, for example: base substitutions. \begin{figure} \begin{center} <>= extra <- factor(c(),levels=c("T>C","T>G","T>A","C>T","C>G","C>A")) extra[PD4107a$Ref_base=="A" & PD4107a$Mutant_base=="G"]<-"T>C" extra[PD4107a$Ref_base=="T" & PD4107a$Mutant_base=="C"]<-"T>C" extra[PD4107a$Ref_base=="A" & PD4107a$Mutant_base=="C"]<-"T>G" extra[PD4107a$Ref_base=="T" & PD4107a$Mutant_base=="G"]<-"T>G" extra[PD4107a$Ref_base=="A" & PD4107a$Mutant_base=="T"]<-"T>A" extra[PD4107a$Ref_base=="T" & PD4107a$Mutant_base=="A"]<-"T>A" extra[PD4107a$Ref_base=="G" & PD4107a$Mutant_base=="A"]<-"C>T" extra[PD4107a$Ref_base=="C" & PD4107a$Mutant_base=="T"]<-"C>T" extra[PD4107a$Ref_base=="G" & PD4107a$Mutant_base=="C"]<-"C>G" extra[PD4107a$Ref_base=="C" & PD4107a$Mutant_base=="G"]<-"C>G" extra[PD4107a$Ref_base=="G" & PD4107a$Mutant_base=="T"]<-"C>A" extra[PD4107a$Ref_base=="C" & PD4107a$Mutant_base=="A"]<-"C>A" PD4107a$extra<-extra rainfall<-imd(data=PD4107a,chr=Chr,position=Position,extra=extra) plot(rainfall$number, rainfall$log10distance, col=c("yellow", "green", "pink", "red", "black", "blue")[rainfall$extra], pch=20, ylab="Intermutation distance (bp)", xlab="PD4107a", yaxt="n") axis(2, at=c(0, 1, 2, 3, 4, 6), labels=c("1", "10", "100", "1000", "10000", "1000000"), las=2, cex.axis=0.6) legend("topleft", legend = levels(rainfall$extra), col=c("yellow", "green", "pink", "red", "black", "blue"), pch=20, horiz=TRUE, text.font=4, bg='lightblue') @ \end{center} \caption{Rainfall plot of somatic substitution mutations from a patient with breast cancer (PD4107a).} \label{fig:1} \end{figure} \newpage Figure 1 is the rainfall plot of PD4107a. The horizontal axis presents the number of consecutive mutations, the vertical axis indicates the distance between each somatic substitution and the substitution immediately prior (inter-mutational distance). Four features were observed in the rainfall plot of PD4107a (Figure 2): \\ 1) Many of the somatic mutations have IMD greater than 10000 bp; thus, two consecutive mutations are sufficiently remote to be candidates to belong to the regional clustering of substitution mutations. \\ 2) There were mutations with distances less than 10 bp, i.e., complex mutations\cite{roberts_clustered_2012, roberts_apobec_2013}. \\ 3) There were mutation zones with IMD less than 1000 bp on chromosomes 6 and 12. These zones can be hyper-mutated regions. \\ 4) The preponderance of C>T and C>G substitutions is present in the candidate zones. \\ The rainfall plot identifies candidate hyper-mutated zones. Visual assessment can be erroneous (Figure 2). \emph{showers()} is called with the default setting. There are no clustered mutations. \begin{figure} \begin{center} <>= set.seed(42) position<-c( c((runif(1001,min=1,max=10000001))), c(c(10110001,10110011,10110021,10110031,10110041,10120000, 10120001,10120011,10120021,10120031,10130000, 10130001,10130011,10130021,10130031,10140000, 10140001,10140011,10140021,10140031,10150000, 10150001,10150011,10150021,10150031,10160000, 10160001,10160011,10160021,10160031,10170000, 10170001,10170011,10170021,10170031,10180000, 10180001,10180011,10180021,10180031,10190000, 10210001,10210011,10210021,10210031,10220000, 10220001,10220011,10220021,10220031,10230000, 10230001,10230011,10230021,10230031,10240000, 10240001,10240011,10240021,10240031,10250000, 10250001,10250011,10250021,10250031,10260000, 10260001,10260011,10260021,10260031,10270000, 10270001,10270011,10270021,10270031,10280000, 10280001,10280011,10280021,10280031,10290000) + round(runif(81,min=0,max=500))), c(round(runif(991,min=10296000,max=20200000)))) rainfall<-imd(position=position) #Rainfall plot for PD4107a cancer sample; plot(rainfall$number, rainfall$log10distance, pch=20, ylab="Intermutation distance (bp)", xlab="Example", yaxt="n") axis(2, at=c(0, 1, 2, 3, 4, 6), labels=c("1", "10", "100", "1000", "10000", "1000000"), las=2, cex.axis=0.6) theta <- seq(0, 2 * pi, length = 200) lines(x = 100 * cos(theta) + 1050, y = sin(theta) + 1.5, col="red") @ \end{center} \caption{Rainfall plot of an example data set of somatic substitution mutations.} \label{fig:2} \end{figure} \newpage \emph{showers()} function in the example data set finds 0 hyper-mutated zones. <<>>= showers(position=position) @ \emph{showers()} function in the PD4107a data set finds 21 hyper-mutated zones with 674 mutations, \emph{features()} shows the mutation positions in the chromosome with additional information. <<>>= showers(data=PD4107a,chr=Chr,position=Position) @ \emph{showers()} function uses the anti-Robinson properties. For example, for a sample of DNA sequence of cancer cell with 14 somatic substitution mutations (Figure 3) and if the hyper-mutated zone contains those segments with >= 5 consecutive mutations with a distance of <= 100 bp, then: \\ 1) the distance between the mutations 3 and 7 is \[d_{37} = 106\] 2) The distance is increased when moving away from the main diagonal, by row \[d_{37} = 106 <= d_{38} = 116\] and by column \[d_{37} = 106 <= d_{27} = 206\] 3) The length of the hyper-mutated zone is determined by row or by column based on anti-Robinson properties: j - i + 1 = 10 - 4 + 1 = 7. \\ 4) The calculated distance to the identified hyper-mutated zones is equal to: n - min + (number of the hyper-mutated zones in the sample) + 1 = 14 - 5 + 1 + 1 = 11. \newpage <<>>= example1<-c(1,101,201,299,301,306,307,317,318,320,418,518,528,628) 10**(dissmutmatrix(position=example1,upper=TRUE)) @ \begin{figure}[h] \begin{center} \includegraphics{figura3} \end{center} \caption{Symmetric dissimilarity mutation matrix for a sample of DNA sequence of cancer cell with 14 somatic substitution mutations.} \label{fig:3} \end{figure} \newpage \emph{dissmutmatrix()} obtains the symmetric dissimilarity mutations matrix. This function computes and returns the distance matrix computed using the Euclidean distance measure to compute the distances between all pairs of positions of somatic mutations. This matrix can be plotted using the \emph{dissplot()} function of the \emph{seriation} R package\cite{hahsler_getting_2008}. Plotting the distance matrix helps to visualize and identify mutation clusters in addition to locating the micro-clustered mutated regions within the macro-clustered mutated zones that occur during the oncogenic process. The plot is applied to chromosome 6 of the PD4107 data set (Figure 4). The distances, in logarithm of base 10, are colored according to the existing color palette. Observed clusters of mutations, or candidates for hyper-mutated zones (less than 5000 bp between distant mutations), are shown by orange and red squares (20 regions). \begin{figure}[h] \begin{center} <>= mut.matrix <- dissmutmatrix(data=PD4107a, chr=Chr, position=Position, subset=6) dissplot(mut.matrix, method=NA, options=list( col = c("black", "navy", "blue", "cyan", "green", "yellow", "orange", "red", "darkred", "darkred", "white"))) @ \end{center} \caption{Graphical representation of the anti-Robinson matrix, i.e, the dissimilarity plot, generated from the mutation positions on chromosome 6.} \label{fig:4} \end{figure} \section{Conclusion.} The anti-Robinson matrix properties can be used to identify and view all small highly mutated zones within the oncogenic process. The properties of this matrix are implemented in an R package: \emph{ClusteredMutations}. \newpage \printbibliography \end{document}