--- title: "MicroSEC: Sequence artifact filtering pipeline for FFPE samples" author: "Masachika Ikegami" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{my-vignette} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: inline --- ## Introduction This pipeline is designed for filtering mutations found in formalin-fixed and paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical analysis, and the results for mutations with less than 10 supporting reads are not reliable. Four files are necessary for the analysis: mutation information file, BAM file, and mutation supporting read ID information file. # File 1: mutation information file (mandatory) This tsv or tsv.gz file should contain at least these columns: Sample Mut_type Chr Pos Ref Alt SimpleRepeat_TRF Neighborhood_sequence sample_name 1-snv chr1 108130741 C T N CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or not ("Y" or "N"). Neighborhood_sequence: [5'-20 bases] + [Alt sequence] + [3'-20 bases] Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly. If you do not know the SimpleRepeat_TRF, Mut_type, or Neighborhood_sequence, enter "-". Automatically detected. # File 2: BAM file (mandatory) # File 3: sample information tsv file (mandatory, if multiple samples are processed in a batch) Seven to ten columns are necessary (without column names). Optional columns can be deleted if they are not applicable. [sample name] [mutation information tsv file] [BAM file] [read length] [adapter sequence read 1] [optional: adapter sequence read 2] [sample type: Human or Mouse] [panel name] [optional: reference genome fastq file] [optional: simple repeat region bed file] PC9 ./source/CCLE.tsv ./source/Cell_line/PC9.bam 127 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP A375 ./source/CCLE.tsv.gz ./source/Cell_line/A375.bam 127 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa ./reference/simpleRepeat.bed.gz # File 4: Reference genome: Human (hg38 or hg19) or Mouse (mm10) (optional, but mandatory with cram files) # File 5: simple repeat region bed file (optional, but mandatory to detect simple repeat derived artifacts) This pipeline contains 8 filtering processes. Filter 1 : Shorter-supporting lengths distribute too unevenly to occur (1-1 and 1-2). Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)). Filter 1-2: The shorter-supporting lengths distributed over less than 75% of the read length. Filter 2 : Hairpin-structure induced error detection (2-1 and 2-2). Filter 2-1: Palindromic sequences exist within 150 bases. Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary sequence of the opposite strand consisting >= 15 bases. Filter 3 : 3’-/5’-supporting lengths are too unevenly distributed to occur (3-1 and 3-3). Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)). Filter 3-2: The distributions of 3’-/5’-supporting lengths are within 75% of the read length. Filter 4 : >=15% mutations were called by chimeric reads comprising two distant regions. Filter 5 : >=50% mutations were called by soft-clipped reads. Filter 6 : Mutations locating at simple repeat sequences. Filter 7 : Mutations locating at a >=15 homopolymer. Filter 8 : >=10% low quality bases (Quality score <18) in the mutation supporting reads. Supporting lengths are adjusted considering small repeat sequences around the mutations. ## How to use Rscript MicroSEC.R [working/output directory] [sample information tsv file] [progress bar Y/N] # Example ``` $ Rscript MicroSEC.R ~ \ ~/source/Sample_list_test.txt N $ Rscript MicroSEC.R ~ \ ~/source/sample_info_test.tsv.gz Y ``` If you want to know the progress visually, [progress bar Y/N] should be Y. Results are saved in a tsv file. github url: https://github.com/MANO-B/MicroSEC ## Setting ```{r setting} wd <- "~" knitr::opts_chunk$set(collapse = TRUE, fig.width = 12, fig.height = 8, echo = TRUE, warning = FALSE, message = TRUE, comment = "#>") options(rmarkdown.html_vignette.check_title = FALSE, show.error.messages = FALSE, warn = -1) progress_bar <- "N" ``` ## Necessary packages ```{r packages} library(MicroSEC) ``` ## Analysis ```{r analysis} # initialize msec <- NULL homology_search <- NULL mut_depth <- NULL # test data sample_name <- "sample" read_length <- 150 adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" organism <- "hg38" # load mutation information df_mutation <- fun_load_mutation( system.file("extdata", "mutation_list.tsv", package = "MicroSEC"), "sample", BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38, 24) df_bam <- fun_load_bam( system.file("extdata", "sample.bam", package = "MicroSEC")) # another example data # data(exampleMutation) # data(exampleBam) # df_mutation <- exampleMutation # df_bam <- exampleBam # load genomic sequence ref_genome <- fun_load_genome(organism) chr_no <- fun_load_chr_no(organism) # analysis result <- fun_read_check(df_mutation = df_mutation, df_bam = df_bam, ref_genome = ref_genome, sample_name = sample_name, read_length = read_length, adapter_1 = adapter_1, adapter_2 = adapter_2, short_homology_search_length = 4, min_homology_search = 40, progress_bar = progress_bar) msec_read_checked <- result[[1]] homology_searched <- result[[2]] mut_depth_checked <- result[[3]] # search homologous sequences msec_homology_searched = fun_homology(msec = msec_read_checked, df_distant = homology_searched, min_homology_search = 40, ref_genome = ref_genome, chr_no = chr_no, progress_bar = progress_bar) # statistical analysis msec_summarized <- fun_summary(msec_homology_searched) msec_analyzed <- fun_analysis(msec = msec_summarized, mut_depth = mut_depth_checked, short_homology_search_length = 4, min_homology_search = 40, threshold_p = 10 ^ (-6), threshold_hairpin_ratio = 0.50, threshold_short_length = 0.75, threshold_distant_homology = 0.15, threshold_soft_clip_ratio = 0.50, threshold_low_quality_rate = 0.1, homopolymer_length = 15) # save the results as a tsv.gz file. #fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz") ``` ## Results ```{r result} msec_analyzed ``` ## Information about the current R session ```{r sessioninfo} sessionInfo() ```