---
title: "MicroSEC: Sequence artifact filtering pipeline for FFPE samples"
author: "Masachika Ikegami"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{my-vignette}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: inline
---

## Introduction
This pipeline is designed for filtering mutations found in formalin-fixed and 
paraffin-embedded (FFPE) samples. The MicroSEC filter utilizes a statistical 
analysis, and the results for mutations with less than 10 supporting reads are 
not reliable. Four files are necessary for the analysis: mutation information 
file, BAM file, and mutation supporting read ID information file.  
  
# File 1: mutation information file (mandatory)
This tsv or tsv.gz file should contain at least these columns:
     Sample Mut_type   Chr       Pos Ref Alt SimpleRepeat_TRF
     Neighborhood_sequence  
sample_name    1-snv  chr1 108130741   C   T                N
CTACCTGGAGAATGGGCCCATGTGTCCAGGTAGCAGTAAGC  
SimpleRepeat_TRF: Whether the mutation locates at a simple repeat sequence or 
not ("Y" or "N").  
Neighborhood_sequence: [5'-20 bases] + [Alt sequence] + [3'-20 bases]  
Sample, Mut_type, Chr, Pos, Ref, and Alt should be set exactly.  
If you do not know the SimpleRepeat_TRF, Mut_type, or Neighborhood_sequence, 
enter "-". Automatically detected.  
  
# File 2: BAM file (mandatory)  
  
# File 3: sample information tsv file 
(mandatory, if multiple samples are processed in a batch)  
Seven to ten columns are necessary (without column names).  
Optional columns can be deleted if they are not applicable.  
[sample name] [mutation information tsv file] [BAM file] [read length] 
[adapter sequence read 1] [optional: adapter sequence read 2] 
[sample type: Human or Mouse] [panel name] 
[optional: reference genome fastq file] 
[optional: simple repeat region bed file]  
PC9	./source/CCLE.tsv	./source/Cell_line/PC9.bam 127 
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Human TOP  
A375 ./source/CCLE.tsv.gz	./source/Cell_line/A375.bam	127	
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT Hg38 TOP ./reference/hg38.fa 
./reference/simpleRepeat.bed.gz  
  
# File 4: Reference genome: Human (hg38 or hg19) or Mouse (mm10) 
(optional, but mandatory with cram files)  
  
# File 5: simple repeat region bed file 
(optional, but mandatory to detect simple repeat derived artifacts)  
  
This pipeline contains 8 filtering processes.  
Filter 1  : Shorter-supporting lengths distribute too unevenly to occur 
(1-1 and 1-2).  
Filter 1-1: P-values are less than the threshold_p(default: 10^(-6)).  
Filter 1-2: The shorter-supporting lengths distributed over less than 75% of 
the read length.  
Filter 2  : Hairpin-structure induced error detection (2-1 and 2-2).  
Filter 2-1: Palindromic sequences exist within 150 bases. 
Filter 2-2: >=50% mutation-supporting reads contains a reverse complementary 
sequence of the opposite strand consisting >= 15 bases.  
Filter 3  : 3’-/5’-supporting lengths are too unevenly distributed to occur 
(3-1 and 3-3).  
Filter 3-1: P-values are less than the threshold_p(default: 10^(-6)).  
Filter 3-2: The distributions of 3’-/5’-supporting lengths are within 75% of 
the read length.  
Filter 4  : >=15% mutations were called by chimeric reads comprising two 
distant regions.  
Filter 5  : >=50% mutations were called by soft-clipped reads.  
Filter 6  : Mutations locating at simple repeat sequences.  
Filter 7  : Mutations locating at a >=15 homopolymer.  
Filter 8  : >=10% low quality bases (Quality score <18) in the mutation 
supporting reads.  
  
Supporting lengths are adjusted considering small repeat sequences around the 
mutations.  
  
## How to use  
Rscript MicroSEC.R [working/output directory] [sample information tsv file] 
[progress bar Y/N]  
  
# Example
```
$ Rscript MicroSEC.R ~ \
  ~/source/Sample_list_test.txt N
$ Rscript MicroSEC.R ~ \
   ~/source/sample_info_test.tsv.gz Y
```
  
If you want to know the progress visually, [progress bar Y/N] should be Y.  
Results are saved in a tsv file.  
  

github url: https://github.com/MANO-B/MicroSEC  
  
## Setting
```{r setting}
wd <- "~"
knitr::opts_chunk$set(collapse = TRUE,
                      fig.width = 12,
                      fig.height = 8,
                      echo = TRUE,
                      warning = FALSE,
                      message = TRUE,
                      comment = "#>")
options(rmarkdown.html_vignette.check_title = FALSE,
        show.error.messages = FALSE,
        warn = -1)

progress_bar <- "N"
```

## Necessary packages
```{r packages}
library(MicroSEC)
```

## Analysis
```{r analysis}
# initialize
msec <- NULL
homology_search <- NULL
mut_depth <- NULL

# test data
sample_name <- "sample"
read_length <- 150
adapter_1 <- "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
adapter_2 <- "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
organism <- "hg38"

# load mutation information
df_mutation <- fun_load_mutation(
   system.file("extdata", "mutation_list.tsv", package = "MicroSEC"),
   "sample",
   BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38,
   24)
df_bam <- fun_load_bam(
   system.file("extdata", "sample.bam", package = "MicroSEC"))

# another example data
# data(exampleMutation)
# data(exampleBam)
# df_mutation <- exampleMutation
# df_bam <- exampleBam

# load genomic sequence
ref_genome <- fun_load_genome(organism)
chr_no <- fun_load_chr_no(organism)

# analysis
result <- fun_read_check(df_mutation = df_mutation,
                         df_bam = df_bam,
                         ref_genome = ref_genome,
                         sample_name = sample_name,
                         read_length = read_length,
                         adapter_1 = adapter_1,
                         adapter_2 = adapter_2,
                         short_homology_search_length = 4,
                         min_homology_search = 40,
                         progress_bar = progress_bar)
msec_read_checked <- result[[1]]
homology_searched <- result[[2]]
mut_depth_checked <- result[[3]]

# search homologous sequences
msec_homology_searched = fun_homology(msec = msec_read_checked,
                    df_distant = homology_searched,
                    min_homology_search = 40,
                    ref_genome = ref_genome,
                    chr_no = chr_no,
                    progress_bar = progress_bar)

# statistical analysis
msec_summarized <- fun_summary(msec_homology_searched)
msec_analyzed <- fun_analysis(msec = msec_summarized,
                    mut_depth = mut_depth_checked,
                    short_homology_search_length = 4,
                    min_homology_search = 40,
                    threshold_p = 10 ^ (-6),
                    threshold_hairpin_ratio = 0.50,
                    threshold_short_length = 0.75,
                    threshold_distant_homology = 0.15,
                    threshold_soft_clip_ratio = 0.50,
                    threshold_low_quality_rate = 0.1,
                    homopolymer_length = 15)

# save the results as a tsv.gz file.
#fun_save(msec_analyzed, "~/MicroSEC_test.tsv.gz")
```

## Results
```{r result}
msec_analyzed
```

## Information about the current R session
```{r sessioninfo}
sessionInfo()
```