---
title: "Genetic File Information"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Genetic File Information}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(BinaryDosage)
```
The routines getbdinfo, getvcfinfo, and getgeninfo return a list with information about the data in the files. The list returned by each of these routines a section common to them all and a list additionalinfo that is specific to the file type.
## Common section
The common section has the following elements
- filename - Character value with the complete path and file name of the file with the genetic data
- usesfid - Logical value indicating if the subject data has family IDs.
- samples - Data frame containing the following information about the subjects
+ fid - Character value with family IDs
+ sid - Character value with the individual IDs
- onchr - Logical value indicating if all the SNPs are on the same chromosome
- snpidformat - Integer indicating the format of the SNP IDs as follows
+ 0 - Unknown for VCF and GEN files or user specified for binary dosage files
+ 1 - chromosome:location
+ 2 - chromosome:location:referenceallele:alternateallele
+ 3 - chromosome:location_referenceallele_alternateallele
- snps - Data frame containing the following values
+ chromosome - Character value indicating what chromosome the SNP is on
+ location - Integer value with the location of the SNP on the chromosome
+ snpid - Character value with the ID of the SNP
+ reference - Character value of the reference allele
+ alternate - Character value of the alternate allele
- snpinfo - List that contain the following information
+ aaf - numeric vector with the alternate allele frequencies
+ maf - numeric vector with the minor allele frequencies
+ avgcall - Numeric vector with the imputation average call
+ rsq - Numeric vector with the imputation r squared value
- datasize - Numeric vector indicating the size of data in the file for each SNP
- indices - Numeric vector indicating the starting location in the file for each SNP
The list returned has its class value set to "genetic-info".
The datasize and indices values are only returned if the parameter index is set equal to TRUE
## Binary Dosage Additional Information
The additional information returned for binary dosage files contains the following information.
- format - numeric value with the format of the binary dosage file
- subformat - numeric value with the subformat of the binary dosage file
- headersize - integer value with the size of the header in the binary dosage file
- numgroups - integer value of the number of groups of subjects in the binary dosage file. This is usually the number of binary dosage files merged together to form the file
- groups - integer vector with size of each of the groups
This list has its class value set to "bdose-info".
## VCF File Additional Information
The additional information returned for VCF files contains the following information.
- gzipped - Logical value indicating if the file has been compressed using gzip
- headerlines - Integer value indicating the number of lines in the header
- headersize - Numeric value indicating the size of the header in bytes
- quality - Character vector containing the values in QUALITY column
- filter - Character vector containing the values in the FILTER column
- info - Character vector containing the values in the INFO column
- format - Character vector containing the values in the FORMAT column
- datacolumns - Data frame summarizing the entries in the FORMAT value containing the following information
+ numcolumns - Integer value indicating the number of values in the FORMAT value
+ dosage - Integer value indicating the column containing the dosage value
+ genotypeprob - Integer value indicating the column containing the genotype probabilities
+ genotype - Integer value indicating the column containing the genotype call
This list has its class value set to "vcf-info".
The values for quality, filter, info, and format can have a length of 0 if all the values are missing. They will have a length of 1 if all the values are equal. The number of rows in the datacolumns data frame will be equal to the length of the format value.
## GEN File Additional Information
The additional information returned for GEN files contains the following information.
- gzipped - Logical value indicating if the GEN file is compressed using gz
- headersize - Integer value indicating the size of the header in bytes
- format - Integer value indicating the number of genotype probabilities for each subject with the following meanings
+ 1 - Dosage only
+ 2 - $\Pr(g=0)$ and $\Pr(g=1)$
+ 3 - $\Pr(g=0)$, $\Pr(g=1)$, and $\Pr(g=2)$
- startcolumn - Integer value indicating in which column the genetic data starts
- sep - Character value indicating what value separates the columns
$g$ indicates the number of alternate alleles the subject has.
This list has its class value set to "gen-info".