--- title: "Genetic File Information" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Genetic File Information} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(BinaryDosage) ``` The routines getbdinfo, getvcfinfo, and getgeninfo return a list with information about the data in the files. The list returned by each of these routines a section common to them all and a list additionalinfo that is specific to the file type. ## Common section The common section has the following elements - filename - Character value with the complete path and file name of the file with the genetic data - usesfid - Logical value indicating if the subject data has family IDs. - samples - Data frame containing the following information about the subjects + fid - Character value with family IDs + sid - Character value with the individual IDs - onchr - Logical value indicating if all the SNPs are on the same chromosome - snpidformat - Integer indicating the format of the SNP IDs as follows + 0 - Unknown for VCF and GEN files or user specified for binary dosage files + 1 - chromosome:location + 2 - chromosome:location:referenceallele:alternateallele + 3 - chromosome:location_referenceallele_alternateallele - snps - Data frame containing the following values + chromosome - Character value indicating what chromosome the SNP is on + location - Integer value with the location of the SNP on the chromosome + snpid - Character value with the ID of the SNP + reference - Character value of the reference allele + alternate - Character value of the alternate allele - snpinfo - List that contain the following information + aaf - numeric vector with the alternate allele frequencies + maf - numeric vector with the minor allele frequencies + avgcall - Numeric vector with the imputation average call + rsq - Numeric vector with the imputation r squared value - datasize - Numeric vector indicating the size of data in the file for each SNP - indices - Numeric vector indicating the starting location in the file for each SNP The list returned has its class value set to "genetic-info". The datasize and indices values are only returned if the parameter index is set equal to TRUE ## Binary Dosage Additional Information The additional information returned for binary dosage files contains the following information. - format - numeric value with the format of the binary dosage file - subformat - numeric value with the subformat of the binary dosage file - headersize - integer value with the size of the header in the binary dosage file - numgroups - integer value of the number of groups of subjects in the binary dosage file. This is usually the number of binary dosage files merged together to form the file - groups - integer vector with size of each of the groups This list has its class value set to "bdose-info". ## VCF File Additional Information The additional information returned for VCF files contains the following information. - gzipped - Logical value indicating if the file has been compressed using gzip - headerlines - Integer value indicating the number of lines in the header - headersize - Numeric value indicating the size of the header in bytes - quality - Character vector containing the values in QUALITY column - filter - Character vector containing the values in the FILTER column - info - Character vector containing the values in the INFO column - format - Character vector containing the values in the FORMAT column - datacolumns - Data frame summarizing the entries in the FORMAT value containing the following information + numcolumns - Integer value indicating the number of values in the FORMAT value + dosage - Integer value indicating the column containing the dosage value + genotypeprob - Integer value indicating the column containing the genotype probabilities + genotype - Integer value indicating the column containing the genotype call This list has its class value set to "vcf-info". The values for quality, filter, info, and format can have a length of 0 if all the values are missing. They will have a length of 1 if all the values are equal. The number of rows in the datacolumns data frame will be equal to the length of the format value. ## GEN File Additional Information The additional information returned for GEN files contains the following information. - gzipped - Logical value indicating if the GEN file is compressed using gz - headersize - Integer value indicating the size of the header in bytes - format - Integer value indicating the number of genotype probabilities for each subject with the following meanings + 1 - Dosage only + 2 - $\Pr(g=0)$ and $\Pr(g=1)$ + 3 - $\Pr(g=0)$, $\Pr(g=1)$, and $\Pr(g=2)$ - startcolumn - Integer value indicating in which column the genetic data starts - sep - Character value indicating what value separates the columns $g$ indicates the number of alternate alleles the subject has. This list has its class value set to "gen-info".