--- title: '*LDlinkR*: An R Package for Rapidly Calculating Linkage Disequilibrium Statistics in Diverse Populations' author: "Timothy A. Myers, Stephen J. Chanock and Mitchell J. Machiela" date: "`r Sys.Date()`" output: rmarkdown::html_vignette width: 300 vignette: > %\VignetteIndexEntry{*LDlinkR*: An R Package for Rapidly Calculating Linkage Disequilibrium Statistics in Diverse Populations} %\VignetteEngine{knitr::rmarkdown} \usagepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) # library("LDlinkR") ``` ------------------------------------------------------------------------ ## Introduction [LDlink](https://ldlink.nih.gov) is an interactive and powerful suite of web-based tools for querying germline variants in human population groups of interest to generate interactive tables and plots. All population genotype data originates from Phase 3 (Version 5) of the 1000 Genomes Project and variant RS numbers are indexed based on [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) 155. *LDlinkR* is an R package developed to query and download results **(internet access required)** generated by LDlink web-based applications from the R console. *LDlinkR* accelerates genomic research by providing efficient and user-friendly functions to programmatically interrogate pairwise linkage disequilibrium from large lists of genetic variants. Please see the the sections below and the online [LDlink documentation](https://ldlink.nih.gov/?tab=help#Understanding_Linkage_Disequilibrium) for more information about understanding linkage disequilibrium (LD) and additional details about how LDlink calculates patterns of LD across a variety of ancestral human populations. ## Understanding Linkage Disequilibrium What is linkage **dis**equilibrium? Perhaps it is best to start with linkage **equilibrium**. Linkage equilibrium exists when alleles from two different genetic variants occur independently of each other. The inheritance of such variants follows probabilistic patterns governed by population allele frequencies. The vast majority of genetic variants on a chromosome are in linkage equilibrium. Variants in linkage equilibrium are not considered linked. Linkage **dis**equilibrium is present when alleles from two nearby genetic variants commonly occur together in a non-random, linked fashion. This linked mode of inheritance results from genetic variants in close proximity being less likely to be separated by a recombination event and thus alleles of the variants are more commonly inherited together than expected. Alleles of variants in linkage disequilibrium are correlated; with the degree of correlation generally greater in magnitude the closer the variants are in physical distance. Measures of linkage disequilibrium include D prime (D') and R squared (R2). A haplotype is a cluster of genetic variants that are inherited together. Humans are diploid; having maternal and paternal copies of each autosomal chromosome. Each chromosomal copy is organized into segments of high linkage disequilibrium, called haplotype "blocks". Due to unique population histories and differences in variant allele frequencies, haplotype structure tends to be population specific. Although haplotypes are essential for calculating measures of linkage disequilibrium, haplotypes are seldom directly observed. Statistical chromosome phasing techniques are often necessary to infer individual haplotypes. ## LDlink Data Sources [dbSNP](https://www.ncbi.nlm.nih.gov/snp/) [(source: GRCh37 and GRCh38)](https://ftp.ncbi.nih.gov/snp/) - To investigate patterns of linkage disequilibrium, LDlink focuses on two main classes of genetic variation: single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). Every module of LDlink requires the entry of at least one variant as identified by a RefSNP number (RS number) or genomic position (chr#:position). RS numbers are unique labels assigned by dbSNP and are well-curated identifiers that follow the format "rs" followed by a number. The current implementation of LDlink references dbSNP and only accepts input for bi-allelic variants. 1000 Genomes Project (source: [GRCh37](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/), [GRCh38](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/), and [GRCh38 High Coverage](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/)) - Publicly available reference haplotypes from the 1000 Genomes Project are used by LDlink to calculate population-specific measures of linkage disequilibrium. Haplotypes are available for continental populations (ex: European, African, and Admixed American) and sub-populations (ex: Finnish, Gambian, and Peruvian). All LDlink modules require the selection of at least one 1000 Genomes Project sub-population, but several sub-populations can be selected simultaneously. Available haplotypes vary by sub-population based on sample size. [UCSC RefSeq](https://genome.ucsc.edu/cgi-bin/hgTables) (source: GRCh37 and GRCh38) - Publicly available gene transcripts from the UCSC Table Browser are used by LDlink's LDassoc (currently not available in the *LDlinkR* package), LDmatrix, and LDproxy modules to display genes within the genomic window of interest. [RegulomeDB](https://regulomedb.org/) (source: GRCh37) - Publicly available scores from RegulomeDB are used by LDlink's LDassoc (currently not available in the *LDlinkR* package) and LDproxy modules to rank available data types for a single coordinate. GRCh38 support is added via [liftOver](https://genome.ucsc.edu/cgi-bin/hgLiftOver). [Genetic Map](https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html) (source: GRCh37) - Publicly available combined recombination rates (cM/Mb) from the 1000 Genomes Project are used by LDlink's LDassoc (currently not available in the *LDlinkR* package) and LDproxy modules to show recombination at specific coordinates. GRCh38 support is added via [liftOver](https://genome.ucsc.edu/cgi-bin/hgLiftOver). [GTEx Portal](https://www.gtexportal.org/home/) (source: GRCh38) - Publicly available single-tissue cis-QTL data from the GTEx Portal is used by LDlink's LDexpress module to show significant variant-gene associations in multiple tissue types. GRCh37 support is added via GTEx lookup table. [GWAS Catalog](https://www.ebi.ac.uk/gwas/home) (source: GRCh38) - Publicly available NHGRI-EBI Catalog of human genome-wide association studies from GWAS Catalog is used by LDlink's LDtrait module to search if variants have previously been associated with a trait or disease. GRCh37 support is added via [dbSNP](https://www.ncbi.nlm.nih.gov/snp/). ## Calculations LDlink modules report the following measures of linkage disequilibrium: D prime, R squared, and goodness-of-fit statistics. Below is a brief description of each measure. **D prime (D')** - an indicator of allelic segregation for two genetic variants. D' values range from 0 to 1 with higher values indicating tight linkage of alleles. A D' value of 0 indicates no linkage of alleles. A D' value of 1 indicates at least one expected haplotype combination is not observed. **R squared (R2)** - a measure of correlation of alleles for two genetic variants. R2 values range from 0 to 1 with higher values indicating a higher degree of correlation. An R2 value of 0 indicates alleles are independent, whereas an R2 value of 1 indicates an allele of one variant perfectly predicts an allele of another variant. R2 is sensitive to allele frequency. **Goodness of Fit (X2 and p-value)** - statistical test testing whether observed haplotype counts follow frequencies expected from variant allele frequencies. High chi-square statistics and low p-values are evidence that haplotype counts deviate from expected values and suggest linkage disequilibrium may be present. ## Installation - The release version of *LDlinkR* can be installed from CRAN: ```{r eval=FALSE} install.packages("LDlinkR") ``` - The development version of the *LDlinkR* package can be installed from the GitHub repository by using the *remotes* package: ```{r eval=FALSE} install.packages("remotes") remotes::install_github("CBIIT/LDlinkR") ``` *LDlinkR* depends on the following packages: - *utils*, version 3.4.2 or later - *httr*, version 1.4.0 or later Following installation, attach the *LDlinkR* package with: ```{r eval=FALSE} library(LDlinkR) ``` ## Personal Access Token - Required In order to access the LDlink API via *LDlinkR*, we use a personal access token. This is a common convention followed by many APIs and emulates the more familiar HTTPS username/password or SSH keys. You will need to: - Make a one-time request for your personal access token from a web browser at . - Once registered, your personal access token will be emailed to you. It is a string of 12 random letters and numbers. - Provide your token as an argument when using *LDlinkR*. See example below: ```{r, eval=FALSE} LDhap(snps = c("rs3", "rs4", "rs148890987"), pop = "YRI", token = "YourTokenHere123") ``` **Optional:**
However, the best security practice is to store your personal access token as an environment variable where *LDlinkR* can find it and use it on your behalf but where it will not be accidentally shared with the public. **Note:** Modifying R startup files (such as the `.Renviron`) is for the advanced R user only. Modification of these files in the wrong way could cause problems. Please proceed cautiously. Step-by-step instructions follow: After retrieving your personal access token from your email, put your token in your `.Renviron` file. `.Renviron` is a hidden file that lives in your home directory. The easiest way to both find and edit the `.Renviron` file is with a function from the *usethis* package. From the R console, do: ```{r eval=FALSE} usethis::edit_r_environ() ``` Your `.Renviron` file should open in your editor. Add a line that looks like this: ```{r eval=FALSE} LDLINK_TOKEN=YourTokenHere123 ``` **Important**, ensure you put a line break at the end by hitting the *enter/return* key. Save and close the `.Renviron` file. Restart R, as environment variables are only loaded from `.Renviron` at the start of a new R session. Now, check to see that your token is available by entering: ```{r eval = FALSE} Sys.getenv("LDLINK_TOKEN") ``` ## [1] "YourTokenHere123" You should see your personal access token print to the screen, as shown above. Now, *LDlinkR* function calls that use ```{r eval=FALSE} Sys.getenv("LDLINK_TOKEN") ``` for the `token` argument in *LDlinkR* function calls will use your personal access token in a private and secure way. This method will be used in the extended examples that follow. ------------------------------------------------------------------------ ## Functions and Examples ### LDexpress #### Function ```{r eval=FALSE} LDexpress(snps, pop = "CEU", tissue = "ALL", r2d = "r2", r2d_threshold = 0.1, p_threshold = 0.1, win_size = 500000, genome_build = "grch37", token = NULL, file = FALSE ) ``` Search if a list of genomic variants (or variants in LD with those variants) is associated with gene expression in tissues of interest. Quantitative trait loci data is downloaded from the [GTEx Portal](https://gtexportal.org/home/). #### Arguments - `snps`, between 1 - 10 variants, using an rsID or chromosome coordinate (e.g. "chr7:24966446") - `pop`, a 1000 Genomes Project population, (e.g. YRI or CEU), multiple allowed, default = "CEU". See the *list_pop* function in the utilities section below for available human populations and their abbreviation codes. - `tissue`, select from 1 - 54 non-diseased tissue sites collected for the GTEx project, multiple allowed. Acceptable user input is taken either from "tissue_name_ldexpress" or "tissue_abbrev_ldexpress" (tissue abbreviation) code listed in available GTEx tissue sites using the *list_gtex_tissues* function (e.g. "ADI_SUB" for Adipose Subcutaneous). Input is case sensitive. Default = "ALL" for all available tissue types. - `r2d`, Select either "r2" for LD R^2^ (R-squared) or "d" for LD D', default = "r2". - `r2d_threshold`, R-squared or D' (depends on 'r2d' user input parameter) threshold for LD filtering. Any variants within -/+ of the specified genomic window and R^2^ or D' less than the threshold will be removed. Value needs to be in the range 0 to 1. Default value is 0.1. - `p_threshold`, define the eQTL significance threshold used for returning query results. Default value is 0.1 which returns all GTEx eQTL associations with P-value less than 0.1. - `win_size`, set genomic base pair window size for LD calculation. Specify a value greater than or equal to zero and less than or equal to 1000000 base pairs (bp). Default value is -/+ 500000 bp. - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE. #### Usage: Single query variant, multiple populations, multiple tissue types using tissue abbreviation ```{r eval=FALSE} my_output <- LDexpress(snps = "rs4", pop = c("YRI", "CEU"), tissue = c("ADI_SUB", "ADI_VIS_OME"), win_size = "500000", token = Sys.getenv("LDLINK_TOKEN") ) ``` In the above example, output is a data frame stored in the variable `my_output`. See below. ```{r eval=FALSE} head(my_output) ``` ## Query RS_ID Position_grch37 R2 D' Gene_Symbol Gencode_ID ## 1 rs4 rs10637519 chr13:32430479 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## 2 rs4 rs10637519 chr13:32430479 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## 3 rs4 rs473641 chr13:32431244 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## 4 rs4 rs473641 chr13:32431244 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## 5 rs4 rs671746 chr13:32431263 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## 6 rs4 rs671746 chr13:32431263 0.174249321651574 0.965976331360947 RP1-257C22.2 ENSG00000279314.1 ## Tissue Non_effect_Allele_Freq Effect_Allele_Freq Effect_Size P_value ## 1 Adipose - Visceral (Omentum) G=0.565 GTC=0.435 0.207161 1.0227e-05 ## 2 Adipose - Subcutaneous G=0.565 GTC=0.435 0.225642 2.2578e-07 ## 3 Adipose - Visceral (Omentum) A=0.565 G=0.435 0.207161 1.0227e-05 ## 4 Adipose - Subcutaneous A=0.565 G=0.435 0.225642 2.2578e-07 ## 5 Adipose - Visceral (Omentum) C=0.565 T=0.435 0.207161 1.0227e-05 ## 6 Adipose - Subcutaneous C=0.565 T=0.435 0.226558 1.93289e-07 #### Usage: Multiple query variants, single population, a tissue type using the full LDexpress tissue name, no spaces, and genome build GRCh38. ```{r eval=FALSE} my_output <- LDexpress(snps = c("rs345", "rs456"), pop = "YRI", tissue = "Adipose_Visceral_Omentum", genome_build = "grch38", token = Sys.getenv("LDLINK_TOKEN") ) ``` In the above example, output is a data frame stored in the variable `my_output`. See below. ```{r eval=FALSE} head(my_output) ``` ## Query RS_ID Position_grch38 R2 D' Gene_Symbol Gencode_ID ## 1 rs345 rs12877069 chr13:32430415 0.222088835534214 1 RP1-257C22.2 ENSG00000279314.1 ## 2 rs345 rs10637519 chr13:32430479 0.10989010989011 1 RP1-257C22.2 ENSG00000279314.1 ## 3 rs345 rs473641 chr13:32431244 0.10989010989011 1 RP1-257C22.2 ENSG00000279314.1 ## 4 rs345 rs671746 chr13:32431263 0.10989010989011 1 RP1-257C22.2 ENSG00000279314.1 ## 5 rs345 rs9315146 chr13:32432193 0.222088835534214 1 RP1-257C22.2 ENSG00000279314.1 ## 6 rs345 rs657190 chr13:32432232 0.107871720116618 1 RP1-257C22.2 ENSG00000279314.1 Tissue Non_effect_Allele_Freq Effect_Allele_Freq Effect_Size P_value ## 1 Adipose - Visceral (Omentum) C=0.685 T=0.315 0.355769 6.11598e-05 ## 2 Adipose - Visceral (Omentum) G=0.519 GTC=0.481 0.207161 1.0227e-05 ## 3 Adipose - Visceral (Omentum) A=0.519 G=0.481 0.207161 1.0227e-05 ## 4 Adipose - Visceral (Omentum) C=0.519 T=0.481 0.207161 1.0227e-05 ## 5 Adipose - Visceral (Omentum) A=0.685 G=0.315 0.276884 2.20517e-08 ## 6 Adipose - Visceral (Omentum) T=0.514 C=0.486 0.207916 9.95318e-06 ------------------------------------------------------------------------ ### LDhap #### Function LDhap(snps, pop = "CEU", token = NULL, file = FALSE, table_type = "haplotype", genome_build = "grch37" ) Calculates population specific haplotype frequencies of all haplotypes observed for a list of query variants. Input is a list of variant RS numbers (concatenated list) and a population group. #### Arguments - `snps`, a list of between 1 - 30 variants, using an rsID or chromosome coordinate (e.g. "chr7:24966446") - `pop`, a 1000 Genomes Project population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE - `table_type`, Choose from one of four options available to determine output format type...`haplotype`, `variant`, `both` and `merged`. Default = "haplotype". - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: Multiple query variants, single population, and genome build GRCh38 High Coverage ```{r eval=FALSE} LDhap(snps = c("rs3", "rs4", "rs148890987"), pop = "CEU", token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38_high_coverage" ) ``` ## rs3 rs4 rs148890987 Count Frequency ## 1 C A C 183 0.9242 ## 2 T G C 15 0.0758
#### Usage: Multiple query variants, multiple populations using default genome build GRCh37 (hg19) ```{r eval=FALSE} LDhap(snps = c("rs3", "rs4", "rs148890987"), pop = c("YRI", "CEU"), token = Sys.getenv("LDLINK_TOKEN") ) ``` ## rs3 rs4 rs148890987 Count Frequency ## 1 C A C 355 0.8575 ## 2 T G C 41 0.099 ## 3 T G T 11 0.0266 ## 4 C A T 7 0.0169 Output is a table of alleles, haplotype count and haplotype frequencies.
#### Usage: Multiple query variants, single population, 'merged' output format type. ```{r eval=FALSE} LDhap(snps = c("rs660670", "rs556780", "rs355", "rs356", "rs542746"), pop = "CEU", token = Sys.getenv("LDLINK_TOKEN"), table_type = "merged", genome_build = "grch38" ) ``` ## RS_Number Position_grch38 Allele_Frequency Haplotypes ## 1 rs660670 chr13:31863887 A=0.924, G=0.076 A G ## 2 rs556780 chr13:31863023 G=0.924, A=0.076 G A ## 3 rs355 chr13:31883842 A=0.924, G=0.076 A G ## 4 rs356 chr13:31884663 T=0.924, A=0.076 T A ## 5 rs542746 chr13:31860055 G=0.924, A=0.076 G A ## 6 Haplotype_Count 183 15 ## 7 Haplotype_Frequency 0.9242 0.0758 Output is a table with query variants, genomic position GRCH38 (hg38), etc.
#### Usage: Multiple query variants, single population, 'both' output format type. ```{r eval=FALSE} LDhap(snps = c("rs660670", "rs556780", "rs355", "rs356", "rs542746"), pop = "CEU", token = Sys.getenv("LDLINK_TOKEN"), table_type = "both" ) ``` ## [[1]] ## RS_Number Position_grch37 Allele_Frequency ## 1 rs660670 chr13:32438024 A=0.924, G=0.076 ## 2 rs556780 chr13:32437160 G=0.924, A=0.076 ## 3 rs355 chr13:32457979 A=0.924, G=0.076 ## 4 rs356 chr13:32458800 T=0.924, A=0.076 ## 5 rs542746 chr13:32434192 G=0.924, A=0.076 ## ## [[2]] ## rs660670 rs556780 rs355 rs356 rs542746 Count Frequency ## 1 A G A T G 183 0.9242 ## 2 G A G A A 15 0.0758 Output is a list that contains both the 'variant' and 'haplotype' output format types. ------------------------------------------------------------------------ ### LDmatrix #### Function LDmatrix(snps, pop = "CEU", r2d = "r2", token = NULL, file = FALSE, genome_build = "grch37" ) Generates a data frame of pairwise linkage disequilibrium statistics. Input is a list of between 2 to 2500 variants. Desired output can be based on estimates of R^2^ or D'. #### Arguments - `snps`, list of between 2 - 2500 variants, using an rsID or chromosome coordinate (GRCh37/hg19) (e.g. "chr7:24966446") - `pop`, a 1000 Genomes Project population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `r2d`, use either "r2" for pairwise R^2^ statistics or "d" for pairwise D' statistics - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: Multiple query variants, single population, R^2^ and genome build GRCh38 (hg38). ```{r eval=FALSE} LDmatrix(snps = c("rs496202", "rs11147477", "rs201578600"), pop = "YRI", r2d = "r2", token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38" ) ``` ## RS_number rs496202 rs11147477 rs201578600 ## 1 rs496202 1.000 0.503 0.659 ## 2 rs11147477 0.503 1.000 0.786 ## 3 rs201578600 0.659 0.786 1.000
#### Usage: Multiple query variants (mixed use of rsID's & genomic coordinates), multiple populations, D' ```{r eval=FALSE} LDmatrix(snps = c("chr13:32444611", "rs11147477", "rs201578600"), pop = c("YRI", "CEU"), r2d = "d", token = Sys.getenv("LDLINK_TOKEN") ) ``` ## RS_number rs496202 rs11147477 rs201578600 ## 1 rs496202 1.000 0.738 0.973 ## 2 rs11147477 0.738 1.000 0.971 ## 3 rs201578600 0.973 0.971 1.000
#### Usage: Multiple query variants read from text file, multiple populations, D' ```{r} my_variants <- read.table("variant_list.txt") my_variants ``` Then, call *LDmatrix* with: ```{r eval=FALSE} LDmatrix(snps = my_variants[,1], pop = c("YRI", "CEU"), r2d = "d", token = Sys.getenv("LDLINK_TOKEN") ) ``` ## RS_number rs456 rs114 rs127 rs7805287 rs60676332 rs10239961 ## 1 rs456 1.000 0.963 0.929 0.789 0.151 1.000 ## 2 rs114 0.963 1.000 0.886 0.710 0.148 0.459 ## 3 rs127 0.929 0.886 1.000 0.818 0.180 0.912 ## 4 rs7805287 0.789 0.710 0.818 1.000 0.094 0.464 ## 5 rs60676332 0.151 0.148 0.180 0.094 1.000 0.363 ## 6 rs10239961 1.000 0.459 0.912 0.464 0.363 1.000 Output is a table with rows and columns equal to the number of query variants and pairwise linkage disequilibrium statistics. ------------------------------------------------------------------------ ### LDpair #### Function LDpair(var1, var2, pop = "CEU", token = NULL, output = "table", file = FALSE, genome_build = "grch37" ) Investigates potentially correlated alleles for a pair of variants. Input is two query variants and a 1000 Genomes Project reference population(s) of interest. #### Arguments - `var1`, the first RS number (rsID) or genomic coordinate (GRCh37/hg19) (e.g. "chr7:24966446"), must match a bi-allelic variant - `var2`, the second RS number or genomic coordinate, as above, must match a bi-allelic variant - `pop`, a 1000 Genomes Project reference population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `output`, two output format options are available, "text", which displays a two-by-two matrix displaying haplotype counts and allele frequencies along with other statistics, or "table", which displays the same data in rows and columns, default = "table" - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: With `output` argument set to "text" and genome build GRCh38 (hg38) ```{r eval=FALSE} LDpair(var1 = "rs496202", var2 = "rs11147477", pop = "YRI", token = Sys.getenv("LDLINK_TOKEN"), output = "text", genome_build = "grch38" ) ``` ## Query SNPs: ## rs496202 (chr13:32444611) ## rs11147477 (chr13:32509120) ## ## YRI Haplotypes: ## rs11147477 ## C T ## ----------------- ## C | 11 | 26 | 37 (0.171) ## rs496202 ----------------- ## G | 173 | 6 | 179 (0.829) ## ----------------- ## 184 32 216 ## (0.852) (0.148) ## ## G_C: 173 (0.801) ## C_T: 26 (0.12) ## C_C: 11 (0.051) ## G_T: 6 (0.028) ## ## D': 0.7737 ## R2: 0.5037 ## Chi-sq: 108.8005 ## p-value: <0.0001 ## ## rs496202(C) allele is correlated with rs11147477(T) allele ## rs496202(G) allele is correlated with rs11147477(C) allele
#### Usage: With no `output` argument option specified, using default "table". ```{r eval=FALSE} LDpair(var1 = "rs496202", var2 = "rs11147477", pop = "YRI", token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38" ) ``` ## var1 var2 pops var1_pos var2_pos var1_a1 var1_a2 var1_a1_freq var1_a2_freq var2_a1 ## 1 rs496202 rs11147477 YRI chr13:31870474 chr13:31934983 C G 0.173 0.827 C ## var2_a2 var2_a1_freq var2_a2_freq d_prime r2 chisq p_val ## 1 T 0.85 0.15 0.7733 0.503 107.638 1e-04 ## corr_alleles ## 1 rs496202(C)-rs11147477(T), rs496202(G)-rs11147477(C) Output of the `output` argument "text" option is a two-by-two contingency table displaying haplotype counts and allele frequencies of the two query variants. Also displayed are calculated metrics of linkage disequilibrium including: D prime (D'), R square (R^2^), and goodness-of-fit (Chi-square and p-value). Goodness-of-fit tests for deviations of expected haplotype frequencies based on allele frequencies. Correlated alleles are reported if linkage disequilibrium is present (R^2^ \> 0.1). If linkage equilibrium, no alleles are reported. Output from the `output` argument "table" option converts the data from the two-by-two contingency table into a data frame. ------------------------------------------------------------------------ ### LDpop #### Function LDpop(var1, var2, pop = "CEU", r2d = "r2", token = NULL, file = FALSE, genome_build = "grch37" ) Investigates allele frequencies and linkage disequilibrium patterns across 1000G populations. #### Arguments - `var1`, the first RS number (rsID) or genomic coordinate (GRCh37/hg19) (e.g. "chr7:24966446"), must match a bi-allelic variant - `var2`, the second RS number or genomic coordinate, as above, must match a bi-allelic variant - `pop`, a 1000 Genomes Project reference population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `r2d`, use "r2" if desired output is based on estimated R^2^ or "d" if D' - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage ```{r eval=FALSE} LDpop(var1 = "rs496202", var2 = "rs11147477", pop = "YRI", r2d = "r2", token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38_high_coverage" ) ``` ## Population Abbrev N rs496202_Allele_Freq rs11147477_Allele_Freq R2 D' Chisq P ## 1 Yoruba in Ibadan, Nigeria YRI 108 G: 82.87%, C: 17.13% C: 85.19%, T: 14.81% 0.5037 0.7737 108.8005 0 ------------------------------------------------------------------------ ### LDproxy #### Function LDproxy(snp, pop = "CEU", r2d = "r2", token = NULL, file = FALSE, genome_build = "grch37", win_size = "500000" ) Explore proxy and putative functional variants for a single query variant. Input is a single RS number and a population group. Depending on the number of query populations, this function could take some time to run. #### Arguments - `snp`, an RS number (rsID) or chromosome coordinate (GRCh37/hg19) (e.g. "chr7:24966446"), one per query, RS number must match a bi-allelic variant - `pop`, a 1000 Genomes Project reference population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `r2d`, use "r2" if desired output is based on estimated R^2^ or "d" if D' - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). - `win_size` Set base pair (bp) window size. Specify a value greater than or equal to zero and less than or equal to 1,000,000bp. Default value is 500,000bp. #### Usage: single reference population and default genome build GRCh37 (hg19). ```{r eval=FALSE} my_proxies <- LDproxy(snp = "rs456", pop = "YRI", r2d = "r2", token = Sys.getenv("LDLINK_TOKEN") ) ``` Output is a data frame stored in the variable `my_proxies` with 2455 rows and 10 columns with data. ```{r eval=FALSE} head(my_proxies) ``` ## RS_Number Coord Alleles MAF Distance Dprime R2 ## 1 rs456 chr7:24962419 (G/C) 0.1944 0 1 1.0000 ## 2 rs457 chr7:24962426 (T/C) 0.1944 7 1 1.0000 ## 3 rs28475742 chr7:24964633 (G/T) 0.1944 2214 1 1.0000 ## 4 rs123 chr7:24966446 (C/A) 0.1944 4027 1 1.0000 ## 5 rs125 chr7:24959703 (C/T) 0.2037 -2716 1 0.9436 ## 6 rs128 chr7:24958977 (C/T) 0.2037 -3442 1 0.9436 ## Correlated_Alleles RegulomeDB Function ## 1 G=G,C=C 4 ## 2 G=T,C=C 2b ## 3 G=G,C=T 4 ## 4 G=C,C=A 1f ## 5 G=C,C=T 3a ## 6 G=C,C=T 7 Includes information on all variants within the specified window size of the query variant with a pairwise R^2^ value greater than 0.01. ------------------------------------------------------------------------ ### LDproxy_batch #### Function LDproxy_batch(snp, pop = "CEU", r2d = "r2", token = NULL, append = FALSE, genome_build = "grch37", win_size = "500000" ) Query *LDproxy* using a list of query variants. *LDproxy_batch* will make sequential queries, one query per variant. Concurrent queries are not permitted by the LDlink API. Output is saved as text file(s) to the current working directory. Depending on the number of query variants and reference populations selected, this function could take some time to run. #### Arguments - `snp`, a character string or data frame listing RS numbers (rsID) or chromosome coordinates (GRCh37/hg19) (e.g. "chr7:24966446"), one per line. - `pop`, a 1000 Genomes Project reference population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `r2d`, use "r2" if desired output is based on estimated R^2^ or "d" if D' - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `append`, a logical, if TRUE, output for each query variant is appended to a single text file and saved to the current working directory. If FALSE, output for each query variant is saved in its own text file with the query variant as the filename. Default value is FALSE. - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). - `win_size` Set base pair (bp) window size. Specify a value greater than or equal to zero and less than or equal to 1,000,000bp. Default value is 500,000bp. #### Usage: multiple variants, default `pop` and `r2d` The list of query variants passed to *LDproxy_batch* can be stored as a character string. ```{r eval=FALSE} LDproxy_batch(snp = c("rs456", "rs114", "rs127"), token = Sys.getenv("LDLINK_TOKEN") ) ``` Or, a longer list of variants can be read into a data frame from a text file and passed into *LDproxy_batch*. The list should be in a simple text file, one query variant per line. For example: ```{r} my_variants <- read.table("variant_list.txt") my_variants ``` Then, call *LDproxy_batch* with: ```{r eval=FALSE} LDproxy_batch(snp = my_variants, token = Sys.getenv("LDLINK_TOKEN") ) ``` Output not displayed. All output from *LDproxy_batch* is saved to a text file(s) in the current working directory. ------------------------------------------------------------------------ ### LDtrait #### Function ```{r eval=FALSE} LDtrait(snps, pop = "CEU", r2d = "r2", r2d_threshold = 0.1, win_size = 500000, token = NULL, file = FALSE, genome_build = "grch37" ) ``` Search if a list of variants (or variants in LD with those variants) have been previously associated with a trait or disease. Trait and disease data is updated nightly from the [GWAS Catalog](https://www.ebi.ac.uk/gwas/docs/file-downloads). #### Arguments - `snps`, between 1 - 50 variants, using an rsID or chromosome coordinate (GRCh37)(e.g. "chr7:24966446"). All input variants must match a bi-allelic variant. - `pop`, a 1000 Genomes Project population, (e.g. YRI or CEU), multiple allowed, default = "CEU". See the *list_pop* function in the utilities section below for available human populations and their abbreviation codes. - `r2d`, use "r2" to filter desired output from a threshold based on estimated LD R2 (R squared) or "d" for LD D' (D-prime), default = "r2". - `r2d_threshold`, R-squared or D' (depends on 'r2d' user input parameter) threshold for LD filtering. Any variants within -/+ of the specified genomic window and R^2^ or D' less than the threshold will be removed. Value needs to be in the range 0 to 1. Default value is 0.1. - `win_size`, set genomic base pair window size for LD calculation. Specify a value greater than or equal to zero and less than or equal to 1000000 base pairs (bp). Default value is -/+ 500000 bp. - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess). - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE. - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: Single query variant, multiple reference populations and genome build GRCh38(hg38) ```{r eval=FALSE} LDtrait(snps = "rs456", pop = c("YRI", "CEU"), token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38" ) ``` The following is the output from the above function call. ## Query GWAS_Trait RS_Number Position_GRCh38 Alleles ## 1 rs456 Highest math class taken (MTAG) rs10248878 chr7:24869118 C=0.175, T=0.825 ## 2 rs456 Educational attainment (MTAG) rs457 chr7:24922807 C=0.697, T=0.303 ## R2 D' Risk_Allele Effect_Size_95_CI Beta_or_OR P_value ## 1 0.41175133337888 0.920247773906311 0.5967 0.0104 0.0071-0.0137 7e-10 ## 2 1 1 0.4495 0.0072 0.0047-0.0097 4e-08 #### Usage: Multiple query variants, multiple reference populations and *win_size* set to 750000 base pairs (bp), default genome build GRCh37(hg19). ```{r eval=FALSE} LDtrait(snps = c("rs114", "rs496202", "rs345"), pop = c("YRI", "CHB", "CEU"), win_size = "750000", token = Sys.getenv("LDLINK_TOKEN") ) ``` Output of the above function is below. ## Query GWAS_Trait RS_Number ## 1 rs114 Highest math class taken (MTAG) rs10248878 ## 2 rs114 Educational attainment (MTAG) rs457 ## 3 rs496202 Refractive error rs353 ## 4 rs345 DNA methylation variation (age effect) rs203425 ## 5 rs345 Facial morphology (factor 14, intercanthal width) rs799522 ## Position_GRCh37 Alleles R2 D' ## 1 chr7:24908737 C=0.123, T=0.877 0.200231693692643 0.897255733792921 ## 2 chr7:24962426 C=0.748, T=0.252 0.56312684849231 0.969967060647161 ## 3 chr13:32454349 A=0.902, G=0.098 1 1 ## 4 chr13:32468087 A=0.074, T=0.926 0.954994192799071 1 ## 5 chr13:32514028 C=0.769, T=0.231 0.236284178064096 0.918763102725367 ## Risk_Allele Effect_Size_95_CI Beta_or_OR P_value ## 1 0.5967 0.0104 0.0071-0.0137 7e-10 ## 2 0.4495 0.0072 0.0047-0.0097 4e-08 ## 3 1e-12 ## 4 NR 2e-08 ## 5 0.1263 0.2157 0.12-0.31 6e-06 ------------------------------------------------------------------------ ### SNPchip #### Function SNPchip(snps, chip = "ALL", token = NULL, file = FALSE, genome_build = "grch37" ) Used to find commercial genotyping chip arrays for variants. Input is a list of between 1 - 5000 variants (one per line) and desired commercial chip arrays to search. Input variants do not need to be on the same chromosome. #### Arguments - `snps`, between 1 - 5,000 variants, using an rsID or chromosome coordinate (e.g. "chr7:24966446") - `chip`, chip or arrays, platform code(s) for a SNP chip array, ALL_Illumina, ALL_Affy or ALL, default=ALL, use the `list_chips` utility (see below) to lookup available commercial SNP chip arrays and their codes. - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE. - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: Multiple variants, search "ALL" available chip arrays ```{r eval=FALSE} SNPchip(snps = c("rs3", "rs4", "rs148890987"), chip = "ALL", token = Sys.getenv("LDLINK_TOKEN") ) ``` ## WARNING: The following RS number did not have any platforms found: rs148890987, rs3. ## RS_Number Position_GRCh37 A_SNP5.0 A_CHB2 A_250S A_SNP6.0 ## 1 rs148890987 chr13:32403784 0 0 0 0 ## 2 rs3 chr13:32446842 0 0 0 0 ## 3 rs4 chr13:32447222 1 1 1 1
#### Usage: Multiple variants, search two Affymetrix arrays, and genome build GRCh38 (hg38) ```{r eval=FALSE} SNPchip(snps = c("rs3", "rs4", "rs148890987"), chip = c("A_SNP5.0", "A_CHB2"), token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch38" ) ``` ## WARNING: The following RS number did not have any platforms found: rs148890987, rs3. ## RS_Number Position_GRCh38 A_SNP5.0 A_CHB2 ## 1 rs148890987 13:31829647 0 0 ## 2 rs3 13:31872705 0 0 ## 3 rs4 13:31873085 1 1
#### Usage: Multiple variants, search all available Affymetrix arrays using, "ALL_Affy" ```{r eval=FALSE} SNPchip(snps = c("rs3", "rs4", "rs148890987"), chip = "ALL_Affy", token = Sys.getenv("LDLINK_TOKEN") ) ``` ## WARNING: The following RS number did not have any platforms found: rs148890987, rs3. ## RS_Number Position_GRCh37 A_SNP5.0 A_CHB2 A_250S A_SNP6.0 ## 1 rs148890987 chr13:32403784 0 0 0 0 ## 2 rs3 chr13:32446842 0 0 0 0 ## 3 rs4 chr13:32447222 1 1 1 1 Output is a data frame of query variant rows (RS number), genomic coordinate (GRCh37) and genotyping chip array columns. The presence of a "1" designates the variant is present on the respective commercial genotyping array and a "0" indicates that it is not present on the genotyping array. ------------------------------------------------------------------------ ### SNPclip #### Function SNPclip(snps, pop = "CEU", r2_threshold = "0.1", maf_threshold = "0.01", token = NULL, file = FALSE, genome_build = "grch37" ) Prune a list of variants by linkage disequilibrium. Input is a list of variant RS numbers (one per line) and a population group. #### Arguments - `snps`, a list of between 1 - 5,000 variants, using an RS number (rsID) or chromosome coordinate (GRCh37) (e.g. "chr7:24966446"). All input variants must be on the same chromosome and match a bi-allelic variant. - `pop`, a 1000 Genomes Project reference population, uses three letter population code, (e.g. YRI or CEU), multiple allowed, default = "CEU" - `r2_threshold`, Used to set the R^2^ threshold for LD pruning. One of each pair of variants with a R^2^ greater than the threshold is removed. Value needs to be in the range 0 to 1. Default value is 0.1. - `maf_threshold`, Used to set minor allele frequency (MAF) threshold for LD pruning. Variants with a MAF less than or equal to the threshold are removed. Value needs to be in the range 0 to 1. Default value is 0.01. - `token`, LDlink provided user access token is required, default = NULL, register for a free token on the [LDlink web site.](https://ldlink.nih.gov/?tab=apiaccess) - `file`, optional character string naming a path and file for saving results. If file = FALSE, no file will be generated, default = FALSE. - `genome_build` Choose between one of three options...`grch37` for genome build GRCh37 (hg19), `grch38` for GRCh38 (hg38), or `grch38_high_coverage` for GRCh38 High Coverage (hg38) 1000 Genome Project data sets. Default is GRCh37 (hg19). #### Usage: Multiple Variants ```{r eval=FALSE} SNPclip(snps = c("rs3", "rs4", "rs148890987", "rs115955931"), pop = "YRI", r2_threshold = "0.1", maf_threshold = "0.01", token = Sys.getenv("LDLINK_TOKEN"), genome_build = "grch37" ) ``` ## RS_Number Position Alleles ## 1 rs3 chr13:32446842 C=0.829, T=0.171 ## 2 rs4 chr13:32447222 A=0.829, G=0.171 ## 3 rs148890987 chr13:32403784 C=1.0, T=0.0 ## 4 rs115955931 chr13:32130008 G=0.954, A=0.046 ## Details ## 1 Variant kept. ## 2 Variant in LD with rs3 (R2=1.0), variant removed. ## 3 Variant MAF is 0.0, variant removed. ## 4 Variant kept. The output table provides details including query variant RS number, genomic position, alleles, and and details about whether the variant was kept or removed. ------------------------------------------------------------------------ ## Utilities and Examples ### list_chips #### Function list_chips() Provides a data frame listing the names and abbreviation codes for available commercial SNP Chip Arrays from Illumina and Affymetrix. #### Usage ```{r eval=FALSE} list_chips() ``` ------------------------------------------------------------------------ ### list_pop #### Function list_pop() Provides a data frame listing the available reference populations from the 1000 Genomes Project, continental or super-populations (e.g. European, African, Admixed American) and sub-populations (e.g Finnish, Gambian, Peruvian) #### Usage ```{r eval=FALSE} list_pop() ``` ------------------------------------------------------------------------ ### list_gtex_tissues #### Function list_gtex_tissues() Provides a data frame listing the GTEx full names, `LDexpress` full names (without spaces) and acceptable abbreviation codes of the 54 non-diseased tissue sites collected for the [GTEx Portal](https://gtexportal.org/home/) and used as input for the `LDexpress` function. #### Usage ```{r, echo=FALSE} options(width = 100) ``` ```{r eval=FALSE} list_gtex_tissues() ``` ------------------------------------------------------------------------ ## FAQs (Frequently Asked Questions) 1. What if my access token doesn't work? - Please double check that the token was typed accurately. Then, ensure the format of the function call is correct. For example, if your alphanumeric access token is: 123abc456789, then, use it as:\ ```{r eval = FALSE} df <- LDproxy(snp = "rs456", pop = "YRI", token = "123abc456789") ```
| If you still can not solve the problem, please email us at [NCILDlinkWebAdmin\@mail.nih.gov](mailto:NCILDlinkWebAdmin@mail.nih.gov){.email}.
2. Can I set a threshold or cut-off value for R^2^ or D\` values? - No. *LDlinkR* functions do not include 'threshold' as an argument. However, the returned data object can be subset using base R. For example: ```{r eval=FALSE} df <- LDproxy("rs12027135", pop = "CEU",r2d = "r2", token = "YourTokenHere123") new_df <- subset(df, R2 >= 0.8) ```
3. I need to upload hundreds of variants from a text file into LDmatrix. Why do I get an error with the following code? ```{r eval=FALSE} test <- read.table("variant_list.txt", header = FALSE) LDmatrix(snps = test, pop = "CEU", r2d = "r2", token = "YourTokenHere123") ``` `Error in LDmatrix(snps = test, pop = "CEU", r2d = "r2", token = "YourTokenHere123"), : Input is between 2 to 1000 variants.`
- In the above example, 'test' is an object of class data frame and the variants are in the first column. The first column needs to be specified in order to avoid the error. The example below will yield the desired result: ```{r eval=FALSE} test <- read.table("variant_list.txt", header = FALSE) LDmatrix(snps = test[,1], pop = "CEU", r2d = "r2", token = "YourTokenHere123") ``` ``` ## RS_number rs60676332 rs7805287 rs127 rs456 rs10239961 rs114 ## 1 rs60676332 1.000 0.008 0.013 0.017 0.286 0.039 ## 2 rs7805287 0.008 1.000 0.980 0.882 0.170 0.614 ## 3 rs127 0.013 0.980 1.000 0.900 0.167 0.632 ## 4 rs456 0.017 0.882 0.900 1.000 0.177 0.722 ## 5 rs10239961 0.286 0.170 0.167 0.177 1.000 0.008 ## 6 rs114 0.039 0.614 0.632 0.722 0.008 1.000 ```
4. What genome build does LDlink use for genomic coordinates? - All genomic coordinates are based on GRCh37/hg19.
5. How can I ask for help? - If you find that you can't answer a question or solve a problem yourself, you can email us at [NCILDlinkWebAdmin\@mail.nih.gov](mailto:NCILDlinkWebAdmin@mail.nih.gov){.email} or open an issue at the *LDlinkR* GitHub repository (). ## Session Information ```{r} sessionInfo() ```