The genesis tutorial

---
title: "GENESIS tutorial"
output:
  html_document:
    toc: yes
link-citations: yes
---

# GDS format

GDS is Genomic Data Structure, a storage format that can efficiently store genomic data and provide fast random access to subsets of the data. For more information on GDS for sequence data, read the [SeqArray package vignette](https://github.com/zhengxwen/SeqArray/blob/master/vignettes/SeqArrayTutorial.Rmd).

## Convert a VCF to GDS

To use the R packages developed at the University of Washington Genetic Analysis Center for sequence data, we first need to convert a VCF file to GDS. (If the file is BCF, use [https://samtools.github.io/bcftools/bcftools.html](bcftools) to convert to VCF.)

```{r vcf2gds, message=FALSE}
library(SeqArray)
# VCF file should be stored in sbgenomics/workspace directory
vcffile <- "AnalysisFiles/1KG_phase3_subset_chr1.vcf.gz"
gdsfile <- "AnalysisFiles/1KG_phase3_subset_chr1.gds"

# Convert VCF to GDS
seqVCF2GDS(vcffile, gdsfile, fmt.import="GT", storage.option="LZMA_RA")
```

## Exploring a GDS file

We can interact with the GDS file using the SeqArray package.

```{r}
gds <- seqOpen(gdsfile)
gds
```

The `seqGetData` function is the basic function for reading in data from a GDS file

```{r seqGetData}
# the unique sample identifier comes from the VCF header
sample.id <- seqGetData(gds, "sample.id")
length(sample.id)
head(sample.id)
```

```{r}
# a unique integer ID is assigned to each variant
variant.id <- seqGetData(gds, "variant.id")
length(variant.id)
head(variant.id)
```

```{r}
chr <- seqGetData(gds, "chromosome")
head(chr)

pos <- seqGetData(gds, "position")
head(pos)

id <- seqGetData(gds, "annotation/id")
head(id)
```

There are additional useful functions for summary level data.

```{r ref_freq}
# reference allele frequency of each variant
afreq <- seqAlleleFreq(gds)
head(afreq)
summary(afreq)
```

```{r}
hist(afreq, breaks=50)
```

We can define a filter on the `gds` object. After using the `seqSetFilter` command, all subsequent reads from the `gds` object are restricted to the selected subset of data, until a new filter is defined or `seqResetFilter` is called.

```{r filter}
seqSetFilter(gds, variant.id=91:100, sample.id=sample.id[1:5])
```

Genotype data is stored in a 3-dimensional array, where the first dimension is always length 2 for diploid genotypes. The second and third dimensions are samples and variants, respectively. The values of the array denote alleles: `0` is the reference allele and `1` is the alternate allele. For multiallelic variants, other alternate alleles are represented as integers `> 1`.

```{r genotypes}
geno <- seqGetData(gds, "genotype")
dim(geno)
geno[,,1:2]
```

```{r}
geno[,,1:2]
```

The [SeqVarTools package](http://bioconductor.org/packages/SeqVarTools) has some additional functions for interacting with SeqArray-format GDS files. There are functions providing more intuitive ways to read in genotypes.

```{r seqvartools_geno}
library(SeqVarTools)

# return genotypes in matrix format
getGenotype(gds)
```

```{r}
getGenotypeAlleles(gds)
```

```{r}
refDosage(gds)
altDosage(gds)
```

There are functions to extract variant information.

```{r seqvartools_varinfo}
# look at reference and alternate alleles
refChar(gds)
altChar(gds)
```

```{r}
# data.frame of variant information
variantInfo(gds)
```

```{r}
# reset the filter to all variants and samples
seqResetFilter(gds)

# how many alleles for each variant?
n <- seqNumAllele(gds)
table(n)

# some variants have more than one alternate allele
multi.allelic <- which(n > 2)
altChar(gds)[multi.allelic]
```

```{r}
# extract a particular alternate allele
# first alternate
altChar(gds, n=1)[multi.allelic]
# second alternate
altChar(gds, n=2)[multi.allelic]
```

```{r}
# how many variants are biallelic SNVs?
table(isSNV(gds, biallelic=TRUE))
# how many variants are SNVs vs INDELs?
table(isSNV(gds, biallelic=FALSE))
# 11 SNVs are multi-allelic
```

We can also return variant information as a `GRanges` object from the [GenomicRanges package](https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf). This format for representing sequence data is common across many Bioconductor packages. Chromosome is stored in the `seqnames` column. The `ranges` column has variant position, which can be a single base pair or a range.

```{r granges}
gr <- granges(gds)
gr
```

Always use the `seqClose` command to close your connection to a GDS file when you are done working with it. Trying to open an already opened GDS will result in an error.

```{r intro_close}
seqClose(gds)
```

## Exercises

1. Set a filter selecting only multi-allelic variants. Inspect their genotypes using the different methods you learned above. Use the `alleleDosage` method to find dosage for the second (and third, etc.) alternate allele.

2. Use the `hwe` function in SeqVarTools to run a Hardy-Weinberg Equilibrium test on each variant. Identify a variant with low p-value and inspect its genotypes. (Note that the HWE test is only valid for biallelic variants, and will return `NA` for multiallelic variants.)

[Solutions](.\Solutions.html#gds-format)


# Phenotype harmonization

To increase your sample set, you may need to combine phenotype data from different studies in order to run a cross-study analysis.
The studies involved may have collected data in different ways, used different protocols or measurement units, or used different cutpoints to determine case status.
The process of manipulating the phenotype data from different studies so that they can be analyzed together is called "phenotype harmonization".

In this exercise, we assume that you have
created a phenotype harmonization plan for height,
sent it to members from three studies to perform the harmonization,
and
received a harmonized phenotype file from each study.
We will generate some diagnostic information about the harmonized phenotype.

The exercise uses 1000 Genomes data, with simulated phenotypes for study, age, and height.
The example phenotype files shown here are very simplified compared to how actual studies store and organize their their data.

In this exercise, we will be using `dplyr` for a lot of the data manipulation, so load it now.

```{r, message = FALSE}
library(dplyr)
```


### Inspect individual study data in R

First, read the study phenotype files into R.
In this case, each file is tab-delimited.

```{r}
study_1 <- read.table("AnalysisFiles/pheno_data_study_1.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_1)

study_2 <- read.table("AnalysisFiles/pheno_data_study_2.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_2)

study_3 <- read.table("AnalysisFiles/pheno_data_study_3.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_3)
```

Look carefully at the output and see if anything looks suspicious.

You may have noticed that one of the studies has given their variables slightly different names than the others.
Rename them as appropriate.

```{r}
names(study_2)
```

```{r}
study_2 <- study_2 %>%
  rename(sex = Sex, age = Age, height = Height)
# Check that they are correct.
names(study_2)
```

You'll also want to calculate summaries of the data values to see if anything looks very different than what you expect.

```{r}
summary(study_1$height)

```

```{r}
summary(study_2$height)
```

```{r}
summary(study_3$height)
```

Here, the values that study_3 has given you don't seem to have the same range as those from study_1 and study_2.
In cases like this, you'll want to follow up with whoever provided the harmonized data to see what's going on.
It could represent an error in calculating the harmonized data values, a true property of the study (e.g., a study containing all children), or something else.
In this case, the values were measured in inches instead of centimeters, so they will need to be converted to centimeters to be compatible with the other studies.

```{r}
study_3 <- study_3 %>%
  mutate(height = height * 2.54)
```

Calculate the summary again and compare it to the other studies above.

```{r}
summary(study_3$height)
```

The corrected values look much more similar now.

Note that this sort of error is easy to correct, but it is not uncommon to have more subtle issues that need to be addressed when working with phenotype data.
Knowledge of the study design as well as the phenotype area of interest is essential to address them properly.
Additionally, different decisions may need to be made for different analyses based on the specific questions they are trying to answer.

### Compare study values

Next we will make some more direct comparisons between the three studies, so we will combine the data into one data frame.

First, add a study identifier to the data frame for organizational purposes.

```{r}
study_1$study <- "study_1"
study_2$study <- "study_2"
study_3$study <- "study_3"
```

Combine the three different study data frames into one large data frame for joint analysis.
Double check that all column names are the same.

```{r}
all.equal(names(study_1), names(study_2))
all.equal(names(study_1), names(study_3))
```

```{r, message=FALSE}
phen <- dplyr::bind_rows(study_1, study_2, study_3)
```


We can look at the distribution of phenotype data with text-based reports or with plots.

First, inspect distributions with `table` for categorical traits and with `summary` for quantitative traits.
The commads are shown here for study_1, but you should run them for study_2 and study_3 as well to see if you can see any differences.

```{r}
table(study_1$sex)
```

```{r}
summary(study_1$age)
```

```{r}
summary(study_1$height)
```

It is also helpful to use plots to inspect the distributions of phenotype data.
Here, we will look at boxplots of height by study.

```{r height_study, message = FALSE}
library(ggplot2)
ggplot(phen, aes(x = study, y = height)) + geom_boxplot()
```

You may also want to see the difference in height when you include both study and sex:

```{r height_study_sex}
ggplot(phen, aes(x = study, fill = sex, y = height)) + geom_boxplot()
```


These diagnostics are helpful to get a feel for the data.
They can help you see if one study is vastly different from the others or detect outlier values that you may want to look into further.
Some of the differences could also be accounted for by covariates.

## Using regression models to compare studies

The quick diagnostics in the previous section let you see if the data from one study are completely different from the others, but such differences could be due to other factors that could be adjusted for in analysis.
To account for these other factors, we need to fit a statistical model to the data.
In this case, because the phenotype is quantitative, we will use a linear regression model.

We use the `GENESIS` R package for fitting the regression model.
It is also the same package that we use for the association analyses, so this exercise provides a brief introduction to the package and some of the associated data structures.

### Create an Annotated Data Frame

The first step in fitting the regression model is to create an AnnotatedDataFrame.
This data structure is provided by the Bioconductor [Biobase package](https://www.bioconductor.org/packages/release/bioc/manuals/Biobase/man/Biobase.pdf), and it contains both the data and metadata.
You should include a description of each variable in the metadata.

```{r, message=FALSE}
library(Biobase)
metadata <- data.frame(labelDescription = c(
  "subject identifier",
  "subject's sex",
  "age at measurement of height",
  "subject's height in cm",
  "study identifier"
))

annot <- AnnotatedDataFrame(phen, metadata)

# access the data with the pData() function
head(pData(annot))

```

```{r}
# access the metadata with the varMetadata() function
varMetadata(annot)
```

Save the AnnotatedDataFrame for future use.

```{r}
save(annot, file = "OutputFiles/phenotype_annotation.RData")
```

The `GENESIS` code to fit the regression model also requires a `sample.id` column.
Typically the `sample.id` column represents a sample identifier, not a subject id.
In this case, we are only working with subject-level data, so we can use the subject identifier as the sample identifier for model-fitting purposes.

```{r}
annot$sample.id <- annot$subject_id
```


### Fit a regression model without study

We will first fit a regression model that allows us to see if the mean of the height phenotype is different by study after adjusting for other covariates.
In this case, we will adjust for age and sex, but not for study, because we are interested in seeing differences in mean height by study. We use the `fitNullModel` function from the GENESIS package -- the name "null model" comes from association testing context and will be explained later.

```{r, message = FALSE}
outcome <- "height"
covars <- c("sex", "age")

library(GENESIS)
mod_1 <- GENESIS::fitNullModel(annot, outcome = outcome, covars = covars)
```

The output of `fitNullModel` is a list with a number of named elements

```{r}
names(mod_1)
```

The elements that we will work with in this exercise are:

* `converged`: an indicator of whether the model successfully converged
* `model.matrix`: The matrix of subject-covariate values used to fit the model
* `fixef`: The fitted fixed effects
* `betaCov`: The covariance of the fitted fixed effects
* `fit`: A data frame containing information about the fit, in particular:
    * `resid.marginal`: The (marginal) residuals from the model, which have been adjusted for the fixed effects but not for the covariance structure
* `varComp`: The fitted variance components

Make sure the model converged.

```{r}
mod_1$converged
```

Now, add the residuals to the phenotype data frame for plotting.
We need to make sure that we are matching each residual value to the correct subject.
In this case, `model.matrix` is already in the same order as the input AnnotatedDataFrame, but this may not always be the case (for example, if subjects are excluded due to missing phentoype data).
To match the same subject's values together, we use the rownames of the `fit` data frame to match to the `subject_id` column of the annotated data frame.
We then match the row names (and therefore the residuals) to the sample identifier in the phenotype file using the base R function `match`.

```{r}
j <- match(annot$sample.id, rownames(mod_1$fit))
annot$residuals <- mod_1$fit$resid.marginal[j]
```

Next, we want to check if the different studies have the same mean height after adjustment for other covariates (here, age and sex).
We will first do this qualitatively by making a boxplot of the residuals by study.

```{r resid_1}
ggplot(pData(annot), aes(x = study, y = residuals)) + geom_boxplot()
```

From the boxplot, it is clear that the different studies have different mean heights, even after adjustment for sex and age.
At this point, you would need to determine if the differences are acceptable for use in a combined analysis.

### Fit a model with study

Next, we can look at a model that adjusts for other covariates as well as study.
This model allows us to run a statistical test on the fitted study means and to qualitatively check if the variances are the same after adjusting for mean effects.
The outcome is the same, but we now add the study as a covariate.
We also allow for group-specific residual variance by study using the `group.var` argument to `fitNullModel`.

```{r, results = 'hide', message = FALSE}
# include the study in the covariates
covars <- c("age", "sex", "study")

mod_2 <- GENESIS::fitNullModel(annot, outcome = outcome, covars = covars,
                            group.var = "study")
```

The `fixef` element now includes effects for study:
```{r}
mod_2$fixef
```

The regression model also shows the differences in mean height by study.

Finally, we want to check if the height distributions from the different studies have the same variance.
Start by looking at the variance components (`varComp`) element of the model.

```{r}
mod_2$varComp
```

The variance components (`V_study_1`, `V_study_2`, and `V_study_3`) represent the residual variance in each study.
The fitted values of the variance components are different for the different studies, indicating that the distributions of height in the three studies have different variance even after accounting for the other covariates.

We can also show the same information by plotting the residuals by study.
We first have to add the residuals from this model to the AnnotatedDataFrame.

```{r}
annot$residuals <- mod_2$fit$resid.marginal[match(annot$sample.id, rownames(mod_2$fit))]
```

Next make a boxplot of the residuals by study.

```{r resid_2}
ggplot(pData(annot), aes(x = study, y = residuals)) +
  geom_boxplot()
```

Both methods of looking at the variance components indicate that study 1 has a smaller residual variance than the others.

## Final considerations

We have determined that the different studies have both different mean and different variance by study for height.
Before performing genotype-phenotype association tests with these data, you would need to think carefully about whether the phenotype is homogeneous enough to be analyzed together.
In some cases, there may be a valid reason for different means or variances, for example:

* different heights in different study populations, such as a study composed primarily of Asian participants vs. a study with primarily European participants or a study of all men vs. a study of all women;
* possible secular trends in height, such as comparing the Framingham Original cohort from ~1950 to a cohort from the present day.

In other cases, there may be good reasons to exclude one or more studies, for example:

* a systematic measurement error in one study
* miscalculation or misinterpretation of the harmonization algorithm
* study populations that are too different to be compared, such as trying to include a study composed primarily of children with one composed of adults in a height analysis

It may be necessary to look at other variables that you had not previously considered.
Studies may have used different measurement equipment or calibrated their data differently.
There might also be other batch effects due to lab procedures or assays that could result in differences in the variance or mean by study.
The other variables that you may need to consider are highly dependent both on the phenotype being harmonized and on how a given study has been designed.

Unfortunately there is no single set of guidelines you can use to decide how to proceed with analysis of a phenotype.
It is necessary to involve both domain experts and study experts to determine whether the phenotype is homogeneous enough to use in cross-study analysis.


# Association tests - Part I

These exercises introduce genetic association testing: how to identify which genetic variants are associated with a phenotype. In this example, we will test for an association between variant genotypes and height, adjusting for sex, age, and study. Here, we introduce fitting the "null model" and single-variant association testing, as is commonly performed in GWAS (Genome Wide Association Studies).

## Null model

The first step in our association testing procedure is to fit the null model -- i.e., a model fit under the null hypothesis of no individual variant association. Operationally, this is fitting a regression model with the desired outcome phenotype, fixed effect covariates, and random effects.

### Prepare the data

To fit the null model, we will need to create an `AnnotatedDataFrame` with sample information and phenotype data; this class is defined in the Biobase package. We will merge our sample annotation file, which is indexed by a `sample.id` column matched to the GDS file, with our phenotype file, which is indexed by a `subject_id` column. We will use the [dplyr](http://dplyr.tidyverse.org) package for data.frame manipulation.

NOTE: In this example, we use the 1000 Genomes IDs for both sample and subject IDs, though we would generally advise using separate IDs for samples (sequencing instances) and subjects (individuals).

```{r null_model, message = FALSE}
# sample annotation
sampfile <- "AnalysisFiles/sample_annotation.RData"
samp <- get(load(sampfile))

library(Biobase)
# access the data with the pData() function
head(pData(samp))
```

```{r}
# access the metadata with the varMetadata() function
varMetadata(samp)
```

```{r}
# phenotype data
phenfile <- "AnalysisFiles/phenotype_annotation.RData"
phen <- get(load(phenfile))

# access the data with the pData() function
head(pData(phen))
```

```{r}
# access the metadata with the varMetadata() function
varMetadata(phen)
```

```{r}
# merge sample annotation with phenotype data
library(dplyr)
dat <- pData(samp) %>%
    left_join(pData(phen), by=c("subject.id"="subject_id", "sex"="sex"))
head(dat)

# merge the metadata
meta <- bind_rows(varMetadata(samp), varMetadata(phen)[3:5,,drop=FALSE])

# make an AnnotatedDataFrame
annot <- AnnotatedDataFrame(dat, meta)
save(annot, file="OutputFiles/sample_phenotype_annotation.RData")
```

### Fit the null model

We use the `fitNullModel` function from the GENESIS package to fit the null model. We need to specify the outcome (height) and the fixed effect covariates (sex, age, and study). If the sample set involves multiple distinct groups with different variances for the phenotype, we recommend allowing for heterogeneous residual variance among groups with the `group.var` parameter. We saw in a previous exercise that the variance of height differs by study.
We will test for an association between genotype and height, adjusting for sex, age, and study as covariates. If the sample set involves multiple distinct groups with different variances for the phenotype, we recommend allowing the model to use heterogeneous variance among groups with the parameter *group.var*. We saw in a previous exercise that the variance differs by study.

```{r null_model_fit}
library(GENESIS)

# fit the null model
nullmod <- fitNullModel(annot,
                        outcome="height",
                        covars=c("sex", "age", "study"),
                        group.var="study",
                        verbose=FALSE)
save(nullmod, file="OutputFiles/null_model.RData")
```

The `fitNullModel` function returns a lot of information about the model that was fit. We examine some of that information below; to see all of the components, try `names(nullmod)`.

```{r}
# description of the model we fit
nullmod$model
```

```{r}
# fixed effect regression estimates
nullmod$fixef
```

```{r}
# residual variance estimates by group.var
nullmod$varComp
```

```{r}
# model fit: fitted values, residuals
head(nullmod$fit)
```

```{r}
# plot the residuals vs the fitted values
library(ggplot2)
ggplot(nullmod$fit, aes(x = fitted.values, y = resid.marginal)) +
    geom_point(alpha = 0.5) +
    geom_hline(yintercept = 0) +
    geom_smooth(method = 'lm')
```

## Exercise

1. As discussed in the lecture, we recommend a fully adjusted two-stage inverse Normalization procedure for fitting the null model when phenotypes have non-Normal distributions. Using the `two.stage` option in `fitNullModel`, fit a two-stage null model. Compare these residuals with the residuals from the original null model.

[Solutions](.\Solutions.html#association-tests---part-i)

## Single-variant association tests

After fitting our null model, we can use score tests to test each variant across the genome individually for association with the outcome phenotype (i.e. height in our example). Performing these single-variant tests genome-wide is commonly referred to as a GWAS (Genome-Wide Association Study).

We use the `assocTestSingle` function in GENESIS. First, we have to create a `SeqVarData` object including both the GDS file and the sample annotation containing phenotype data. We then create a `SeqVarBlockIterator` object, which breaks the set of all variants in the `SeqVarData` object into blocks, allowing us to analyze genome-wide in manageable pieces. The `assocTestSingle` function iterates over all blocks of variants in the `SeqVarBlockIterator` object and then concatenates and returns the results.

```{r assoc_single, message = FALSE}
library(SeqVarTools)
gdsfile <- "AnalysisFiles/1KG_phase3_subset_chr1.gds"
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)

# make the iterator object
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
iterator
```

```{r}
# run the single-variant association test
assoc <- assocTestSingle(iterator, nullmod)
dim(assoc)
```

```{r}
head(assoc)
```

We make a QQ plot to examine the results.

```{r assoc_single_qq}
library(ggplot2)
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

qqPlot(assoc$Score.pval)
```

A lot of the variants we tested are very rare -- the alternate allele is not observed for many samples. Single-variant tests do not perform well for very rare variants (we will discuss testing rare variants in more detail in the next session). We can use the minor allele count (MAC) observed in the sample to filter rare variants that we may expect to have unreliable test results.

```{r mac}
summary(assoc$MAC)
sum(assoc$MAC < 5)
```

```{r}
qqPlot(assoc$Score.pval[assoc$MAC >= 5])
```

We should expect the majority of variants to fall near the red `y=x` line in the QQ plot. The deviation above the line, commonly referred to as "inflation" is indicative of some model issue. In this example, the issue is likely driven by the fact that we've ignored genetic ancestry and relatedness among these subjects -- more to come later when we discuss mixed models.

```{r assoc_close1}
# close the GDS file!
seqClose(seqData)
```

## Exercise

2. GENESIS also supports testing binary (e.g. case/control) outcomes. We can fit a null model using logistic regression by specifying the argument `family=binomial` in the `fitNullModel` function. Use the `status` column in the sample annotation to fit a null model for simulated case/control status, with `sex` and `Population` as covariates. Run single-variant association tests using this model and make a QQ plot of all variants with MAC >= 5.

[Solutions](.\Solutions.html#association-tests---part-i)


# Association tests - Part II

These exercises continue the introduction to genetic association testing. Here, we introduce multiple-variant association tests, which are commonly used for testing rare variants in aggregate.

## Sliding window tests

We can perform burden, SKAT, SKAT-O, fastSKAT, and SMMAT tests using the GENESIS function `assocTestAggregate`. First, we need to load the null model and `AnnotatedDataFrame` (sample annotation + phenotype data) that we created in the previous session, and we need to create our `SeqVarData` object linking the GDS file to the `AnnotatedDataFrame`.

```{r assoc_window_load, message=FALSE}
# load our null model
nullmodfile <- "AnalysisFiles/null_model.RData"
nullmod <- get(load(nullmodfile))
```

```{r}
# load our sample annotation
annotfile <- "AnalysisFiles/sample_phenotype_annotation.RData"
annot <- get(load(annotfile))
```

```{r}
# open the GDS file
library(SeqVarTools)
gdsfile <- "AnalysisFiles/1KG_phase3_subset_chr1.gds"
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)
```

### Burden test

First, we perform a burden test. We restrict the test to variants with alternate allele frequency < 0.1. (For real data, this threshold would be lower, perhaps 0.05 or 0.01.) We use a flat weighting scheme -- i.e. every variant gets the same weight. We define a sliding window across the genome using a `SeqVarWindowIterator` object.

```{r assoc_window_burden}
# make the window iterator object
iterator <- SeqVarWindowIterator(seqData, windowSize=10000, windowShift=5000, verbose=FALSE)

# run the burden test
library(GENESIS)
assoc <- assocTestAggregate(iterator,
                            nullmod,
                            test="Burden",
                            AF.max=0.1,
                            weight.beta=c(1,1),
                            verbose = FALSE)
```

```{r assoc_window_output}
names(assoc)
```

The function returns the primary results for each window in one table.

```{r}
# results for each window
head(assoc$results)
```

```{r}
# how many variants in each window?
table(assoc$results$n.site)
```

It also returns a list of tables that contain the variant details for each window tested.

```{r}
# variant details for windows with > 1 variant
idx <- which(assoc$results$n.site > 1)
head(assoc$variantInfo[idx])
```

We can make a QQ plot of the burden p-values from the main results table.

```{r assoc_burden_qq}
library(ggplot2)
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

# make a QQ plot of the burden test p-values
qqPlot(assoc$results$Score.pval)
```

### SKAT test

We can also perform a SKAT test. This time, we will use the Wu weights, which give larger weights to rarer variants.

```{r assoc_window_skat, message = FALSE}
# reset the iterator to the first window
resetIterator(iterator)

# run the SKAT test
assoc <- assocTestAggregate(iterator,
                            nullmod,
                            test="SKAT",
                            AF.max=0.1,
                            weight.beta=c(1,25),
                            verbose = FALSE)
```

```{r}
# results for each window
head(assoc$results)
```

```{r}
# variant details for windows with > 1 variant
idx <- which(assoc$results$n.site > 1)
head(assoc$variantInfo[idx])
```

```{r}
# make a QQ plot of the SKAT test p-values
qqPlot(assoc$results$pval)
```

```{r assoc_close2}
seqClose(seqData)
```


## Exercise

1. Perform a sliding window SKAT test for the outcome status. Adjust your model for the covariates sex and study. When performing your SKAT test, use all variants with alternate allele frequency < 20%, and use the Wu weights to give larger weights to rarer variants. Use the same `windowSize` and `windowShift` as in the examples. How many windows have >1 variant? Make a QQ plot of the SKAT p-values.

[Solutions](.\Solutions.html#association-tests---part-ii)


# Ancestry and Relatedness Inference

## LD-pruning

We generally advise that population structure and relatedness inference be performed using a set of (nearly) independent genetic variants. To find this set of variants, we perform linkage-disequilibrium (LD) pruning on the study sample set. We typically use an LD threshold of `r^2 < 0.1` to select variants.

```{r ld-pruning, message = FALSE}
library(SeqArray)
gdsfile <- "AnalysisFiles/1KG_phase3_subset.gds"
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# run LD pruning
library(SNPRelate)
set.seed(100) # LD pruning has a random element; so make this reproducible
snpset <- snpgdsLDpruning(gds,
                          method="corr",
                          slide.max.bp=10e6,
                          ld.threshold=sqrt(0.1))
```

```{r}
# how many variants on each chr?
sapply(snpset, length)
```

```{r}
# get the full list of LD-pruned variants
pruned <- unlist(snpset, use.names=FALSE)
length(pruned)
save(pruned, file = "OutputFiles/ld_pruned_variants.RData")
```

## Computing a GRM

We can use the [SNPRelate package](https://github.com/zhengxwen/SNPRelate) to compute a Genetic Relationship matrix (GRM). A GRM captures genetic relatedness due to both distant ancestry (i.e. population structure) and recent kinship (i.e. family structure) in a single matrix.

SNPRelate offers several algorithms for computing a GRM, including the commonly-used GCTA [Yang et al 2011](https://www.ncbi.nlm.nih.gov/pubmed/21167468). The most recent algorithm added to the package is "IndivBeta" [Weir and Goudet 2017](https://www.ncbi.nlm.nih.gov/pubmed/28550018).

```{r grm}
# compute the GRM
library(SNPRelate)
grm <- snpgdsGRM(gds, method="GCTA", snp.id = pruned)
```

```{r}
names(grm)
dim(grm$grm)
```

```{r}
# look at the top corner of the matrix
grm$grm[1:5,1:5]
```

## De-convoluting ancestry and relatedness

To disentangle distant ancestry (i.e. population structure) from recent kinship (i.e. familial relatedness), we implement the analysis described in [Conomos et al., 2016](https://www.cell.com/ajhg/fulltext/S0002-9297(15)00496-6). This approach uses the [KING](http://www.ncbi.nlm.nih.gov/pubmed/20926424), [PC-AiR](http://www.ncbi.nlm.nih.gov/pubmed/25810074), and [PC-Relate](http://www.ncbi.nlm.nih.gov/pubmed/26748516) methods.

### KING

Step 1 is to get initial kinship estimates using [KING-robust](http://www.ncbi.nlm.nih.gov/pubmed/20926424), which is robust to discrete population structure but not ancestry admixture. KING-robust will be able to identify close relatives (e.g. 1st and 2nd degree) reliably, but may identify spurious pairs or miss more distant pairs of relatives in the presence of admixture. KING is available as its own software, but the KING-robust algorithm is also available in SNPRelate.

```{r king}
# run KING-robust
king <- snpgdsIBDKING(gds, snp.id=pruned)
```

```{r}
names(king)
dim(king$kinship)
```

```{r}
kingMat <- king$kinship
colnames(kingMat) <- rownames(kingMat) <- king$sample.id

# look at the top corner of the matrix
kingMat[1:5,1:5]

save(kingMat, file = "OutputFiles/king_matrix.RData")
```

We extract pairwise kinship estimates and IBS0 values (the proportion of variants for which the pair of indivdiuals share 0 alleles identical by state) to plot.

```{r king_plot}
kinship <- snpgdsIBDSelection(king)
head(kinship)
```

We use a hexbin plot to visualize the relatedness for all pairs of samples.

```{r}
library(ggplot2)
ggplot(kinship, aes(IBS0, kinship)) +
    geom_hline(yintercept=2^(-seq(3,9,2)/2), linetype="dashed", color="grey") +
    geom_hex(bins = 100) +
    ylab("kinship estimate") +
    theme_bw()
```

We see a few parent-offspring, full sibling, 2nd degree, and 3rd degree relative pairs. The abundance of negative estimates represent pairs of individuals who have ancestry from different populations -- the magnitude of the negative relationship is informative of how different their ancestries are; more on this below.

### PC-AiR

The next step is [PC-AiR](http://www.ncbi.nlm.nih.gov/pubmed/25810074), which provides robust population structure inference in samples with kinship and pedigree structure. PC-AiR is available in the GENESIS package via the function `pcair`.

First, PC-AiR partitions the full sample set into a set of mutually unrelated samples that is maximally informative about all ancestries in the sample (i.e. the unrelated set) and their relatives (i.e. the related set). We use a 3rd degree kinship threshold (`kin.thresh = 2^(-9/2)`), which corresponds to first cousins -- this defines anyone less related than first cousins as "unrelated". We use the negative KING-robust estimates as "ancestry divergence" measures (`divMat`) to identify pairs of samples with different ancestry -- we preferentially select individuals with many negative estimates for the unrelated set to ensure ancestry representation. For now, we also use the KING-robust estimates as our kinship measures (`kinMat`); more on this below.

Once the unrelated and related sets are identified, PC-AiR performs a standard Principal Component Analysis (PCA) on the unrelated set of individuals and then projects the relatives onto the PCs. Under the hood, PC-AiR uses the SNPRelate package for efficient PC computation and projection.

```{r pcair1}
# run PC-AiR
library(GENESIS)
pca <- pcair(gds,
            kinobj = kingMat,
            kin.thresh=2^(-9/2),
            divobj = kingMat,
            div.thresh=-2^(-9/2))
```

```{r}
names(pca)
```

```{r}
# the unrelated set of samples
length(pca$unrels)
head(pca$unrels)

# the related set of samples
length(pca$rels)
head(pca$rels)
```

```{r}
# extract the top 10 PCs and make a data.frame
pcs <- data.frame(pca$vectors[,1:10])
colnames(pcs) <- paste0('PC', 1:10)
pcs$sample.id <- pca$sample.id
dim(pcs)
head(pcs)
```

We'd like to determine which PCs are ancestry informative. To do this we look at the PCs in conjunction with population information for the 1000 Genomes samples. This information is stored in an `AnnotatedDataFrame`. We make a parallel coordinates plot, color-coding by 1000 Genomes population.

```{r pcair1_parcoord, message = FALSE}
library(Biobase)
sampfile <- "AnalysisFiles/sample_annotation.RData"
annot <- get(load(sampfile))

library(dplyr)
annot <- pData(annot) %>%
        dplyr::select(sample.id, Population)
pc.df <- left_join(pcs, annot, by="sample.id")

library(GGally)
library(RColorBrewer)
pop.cols <- setNames(brewer.pal(12, "Paired"),
                 c("ACB", "ASW", "CEU", "GBR", "CHB", "JPT", "CLM", "MXL", "LWK", "YRI", "GIH", "PUR"))
ggparcoord(pc.df, columns=1:10, groupColumn="Population", scale="uniminmax") +
    scale_color_manual(values=pop.cols) +
    xlab("PC") + ylab("")
```

### PC-Relate

The next step is [PC-Relate](http://www.ncbi.nlm.nih.gov/pubmed/26748516), which provides accurate kinship inference, even in the presence of population structure and ancestry admixture, by conditioning on ancestry informative PCs. As we saw above, the first 4 PCs separate populations in our study, so we condition on PCs 1-4 in our PC-Relate analysis. PC-Relate can be performed using the `pcrelate` function in GENESIS, which expects a `SeqVarIterator` object for the genotype data. The `training.set` argument allows for specification of which samples to use to "learn" the ancestry adjustment -- we recommend the unrelated set from the PC-AiR analysis.

(NOTE: this will take a few minutes to run).

```{r pcrelate1}
seqResetFilter(gds, verbose=FALSE)
library(SeqVarTools)
seqData <- SeqVarData(gds)

# filter the GDS object to our LD-pruned variants
seqSetFilter(seqData, variant.id=pruned)
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)

pcrel <- pcrelate(iterator,
                  pcs=pca$vectors[,1:4],
                  training.set=pca$unrels)
```

```{r}
names(pcrel)
```

```{r}
# relatedness between pairs of individuals
dim(pcrel$kinBtwn)
```

```{r}
head(pcrel$kinBtwn)
```

```{r}
# self-kinship estimates
dim(pcrel$kinSelf)
```

```{r}
head(pcrel$kinSelf)
```

We plot the pairwise kinship estimates againts the IBD0 (`k0`) estimates (the proportion of variants for which the pair of individuals share 0 alleles identical by descent). We use a hexbin plot to visualize the relatedness for all pairs of samples.

```{r pcrelate1_plot}
ggplot(pcrel$kinBtwn, aes(k0, kin)) +
    geom_hline(yintercept=2^(-seq(3,9,2)/2), linetype="dashed", color="grey") +
    geom_hex(bins = 100) +
    geom_abline(intercept = 0.25, slope = -0.25) +
    ylab("kinship estimate") +
    theme_bw()
```

We see that the PC-Relate relatedness estimates for unrelated pairs (i.e. kin ~ 0 and k0 ~ 1) are much closer to expectation than those from KING-robust.

We can use the `pcrelateToMatrix` function to transform the output into an (n x n) kinship matrix (KM).

```{r pcrelate1_km}
pcrelMat <- pcrelateToMatrix(pcrel, scaleKin=1, verbose=FALSE)

# look at the top corner of the matrix
pcrelMat[1:5,1:5]

save(pcrelMat, file = "OutputFiles/pcrelate_matrix_round1.RData")
```

```{r}
seqClose(seqData)
```

## Exercises

In small samples (such as this one), we recommend performing a second iteration of PC-AiR and PC-Relate. Now that we have the PC-Relate ancestry-adjusted kinship estimates, we can better partition our sample into unrelated and related sets. This can lead to better ancestry PCs from PC-AiR and better relatedness estimates from PC-Relate.

1. Perform a second PC-AiR analysis, this time using the PC-Relate kinship matrix as the kinship estimates (you should still use the KING-robust matrix for the ancestry divergence estimates). How does the unrelated set compare to the first PC-AiR analysis?

2. Make a parallel coordinates plot of the top 10 PC-AiR PCs. How does this compare to the plot from the first iteration? How many PCs seem to reflect ancestry?

3. Perform a second PC-Relate analysis, this time using the new PC-AiR PCs to adjust for ancestry. Make a hexbin plot of estimated kinship vs IBD0.

[Solutions](.\Solutions.html#ancestry-and-relatedness-inference)


# Mixed models

These exercises extend what was previously introduced in the association tests from regression models to mixed models that account for genetic relatedness among samples.

## Null model

Recall that the first step in our association testing procedure is to fit the null model. In addition to the `AnnotatedDataFrame` with phenotype data that we used previously, we will also use the ancestry PCs and pairwise kinship estimates we created in the previous session. We will use the first 4 PCs to adjust for ancestry.

```{r null_model_mm, message = FALSE}
# sample annotation
sampfile <- "AnalysisFiles/sample_phenotype_annotation.RData"
annot <- get(load(sampfile))
library(Biobase)
head(pData(annot))
```

```{r}
# load the ancestry PCs
pcfile <- "AnalysisFiles/pcs.RData"
pcs <- get(load(pcfile))
pcs <- pcs[,c("sample.id", "PC1", "PC2", "PC3", "PC4")]
head(pcs)
```

```{r}
# merge PCs with the sample annotation
library(dplyr)
dat <- left_join(pData(annot), pcs, by="sample.id")
# update the AnnotatedDataFrame
pData(annot) <- dat
save(annot, file="OutputFiles/sample_phenotype_pcs.RData")
```

We can create a kinship matrix from the output of `pcrelate`. We multiply the kinship values by 2 to get values on the same scale as the standard GRM. This matrix is represented in R as a symmetric matrix object from the Matrix package.

```{r load_kinship}
kinfile <- "AnalysisFiles/pcrelate_kinship.RData"
pcrel <- get(load(kinfile))
library(GENESIS)
kinship <- pcrelateToMatrix(pcrel, scaleKin=2, verbose=FALSE)
kinship[1:5,1:5]
```

When running a mixed model analysis, we still fit the null model using the `fitNullModel` function in GENESIS. Now, we include the kinship matrix in the model with the `cov.mat` argument, which is used to specify the random effect(s) in the model with covariance structure(s) proportional to the supplied matrix(s). The inclusion of these random effects is what makes this a mixed model, rather than a simple regression model. We also add the ancestry PCs to the list of covariates and allow for heterogeneous residual variance by `study` with the `group.var` argument, as before.

```{r null_model_fit_mm}
nullmod <- fitNullModel(annot,
                        outcome="height",
                        covars=c("sex", "age", "study", paste0("PC", 1:4)),
                        cov.mat=kinship,
                        group.var="study",
                        verbose=FALSE)
save(nullmod, file="OutputFiles/null_mixed_model.RData")
```

We can investigate the output from `fitNullModel`.
```{r}
# description of the model we fit
nullmod$model
```

```{r}
# fixed effect regression estimates
nullmod$fixef
```

```{r}
# variance component estimates by group.var
nullmod$varComp
```

```{r}
# model fit: fitted values, residuals
head(nullmod$fit)
```

```{r}
library(ggplot2)
ggplot(nullmod$fit, aes(x = fitted.values, y = resid.marginal)) +
    geom_point(alpha = 0.5) +
    geom_hline(yintercept = 0) +
    geom_smooth(method = 'lm')
```


## Single-variant tests

Now we can run a single-variant test, accounting for genetic ancestry and genetic relatedness among the subjects. We use the same `assocTestSingle` function as before; the only difference is that we pass in our new null model.

```{r assoc_single_mm, message = FALSE}
library(SeqVarTools)
gdsfile <- "AnalysisFiles/1KG_phase3_subset_chr1.gds"
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)

# make the iterator object
iterator <- SeqVarBlockIterator(seqData, verbose=FALSE)
```

```{r}
# run the single-variant association test
assoc <- assocTestSingle(iterator, nullmod)
dim(assoc)
```

```{r}
head(assoc)
```

We make the usual QQ plot, filtering to variants with minor allele count (MAC) >= 5.

```{r assoc_single_qq_mm}
library(ggplot2)
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

qqPlot(assoc$Score.pval[assoc$MAC >= 5])
```

Notice that we observe much less inflation than before, when we did not adjust for ancestry and relatedness.

```{r assoc_mm_close}
seqClose(seqData)
```

## Exercise

1. Perform a single-variant association test for `status`. Adjust for sex, age, study, ancestry, and kinship in the model. Don't forget to consider the `family` parameter. Make a QQ plot of the p-values for all variants with MAC >= 5.

[Solutions](.\Solutions.html#mixed-models)


# Variant annotation

In this session, we will learn how to use [Annotation Explorer](https://platform.sb.biodatacatalyst.nhlbi.nih.gov/u/biodatacatalyst/annotation-explorer/), an open tool available on NHLBI BioData Catalyst cloud platform that eliminates the challenges of working with very large variant-level annotated datasets. Using Annotation Explorer, we will learn how to explore and interactively query variant annotations and integrate them in GWAS analyses. Annotation Explorer can be used pre-association testing -- for example, to generate annotation informed variant filtering and grouping files for aggregate testing -- as well as for post-association testing -- for example, to explore annotations of variants in a novel GWAS signal. We will execute three representative use cases to demonstrate both pre- and post-GWAS applications. For all the use cases, we will be using the open-access dataset `TOPMed_freeze5_open`, which everyone has access to.

Annotation explorer has an interactive graphical user interface built on high performance databases and does not require any programming experience. It currently caps the number of users at a given time, so we will not all be able to use it live at the same time during the workshop. We request that everyone perform the hands-on exercises involving Annotation explorer after the workshop is over, at their own convenience. We have provided a [detailed tutorial]( https://docs.google.com/document/d/1_yXemTTYnBzL6Dv4fngojE0T5CAH3Z-CSxj1X5Qq8kI/edit?usp=sharing) and will provide a video recording of this demo for how to perform the following exercises using Annotation Explorer.

## Use cases

### Use case 1

User wants to generate aggregation units for rare variant association testing such that only missense variants which have `CADD phred score >20` are grouped by Ensemble gene definitions.

### Use case 2

User wants to generate aggregation units for rare variant association testing such that they retain only variants with `fathmm_MKL_non_coding_score > 0.5` grouped by user-defined genomic coordinates (for example, using ATAC-Seq peaks from the tissue of your choice).

### Use case 3

User wants to explore the annotations for a variant of their interest.

## Exercise

1. Using [Annotation Explorer](https://platform.sb.biodatacatalyst.nhlbi.nih.gov/u/biodatacatalyst/annotation-explorer/), generate a new set of aggregation units by setting up the same filtering criteria as in use case 1, but this time use a different CADD phred score cut-off (example: 40, 10) and study how that changes plots under the `interactive plots` tab of Annotation Explorer. For example, how does changing the filtering criteria change the number of aggregation units with no variants? How does the distribution and number of aggregation units in each bin change in the histogram?

[Solutions](.\Solutions.html#variant-annotation)


# Annotation informed aggregate association tests

## Aggregate unit for association testing exercise

Now that we know how to make genome annotation informed aggregation  units using Annotation Explorer, such as the gene-based variant aggregation units, we can proceed to an association testing exercise. *NOTE : The exercises in this workshop are based on the 1000 genomes dataset mapped to genome build GRCh37/hg19. Because the aggregation  units we generated using the Annotation Explorer in the previous section are mapped to GRCh38 and are not based on 1000 genomes data, we will NOT be using them in this section*. Instead, in this exercise we will be using pre-computed aggregation  units based on 1000 genomes mapped to GRCh37 so that the annotation positions are consistent with the build used for genotyping data in the workshop. These gene-based units include SNVs from all chromosomes (no indels, and not just chromosome 1 as before). Each genic unit was specified to include the set of SNVs falling within a GENCODE-defined gene boundaries and the 20 kb flanking regions upstream and downstream of that range. This set of aggregation units is not filtered by CADD score or consequence.

The aggregation units are defined in an R dataframe in the format consistent with the output from Annotation Explorer and compatible with the GENESIS association testing workflows. Each row of the dataframe specifies a variant (chr, pos, ref, alt) and the group identifier (group_id) it is a part of. Multiple rows with different group identifiers can be specified to assign a variant to different groups (i.e., a variant can be assigned to multiple genes).

Begin by loading the aggregation units:

```{r agg_unit}
library(dplyr)
aggfile <- "AnalysisFiles/variants_by_gene.RData"
aggunit <- get(load(aggfile))
head(aggunit)
```

```{r}
# an example of a variant that is present in multiple groups
mult <- aggunit %>%
    group_by(chr, pos) %>%
    summarise(n=n()) %>%
    filter(n > 1)
inner_join(aggunit, mult[2,1:2])
```

## Association testing with aggregate units

We can run burden and SKAT tests on each of these units using the same `assocTestAggregate` function we used previously. We define a `SeqVarListIterator` object where each list element is an aggregate unit. The constructor expects a `GRangesList`, so we use the TopmedPipeline function `aggregateGRangesList` to quickly convert our single dataframe to the required format. This function can account for multiallelic variants (the same chromosome, position, and ref, but different alt alleles).

```{r aggVarList}
# open the GDS file
library(SeqVarTools)
gdsfile <- "AnalysisFiles/1KG_phase3_subset_chr1.gds"
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)

# sample annotation file
annotfile <- "AnalysisFiles/sample_phenotype_pcs.RData"
annot <- get(load(annotfile))

# make the seqVarData object
seqData <- SeqVarData(gds, sampleData=annot)
```

```{r}
# subset to chromosome 1
aggunit1 <- filter(aggunit, chr == 1)

# create the GRangesList object
library(TopmedPipeline)
aggVarList <- aggregateGRangesList(aggunit1)
length(aggVarList)
head(names(aggVarList))
aggVarList[[1]]
```

```{r}
# construct the iterator using the SeqVarListIterator function
iterator <- SeqVarListIterator(seqData, variantRanges=aggVarList, verbose=FALSE)
```

As in the previous section, we must load the null model we fit earlier before running the association test.

```{r assoc_aggregate}
# load the null model
nullmodfile <- "AnalysisFiles/null_mixed_model.RData"
nullmod <- get(load(nullmodfile))

# run the burden test
library(GENESIS)
assoc <- assocTestAggregate(iterator,
                            nullmod,
                            test="Burden",
                            AF.max=0.1,
                            weight.beta=c(1,1),
                            verbose = FALSE)
```

```{r}
names(assoc)
```

```{r}
head(assoc$results)
```

```{r}
head(names(assoc$variantInfo))
```

```{r}
assoc$variantInfo[[3]]
```

We can make our usual QQ plot

```{r}
library(ggplot2)
qqPlot <- function(pval) {
    pval <- pval[!is.na(pval)]
    n <- length(pval)
    x <- 1:n
    dat <- data.frame(obs=sort(pval),
                      exp=x/n,
                      upper=qbeta(0.025, x, rev(x)),
                      lower=qbeta(0.975, x, rev(x)))

    ggplot(dat, aes(-log10(exp), -log10(obs))) +
        geom_line(aes(-log10(exp), -log10(upper)), color="gray") +
        geom_line(aes(-log10(exp), -log10(lower)), color="gray") +
        geom_point() +
        geom_abline(intercept=0, slope=1, color="red") +
        xlab(expression(paste(-log[10], "(expected P)"))) +
        ylab(expression(paste(-log[10], "(observed P)"))) +
        theme_bw()
}

qqPlot(assoc$results$Score.pval)
```

```{r}
seqClose(seqData)
```

## Exercise

1. Since we are working with a subset of the data, many of the genes listed in `group_id` have a very small number of variants. Create a new set of aggregation units based on position, rather than gene name -- create 10 units that are 1MB long and span all of the chr1 variants by using the TopmedPipeline function `aggregateGRanges`. Run a SKAT test using those units and a `SeqVarRangeIterator` object.

[Solutions](.\Solutions.html#annotation-informed-aggregate-association-tests)


# Links to Exercises and Solutions

#### [Link to exercises](.\Exercises.html)
#### [Link to solutions](.\Solutions.html)

# R API implementation

## Installation

The [*sevenbridges*](http://bioconductor.org/packages/devel/bioc/vignettes/sevenbridges/inst/doc/api.html#1_introduction) package is available on both the release and devel branch from Bioconductor.
```{r, message=FALSE, error=FALSE, results=FALSE}
# To install it from the devel branch, use:
#install.packages("BiocManager")
#BiocManager::install("sevenbridges", version = "devel")
```
```{r}
#remotes::install_github("sbg/sevenbridges-r")
```
Before you can access your account via the API, you have to provide your credentials. You can obtain your credentials in the form of an “authentication token” from the Developer Tab on the visual interface. Once you’ve obtained this, create an Auth object, so it remembers your authentication token and the path for the API. All subsequent requests will draw upon these two pieces of information.

```{r, message=FALSE}
#library("sevenbridges")

# Authentication
# Set environment
#sbg_set_env("https://api.sb.biodatacatalyst.nhlbi.nih.gov/v2", "<API-token>")

# Create an Auth object:
#api <- Auth(from = "env")
```

To draft a new task, you need to specify the following:<br />

+ The name of the task <br />
+ An optional description <br />
+ The app id of the workflow you are executing <br />
+ The inputs for your workflow. In this case, the CWL app accepts four parameters: number, min, max, and seed. <br />

You can always check the App details on the visual interface for task input requirements. To find the required inputs with R, you need to get an App object first.

## VCF to GDS

Please copy the public GENESIS VCF to GDS workflow to your project prior to running the API script.

Set up project and application you want to run:
```{r}
# Set project where you want to run the task.
# You can get this from the project URL - username/project_id
#project <- api$project(id = "biodatacatalyst/genesis-tutorial")

# Set application using the app ID
# You should change this to reflect your project, same as above.
#app <- api$app(id = "biodatacatalyst/genesis-tutorial/vcf-to-gds")

# get input matrix
#app$input_matrix()
#app$input_matrix(c("id", "label", "type"))
```

Get input files and create task draft:
```{r, message=FALSE}
# get the input vcf file
#vcf_input <- project$file(name = c('1KG_phase3_subset_chr1.vcf.gz'), exact = TRUE)

# add new tasks
#taskName <- paste0("VCF to GDS - R API run ")

# draft your task
#you should change the project ID here too
#task <- project$task_add(
#  name = taskName,
#  description = "VCF to GDS api run",
#  app = "biodatacatalyst/genesis-tutorial/vcf-to-gds",
#  inputs = list(vcf_file = vcf_input)
#)

```

```{r}
#task$run()
```

# References

Matthew P. Conomos, Stephanie M. Gogarten, Lisa Brown, Han Chen, Thomas Lumley, Ken Rice, Tamar Sofer, Adrienne Stilp, Timothy Thornton, Chaoyu Yu. *GENetic EStimation and Inference in Structured samples (GENESIS)*. https://bioconductor.org/packages/devel/bioc/manuals/GENESIS/man/GENESIS.pdf.

Xiuwen Zheng, Stephanie Gogarten, David Levine, Cathy Laurie. *SeqArray*. https://www.bioconductor.org/packages/release/bioc/manuals/SeqArray/man/SeqArray.pdf.

Stephanie M. Gogarten, Xiuwen Zheng, Adrienne Stilp. *SeqVarTools*. https://bioconductor.org/packages/devel/bioc/manuals/SeqVarTools/man/SeqVarTools.pdf.

Xiuwen Zheng, Stephanie Gogarten, Cathy Laurie, Bruce Weir. *SNPRelate*. https://www.bioconductor.org/packages/release/bioc/manuals/SNPRelate/man/SNPRelate.pdf.

R. Gentleman, V. Carey, M. Morgan, S. Falcon. *Biobase: Base functions for Bioconductor*.
https://www.bioconductor.org/packages/release/bioc/manuals/Biobase/man/Biobase.pdf.

P. Aboyoun, H. Pagès, and M. Lawrence. *Representation and manipulation of genomic intervals*. https://bioconductor.org/packages/release/bioc/manuals/GenomicRanges/man/GenomicRanges.pdf.

*SBG R-API*. http://bioconductor.org/packages/devel/bioc/vignettes/sevenbridges/inst/doc/api.html#1_introduction.