compgen-day1

https://beta.etherpad.org/p/compgen_day1

Aim:
calculate the lengths of CpG islands that overlap with genes in set A and set B. Visualize this information in an appropriate way. you have access to set A and set B.

set A
https://www.dropbox.com/s/bzshp483qtg03r4/SetA.txt?dl=0

set B
https://www.dropbox.com/s/ssv1qd32ectizz9/SetB.txt?dl=0

cheatsheet:
http://compgenomr.github.io/book/operations-on-genomic-intervals-with-genomicranges-package.html

Task 0
Figure out which human genome assembly are setA and setB coming from ?

Task 1
Download  CpG islands
hint: http://genome-euro.ucsc.edu go to table browser, you need to select the
right table to download in BED format

Task 2
read CpG islands and gene sets as GRanges objects
hint: read.table() and GRanges() functions

```
cpg = read.table('cpi_hg18.bed')
setA = read.table('SetA.txt', header = TRUE, sep = '\t')
setB = read.table('SetB.txt', header = TRUE)

gr_cpg = GRanges(seqnames = cpg$V1, IRanges(start = cpg$V2, end = cpg$V3), name = cpg$V4)
gr_setA = GRanges(seqnames = setA$chr, IRanges(start = setA$start, end = setA$end), gene = setA$EnsembGeneID)
gr_setB = GRanges(seqnames = setB$chr, IRanges(start = setB$start, end = setB$end), gene = setB$EnsemblGeneID)
```
Task 3

    Overlap CpG islands with Gene coordinates from SetA and Set B


```
gr_ov_setA = subsetByOverlaps(gr_cpg, gr_setA)
gr_ov_setB = subsetByOverlaps(gr_cpg, gr_setB)
```

Task 4
plot length distribution of CpG islands overlapping with set A and set B
hint: width() function on GRanges objects gives you length of intervals

```
wiA = width(gr_ov_setA)
wiB = width(gr_ov_setB)

par(mfrow = c(2,1))
hist(log(wiA))
hist(log(wiB))

par(mfrow = c(2,1))
hist(wiA)
hist(wiB)
```

Extra task:
calculate the total CpG island length per gene and plot distribution of that for set A and set B. For each gene sum up the lengths of CpG islands overlapping with the gene, plot the distributions for set A and set B.

```
require(dplyr)
require(ggplot2)

# By-gene sum of CpG length for SetA
A.fO = data.frame(findOverlaps(CpG.gr, SetA.gr))
A.fO$EnsembGeneID = SetA[A.fO$subjectHits, "EnsembGeneID"]
A.fO$CpGLength = width(CpG.gr[A.fO$queryHits])
A.LengthByGene = summarise(group_by(A.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
A.LengthByGene$Set="A"

# By-gene sum of CpG length for SetB
B.fO = data.frame(findOverlaps(CpG.gr, SetB.gr))
B.fO$EnsembGeneID = SetB[B.fO$subjectHits, "EnsembGeneID"]
B.fO$CpGLength = width(CpG.gr[B.fO$queryHits])
B.LengthByGene = summarise(group_by(B.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
B.LengthByGene$Set="B"

# Combine SetA and SetB
LengthByGene = rbind(A.LengthByGene,B.LengthByGene)

# Plot
ggplot(data=LengthByGene, aes(x = CpGLengthSumByGene, fill=Set)) +
  geom_density(color=F, alpha=0.2) +
  scale_x_log10() +
  theme_minimal()
```

Making visual summaries of genomic data

Task 1
Extract -500,+500 bp regions around TSSes on chr21, there are refseq files in the compGenomRData package or you can
pull the data out of UCSC browser. Use SP1 ChIP-seq data in the compGenomRData package, access the file path via `system.file() `command

    (wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam` ) to create an average profile of read coverage around TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these possible using `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files.


```
transcriptFilechr21=system.file("extdata",
                      "refseq.hg19.chr21.bed",
                      package="compGenomRData")
gene.chr21=readTranscriptFeatures(transcriptFilechr21,
                                  up.flank = 500,down.flank = 500)
prom.chr21=gene.chr21$promoters

bamFile=system.file("extdata",
                    "wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
                    package="compGenomRData")
sm=ScoreMatrix(bamFile,prom.chr21,
               type="bam",strand.aware = TRUE)
plotMeta(sm)
```


Task 2

    Extract -500,+500 bp regions around TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the compGenomRData package and create heatmaps and average signal profiles for regions around the TSSes.


```
transcriptFile=system.file("extdata",
                      "refseq.hg19.chr20.bed",
                      package="compGenomRData")
gene.chr20=readTranscriptFeatures(transcriptFile,up.flank = 500,down.flank = 500)
prom.chr20=gene.chr20$promoters


h3k4me3File=system.file("extdata",
                      "H1.ESC.H3K4me3.chr20.bw",
                      package="compGenomRData")

H3K27acFile=system.file("extdata",
                      "H1.ESC.H3K27ac.chr20.bw",
                      package="compGenomRData")

sml=ScoreMatrixList(c(h3k4me3File,
                      H3K27acFile),prom.chr20,
                      type="bigWig",strand.aware = TRUE)

# look for the average enrichment
plotMeta(sml, profile.names = c("H3K4me3","H3K27ac"), xcoords = c(-500,500),
         ylab="log2enrichment",dispersion = "se",
         xlab="bases around TSS")

multiHeatMatrix(sml,order=TRUE,winsorize = c(0,95),
                matrix.main = c("H3K4me3","H3K27ac"))
```


Task 3
Download P300 ChIP-seq peaks data from UCSC browser. The peaks are  the locations for P300 binding, which marks enhancer regions in the genome. (HINT:  group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name , the table contains all transcription factors not just P300). You need to use the hg19 assembly, the bigWig files are from hg19 (http://genome-euro.ucsc.edu/cgi-bin/hgTables)

Use the mid-point of the peaks as anchors and extend 500 bp around the mid-point as regions of interest. You can use GenomicRanges::resize() function to get 1000bp regions of interest. Ex: `resize(p300.gr,width=1000,fix="center" )`


Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20, use data from compGenomRData package. Make multi-heatmaps and metaplots, what is different from TSS profiles ?

```
library(genomation)
peaks=readBed("~/Downloads/chip.peaks.bed")
p300.gr=peaks[mcols(peaks)$name=="EP300",]

rs.p300.gr=resize(p300.gr,width=1000,fix="center" )

rs.p300.gr=rs.p300.gr[seqnames(rs.p300.gr)=="chr20",]

dnaseFile=system.file("extdata",
                      "H1.ESC.dnase.chr20.bw",
                      package="compGenomRData")

sml=ScoreMatrixList(c(h3k4me3File,
                      H3K27acFile,dnaseFile),rs.p300.gr,
                      type="bigWig",strand.aware = FALSE)

plotMeta(sml)
```


Task 4
Cluster the rows of multi-heatmaps? Are there obvious clusters ? HINT: check arguments of multiHeatMatrix() function.

```
multiHeatMatrix(sml,
                winsorize = c(0,95),
                matrix.main=c("H3K4me3","H3K27ac","DNase"),
               ,
               clustfun=function(x) kmeans(x, centers=3)$cluster)

```