Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- https://beta.etherpad.org/p/compgen_day1
- Aim:
- calculate the lengths of CpG islands that overlap with genes in set A and set B. Visualize this information in an appropriate way. you have access to set A and set B.
- set A
- https://www.dropbox.com/s/bzshp483qtg03r4/SetA.txt?dl=0
- set B
- https://www.dropbox.com/s/ssv1qd32ectizz9/SetB.txt?dl=0
- cheatsheet:
- http://compgenomr.github.io/book/operations-on-genomic-intervals-with-genomicranges-package.html
- Task 0
- Figure out which human genome assembly are setA and setB coming from ?
- Task 1
- Download CpG islands
- hint: http://genome-euro.ucsc.edu go to table browser, you need to select the
- right table to download in BED format
- Task 2
- read CpG islands and gene sets as GRanges objects
- hint: read.table() and GRanges() functions
- ```
- cpg = read.table('cpi_hg18.bed')
- setA = read.table('SetA.txt', header = TRUE, sep = '\t')
- setB = read.table('SetB.txt', header = TRUE)
- gr_cpg = GRanges(seqnames = cpg$V1, IRanges(start = cpg$V2, end = cpg$V3), name = cpg$V4)
- gr_setA = GRanges(seqnames = setA$chr, IRanges(start = setA$start, end = setA$end), gene = setA$EnsembGeneID)
- gr_setB = GRanges(seqnames = setB$chr, IRanges(start = setB$start, end = setB$end), gene = setB$EnsemblGeneID)
- ```
- Task 3
- Overlap CpG islands with Gene coordinates from SetA and Set B
- ```
- gr_ov_setA = subsetByOverlaps(gr_cpg, gr_setA)
- gr_ov_setB = subsetByOverlaps(gr_cpg, gr_setB)
- ```
- Task 4
- plot length distribution of CpG islands overlapping with set A and set B
- hint: width() function on GRanges objects gives you length of intervals
- ```
- wiA = width(gr_ov_setA)
- wiB = width(gr_ov_setB)
- par(mfrow = c(2,1))
- hist(log(wiA))
- hist(log(wiB))
- par(mfrow = c(2,1))
- hist(wiA)
- hist(wiB)
- ```
- Extra task:
- calculate the total CpG island length per gene and plot distribution of that for set A and set B. For each gene sum up the lengths of CpG islands overlapping with the gene, plot the distributions for set A and set B.
- ```
- require(dplyr)
- require(ggplot2)
- # By-gene sum of CpG length for SetA
- A.fO = data.frame(findOverlaps(CpG.gr, SetA.gr))
- A.fO$EnsembGeneID = SetA[A.fO$subjectHits, "EnsembGeneID"]
- A.fO$CpGLength = width(CpG.gr[A.fO$queryHits])
- A.LengthByGene = summarise(group_by(A.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
- A.LengthByGene$Set="A"
- # By-gene sum of CpG length for SetB
- B.fO = data.frame(findOverlaps(CpG.gr, SetB.gr))
- B.fO$EnsembGeneID = SetB[B.fO$subjectHits, "EnsembGeneID"]
- B.fO$CpGLength = width(CpG.gr[B.fO$queryHits])
- B.LengthByGene = summarise(group_by(B.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
- B.LengthByGene$Set="B"
- # Combine SetA and SetB
- LengthByGene = rbind(A.LengthByGene,B.LengthByGene)
- # Plot
- ggplot(data=LengthByGene, aes(x = CpGLengthSumByGene, fill=Set)) +
- geom_density(color=F, alpha=0.2) +
- scale_x_log10() +
- theme_minimal()
- ```
- Making visual summaries of genomic data
- Task 1
- Extract -500,+500 bp regions around TSSes on chr21, there are refseq files in the compGenomRData package or you can
- pull the data out of UCSC browser. Use SP1 ChIP-seq data in the compGenomRData package, access the file path via `system.file() `command
- (wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam` ) to create an average profile of read coverage around TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these possible using `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files.
- ```
- transcriptFilechr21=system.file("extdata",
- "refseq.hg19.chr21.bed",
- package="compGenomRData")
- gene.chr21=readTranscriptFeatures(transcriptFilechr21,
- up.flank = 500,down.flank = 500)
- prom.chr21=gene.chr21$promoters
- bamFile=system.file("extdata",
- "wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
- package="compGenomRData")
- sm=ScoreMatrix(bamFile,prom.chr21,
- type="bam",strand.aware = TRUE)
- plotMeta(sm)
- ```
- Task 2
- Extract -500,+500 bp regions around TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the compGenomRData package and create heatmaps and average signal profiles for regions around the TSSes.
- ```
- transcriptFile=system.file("extdata",
- "refseq.hg19.chr20.bed",
- package="compGenomRData")
- gene.chr20=readTranscriptFeatures(transcriptFile,up.flank = 500,down.flank = 500)
- prom.chr20=gene.chr20$promoters
- h3k4me3File=system.file("extdata",
- "H1.ESC.H3K4me3.chr20.bw",
- package="compGenomRData")
- H3K27acFile=system.file("extdata",
- "H1.ESC.H3K27ac.chr20.bw",
- package="compGenomRData")
- sml=ScoreMatrixList(c(h3k4me3File,
- H3K27acFile),prom.chr20,
- type="bigWig",strand.aware = TRUE)
- # look for the average enrichment
- plotMeta(sml, profile.names = c("H3K4me3","H3K27ac"), xcoords = c(-500,500),
- ylab="log2enrichment",dispersion = "se",
- xlab="bases around TSS")
- multiHeatMatrix(sml,order=TRUE,winsorize = c(0,95),
- matrix.main = c("H3K4me3","H3K27ac"))
- ```
- Task 3
- Download P300 ChIP-seq peaks data from UCSC browser. The peaks are the locations for P300 binding, which marks enhancer regions in the genome. (HINT: group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name , the table contains all transcription factors not just P300). You need to use the hg19 assembly, the bigWig files are from hg19 (http://genome-euro.ucsc.edu/cgi-bin/hgTables)
- Use the mid-point of the peaks as anchors and extend 500 bp around the mid-point as regions of interest. You can use GenomicRanges::resize() function to get 1000bp regions of interest. Ex: `resize(p300.gr,width=1000,fix="center" )`
- Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20, use data from compGenomRData package. Make multi-heatmaps and metaplots, what is different from TSS profiles ?
- ```
- library(genomation)
- peaks=readBed("~/Downloads/chip.peaks.bed")
- p300.gr=peaks[mcols(peaks)$name=="EP300",]
- rs.p300.gr=resize(p300.gr,width=1000,fix="center" )
- rs.p300.gr=rs.p300.gr[seqnames(rs.p300.gr)=="chr20",]
- dnaseFile=system.file("extdata",
- "H1.ESC.dnase.chr20.bw",
- package="compGenomRData")
- sml=ScoreMatrixList(c(h3k4me3File,
- H3K27acFile,dnaseFile),rs.p300.gr,
- type="bigWig",strand.aware = FALSE)
- plotMeta(sml)
- ```
- Task 4
- Cluster the rows of multi-heatmaps? Are there obvious clusters ? HINT: check arguments of multiHeatMatrix() function.
- ```
- multiHeatMatrix(sml,
- winsorize = c(0,95),
- matrix.main=c("H3K4me3","H3K27ac","DNase"),
- ,
- clustfun=function(x) kmeans(x, centers=3)$cluster)
- ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement