Advertisement
Guest User

compgen-day1

a guest
May 22nd, 2019
43
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
R 7.06 KB | None | 0 0
  1. https://beta.etherpad.org/p/compgen_day1
  2.  
  3. Aim:
  4. calculate the lengths of CpG islands that overlap with genes in set A and set B. Visualize this information in an appropriate way. you have access to set A and set B.
  5.  
  6. set A
  7. https://www.dropbox.com/s/bzshp483qtg03r4/SetA.txt?dl=0
  8.  
  9. set B
  10. https://www.dropbox.com/s/ssv1qd32ectizz9/SetB.txt?dl=0
  11.  
  12. cheatsheet:
  13. http://compgenomr.github.io/book/operations-on-genomic-intervals-with-genomicranges-package.html
  14.  
  15. Task 0
  16. Figure out which human genome assembly are setA and setB coming from ?
  17.  
  18. Task 1
  19. Download  CpG islands
  20. hint: http://genome-euro.ucsc.edu go to table browser, you need to select the
  21. right table to download in BED format
  22.  
  23. Task 2
  24. read CpG islands and gene sets as GRanges objects
  25. hint: read.table() and GRanges() functions
  26.  
  27. ```
  28. cpg = read.table('cpi_hg18.bed')
  29. setA = read.table('SetA.txt', header = TRUE, sep = '\t')
  30. setB = read.table('SetB.txt', header = TRUE)
  31.  
  32. gr_cpg = GRanges(seqnames = cpg$V1, IRanges(start = cpg$V2, end = cpg$V3), name = cpg$V4)
  33. gr_setA = GRanges(seqnames = setA$chr, IRanges(start = setA$start, end = setA$end), gene = setA$EnsembGeneID)
  34. gr_setB = GRanges(seqnames = setB$chr, IRanges(start = setB$start, end = setB$end), gene = setB$EnsemblGeneID)
  35. ```
  36. Task 3
  37.  
  38.     Overlap CpG islands with Gene coordinates from SetA and Set B
  39.  
  40.  
  41. ```
  42. gr_ov_setA = subsetByOverlaps(gr_cpg, gr_setA)
  43. gr_ov_setB = subsetByOverlaps(gr_cpg, gr_setB)
  44. ```
  45.  
  46. Task 4
  47. plot length distribution of CpG islands overlapping with set A and set B
  48. hint: width() function on GRanges objects gives you length of intervals
  49.  
  50. ```
  51. wiA = width(gr_ov_setA)
  52. wiB = width(gr_ov_setB)
  53.  
  54. par(mfrow = c(2,1))
  55. hist(log(wiA))
  56. hist(log(wiB))
  57.  
  58. par(mfrow = c(2,1))
  59. hist(wiA)
  60. hist(wiB)
  61. ```
  62.  
  63. Extra task:
  64. calculate the total CpG island length per gene and plot distribution of that for set A and set B. For each gene sum up the lengths of CpG islands overlapping with the gene, plot the distributions for set A and set B.
  65.  
  66. ```
  67. require(dplyr)
  68. require(ggplot2)
  69.  
  70. # By-gene sum of CpG length for SetA
  71. A.fO = data.frame(findOverlaps(CpG.gr, SetA.gr))
  72. A.fO$EnsembGeneID = SetA[A.fO$subjectHits, "EnsembGeneID"]
  73. A.fO$CpGLength = width(CpG.gr[A.fO$queryHits])
  74. A.LengthByGene = summarise(group_by(A.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
  75. A.LengthByGene$Set="A"
  76.  
  77. # By-gene sum of CpG length for SetB
  78. B.fO = data.frame(findOverlaps(CpG.gr, SetB.gr))
  79. B.fO$EnsembGeneID = SetB[B.fO$subjectHits, "EnsembGeneID"]
  80. B.fO$CpGLength = width(CpG.gr[B.fO$queryHits])
  81. B.LengthByGene = summarise(group_by(B.fO, EnsembGeneID), CpGLengthSumByGene = sum(CpGLength))
  82. B.LengthByGene$Set="B"
  83.  
  84. # Combine SetA and SetB
  85. LengthByGene = rbind(A.LengthByGene,B.LengthByGene)
  86.  
  87. # Plot
  88. ggplot(data=LengthByGene, aes(x = CpGLengthSumByGene, fill=Set)) +
  89.   geom_density(color=F, alpha=0.2) +
  90.   scale_x_log10() +
  91.   theme_minimal()
  92. ```
  93.  
  94. Making visual summaries of genomic data
  95.  
  96. Task 1
  97. Extract -500,+500 bp regions around TSSes on chr21, there are refseq files in the compGenomRData package or you can
  98. pull the data out of UCSC browser. Use SP1 ChIP-seq data in the compGenomRData package, access the file path via `system.file() `command
  99.  
  100.     (wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam` ) to create an average profile of read coverage around TSSes. Following that, visualize the read coverage with a heatmap. **HINT**: All of these possible using `genomation` package functions. Check `help(ScoreMatrix)` to see how you can use bam files.
  101.  
  102.  
  103. ```
  104. transcriptFilechr21=system.file("extdata",
  105.                       "refseq.hg19.chr21.bed",
  106.                       package="compGenomRData")
  107. gene.chr21=readTranscriptFeatures(transcriptFilechr21,
  108.                                   up.flank = 500,down.flank = 500)
  109. prom.chr21=gene.chr21$promoters
  110.  
  111. bamFile=system.file("extdata",
  112.                     "wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
  113.                     package="compGenomRData")
  114. sm=ScoreMatrix(bamFile,prom.chr21,
  115.                type="bam",strand.aware = TRUE)
  116. plotMeta(sm)
  117. ```
  118.  
  119.  
  120. Task 2
  121.  
  122.     Extract -500,+500 bp regions around TSSes on chr20. Use H3K4me3 (`H1.ESC.H3K4me3.chr20.bw`) and H3K27ac (`H1.ESC.H3K27ac.chr20.bw`) ChIP-seq enrichment data in the compGenomRData package and create heatmaps and average signal profiles for regions around the TSSes.
  123.  
  124.  
  125. ```
  126. transcriptFile=system.file("extdata",
  127.                       "refseq.hg19.chr20.bed",
  128.                       package="compGenomRData")
  129. gene.chr20=readTranscriptFeatures(transcriptFile,up.flank = 500,down.flank = 500)
  130. prom.chr20=gene.chr20$promoters
  131.  
  132.  
  133. h3k4me3File=system.file("extdata",
  134.                       "H1.ESC.H3K4me3.chr20.bw",
  135.                       package="compGenomRData")
  136.  
  137. H3K27acFile=system.file("extdata",
  138.                       "H1.ESC.H3K27ac.chr20.bw",
  139.                       package="compGenomRData")
  140.  
  141. sml=ScoreMatrixList(c(h3k4me3File,
  142.                       H3K27acFile),prom.chr20,
  143.                       type="bigWig",strand.aware = TRUE)
  144.  
  145. # look for the average enrichment
  146. plotMeta(sml, profile.names = c("H3K4me3","H3K27ac"), xcoords = c(-500,500),
  147.          ylab="log2enrichment",dispersion = "se",
  148.          xlab="bases around TSS")
  149.  
  150. multiHeatMatrix(sml,order=TRUE,winsorize = c(0,95),
  151.                 matrix.main = c("H3K4me3","H3K27ac"))
  152. ```
  153.  
  154.  
  155. Task 3
  156. Download P300 ChIP-seq peaks data from UCSC browser. The peaks are  the locations for P300 binding, which marks enhancer regions in the genome. (HINT:  group: "regulation", track: "Txn Factor ChIP", table:"wgEncodeRegTfbsClusteredV3", you need to filter the rows for "EP300" name , the table contains all transcription factors not just P300). You need to use the hg19 assembly, the bigWig files are from hg19 (http://genome-euro.ucsc.edu/cgi-bin/hgTables)
  157.  
  158. Use the mid-point of the peaks as anchors and extend 500 bp around the mid-point as regions of interest. You can use GenomicRanges::resize() function to get 1000bp regions of interest. Ex: `resize(p300.gr,width=1000,fix="center" )`
  159.  
  160.  
  161. Check enrichment of H3K4me3, H3K27ac and DNase-seq (`H1.ESC.dnase.chr20.bw`) experiments on chr20, use data from compGenomRData package. Make multi-heatmaps and metaplots, what is different from TSS profiles ?
  162.  
  163. ```
  164. library(genomation)
  165. peaks=readBed("~/Downloads/chip.peaks.bed")
  166. p300.gr=peaks[mcols(peaks)$name=="EP300",]
  167.  
  168. rs.p300.gr=resize(p300.gr,width=1000,fix="center" )
  169.  
  170. rs.p300.gr=rs.p300.gr[seqnames(rs.p300.gr)=="chr20",]
  171.  
  172. dnaseFile=system.file("extdata",
  173.                       "H1.ESC.dnase.chr20.bw",
  174.                       package="compGenomRData")
  175.  
  176. sml=ScoreMatrixList(c(h3k4me3File,
  177.                       H3K27acFile,dnaseFile),rs.p300.gr,
  178.                       type="bigWig",strand.aware = FALSE)
  179.  
  180. plotMeta(sml)
  181. ```
  182.  
  183.  
  184. Task 4
  185. Cluster the rows of multi-heatmaps? Are there obvious clusters ? HINT: check arguments of multiHeatMatrix() function.
  186.  
  187. ```
  188. multiHeatMatrix(sml,
  189.                 winsorize = c(0,95),
  190.                 matrix.main=c("H3K4me3","H3K27ac","DNase"),
  191.                ,
  192.                clustfun=function(x) kmeans(x, centers=3)$cluster)
  193.                
  194. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement