Advertisement
Guest User

Untitled

a guest
Aug 4th, 2017
107
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 9.63 KB | None | 0 0
  1. Signe White
  2. April 13, 2015
  3.  
  4. Summary of Rotation Project for Dr. Timothy Read
  5.  
  6. Creating R code to determine which SNPs correlate with antibiotic resistance in various strains of Staphylococcus aureus.
  7.  
  8. First, I created a code using MySQL to download all of the data from snp_staph into R.
  9.  
  10. library(RMySQL)
  11.  
  12. #open SQL connection
  13. con <- dbConnect(MySQL(), group = "snp_staph", user = "snp_staph", password = "snp_staph", dbname = "snp_staph")
  14. dbListTables(con)
  15.  
  16. #convert SQL tables into R data frames
  17. annotation <- dbReadTable(con, "annotation")
  18. indel <- dbReadTable(con, "indel")
  19. reference <- dbReadTable(con, "reference")
  20. remark <- dbReadTable(con,"remark")
  21. sample <- dbReadTable(con,"sample")
  22. snp <- dbReadTable(con, "snp")
  23. straintoindel <- dbReadTable(con,"straintoindel")
  24. straintosnp <- dbReadTable(con, "straintosnp")
  25.  
  26. #annotation contains id #, locus tag, gene, gi, and product
  27. #snp contains id, ref_pos, alt_base, codon, codon_pos, aid
  28. #remark are quality scores?
  29. #sample is id, name of strain, origin(?), mlst(sequence typing), cc(?)
  30. #1,2,3,4 = A, C, G, T
  31.  
  32. # gyrA has these variants associated with RES = S84L, E88K, G106D, S85P, E88G, E88L
  33.  
  34. # to determine which annotation ID corresponds to a gene (ex. gyrA), we need to look up in annotation table.
  35. # to do this gyrA <- annotation[annotation$gene == "gyrA",]
  36. # gryA is annotation id (aid) 7, therefore we can pull snps out for that gene with
  37. # snpid7 <- snp[snp$aid == 7,]
  38.  
  39. #annotation[annotation$gene == "gyrA",]
  40. #snp[snp$aid == 2468,]
  41.  
  42. Second, I found all the genes in the database that were similar to the genes in Table 2 of Gordon et al. 2014.
  43.  
  44.  
  45. The reference genome in the Gordon et al. paper is MSSA476, whereas the snp_staph data base’s reference genome is N135. Genes that were included in both reference genomes were gyrA and rpoB. I searched BLAST for the genes that were in MSSA476 but not in the reference genome to find their analogous genes. *Need to do this.
  46.  
  47. Gene gyrA has the amino acid substitutions S84L, E88K, G106D, S85P, E88G, E88L.
  48. I looked up all id’s associated with gyrA. I began with the first id, 7, that codes for DNA gyrase.
  49.  
  50. > annotation[annotation$gene == "gyrA",]
  51. id locus_tag gene gi product
  52. 7 7 SA0006 gyrA 15925711 DNA gyrase subunit A
  53. 2648 2648 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  54. 3951 3951 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  55. 6057 6057 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  56. 8756 8756 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  57. 11497 11497 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  58. 13222 13222 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  59. 15956 15956 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  60. 16364 16364 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
  61.  
  62. This command pulled out all aid’s = 7 from the snp table (list is too long to copy here).
  63. > snp[snp$aid==7,]
  64.  
  65. Next, I looked up each aid 7 and codon associated with antibiotic resistance:
  66. > snp[snp$aid ==7 & snp$codon == 84, ]
  67. id ref_pos alt_base codon codon_pos aid
  68. 64021 64021 7266 1 88 1 7
  69. 152937 152937 7267 3 88 2 7
  70. 342053 342053 7267 4 88 2 7
  71. 342147 342147 7267 2 88 2 7
  72. 342716 342716 7266 2 88 1 7
  73.  
  74. I then repeated this for the other genes included in gyrA.
  75.  
  76. I made a spreadsheet with just a codon site for each gene, filling in information in Excel in this order:
  77.  
  78. Codon | Codon Position | Original Base | Alternative Base | Reference Position | ID
  79.  
  80. To find the “original base:”
  81. > reference[reference$pos==7266,]
  82.  
  83. I then created a tab delimited text file (.txt) for each and uploaded them into R studio files. I was careful to make sure that all headers were identical before uploading.
  84.  
  85.  
  86.  
  87. I then created an R code to determine the wild type codon, mutant codon, wild type amino acid, and mutant amino acid:
  88.  
  89. For each .txt file you import, change the first line to be the name of the file (read.table) such as a gyrA.txt
  90. Change the last line (write.table) to be whatever you want output to be called, such as gyrA_muts.
  91.  
  92. x <- read.table("gyrA.txt", header = T, sep = "\t")
  93. j <- 0
  94.  
  95. for(i in x$Codon.Position){
  96. #the j keeps track of where in the matrix you are
  97. j <- j + 1
  98. if(i == 1){
  99. #lookup position of 2 following bases from 1st base in codon
  100. a <- (x[j,5])
  101. b <- (x[j,5]+1)
  102. c <- (x[j,5]+2)
  103. #now lookup what the base is in codon
  104. k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
  105. #k.alt is the alternative base for that position
  106. k.alt <- chartr("1234", "ACGT",(x[j,4]))
  107. #l and m are the other two bases in the codon so we can figure out the correct AA
  108. l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
  109. m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
  110.  
  111. #now lookup what nucleotides those are
  112. x[j,7]<- paste(k,l,m, sep="")
  113. x[j,8] <- paste(k.alt,l,m,sep="")
  114. }
  115. if(i == 2){
  116. #lookup position of 1 forward base and one behind base
  117. a <- (x[j,5] -1)
  118. b <- (x[j,5])
  119. c <- (x[j,5]+1)
  120. #now lookup what the base is in codon
  121. k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
  122. #k.alt is the alternative base for that position
  123. k.alt <- chartr("1234", "ACGT",(x[j,4]))
  124. #l and m are the other two bases in the codon so we can figure out the correct AA
  125. l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
  126. m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
  127.  
  128. #now lookup what nucleotides those are
  129. x[j,7]<- paste(k,l,m, sep="")
  130. x[j,8] <- paste(k,k.alt,m,sep="")
  131.  
  132. }
  133. if(i == 3){
  134. #lookup position of 2 previous bases
  135. a <- (x[j,5] -2)
  136. b <- (x[j,5] -1)
  137. c <- (x[j,5])
  138. #now lookup what the base is in codon
  139. k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
  140. #k.alt is the alternative base for that position
  141. k.alt <- chartr("1234", "ACGT",(x[j,4]))
  142. #l and m are the other two bases in the codon so we can figure out the correct AA
  143. l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
  144. m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
  145.  
  146. #now lookup what nucleotides those are
  147. x[j,7]<- paste(k,l,m, sep="")
  148. x[j,8] <- paste(k,l,k.alt,sep="")
  149.  
  150. }
  151. }
  152.  
  153. colnames(x)[colnames(x)=="V7"] <- "Wild.Type"
  154. colnames(x)[colnames(x)=="V8"] <- "Mutant"
  155. j <- 0
  156. #translate wild type
  157. for(k in x$Wild.Type){
  158. j <- j+1
  159. #print(j)
  160. tmp <- translate(s2c(k))
  161. x[j,9] <- tmp
  162. }
  163. colnames(x)[colnames(x)=="V9"] <- "Wild.TypeAA"
  164. j <- 0
  165. for(k in x$Mutant){
  166. j <- j+1
  167. #print(j)
  168. tmp <- translate(s2c(k))
  169. x[j,10] <- tmp
  170. }
  171. colnames(x)[colnames(x)=="V10"] <- "MutantAA"
  172.  
  173. write.table(x = x, file = "gyrA_muts",quote = FALSE)
  174.  
  175. Next, I exported this table from R as a tab delimited file and open it from within Excel. In the pop-up window click the “delimited” and “space” tabs to make sure the data is in its own cell.
  176.  
  177. Then I highlighted each relevant drug resistance mutation in the table.
  178.  
  179. It will look something like this:
  180.  
  181.  
  182.  
  183. Next, I created a code to find out which bacterial samples had this antibiotic resistant mutation, beginning with the first highlighted row:
  184. *Note: as you can see, there are many duplicates of codons and their corresponding information in this table.
  185.  
  186. #to find which bacterial samples contain codon S84L
  187. > straintosnp[straintosnp$snpid == 121958,]
  188.  
  189. The output of this command:
  190. sid snpid rid
  191. 2028147 132 121958 1
  192. 11476732 635 121958 1
  193. 11669148 646 121958 1
  194. 11687138 647 121958 1
  195. 11705116 648 121958 1
  196. 15351318 810 121958 1
  197. 21383658 1110 121958 1
  198. 21401574 1111 121958 1
  199. 21473767 1115 121958 1
  200. 22879076 1193 121958 1
  201. 24634236 1292 121958 1
  202. 37166411 1900 121958 1
  203. 44252787 2151 121958 1
  204.  
  205. Give the above command a title:
  206. gyrAS84L <- straintosnp[straintosnp$snpid == 121958,]
  207.  
  208. Next, I created a for loop to determine which strains have the above sids.
  209.  
  210. >for(i in gyrAS84L$sid){print(sample[sample$id==i,])}
  211.  
  212. Which will give you this output:
  213.  
  214. id name origin mlst cc
  215. 103 132 ERX015563 NA 0 0
  216. id name origin mlst cc
  217. 606 635 ERX042901 NA 239 239
  218. id name origin mlst cc
  219. 617 646 ERX042912 NA 239 239
  220. id name origin mlst cc
  221. 618 647 ERX042913 NA 239 239
  222. id name origin mlst cc
  223. 619 648 ERX042914 NA 239 239
  224. id name origin mlst cc
  225. 781 810 ERX062379 NA 8 8
  226. id name origin mlst cc
  227. 1081 1110 ERX063947 NA 1282 239
  228. id name origin mlst cc
  229. 1082 1111 ERX063948 NA 1282 239
  230. id name origin mlst cc
  231. 1086 1115 ERX063952 NA 1282 239
  232. id name origin mlst cc
  233. 1164 1193 ERX086645 NA 239 239
  234. id name origin mlst cc
  235. 1263 1292 ERX095299 NA 1282 239
  236. id name origin mlst cc
  237. 1871 1900 ERX136667 NA 398 398
  238. id name origin mlst cc
  239. 2122 2151 ERX139328 NA 8 8
  240.  
  241. I then saved this data as a .txt file.
  242.  
  243. Alternative gene names:
  244. In the paper, fusA = “fus” in N315
  245. In the paper, grlA = “parC” in N315
  246. In the paper, grlB = “parE” in N315 -> I could not find any of the listed aa substitutions for parE
  247. In the paper, rpoB = “drfA” in N315??
  248. In the paper, dfrB = “dfrA” in N315 -> I could not find any of the listed aa substitutions for dfrA
  249. Maybe this is the same gene in N315?
  250.  
  251. For parC -> S80F and S80Y there are too many strains that I cannot see them all in the console and I’m not sure how to make a for loop output a text file.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement