Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Signe White
- April 13, 2015
- Summary of Rotation Project for Dr. Timothy Read
- Creating R code to determine which SNPs correlate with antibiotic resistance in various strains of Staphylococcus aureus.
- First, I created a code using MySQL to download all of the data from snp_staph into R.
- library(RMySQL)
- #open SQL connection
- con <- dbConnect(MySQL(), group = "snp_staph", user = "snp_staph", password = "snp_staph", dbname = "snp_staph")
- dbListTables(con)
- #convert SQL tables into R data frames
- annotation <- dbReadTable(con, "annotation")
- indel <- dbReadTable(con, "indel")
- reference <- dbReadTable(con, "reference")
- remark <- dbReadTable(con,"remark")
- sample <- dbReadTable(con,"sample")
- snp <- dbReadTable(con, "snp")
- straintoindel <- dbReadTable(con,"straintoindel")
- straintosnp <- dbReadTable(con, "straintosnp")
- #annotation contains id #, locus tag, gene, gi, and product
- #snp contains id, ref_pos, alt_base, codon, codon_pos, aid
- #remark are quality scores?
- #sample is id, name of strain, origin(?), mlst(sequence typing), cc(?)
- #1,2,3,4 = A, C, G, T
- # gyrA has these variants associated with RES = S84L, E88K, G106D, S85P, E88G, E88L
- # to determine which annotation ID corresponds to a gene (ex. gyrA), we need to look up in annotation table.
- # to do this gyrA <- annotation[annotation$gene == "gyrA",]
- # gryA is annotation id (aid) 7, therefore we can pull snps out for that gene with
- # snpid7 <- snp[snp$aid == 7,]
- #annotation[annotation$gene == "gyrA",]
- #snp[snp$aid == 2468,]
- Second, I found all the genes in the database that were similar to the genes in Table 2 of Gordon et al. 2014.
- The reference genome in the Gordon et al. paper is MSSA476, whereas the snp_staph data base’s reference genome is N135. Genes that were included in both reference genomes were gyrA and rpoB. I searched BLAST for the genes that were in MSSA476 but not in the reference genome to find their analogous genes. *Need to do this.
- Gene gyrA has the amino acid substitutions S84L, E88K, G106D, S85P, E88G, E88L.
- I looked up all id’s associated with gyrA. I began with the first id, 7, that codes for DNA gyrase.
- > annotation[annotation$gene == "gyrA",]
- id locus_tag gene gi product
- 7 7 SA0006 gyrA 15925711 DNA gyrase subunit A
- 2648 2648 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 3951 3951 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 6057 6057 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 8756 8756 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 11497 11497 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 13222 13222 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 15956 15956 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- 16364 16364 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
- This command pulled out all aid’s = 7 from the snp table (list is too long to copy here).
- > snp[snp$aid==7,]
- Next, I looked up each aid 7 and codon associated with antibiotic resistance:
- > snp[snp$aid ==7 & snp$codon == 84, ]
- id ref_pos alt_base codon codon_pos aid
- 64021 64021 7266 1 88 1 7
- 152937 152937 7267 3 88 2 7
- 342053 342053 7267 4 88 2 7
- 342147 342147 7267 2 88 2 7
- 342716 342716 7266 2 88 1 7
- I then repeated this for the other genes included in gyrA.
- I made a spreadsheet with just a codon site for each gene, filling in information in Excel in this order:
- Codon | Codon Position | Original Base | Alternative Base | Reference Position | ID
- To find the “original base:”
- > reference[reference$pos==7266,]
- I then created a tab delimited text file (.txt) for each and uploaded them into R studio files. I was careful to make sure that all headers were identical before uploading.
- I then created an R code to determine the wild type codon, mutant codon, wild type amino acid, and mutant amino acid:
- For each .txt file you import, change the first line to be the name of the file (read.table) such as a gyrA.txt
- Change the last line (write.table) to be whatever you want output to be called, such as gyrA_muts.
- x <- read.table("gyrA.txt", header = T, sep = "\t")
- j <- 0
- for(i in x$Codon.Position){
- #the j keeps track of where in the matrix you are
- j <- j + 1
- if(i == 1){
- #lookup position of 2 following bases from 1st base in codon
- a <- (x[j,5])
- b <- (x[j,5]+1)
- c <- (x[j,5]+2)
- #now lookup what the base is in codon
- k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
- #k.alt is the alternative base for that position
- k.alt <- chartr("1234", "ACGT",(x[j,4]))
- #l and m are the other two bases in the codon so we can figure out the correct AA
- l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
- m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
- #now lookup what nucleotides those are
- x[j,7]<- paste(k,l,m, sep="")
- x[j,8] <- paste(k.alt,l,m,sep="")
- }
- if(i == 2){
- #lookup position of 1 forward base and one behind base
- a <- (x[j,5] -1)
- b <- (x[j,5])
- c <- (x[j,5]+1)
- #now lookup what the base is in codon
- k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
- #k.alt is the alternative base for that position
- k.alt <- chartr("1234", "ACGT",(x[j,4]))
- #l and m are the other two bases in the codon so we can figure out the correct AA
- l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
- m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
- #now lookup what nucleotides those are
- x[j,7]<- paste(k,l,m, sep="")
- x[j,8] <- paste(k,k.alt,m,sep="")
- }
- if(i == 3){
- #lookup position of 2 previous bases
- a <- (x[j,5] -2)
- b <- (x[j,5] -1)
- c <- (x[j,5])
- #now lookup what the base is in codon
- k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
- #k.alt is the alternative base for that position
- k.alt <- chartr("1234", "ACGT",(x[j,4]))
- #l and m are the other two bases in the codon so we can figure out the correct AA
- l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
- m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))
- #now lookup what nucleotides those are
- x[j,7]<- paste(k,l,m, sep="")
- x[j,8] <- paste(k,l,k.alt,sep="")
- }
- }
- colnames(x)[colnames(x)=="V7"] <- "Wild.Type"
- colnames(x)[colnames(x)=="V8"] <- "Mutant"
- j <- 0
- #translate wild type
- for(k in x$Wild.Type){
- j <- j+1
- #print(j)
- tmp <- translate(s2c(k))
- x[j,9] <- tmp
- }
- colnames(x)[colnames(x)=="V9"] <- "Wild.TypeAA"
- j <- 0
- for(k in x$Mutant){
- j <- j+1
- #print(j)
- tmp <- translate(s2c(k))
- x[j,10] <- tmp
- }
- colnames(x)[colnames(x)=="V10"] <- "MutantAA"
- write.table(x = x, file = "gyrA_muts",quote = FALSE)
- Next, I exported this table from R as a tab delimited file and open it from within Excel. In the pop-up window click the “delimited” and “space” tabs to make sure the data is in its own cell.
- Then I highlighted each relevant drug resistance mutation in the table.
- It will look something like this:
- Next, I created a code to find out which bacterial samples had this antibiotic resistant mutation, beginning with the first highlighted row:
- *Note: as you can see, there are many duplicates of codons and their corresponding information in this table.
- #to find which bacterial samples contain codon S84L
- > straintosnp[straintosnp$snpid == 121958,]
- The output of this command:
- sid snpid rid
- 2028147 132 121958 1
- 11476732 635 121958 1
- 11669148 646 121958 1
- 11687138 647 121958 1
- 11705116 648 121958 1
- 15351318 810 121958 1
- 21383658 1110 121958 1
- 21401574 1111 121958 1
- 21473767 1115 121958 1
- 22879076 1193 121958 1
- 24634236 1292 121958 1
- 37166411 1900 121958 1
- 44252787 2151 121958 1
- Give the above command a title:
- gyrAS84L <- straintosnp[straintosnp$snpid == 121958,]
- Next, I created a for loop to determine which strains have the above sids.
- >for(i in gyrAS84L$sid){print(sample[sample$id==i,])}
- Which will give you this output:
- id name origin mlst cc
- 103 132 ERX015563 NA 0 0
- id name origin mlst cc
- 606 635 ERX042901 NA 239 239
- id name origin mlst cc
- 617 646 ERX042912 NA 239 239
- id name origin mlst cc
- 618 647 ERX042913 NA 239 239
- id name origin mlst cc
- 619 648 ERX042914 NA 239 239
- id name origin mlst cc
- 781 810 ERX062379 NA 8 8
- id name origin mlst cc
- 1081 1110 ERX063947 NA 1282 239
- id name origin mlst cc
- 1082 1111 ERX063948 NA 1282 239
- id name origin mlst cc
- 1086 1115 ERX063952 NA 1282 239
- id name origin mlst cc
- 1164 1193 ERX086645 NA 239 239
- id name origin mlst cc
- 1263 1292 ERX095299 NA 1282 239
- id name origin mlst cc
- 1871 1900 ERX136667 NA 398 398
- id name origin mlst cc
- 2122 2151 ERX139328 NA 8 8
- I then saved this data as a .txt file.
- Alternative gene names:
- In the paper, fusA = “fus” in N315
- In the paper, grlA = “parC” in N315
- In the paper, grlB = “parE” in N315 -> I could not find any of the listed aa substitutions for parE
- In the paper, rpoB = “drfA” in N315??
- In the paper, dfrB = “dfrA” in N315 -> I could not find any of the listed aa substitutions for dfrA
- Maybe this is the same gene in N315?
- For parC -> S80F and S80Y there are too many strains that I cannot see them all in the console and I’m not sure how to make a for loop output a text file.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement