Untitled

Signe White
April 13, 2015

Summary of Rotation Project for Dr. Timothy Read

Creating R code to determine which SNPs correlate with antibiotic resistance in various strains of Staphylococcus aureus.

First, I created a code using MySQL to download all of the data from snp_staph into R.

library(RMySQL)

#open SQL connection
con <- dbConnect(MySQL(), group = "snp_staph", user = "snp_staph", password = "snp_staph", dbname = "snp_staph")
dbListTables(con)

#convert SQL tables into R data frames
annotation <- dbReadTable(con, "annotation")
indel <- dbReadTable(con, "indel")
reference <- dbReadTable(con, "reference")
remark <- dbReadTable(con,"remark")
sample <- dbReadTable(con,"sample")
snp <- dbReadTable(con, "snp")
straintoindel <- dbReadTable(con,"straintoindel")
straintosnp <- dbReadTable(con, "straintosnp")

#annotation contains id #, locus tag, gene, gi, and product
#snp contains id, ref_pos, alt_base, codon, codon_pos, aid
#remark are quality scores?
#sample is id, name of strain, origin(?), mlst(sequence typing), cc(?)
#1,2,3,4 = A, C, G, T

# gyrA has these variants associated with RES = S84L, E88K, G106D, S85P, E88G, E88L

# to determine which annotation ID corresponds to a gene (ex. gyrA), we need to look up in annotation table.
# to do this gyrA <- annotation[annotation$gene == "gyrA",]
# gryA is annotation id (aid) 7, therefore we can pull snps out for that gene with
# snpid7 <- snp[snp$aid == 7,]

#annotation[annotation$gene == "gyrA",]
#snp[snp$aid == 2468,]

Second, I found all the genes in the database that were similar to the genes in Table 2 of Gordon et al. 2014.


The reference genome in the Gordon et al. paper is MSSA476, whereas the snp_staph data base’s reference genome is N135. Genes that were included in both reference genomes were gyrA and rpoB. I searched BLAST for the genes that were in MSSA476 but not in the reference genome to find their analogous genes. *Need to do this.

Gene gyrA has the amino acid substitutions S84L, E88K, G106D, S85P, E88G, E88L.
I looked up all id’s associated with gyrA. I began with the first id, 7, that codes for DNA gyrase.

> annotation[annotation$gene == "gyrA",]
         id    locus_tag gene        gi                                       product
7         7       SA0006 gyrA  15925711                          DNA gyrase subunit A
2648   2648 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
3951   3951 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
6057   6057 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
8756   8756 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
11497 11497 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
13222 13222 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
15956 15956 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A
16364 16364 USA300HOU_00 gyrA 161508272 DNA topoisomerase (ATP-hydrolyzing) subunit A

This command pulled out all aid’s = 7 from the snp table (list is too long to copy here).
> snp[snp$aid==7,]

Next, I looked up each aid 7 and codon associated with antibiotic resistance:
> snp[snp$aid ==7 & snp$codon == 84, ]
   id    ref_pos alt_base codon 	codon_pos aid
64021   64021    7266        1    88         1    7
152937 152937    7267        3    88         2    7
342053 342053    7267        4    88         2    7
342147 342147    7267        2    88         2    7
342716 342716    7266        2    88         1    7

I then repeated this for the other genes included in gyrA.

I made a spreadsheet with just a codon site for each gene, filling in information in Excel in this order:

Codon | Codon Position | Original Base | Alternative Base | Reference Position | ID

To find the “original base:”
> reference[reference$pos==7266,]

I then created a tab delimited text file (.txt) for each and uploaded them into R studio files. I was careful to make sure that all headers were identical before uploading.


I then created an R code to determine the wild type codon, mutant codon, wild type amino acid, and mutant amino acid:

For each .txt file you import, change the first line to be the name of the file (read.table) such as a gyrA.txt
Change the last line (write.table) to be whatever you want output to be called, such as gyrA_muts.

x <- read.table("gyrA.txt", header = T, sep = "\t")
j <- 0

for(i in x$Codon.Position){
  #the j keeps track of where in the matrix you are
  j <- j + 1
  if(i == 1){
    #lookup position of 2 following bases from 1st base in codon
    a <- (x[j,5])
    b <- (x[j,5]+1)
    c <- (x[j,5]+2)
    #now lookup what the base is in codon
    k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
    #k.alt is the alternative base for that position
    k.alt <- chartr("1234", "ACGT",(x[j,4]))
    #l and m are the other two bases in the codon so we can figure out the correct AA
    l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
    m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))

    #now lookup what nucleotides those are
    x[j,7]<- paste(k,l,m, sep="")
    x[j,8] <- paste(k.alt,l,m,sep="")
  }
  if(i == 2){
    #lookup position of 1 forward base and one behind base
    a <- (x[j,5] -1)
    b <- (x[j,5])
    c <- (x[j,5]+1)
    #now lookup what the base is in codon
    k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
    #k.alt is the alternative base for that position
    k.alt <- chartr("1234", "ACGT",(x[j,4]))
    #l and m are the other two bases in the codon so we can figure out the correct AA
    l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
    m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))

    #now lookup what nucleotides those are
    x[j,7]<- paste(k,l,m, sep="")
    x[j,8] <- paste(k,k.alt,m,sep="")

  }
  if(i == 3){
    #lookup position of 2 previous bases
    a <- (x[j,5] -2)
    b <- (x[j,5] -1)
    c <- (x[j,5])
    #now lookup what the base is in codon
    k <- chartr("1234", "ACGT",(reference$base[reference$pos == a]))
    #k.alt is the alternative base for that position
    k.alt <- chartr("1234", "ACGT",(x[j,4]))
    #l and m are the other two bases in the codon so we can figure out the correct AA
    l <- chartr("1234", "ACGT",(reference$base[reference$pos == b]))
    m <- chartr("1234", "ACGT",(reference$base[reference$pos == c]))

    #now lookup what nucleotides those are
    x[j,7]<- paste(k,l,m, sep="")
    x[j,8] <- paste(k,l,k.alt,sep="")

  }
}

colnames(x)[colnames(x)=="V7"] <- "Wild.Type"
colnames(x)[colnames(x)=="V8"] <- "Mutant"
j <- 0
#translate wild type
for(k in x$Wild.Type){
  j <- j+1
  #print(j)
  tmp <- translate(s2c(k))
  x[j,9] <- tmp
}
colnames(x)[colnames(x)=="V9"] <- "Wild.TypeAA"
j <- 0
for(k in x$Mutant){
  j <- j+1
  #print(j)
  tmp <- translate(s2c(k))
  x[j,10] <- tmp
}
colnames(x)[colnames(x)=="V10"] <- "MutantAA"

write.table(x = x, file = "gyrA_muts",quote = FALSE)

Next, I exported this table from R as a tab delimited file and open it from within Excel. In the pop-up window click the “delimited” and “space” tabs to make sure the data is in its own cell.

Then I highlighted each relevant drug resistance mutation in the table.

It will look something like this:


Next, I created a code to find out which bacterial samples had this antibiotic resistant mutation, beginning with the first highlighted row:
*Note: as you can see, there are many duplicates of codons and their corresponding information in this table.

#to find which bacterial samples contain codon S84L
> straintosnp[straintosnp$snpid == 121958,]

The output of this command:
          sid  snpid rid
2028147   132 121958   1
11476732  635 121958   1
11669148  646 121958   1
11687138  647 121958   1
11705116  648 121958   1
15351318  810 121958   1
21383658 1110 121958   1
21401574 1111 121958   1
21473767 1115 121958   1
22879076 1193 121958   1
24634236 1292 121958   1
37166411 1900 121958   1
44252787 2151 121958   1

Give the above command a title:
gyrAS84L <- straintosnp[straintosnp$snpid == 121958,]

Next, I created a for loop to determine which strains have the above sids.

>for(i in gyrAS84L$sid){print(sample[sample$id==i,])}

Which will give you this output:

     id      name origin mlst cc
103 132 ERX015563     NA    0  0
     id      name origin mlst  cc
606 635 ERX042901     NA  239 239
     id      name origin mlst  cc
617 646 ERX042912     NA  239 239
     id      name origin mlst  cc
618 647 ERX042913     NA  239 239
     id      name origin mlst  cc
619 648 ERX042914     NA  239 239
     id      name origin mlst cc
781 810 ERX062379     NA    8  8
       id      name origin mlst  cc
1081 1110 ERX063947     NA 1282 239
       id      name origin mlst  cc
1082 1111 ERX063948     NA 1282 239
       id      name origin mlst  cc
1086 1115 ERX063952     NA 1282 239
       id      name origin mlst  cc
1164 1193 ERX086645     NA  239 239
       id      name origin mlst  cc
1263 1292 ERX095299     NA 1282 239
       id      name origin mlst  cc
1871 1900 ERX136667     NA  398 398
       id      name origin mlst cc
2122 2151 ERX139328     NA    8  8

I then saved this data as a .txt file.

Alternative gene names:
In the paper, fusA = “fus” in N315
In the paper, grlA = “parC” in N315
In the paper, grlB = “parE” in N315 -> I could not find any of the listed aa substitutions for parE
In the paper, rpoB = “drfA” in N315??
In the paper, dfrB = “dfrA” in N315 -> I could not find any of the listed aa substitutions for dfrA
Maybe this is the same gene in N315?

For parC -> S80F and S80Y there are too many strains that I cannot see them all in the console and I’m not sure how to make a for loop output a text file.