Advertisement
meowcat

deprofile.R

Jun 24th, 2012
361
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
R 12.31 KB | None | 0 0
  1. # deprofile.R by Michael Stravs, Eawag <michael.stravs@eawag.ch>
  2. # this code is public domain, make of it what you want and don't blame me
  3. # Needed for na.locf, which the R-style implementation of deprofile.fwhm relies on:
  4. library(zoo)
  5.  
  6. #' Convert profile data to "centroid" mode using full-width, half-height algorithm or
  7. #' local maximum method.
  8. #'
  9. #' WARNING: Not yet thoroughly tested!
  10. #'
  11. #' The "deprofile" functions convert profile-mode high-resolution input data to "centroid"-mode
  12. #' data amenable to i.e. centWave.
  13. #'
  14. #' The "deprofile.fwhm" method is basically an R-semantic version of the "Exact Mass" m/z deprofiler
  15. #' from MZmine. It takes the center between the m/z values at half-maximum intensity for the exact peak mass.
  16. #' The logic is stolen verbatim from the Java MZmine algorithm, but it has been rewritten to use the fast
  17. #' R vector operations instead of loops wherever possible. It's slower than the Java implementation, but
  18. #' still decently fast IMO. Using matrices instead of the data frame would be more memory-efficient
  19. #' and also faster, probably.
  20. #'
  21. #' The "deprofile.localMax" method uses local maxima and is probably the same used by e.g. Xcalibur.
  22. #' For well-formed peaks, "deprofile.fwhm" gives more accurate mass results; for some applications,
  23. #' "deprofile.localMax" might be better (e.g. for fine isotopic structure peaks which are
  24. #' not separated by a valley and also not at half maximum.) For MS2 peaks, which have no isotopes,
  25. #' "deprofile.fwhm" is probably the better choice generally.
  26. #'
  27. #' The word "centroid" is used for convenience to denote not-profile-mode data.
  28. #' The data points are NOT mathematical centroids; we would like to have a better word for it.
  29. #'
  30. #' The "noise" parameter was only included for completeness, I personally don't use it.
  31. #'
  32. #' \code{deprofile.fwhm} and \code{deprofile.localMax} are the workhorses; \code{deprofile.xcmsRaw} and
  33. #' \code{deprofile.scan} take an xcmsRaw object or a 2-column scan as inputs, respectively.
  34. #' \code{deprofile} dispatches the call to the appropriate worker method.
  35. #'
  36. #' @note Known limitations: If the absolute leftmost stick or the absolute rightmost stick in
  37. #'              a scan are maxima, they will be discarded! However, I don't think this will
  38. #'              ever present a practical problem.
  39. #'
  40. #' @aliases deprofile, deprofile.xcmsRaw, deprofile.fwhm, deprofile.localMax
  41. #' @usage       deprofile.scan(scan, noise, method, ...)
  42. #'              deprofile.xcmsRaw(xraw, noise, method, ...)
  43. #'              deprofile(df, noise, method, ...)
  44. #'              deprofile.fwhm(df, noise, cut)
  45. #'              deprofile.localMax(df, noise)
  46. #' @param scan A matrix with 2 columns for m/z and intensity; from profile-mode high-resolution data. Corresponds
  47. #'              to the spectra obtained with xcms::getScan or mzR::peaks.
  48. #' @param noise The noise cutoff. A peak is not included if the maximum stick intensity of the peak
  49. #'              is below the noise cutoff.
  50. #' @param copy for \code{xcmsRaw}: if \code{TRUE}, the \code{xcmsRaw} object is deep-copied and calculations
  51. #'              are performed on the copy. Otherwise (default) the calculations are performed in-place.
  52. #' @param method "deprofile.fwhm" for full-width half-maximum or "deprofile.localMax" for
  53. #'              local maximum.
  54. #' @param colnames For deprofile.scan: return matrix with column names (xcms-style,
  55. #'              \code{TRUE}, default) or plain (mzR-style, \code{FALSE}).
  56. #' @param df A dataframe with at least the columns \code{mz} and \code{int} to perform deprofiling on.
  57. #' @param xraw An xcmsRaw object with high-resolution profile mode MS data, e.g. Orbitrap.
  58. #' @param cut A parameter for "deprofile.fwhm" indicating where the peak flanks should be taken. Standard is 0.5
  59. #'              (as the algorithm name says, at half maximum.) Setting \code{cut = 0.75} would instead determine the peak
  60. #'              width at 3/4 maximum, which might give a better accuracy for merged peaks, but could be less accurate
  61. #'              if too few data points are present.
  62. #' @return \code{deprofile.scan}: a matrix with 2 columns for m/z and intensity
  63. #'              \code{deprofile.xcmsRaw}: a new xcmsRaw object, if \code{copy = TRUE}. Otherwise,
  64. #'              nothing is returned and the conversion is performed in place.
  65. #'
  66. #' @examples
  67. #' \dontrun{
  68. #' # from xcmsRaw:
  69. #' xraw <- xcmsRaw("myfile.mzML")
  70. #' # by FWHM method:
  71. #' xraw.sticked <- deprofile.xcmsRaw(xraw, copy=T, method="deprofile.fwhm")
  72. #' # local maximum: faster, but less accurate especially for "low" resolution
  73. #' xraw.sticked <- deprofile.xcmsRaw(xraw, copy=T, method="deprofile.localMax")
  74. #' scan.xraw.profile <- getScan(xraw, 50)
  75. #' scan.xraw <- getScan(xraw.sticked, 50)
  76. #' # alternatively directly from scan:
  77. #' scan.xraw.direct <- deprofile.scan(scan.xraw.profile)
  78. #'
  79. #' # the same scan from mzR:
  80. #' mzrFile <- openMSfile("myfile.mzML")
  81. #' acqNo <- xraw@acquisitionNum[[50]]
  82. #' scan.mzML.profile <- mzR::peaks(mzrFile, acqNo)
  83. #' scan.mzML <- deprofile.scan(scan.mzML.profile)
  84. #' close(mzrFile)
  85. #' }
  86. #'
  87. #' @author Michael Stravs, Eawag <michael.stravs@eawag.ch>
  88. #' @references mzMine source code http://sourceforge.net/svn/?group_id=139835
  89. #' @export
  90. deprofile.scan <- function(scan, noise = NA, method="deprofile.fwhm", colnames=T, ...)
  91. {
  92.     # Format the input
  93.     df <- as.data.frame(scan)
  94.     colnames(df) <- c("mz", "int")
  95.     # Call the actual workhorse
  96.     peaklist <- deprofile(df, noise, method, ...)
  97.     # return the centroided peaklist
  98.     peaklist.m <- as.matrix(peaklist[,c("mz", "int")])
  99.     if(colnames)
  100.         colnames(peaklist.m) <- c("mz", "intensity")
  101.     else
  102.         colnames(peaklist.m) <- NULL
  103.     return(peaklist.m)
  104. }
  105.  
  106. # This implementation is cooler because it doesn't iterate over the scans,
  107. # but it fails on my lab PC because it needs too much memory. Also, it's slower :(
  108. # and I don't understand why.
  109. #olddoc' Additionally, if using deprofile.fwhm.xcmsRaw, a peak might be interpreted as
  110. #olddoc'                too large if it crosses 2 scans (i.e. na.locf(fromLast=T) carries a groupmax
  111. #olddoc'                into a preceding scan or vice versa, see code).
  112. #olddoc'                Again, I think this is not a practical problem. But we have to verify this.
  113. ## deprofile.xcmsRaw <- function(xraw, noise = NA, copy=F, method="deprofile.fwhm", ...)
  114. ## {
  115. ##     if(copy)
  116. ##         xraw <- deepCopy(xraw)
  117. ##     # This computes the entire centroid matrix in one flush without splitting the data by scans.
  118. ##     # Is it really error-free? We have to check.
  119. ##     # make a suitable input for masslist.fwhm
  120. ##     df <- data.frame(mz = xraw@env$mz, int = xraw@env$intensity)
  121. ##     # add additional input to be preserved: the scan index
  122. ##     df$scanindex <- NA
  123. ##     df[xraw@scanindex + 1, "scanindex"] <- xraw@scanindex
  124. ##     df$scanindex.full <- na.locf(df$scanindex)
  125. ##     peaklist <- deprofile(df, noise, method, ...)
  126. ##     # Now we can get the mz, int values from the new peaklist but we have
  127. ##     # to recompute the correct scan indices.
  128. ##     xraw@env$mz <- peaklist$mz
  129. ##     xraw@env$intensity <- peaklist$int
  130. ##     # Since the entry where scanindex != NA might have been cut and we can't rely
  131. ##     # on it always being there, we have to find the first occurrence of every scanindex
  132. ##     # and make it 0- instead of 1-based
  133. ##     peaklist$n2 <- 1:nrow(peaklist)
  134. ##     xraw@scanindex <- as.integer(which(!duplicated(peaklist$scanindex.full)) - 1)
  135. ##     # I do not recalculate TICs or the profile matrix. The profile matrix is low-resolution
  136. ##     # anyway and would look the same if recalculated.
  137. ##     if(copy)
  138. ##         return(xraw)
  139. ##     # That's all, folks.
  140. ## }
  141.  
  142. deprofile.xcmsRaw <- function(xraw, noise = NA, copy=F, method="deprofile.fwhm", ...)
  143. {
  144.     if(copy)
  145.         xraw <- deepCopy(xraw)
  146.     # This splits the data by scan and recomposes it
  147.     # because the other version causes memory issues on my lab PC.
  148.     # Using actual split() would be extremely slow.
  149.     nscans <- length(xraw@scanindex)
  150.     scanindex.lo <- xraw@scanindex + 1
  151.     scanindex.hi <- c(xraw@scanindex[-1], length(xraw@env$mz))
  152.     mz <- numeric(0)
  153.     int <- numeric(0)
  154.     scanindex <- c(0)
  155.     for(n in 1:nscans)
  156.     {
  157.         df <- data.frame(mz = xraw@env$mz[scanindex.lo[[n]]:scanindex.hi[[n]]]
  158.         , int = xraw@env$intensity[scanindex.lo[[n]]:scanindex.hi[[n]]])
  159.         peaklist <- deprofile(df, noise, method, ...)
  160.         # prolong the vectors
  161.         mz <- c(mz, peaklist$mz)
  162.         int <- c(int, peaklist$int)
  163.         scanindex <- c(scanindex, length(mz))
  164.     }
  165.     # remove the last element from scanindex:
  166.     scanindex <- scanindex[-length(scanindex)]
  167.     # Now store everything into the xraw
  168.     xraw@env$mz <- mz
  169.     xraw@env$intensity <- int
  170.     xraw@scanindex <- as.integer(scanindex)
  171.     # I do not recalculate TICs or the profile matrix. The profile matrix is low-resolution
  172.     # anyway and would look the same if recalculated.
  173.     if(copy)
  174.         return(xraw)
  175.     # That's all, folks.
  176. }
  177.  
  178.  
  179. deprofile <- function(df, noise, method, ...)
  180. {
  181.     return(do.call(method, list(df, noise, ...)))
  182. }
  183.  
  184. deprofile.fwhm <- function(df, noise=NA, cut=0.5)
  185. {
  186.     # split sticks into groups according to how MzMine does it:
  187.     # a new group starts at zeroes and at new ascending points
  188.     df$groupmax <- NA
  189.     rows <- nrow(df)
  190.     df <- within(df,
  191.             {
  192.                 # identify local maxima
  193.                 groupmax[which(diff(sign(diff(int)))<0) + 1] <- which(diff(sign(diff(int)))<0) + 1
  194.                 # make forward-filled and backward-filled list for which was the last maximum.
  195.                 # This assigns the sticks to groups.
  196.                 groupmax_f <- na.locf(groupmax, na.rm=F)
  197.                 groupmax_b <- na.locf(groupmax, fromLast=T, na.rm=F)
  198.                 # take backward-filled group as default
  199.                 # and forward-filled group where the peak was ascending
  200.                 groupmax_b[which(diff(int)<0)+1 ] <- groupmax_f[which(diff(int) <0)+1]
  201.                 # eliminate zeroes
  202.                 groupmax_b[which(int==0)]<-NA
  203.                 # add "next intensities" and "next m/z" as well as "index" (n)
  204.                 n <- 1:rows
  205.                 int1 <- c(int[-1],0)
  206.                 mz1 <- c(mz[-1],0)
  207.                 # delete all the intensity+1 values from points which are last-in-group and therefore have int1 from next group!
  208.                 int1[which(!is.na(groupmax))-1] <- 0
  209.                 # find maximal intensity point for each peak member
  210.                 maxint <- df[groupmax_b, "int"]
  211.                 hm <- maxint * cut
  212.                 up <- ifelse( (int <= hm) & (int1 >= hm) & (n < groupmax_b), groupmax_b, NA)
  213.                 down <- ifelse( (int >= hm) & (int1 <= hm) & (n > groupmax_b), groupmax_b, NA)
  214.             })
  215.  
  216.     # Compile finished peak list
  217.     peaklist <- df[which(!is.na(df$groupmax)),]
  218.     # Noise parameter:
  219.     if(!is.na(noise))
  220.         peaklist <- subset(peaklist, maxint > noise)
  221.     # Find which peaks might have a FWHM value to substitue for the maxint mz value
  222.     # We isolate the peaklists for left-hand and for right-hand FWHM peak.
  223.     # If any straight line is found, the rightmost of the left points and the leftmost of the right points is used.
  224.     peaklist.left <- df[which(!is.na(df$up) & !duplicated(df$up, fromLast=T)),]
  225.     peaklist.right <- df[which(!is.na(df$down) & !duplicated(df$down)),]
  226.     # calculate the slopes and the corresponding m/z value at half maximum
  227.     peaklist.left$slope <- (peaklist.left$int1 - peaklist.left$int) / (peaklist.left$mz1 - peaklist.left$mz)
  228.     peaklist.right$slope <- (peaklist.right$int1 - peaklist.right$int) / (peaklist.right$mz1 - peaklist.right$mz)
  229.     peaklist.left$mzleft <- peaklist.left$mz + (peaklist.left$hm - peaklist.left$int) / peaklist.left$slope
  230.     peaklist.right$mzright <- peaklist.right$mz + (peaklist.right$hm - peaklist.right$int) / peaklist.right$slope
  231.     # add the two values to the full-peaklist where they exist
  232.     peaklist <- merge(peaklist, peaklist.left[,c("up", "mzleft")], by.x="groupmax", by.y="up", all.x=T, suffix=c("", ".left"))
  233.     peaklist <- merge(peaklist, peaklist.right[,c("down", "mzright")], by.x="groupmax", by.y="down", all.x=T, suffix=c("", ".right"))
  234.     # Find which entries have both a left and a right end,
  235.     # and calculate the center mass for them.
  236.     peaklist.indexMzhm <- which(!is.na(peaklist$mzleft) & !is.na(peaklist$mzright))
  237.     peaklist[peaklist.indexMzhm, "mz"] <- (peaklist[peaklist.indexMzhm, "mzleft"] + peaklist[peaklist.indexMzhm, "mzright"]) / 2
  238.    
  239.     return(peaklist)
  240. }
  241.  
  242. deprofile.localMax <- function(df, noise=NA)
  243. {
  244.     # split sticks into groups:
  245.     # a new group starts at zeroes and at new ascending points
  246.     df$groupmax <- NA
  247.     rows <- nrow(df)
  248.     df$groupmax[which(diff(sign(diff(df$int)))<0) + 1] <- which(diff(sign(diff(df$int)))<0) + 1
  249.     peaklist <- df[which(!is.na(df$groupmax)),]
  250.     # Noise parameter:
  251.     if(!is.na(noise))
  252.         peaklist <- subset(peaklist, int > noise)
  253.     # And that's it.
  254.     return(peaklist)
  255. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement