Advertisement
Guest User

Untitled

a guest
Oct 22nd, 2019
73
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 5.31 KB | None | 0 0
  1. #########################################################################################
  2. # This gist contains a quick walk-through of several ways to produce scales capturing
  3. # the average value of one or more variables. Each row (observation) gets its own
  4. # value. We'll assume your data are not fully "tidy." What I mean by this is that you
  5. # have an observation for each row and you want to calculate that observation's value
  6. # on the scale, but each variable that should go into your scale is in its own column.
  7. #########################################################################################
  8.  
  9. #########################################################################################
  10. # Set-up (packages, fake data, etc.)
  11. #########################################################################################
  12.  
  13. # Load dplyr
  14. library(dplyr)
  15.  
  16. # Set seed
  17. set.seed(9)
  18.  
  19. # Create some fake data with variables x, y, z, w
  20. # for example purposes
  21. mydata <- data.frame(x = rnorm(10),
  22. y = runif(10),
  23. z = seq(1:10),
  24. w = c(1,rep(NA,9)))
  25.  
  26. # Let's take a look at our raw data. Note
  27. # that we have missing values in variable w.
  28. mydata
  29.  
  30. #########################################################################################
  31. # Scale creation
  32. #########################################################################################
  33.  
  34. # We'll now create 2 scales. Each scale will
  35. # capture the average of two variable in our data.
  36.  
  37. # Which variables? Let's choose two variables
  38. # for each of the scales: scale1 is going
  39. # to reflect the mean of variables named x and y,
  40. # scale2 will reflect the average of variables z and y.
  41.  
  42. scale1_vars <- c("x","y")
  43. scale2_vars <- c("y","z")
  44.  
  45. # Now let's generate our scales. If a row is missing a
  46. # value for one of the variables, the average will be
  47. # computed from the other variable if it is not also
  48. # missing. If both values are missing, the scale value
  49. # will be NA.
  50.  
  51. # The code below tells the mutate() function from the dplyr package
  52. # that we want to generate new variables (scale1 and scale2) by
  53. # calculating the average of the values observed in each row for
  54. # only those columns containing the variables we previously included
  55. # in scale1_vars or scale2_vars. A different scale is generated for
  56. # each group of variables.
  57. fulldf <- mydata %>% dplyr::mutate(scale1 = rowMeans(mydata[ ,scale1_vars], na.rm = TRUE),
  58. scale2 = rowMeans(mydata[ ,scale2_vars], na.rm = TRUE))
  59.  
  60. # A benefit of the above approach is that you can put any variables you
  61. # want into scales1_vars or scales2_vars and you don't need to know how
  62. # many variables you put in - the code will simply calculate the average
  63. # across all of the columns.
  64.  
  65. # Now let's calculate the averages manually. We can do this by
  66. # adding up the variables we want in our scale, then dividing by the
  67. # number of variables to get the average of the variables.
  68. fulldf <- fulldf %>% dplyr::mutate(scale1_manual = (x + y) / 2,
  69. scale2_manual = (y + z) / 2)
  70.  
  71. # Let's make sure the manual approach and the other approach produce
  72. # the same result. We can use the identical() function to tell us if the
  73. # two variables are identical! TRUE if so, FALSE if not.
  74. identical(fulldf$scale1_manual, fulldf$scale1) # TRUE
  75. identical(fulldf$scale2_manual, fulldf$scale2) # TRUE
  76.  
  77. # Based on the above, our approach works! The former approach is nice because
  78. # (1) you don't need to type out the # of variables you are including in the scale
  79. # and so don't run the risk of forgetting to change the number you're dividing
  80. # by if/when you change the number of items in your scale and (2) it calculates
  81. # averages for rows using the available variables where there is no missingness
  82. # rather than simply returning NA if ANY variable is missing. On the other hand,
  83. # (a) it is more lines of code because you first say which sets of variables you
  84. # want to include in your scales in some lines of code then actually generate
  85. # the scales in some more code, (b) it's not so obvious mathematically what you
  86. # are doing unless you immdiately see "rowMeans" and know that means it is
  87. # calculating the average of the rows for each scale, and (c) you might want
  88. # to drop any observation with missing data rather than simply use the available
  89. # data to calculate an average. For example, if we make a scale using the variable
  90. # with missing data (w) then we will find the two approaches produce different
  91. # results:
  92.  
  93. # Approach 1
  94. scale3_vars <- c("w","z")
  95. fulldf <- fulldf %>% dplyr::mutate(scale3 = rowMeans(fulldf[ ,scale3_vars], na.rm = TRUE))
  96.  
  97. # Approach 2
  98. fulldf <- fulldf %>% dplyr::mutate(scale3_manual = (w + z) / 2)
  99.  
  100. # Test if identical
  101. identical(fulldf$scale3, fulldf$scale3_manual)
  102.  
  103. # To see why, here are the computed scale values for the two different approaches:
  104. fulldf$scale3
  105. fulldf$scale3_manual
  106.  
  107. # Note: there are more efficient ways of doing this as well, such as writing a
  108. # function to implement the second approach simply by giving the function the
  109. # variable names and scale name, while also building in functionality to let
  110. # the user choose to include or omit variables with missing values. The
  111. # examples above are reasonable ways of doing this for those just learning R
  112. # and/or the Tidyverse.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement