Advertisement
Guest User

Mixed data clustering example by TDL

a guest
Aug 19th, 2018
132
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
R 7.30 KB | None | 0 0
  1. # I think your greatest challenge in such a clustering is how you want to balance each variable in respect to each other. In your particular case, Age is a numeric variable, Gender is a binary indicator, and the Interests is actually a mixture of binary indicators (i.e. for each possible existing interest, is the person interested in it or not).
  2. # In clustering, one of the key determining factors is to choose a suitable distance/dissimilarity metric. The one I'm mostly familiar with in respect to clustering mixed data is the Gower's dissimilarity. It is quite easy to understand how it combines each column (variable) together. It does not incorporate e.g. inter-variable correlations, which may be essential in your particular task, as e.g. people who are interested in 'Cooking' might be more likely to be also interested in 'Gardening' and you might want to emphasize this correlation.
  3.  
  4. # An another option that does not require a distance/dissimilarity metric particularly designed for mixed data is to transform your data into numeric data (continuous and ordinal) and applying a suitable distance metric. Such could be, for example, standardized Euclidean distance. In such a case you need to be careful though that the distance metric will be representative of your original research question; for example, if you binarize all your 'Interests', you might accidentally weight Age by 1, Gender by 1, and Interests by the amount of unique instances that exist.
  5.  
  6. # Your 'Interests' column is not a true factor as it's rather a combination of various possible interests, and therefore it will be highly sparse in regards to finding people with similar combination of interests. I would therefore consider creating a numeric representation of your data and weighting your numeric variables accordingly. Here is an example R code following your example:
  7.  
  8. ## Example code
  9.  
  10. # You had a typo in your original example, as there was a comma missing between "wemaog" and "leighw12".
  11. id <-  c("jacobl","georgea123","makeretale","hafeezaka","billiejogr","wemaog","leighw12","leannep1234","jerryf1","yesh5"  )
  12.  
  13. age <- c(52, 58, 59, 52, 38, 30, 59, 27, 51, 71)
  14.  
  15. gender <-  c("Female","Male","Male","Male","Female","Female","Male","Female","Female","Female")
  16.  
  17. interests <- c("Cooking, Reading",
  18.            "Cooking, Movies, Reading, Travel",
  19.            "Gardening, IndoorGames, Movies, Music, Pets, PhotographyArt, Travel",
  20.            "Cooking, PhotographyArt, Other",
  21.            "Cooking, Gardening, IndoorGames, Movies, Music, CulturalFestivities, Sports, Travel",
  22.            "Cooking, IndoorGames, Movies, Music, Pets, Reading",
  23.            "IndoorGames, Movies, Music, Reading, Sports, Travel",
  24.            "Gardening, IndoorGames, Movies, Music, Pets, PhotographyArt, Reading, Travel",
  25.            "Cooking, Gardening, IndoorGames, Movies, Music, CulturalFestivities, PhotographyArt, Reading, Sports, Travel",
  26.            "Gardening, Movies, Music, Pets, Sports, Travel, Cooking, PhotographyArt, IndoorGames")
  27.  
  28.  
  29. df <- data.frame(id, gender, age, interests)
  30.  
  31. ##
  32.  
  33. # id is not a variable for clustering, it is an identifier and we will therefore move it into 'rownames'.
  34. rownames(df) <- df$id
  35. df <- df[,-1]
  36. # Your gender column looks a bit funny, though I take it it's just because it's toy data.
  37.  
  38. # Split comma-separated hobbies and make them into a list where each list elements is a vector of hobbies as character strings.
  39. personhobbies <- lapply(df$interests,
  40.     FUN=function(z) {
  41.         gsub(" ", "", strsplit(as.character(z), ",")[[1]])
  42.     })
  43. # Find all the possible unique hobbies.
  44. possiblehobbies <- unique(unlist(personhobbies))
  45. #> length(possiblehobbies)
  46. #[1] 12
  47.  
  48. # Age is approximately a continuous variable, even if the decimals are not reported.
  49.  
  50. # Binarize the Female/Male gender indicator.
  51. isFemale <- as.numeric(df$gender == "Female")
  52. # 1 equals Female, 0 equals Male
  53.  
  54. # We will binarize all of the hobbies then; however, here you have to be careful, as you will binarize 12 potential candidates for hobbies, and if you don't standardize your columns you will weight them 12-fold in respect to e.g. the gender, and this is probably what you don't want to be represented in the distance/dissimilarity matrix that is the key ingredient in your clustering.
  55.  
  56. # We want a matrix where columns are individual hobbies and rows are the individuals
  57. hobbymatrix <- do.call("rbind",
  58.     # Loop over individuals
  59.     lapply(hobbies, FUN=function(z){
  60.     # Loop over all the possible hobbies
  61.     unlist(lapply(possiblehobbies, FUN=function(q){
  62.         as.numeric(q %in% z)
  63.     }))
  64. }))
  65. rownames(hobbymatrix) <- rownames(df)
  66. # Cropping the hobby names just for readability
  67. colnames(hobbymatrix) <- substr(possiblehobbies, 1, 5)
  68.  
  69. hobbymatrix
  70. #> hobbymatrix
  71. #            Cooki Readi Movie Trave Garde Indoo Music Pets Photo Other Cultu Sport
  72. #jacobl          1     1     0     0     0     0     0    0     0     0     0     0
  73. #georgea123      1     1     1     1     0     0     0    0     0     0     0     0
  74. #makeretale      0     0     1     1     1     1     1    1     1     0     0     0
  75. #hafeezaka       1     0     0     0     0     0     0    0     1     1     0     0
  76. #billiejogr      1     0     1     1     1     1     1    0     0     0     1     1
  77. #wemaog          1     1     1     0     0     1     1    1     0     0     0     0
  78. #leighw12        0     1     1     1     0     1     1    0     0     0     0     1
  79. #leannep1234     0     1     1     1     1     1     1    1     1     0     0     0
  80. #jerryf1         1     1     1     1     1     1     1    0     1     0     1     1
  81. #yesh5           1     0     1     1     1     1     1    1     1     0     0     1
  82.  
  83. # Notice that if we use these numeric columns, we'll weight the hobbies disproportionally in respect to e.g. gender. One quick fix is to scale the indicators by the number of possible choices.
  84. hobbymatrixscaled <- hobbymatrix/length(possiblehobbies)
  85.  
  86. # Construct a numeric data matrix; we will furthermore scale the Age variable to have zero mean and unit standard deviation (z-score)
  87. datscaled <- data.frame(zAge = scale(df$age), isFemale = isFemale, hobbymatrixscaled)
  88. datraw <- data.frame(zAge = scale(df$age), isFemale = isFemale, hobbymatrix)
  89.  
  90. # Notice that some dist-functions want you to input a matrix and not a data.frame, so you might want to cast as.matrix(dat) before
  91.  
  92. # Shamelessly self-promoting my own customizable heatmap; hierarchical clustering is performed simultaneously for both the rows (individuals) as well as the columns (age, gender, and hobbies)
  93. if(!require("hamlet")){
  94.     install.packages("hamlet") # Available on CRAN
  95.     library("hamlet")
  96. }
  97.  
  98. par(mfrow=c(1,3), mar=c(4,4,4,1))
  99. # Left heatmap
  100. h1 <- hmap(as.matrix(datscaled))
  101. hmap.key(h1)
  102. title(main="Data clustered with scaling", xlab="Variables", ylab="Individuals")
  103. # Middle heatmap
  104. h2 <- hmap(as.matrix(datraw), scale="none")
  105. hmap.key(h2)
  106. title(main="Raw binary data", xlab="Variables", ylab="Individuals")
  107. # Plot a raw hierarchical clustering for individuals only, based on the raw data
  108. plot(hclust(d=dist(datraw)))
  109. # Notice that Euclidean distance is used in all above by default.
  110.  
  111.  
  112. # Notice that you obtain very different conclusions depending on how you treat your variables, so you will need to give this quite a bit of thought, as well as the distance/dissimilarity metric (not to mention other clustering related strategies).
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement