Advertisement
Guest User

xama

a guest
Oct 30th, 2014
130
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 5.32 KB | None | 0 0
  1.  
  2.  
  3. Home
  4. |Interface
  5. |Input
  6. |Manage
  7. |Stats
  8. |Adv Stats
  9. |Graphs
  10. |Adv Graphs
  11. |Blog
  12.  
  13. Quick-R
  14.  
  15. accessing the power of R
  16. Advanced Statistics
  17.  
  18. Generalized Linear Models
  19. Discriminant Function
  20. Time Series
  21. Factor Analysis
  22. Correspondence Analysis
  23. Multidimensional Scaling
  24. Cluster Analysis
  25. Tree-Based Models
  26. Bootstrapping
  27. Matrix Algebra
  28.  
  29. R in Action
  30.  
  31. R in Action
  32.  
  33. R in Action (2nd ed) significantly expands upon this material. Use promo code ria38 for a 38% discount.
  34. Top Menu
  35.  
  36. Home
  37. The R Interface
  38. Data Input
  39. Data Management
  40. Basic Statistics
  41. Advanced Statistics
  42. Basic Graphs
  43. Advanced Graphs
  44. Blog
  45.  
  46. Cluster Analysis
  47.  
  48. R has an amazing variety of functions for cluster analysis. In this section, I will describe three of the many approaches: hierarchical agglomerative, partitioning, and model based. While there are no best solutions for the problem of determining the number of clusters to extract, several approaches are given below.
  49. Data Preparation
  50.  
  51. Prior to clustering data, you may want to remove or estimate missing data and rescale variables for comparability.
  52.  
  53. # Prepare Data
  54. mydata <- na.omit(mydata) # listwise deletion of missing
  55. mydata <- scale(mydata) # standardize variables
  56. Partitioning
  57.  
  58. K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. The analyst looks for a bend in the plot similar to a scree test in factor analysis. See Everitt & Hothorn (pg. 251).
  59.  
  60. # Determine number of clusters
  61. wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  62. for (i in 2:15) wss[i] <- sum(kmeans(mydata,
  63. centers=i)$withinss)
  64. plot(1:15, wss, type="b", xlab="Number of Clusters",
  65. ylab="Within groups sum of squares")
  66.  
  67. # K-Means Cluster Analysis
  68. fit <- kmeans(mydata, 5) # 5 cluster solution
  69. # get cluster means
  70. aggregate(mydata,by=list(fit$cluster),FUN=mean)
  71. # append cluster assignment
  72. mydata <- data.frame(mydata, fit$cluster)
  73.  
  74. A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width.
  75. Hierarchical Agglomerative
  76.  
  77. There are a wide range of hierarchical clustering approaches. I have had good luck with Ward's method described below.
  78.  
  79. # Ward Hierarchical Clustering
  80. d <- dist(mydata, method = "euclidean") # distance matrix
  81. fit <- hclust(d, method="ward")
  82. plot(fit) # display dendogram
  83. groups <- cutree(fit, k=5) # cut tree into 5 clusters
  84. # draw dendogram with red borders around the 5 clusters
  85. rect.hclust(fit, k=5, border="red")
  86.  
  87. dendogram click to view
  88.  
  89. The pvclust( ) function in the pvclust package provides p-values for hierarchical clustering based on multiscale bootstrap resampling. Clusters that are highly supported by the data will have large p values. Interpretation details are provided Suzuki. Be aware that pvclust clusters columns, not rows. Transpose your data before using.
  90.  
  91. # Ward Hierarchical Clustering with Bootstrapped p values
  92. library(pvclust)
  93. fit <- pvclust(mydata, method.hclust="ward",
  94. method.dist="euclidean")
  95. plot(fit) # dendogram with p values
  96. # add rectangles around groups highly supported by the data
  97. pvrect(fit, alpha=.95)
  98.  
  99. clustering with p values click to view
  100. Model Based
  101.  
  102. Model based approaches assume a variety of data models and apply maximum likelihood estimation and Bayes criteria to identify the most likely model and number of clusters. Specifically, the Mclust( ) function in the mclust package selects the optimal model according to BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models. (phew!). One chooses the model and number of clusters with the largest BIC. See help(mclustModelNames) to details on the model chosen as best.
  103.  
  104. # Model Based Clustering
  105. library(mclust)
  106. fit <- Mclust(mydata)
  107. plot(fit) # plot results
  108. summary(fit) # display the best model
  109.  
  110. model based clustering cluster scatter plots click to view
  111. Plotting Cluster Solutions
  112.  
  113. It is always a good idea to look at the cluster results.
  114.  
  115. # K-Means Clustering with 5 clusters
  116. fit <- kmeans(mydata, 5)
  117.  
  118. # Cluster Plot against 1st 2 principal components
  119.  
  120. # vary parameters for most readable graph
  121. library(cluster)
  122. clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
  123. labels=2, lines=0)
  124.  
  125. # Centroid Plot against 1st 2 discriminant functions
  126. library(fpc)
  127. plotcluster(mydata, fit$cluster)
  128.  
  129. clusplot discriminant plot click to view
  130. Validating cluster solutions
  131.  
  132. The function cluster.stats() in the fpc package provides a mechanism for comparing the similarity of two cluster solutions using a variety of validation criteria (Hubert's gamma coefficient, the Dunn index and the corrected rand index)
  133.  
  134. # comparing 2 cluster solutions
  135. library(fpc)
  136. cluster.stats(d, fit1$cluster, fit2$cluster)
  137.  
  138. where d is a distance matrix among objects, and fit1$cluster and fit$cluster are integer vectors containing classification results from two different clusterings of the same data.
  139.  
  140.  
  141. Copyright © 2014 Robert I. Kabacoff, Ph.D. | Sitemap
  142. Designed by WebTemplateOcean.com
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement