Advertisement
Guest User

Untitled

a guest
Dec 13th, 2019
90
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 21.69 KB | None | 0 0
  1. ---
  2. title: "Cluster Validation"
  3. author:
  4. - Craciun Dorina (grupa 405)
  5. - Tapirdea Alexandru (grupa 405)
  6. date: "`r format(Sys.time(), '%d %B %Y')`"
  7. # output:
  8. # ioslides_presentation:
  9. # build: yes
  10. # incremental: yes
  11. # smaller: yes
  12. # slidy_presentation:
  13. # incremental: yes
  14. output:
  15. xaringan::moon_reader
  16. ---
  17.  
  18. ```{r setup, include=FALSE}
  19. knitr::opts_chunk$set(echo = FALSE)
  20. knitr::opts_chunk$set(warning = FALSE)
  21. library(xaringan)
  22. ```
  23.  
  24. # Validarea clusterizarii
  25. Procedura de validare a clusterizarii consta in masurarea performantei rezultatelor obtinute dupa aplicarea algoritmului de clusterizare.
  26.  
  27. Inainte de a aplica orice algoritm de clusterizare, se parcurg urmatoarele etape:
  28.  
  29. <ol>
  30. <li> evaluarea tendintei de clusterizare </li>
  31. <li> determinarea nr. optim de clustere </li>
  32. <li> clusterizarea propriu-zisa </li>
  33. <li> validarea clusterizarii </li>
  34. </ol>
  35. ???
  36. Before applying any clustering algorithm to a data set, the first thing to do is to
  37. assess the clustering tendency. That is, whether applying clustering is suitable for the data. If yes, then how many clusters are there. Next, you can perform hierarchical clustering or partitioning clustering (with a pre-specified number of clusters). Finally, you can use a number of measures, described in this part, to evaluate the goodness of the clustering results.
  38.  
  39. ---
  40. ## Evaluarea tendintei de clusterizare
  41. Inainte de a aplica orice metoda de clusterizare este important a se decide daca setul respectiv de date prezinta clustere semnificative (non-random structures). In caz afirmativ, se stabileste si nr. de clustere.
  42.  
  43. ```{r echo=FALSE}
  44. # install.packages(c("factoextra", "clustertend", "ggplot2"))
  45. ```
  46. ```{r echo=TRUE}
  47. head(iris, 3)
  48. ```
  49. ---
  50. ```{r echo=TRUE}
  51. # Iris data set, excluding the column Species
  52. df <- iris[, -5]
  53. # Random data generated from the iris data set
  54. random_df <- apply(df, 2,
  55. function(x){runif(length(x), min(x), (max(x)))})
  56. random_df <- as.data.frame(random_df)
  57. ```
  58. ```{r echo=FALSE, warning=FALSE, message=FALSE}
  59. # Standardize the data sets
  60. df <- iris.scaled <- scale(df)
  61. random_df <- scale(random_df)
  62. library("factoextra")
  63. library("ggplot2")
  64. set.seed(123)
  65. # Plot faithful data set
  66. fviz_pca_ind(prcomp(df), title = "PCA - Iris data",
  67. habillage = iris$Species, palette = "jco",
  68. geom = "point", ggtheme = theme_classic(),
  69. legend = "bottom")
  70. ```
  71. ???
  72. O problema ar fi faptul ca alg. de clusterizare vor intoarce clustere chiar daca acestea nu exista sau nu sunt bine definite. Din acest motiv trebuie sa evaluam setul de date inainte de a aplica un algoritm. Pentru aceasta am ales sa exemplificam pe setul de date iris care arata asa... De asemenea am generat si un set de date random pornind de la setul de date initial si vom aplica diversi alg. pe ambele seturi pentru a observa diferentele.
  73. ---
  74.  
  75. ```{r echo=TRUE}
  76. # Plot the random df
  77. fviz_pca_ind(prcomp(random_df), title = "PCA - Random data",
  78. geom = "point", ggtheme = theme_classic())
  79. ```
  80. ---
  81. ##### Aplicam alg. de clusterizare ierarhica si k-means pe fiecare din cele 2 seturi de date
  82. ```{r echo=TRUE, out.height="400px"}
  83. # K-means on iris dataset
  84. km.res1 <- kmeans(df, 3)
  85. fviz_cluster(list(data = df, cluster = km.res1$cluster),
  86. ellipse.type = "norm", geom = "point", stand = FALSE,
  87. palette = "jco", ggtheme = theme_classic())
  88. ```
  89. ---
  90. ```{r echo=TRUE}
  91. # Hierarchical clustering on iris dataset
  92. fviz_dend(hclust(dist(df)), k = 3, k_colors = "jco",
  93. as.ggplot = TRUE, show_labels = FALSE)
  94. ```
  95. ---
  96. ```{r echo=TRUE}
  97. #K-means on the random dataset
  98. km.res2 <- kmeans(random_df, 3)
  99. fviz_cluster(list(data = random_df, cluster = km.res2$cluster),
  100. ellipse.type = "norm", geom = "point", stand = FALSE,
  101. palette = "jco", ggtheme = theme_classic())
  102. ```
  103. ---
  104. ```{r echo=TRUE}
  105. # Hierarchical clustering on the random dataset
  106. fviz_dend(hclust(dist(random_df)), k = 3, k_colors = "jco",
  107. as.ggplot = TRUE, show_labels = FALSE)
  108. ```
  109.  
  110.  
  111. ???
  112. Se observa ca alg. incearca o clusterizare chiar si pe datele care sunt distribuite uniform, unde in mod evident nu exista clustere cu semnificatie. Din acest motiv avem nevoie de metode de evaluare a tendintelor de clusterizare.
  113. ---
  114.  
  115.  
  116. ### Metode de evaluare a tendintelor de clusterizare
  117. ##### 1. metode statistice (Hopkins statistic)
  118. ##### 2. metode vizuale (Visual Assessment of cluster Tendency - VAT algorithm)
  119. <hr/>
  120. ##### 1. Metode statistice (Hopkins statistic)
  121.  
  122. - determina probabilitatea ca un set de date sa fie generat dupa o distributie uniforma
  123. Fie o multime D un data set real.
  124. 1. se considera un set de n puncte ( $p_1$ ,..., $p_n$ ) uniform distribuite din D
  125. 2. pentru fiecare punct $p_i$ se gaseste cel mai apropiat vecin $p_j$ si se calculeaza distanta dintre acestea $x_i$ = dist( $p_i$ , $p_j$ )
  126. 3. se genereaza un set de date random din distributia uniforma initiala ( $q_1$ ,..., $q_n$ ) cu aceeasi deviatie standard.
  127. 4. pentru fiecare punct $q_i$ se gaseste cel mai apropiat vecin $q_j$ si se calculeaza distanta dintre acestea $y_i$ = dist( $q_i$ , $q_j$ )
  128. 5. se calculeaza statistica Hopkins ca fiind suma distantelor din setul de date generat random impartit la suma distantelor din setul initial si cel random
  129. $H = sum(y)/(sum(x)+sum(y))$
  130. ---
  131. ##### 1. Metode statistice (Hopkins statistic)
  132. - O valoare apropiata de 0.5 inseamna ca setul de date D este uniform distribuit.
  133. - H0: null hypothesis: D is uniformly ditributed (no meaningful clusters)
  134. - H1: alternative hypothesis: the data set D is not uniformly distributed (contains meaningful clusters)
  135. - H value ~ 0.5 => H0
  136. - H value ~ 0 => H1
  137.  
  138. ```{r echo=FALSE}
  139. library(clustertend)
  140. set.seed(123)
  141. ```
  142.  
  143. ```{r echo=TRUE}
  144. # Compute Hopkins statistic for iris dataset
  145. hopkins(df, n = nrow(df)-1)
  146. ```
  147. ```{r echo=FALSE}
  148. set.seed(123)
  149. ```
  150. ```{r echo=TRUE}
  151. # Compute Hopkins statistic for a random dataset
  152. hopkins(random_df, n = nrow(random_df)-1)
  153. ```
  154. ---
  155.  
  156. ##### 2. Metode vizuale
  157. - VAT algorithm:
  158. 1. se calculeaza matricea de nesimilaritate (DM) folosind distanta euclidiana
  159. 2. se ordoneaza DM crescator dupa distanta => ODM
  160. 3. se afiseaza ODM ca ODI (ordered dissimilarity image)
  161.  
  162. ```{r echo=TRUE, out.height="350px"}
  163. fviz_dist(dist(df), show_labels = FALSE)+labs(title = "Iris data")
  164. ```
  165. ---
  166. ```{r echo=TRUE, out.height="350px"}
  167. fviz_dist(dist(random_df), show_labels = FALSE)+
  168. labs(title = "Random data")
  169. ```
  170. - Culoarea este proportionala cu valoarea disimilaritatii: cu cat este rosu mai curat dist( $x_i$ , $x_j$ ) = 0, iar cu cat este mai albastru dist( $x_i$ , $x_j$ ) = 1.
  171. ???
  172. Rosu: similaritate ridicata, albastru: similaritate scazuta.
  173. DM confirma ca in setul de date iris avem o structura clusterizata
  174. Alg. VAT determina tendinta de clusterizare intr-o forma vizuala numarand patratele inchise la culoare de-a lungul diagonalei principale.
  175. ---
  176.  
  177. ### Metode pentru determinarea nr. optim de clustere
  178. #### 1. Metode directe:
  179. - prin optimizarea unui criteriu (sums of squares, the average silhouette)
  180. ##### Elbow method
  181. ##### Average silhouette method
  182.  
  183. #### 2. Metode de testare statistica:
  184. - constau in compararea datelor cu ipoteza nula
  185. ##### Gap statistic method
  186.  
  187. #### In R avem urmatoarele functii:
  188. - fviz_nbclust()
  189. - NbClust()
  190.  
  191. ???
  192. 1. Metode directe
  193. Elbow method
  194. - WSS=within-cluster sum of square
  195. - total-WSS -> the compactness of clustering; to be minimized.
  196. - Metoda Elbow considera total WSS ca fiind functie de nr. de clustere:
  197. - Astfel determinat un nr. de clustere, adaugarea altuia nu ar imbunatati rezultatul total-WSS
  198. 1. se ruleaza alg. de clusterizare pentru diferite nr. de clustere( $k_i$ )
  199. 2. pentru fiecare $k_i$ , se calculeaza total-WSS
  200. 3. se ploteaza curba data de wss si nr. $k_i$ de clustere
  201. 4. locul in care curba isi schimba forma este considerat ca fiind un nr. bun de clustere.
  202.  
  203. Average silhouette method
  204. - Aceasta metoda masoara calitatea clusterizarii.
  205. - Determina cat de mult fiecare obiect apartine clusterului din care face parte
  206. - Nr. optim de clustere este cel care maximizeaza media dintre valorile posibile pentru k
  207. 1. se ruleaza alg. de clusterizare pentru diferite nr. de clustere( $k_i$ )
  208. 2. pentru fiecare $k_i$ , se calculeaza average silhouette
  209. 3. se ploteaza curba data de average silhouette si nr. $k_i$ de clustere
  210. 4. locul in care curba isi schimba forma este considerat ca fiind un nr. bun de clustere.
  211.  
  212. 2.Metode de testare statistica
  213. Gap statistic method
  214. - Compara variatia total-wss pentru diferite nr. de clustere si returneaza nr. de clustere care maximizeaza aceasta statistica
  215. 1. se aplica un alg. de clustering pe date folosind nr. diferite de clustere k=1,...,kmax si se calc. total-wss (Wk)
  216. 2. se genereaza un data set B cu o ditributie uniforma random. Se clusterizeaza si acest set de date si se calc. $Wk_B$ .
  217. 3. se calculeaza gap statistic ca fiind:
  218. Gap(k)=1/B*sum(log( $Wk_B$ ))-log(Wk)
  219. Se calculeaza deviatia standard a statisticilor (sk).
  220. 4. se alege nr. min. de clustere (k) pentru care:
  221. Gap(k) >= Gap(k+1) - s(k+1)
  222.  
  223. 1.fviz_nbclust() (factoextra)
  224. It can be used to compute the three different methods [elbow, silhouette and gap statistic] for any partitioning clustering methods
  225. 2. NbClust() (NbClust)
  226. It provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
  227. It can simultaneously computes all the indices and determine the number of
  228. clusters in a single function call.
  229. ---
  230.  
  231. ### Example: USArrests data
  232. ```{r echo=FALSE}
  233. #install.packages("NbClust")
  234. library(factoextra)
  235. library(NbClust)
  236. # Standardize the data
  237. df <- scale(USArrests)
  238. set.seed(123)
  239. ```
  240. ```{r echo=TRUE}
  241. head(df)
  242. ```
  243. ---
  244. ```{r echo=TRUE}
  245. # Elbow method
  246. fviz_nbclust(df, kmeans, method = "wss") +
  247. geom_vline(xintercept = 4, linetype = 2)+
  248. labs(subtitle = "Elbow method")
  249. ```
  250. ---
  251. ```{r echo=TRUE}
  252. # Silhouette method
  253. fviz_nbclust(df, kmeans, method = "silhouette")+
  254. labs(subtitle = "Silhouette method")
  255. ```
  256. ---
  257. ```{r echo=TRUE, out.height="450px"}
  258. # Gap statistic
  259. # nboot = 50 to keep the function speedy, recommended value: nboot= 500
  260. # verbose = FALSE to hide computing progression.
  261. fviz_nbclust(df, kmeans, nstart = 25, method = "gap_stat",
  262. nboot = 50)+labs(subtitle = "Gap statistic method")
  263. ```
  264. ???
  265. NbClust(data = NULL, diss = NULL, distance = "euclidean",
  266. min.nc = 2, max.nc = 15, method = NULL)
  267. data: matrix
  268. diss: dissimilarity matrix to be used. By default, diss=NULL, but if it is replaced by a dissimilarity matrix, distance should be NULL
  269. distance: the distance measure to be used to compute the dissimilarity matrix (euclidean, manhattan or NULL)
  270. min.nc, max.nc: minimal and maximal number of clusters, respectively
  271. method: The cluster analysis method to be used including ward.D, ward.D2, single, complete, average, kmeans and more.
  272. To compute NbClust() for kmeans, use method = kmeans.
  273. To compute NbClust() for hierarchical clustering, method should be one of
  274. c(ward.D, ward.D2, single, complete, average).
  275. ---
  276. ```{r echo=FALSE, include=FALSE}
  277. library("NbClust")
  278. library("factoextra")
  279. nb <- NbClust(df, distance = "euclidean", min.nc = 2,
  280. max.nc = 10, method = "kmeans")
  281. ```
  282.  
  283. ```{r echo=TRUE, out.height="200px"}
  284. fviz_nbclust(nb)
  285. ```
  286. ---
  287. ### Cluster Validation Statistics
  288. #### 1. Internal cluster validation
  289. - se utilizeaza doar informatii interne din procedeul de clusterizare
  290. - estimeaza nr. de clustere si alg. cel mai potrivit
  291. - Coeficientul silhouette
  292. - Indexul Dunn
  293. ---
  294. #### 2. External cluster validation
  295. - se compara rezultatele unei analize de clusterizare cu un rezultat deja cunoscut
  296. - se utilizeaza pentru a alege cel mai bun algoritm pentru respectivul set de date
  297. - Rand index
  298. - Meila's variation index VI
  299.  
  300. #### 3. Relative cluster validation
  301. - evalueaza clusterizarea prin variatia parametrilor aplicati aceluiasi alg.
  302. - se utilizeaza pentru a det. nr. optim de clustere
  303.  
  304. ???
  305. Coeficientul silhouette
  306. -masoara cat de bine este clusterizat un set de date si estimeaza distanta medie dintre clustere
  307. -ploteaza distanta dintre fiecare punct al unui cluster si punctele din clusterele vecine
  308. Dunn index este maxim cand datele au fost clusterizate optim
  309.  
  310. ---
  311.  
  312. ```{r echo=F}
  313. library(factoextra)
  314. library(fpc)
  315. library(NbClust)
  316. ### Data preparation
  317. # Excluding the column "Species" at position 5
  318. df <- iris[, -5]
  319. # Standardize
  320. df <- scale(df)
  321. ```
  322. ##### Pentru exemplificare vom folosi datasetul IRIS
  323. ```{r eval=F}
  324. # K-means clustering
  325. km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE)
  326. # Visualize k-means clusters
  327. fviz_cluster(km.res, geom = "point", ellipse.type = "norm",
  328. palette = "jco", ggtheme = theme_minimal())
  329. # Hierarchical clustering
  330. hc.res <- eclust(df, "hclust", k = 3, hc_metric = "euclidean",hc_method = "ward.D2", graph = FALSE)
  331. # Visualize dendrograms
  332. fviz_dend(hc.res, show_labels = FALSE,
  333. palette = "jco", as.ggplot = TRUE)
  334. ```
  335. ```{r echo=T, eval=T, fig.show='hold', out.width="50%"}
  336. # K-means clustering & visualize
  337. km.res <- eclust(df, "kmeans", k = 3, nstart = 25, graph = FALSE)
  338. fviz_cluster(km.res, geom = "point", ellipse.type = "norm",
  339. palette = "jco", ggtheme = theme_minimal())
  340. # Hierarchical clustering & visualize
  341. hc.res <- eclust(df, "hclust", k = 3, hc_metric = "euclidean",
  342. hc_method = "ward.D2", graph = FALSE)
  343. fviz_dend(hc.res, show_labels = FALSE, as.ggplot = TRUE, palette = "jco")
  344. ```
  345. ---
  346. ```{r echo=T}
  347. # Silhouette plot
  348. # fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
  349. # Silhouette information
  350. silinfo <- km.res$silinfo
  351. names(silinfo)
  352. # Silhouette widths of each observation
  353. head(silinfo$widths[, 1:3], 10)
  354. ```
  355. ---
  356. ```{r echo=T}
  357. # Average silhouette width of each cluster
  358. silinfo$clus.avg.widths
  359. # The total average (mean of all individual silhouette widths)
  360. silinfo$avg.width
  361. # The size of each clusters
  362. km.res$size
  363. # Silhouette width of observation
  364. sil <- km.res$silinfo$widths[, 1:3]
  365. # Objects with negative silhouette
  366. neg_sil_index <- which(sil[, 'sil_width'] < 0)
  367. sil[neg_sil_index, , drop = FALSE]
  368. ```
  369. ???
  370. cluster.stats(d=NULL, clustering, al.clustering=NULL)
  371. - d: a distance object between cases as generated by the dist() function
  372. - clustering: vector containing the cluster number of each observation
  373. - alt.clustering: vector such as for clustering, indicating an alternative clustering
  374.  
  375. The function cluster.stats() returns a list containing many components useful for
  376. analyzing the intrinsic characteristics of a clustering:
  377. - cluster.number: number of clusters
  378. - cluster.size: vector containing the number of points in each cluster
  379. - average.distance, median.distance: vector containing the cluster-wise within average/ median distances
  380. - average.between: average distance between clusters. We want it to be as large as possible
  381. - average.within: average distance within clusters. We want it to be as small as
  382. possible
  383. - clus.avg.silwidths: vector of cluster average silhouette widths. Recall that, the
  384. silhouette width is also an estimate of the average distance between clusters.
  385. Its value is comprised between 1 and -1 with a value of 1 indicating a very good
  386. cluster.
  387. - within.cluster.ss: a generalization of the within clusters sum of squares (kmeans objective function), which is obtained if d is a Euclidean distance matrix.
  388. - dunn, dunn2: Dunn index
  389. - corrected.rand, vi: Two indexes to assess the similarity of two clustering: the
  390. corrected Rand index and Meila 's VI
  391. ---
  392.  
  393. ### Dunn Index
  394. ```{r echo=T}
  395. # Statistics for k-means clustering
  396. km_stats <- cluster.stats(dist(df), km.res$cluster)
  397. # Dunn index
  398. km_stats$dunn
  399. km_stats
  400. ```
  401. ---
  402. ### External cluster validation
  403. ```{r echo=T}
  404. table(iris$Species, km.res$cluster)
  405. ```
  406. - All setosa species (n = 50) has been classified in cluster 1
  407. - A large number of versicor species (n = 39 ) has been classified in cluster 3.
  408. Some of them ( n = 11) have been classified in cluster 2.
  409. - A large number of virginica species (n = 36 ) has been classified in cluster 2.
  410. Some of them (n = 14) have been classified in cluster 3.
  411. ```{r echo=T}
  412. # Compute cluster stats
  413. species <- as.numeric(iris$Species)
  414. clust_stats <- cluster.stats(d = dist(df), species, km.res$cluster)
  415. # Corrected Rand index
  416. # Masoara similaritatea dintre 2 partitii (clusterizare si impartirea pe specii) si are # valori din intervalul [-1, 1]. Cu cat val. e mai apropiata de 1, cu atat este mai buna clusterizarea.
  417. clust_stats$corrected.rand
  418. ```
  419. ---
  420. ```{r echo=T}
  421. # VI
  422. clust_stats$vi
  423. ```
  424. Aceeasi analiza poate fi facuta si aplicand alti algoritmi de clusterizare precum: clusterizarea ierarhica sau PAM (Partitioning Around Medoids).
  425. ##### PAM
  426. ```{r echo=T}
  427. # Agreement between species and pam clusters
  428. pam.res <- eclust(df, "pam", k = 3, graph = FALSE)
  429. table(iris$Species, pam.res$cluster)
  430. cluster.stats(d = dist(iris.scaled),
  431. species, pam.res$cluster)$vi
  432.  
  433. ```
  434. ---
  435. ##### HC
  436. ```{r echo=T}
  437. # Agreement between species and HC clusters
  438. res.hc <- eclust(df, "hclust", k = 3, graph = FALSE)
  439. table(iris$Species, res.hc$cluster)
  440. cluster.stats(d = dist(iris.scaled),
  441. species, res.hc$cluster)$vi
  442. ```
  443. ---
  444. ## Determinarea celui mai bun algoritm pentru clusterizare
  445. Pentru a compara simultan diferiti algoritmi de clusterizare putem sa folosim clValid.
  446. ClValid compara algoritmii de clusterizare folosind 2 masuri diferite pentru validarea rezultatelor:
  447.  
  448.  
  449. 1. Internal measures
  450. 2. Stability measures
  451.  
  452. ### 1. Internal measures
  453. Folosesc informatii de ordin intrinsec, precum:
  454.  
  455. - Conectivitatea
  456. - Coeficientul silhouette
  457. - Dunn index
  458.  
  459.  
  460.  
  461. ---
  462. ### 2. Stability measures
  463. Evalueaza consistenta clusterului final prin compararea cu alte clustere obtinute prin eliminarea cate unei coloane.
  464. Stability measures evalueaza:
  465. - APN(The average proportion of non-overlap)--> valori intre [0,1], unde valorile scazute indica o consistenta ridicata. APN reprezinta media observatiilor care nu au fost plasate IN ACELASI CLUSTER de catre metoda de clusterizare bazata pe intreg setul de date si cele bazate pe datele din care s-a eliminat pe rand cate o coloana.
  466. - AD(The average distance)--> valori intre [0,∞), unde valorile scazute sunt de dorit. AD distanta medie dintre elementele plasate IN ACELASI CLUSTER, in ambele cazuri. (date complete si date din care s-a eliminat o coloana)
  467.  
  468.  
  469. - ADM(The average distance between means) valori intre [0,1], cu valori scazute de dorit. ADM reprezinta media distantelor dintre centrele clusterelor pentru elementele plasate IN ACELASI CLUSTER (in ambele cazuri)
  470. - FOM (The figure of merit) --> valori intre [0,1], unde valorile scazute indica o consistenta crescuta. FOM este varianta medie a coloanei eliminate dintr-un cluster, unde clusterul este obtinut pe baza coloanelor ramase.
  471. ---
  472. ```{r echo=T}
  473. library(clValid)
  474. # Iris data set:
  475. # - Remove Species column and scale
  476. df <- scale(iris[, -5])
  477. # Compute clValid
  478. clmethods <- c("hierarchical","kmeans","pam")
  479. intern <- clValid(df, nClust = 2:6,
  480. clMethods = clmethods, validation = "internal")
  481. # Summary
  482. summary(intern)
  483. ```
  484. ---
  485. ## Cod R pentru functia clValid
  486. ```{r echo=T, eval=F}
  487. clValid(obj, nClust, clMethods = "hierarchical",
  488. validation = "stability", maxitems = 600,
  489. metric = "euclidean", method = "average")
  490. ```
  491. Parametrii:
  492. - obj: datele pentru care se face clusterizarea.
  493. - nClust : numarul de clustere de evaluat
  494. - clMethods: metoda de clusterizare folosita
  495. - validation: tipul de validare folosit(internal, stability sau biological)
  496. - maxitems: numarul maxim de elemente care pot sa fie clusterizare
  497. - metric: metrica folosita ("euclidean","correlation","manhattan")
  498. - method: [doar pentru hierarchical clustering], metoda de aglomerare folosita ("ward","single","complete")
  499. ---
  500. ### The stability measures can be computed as follow:
  501. ```{r echo=T, eval=T}
  502. # Stability measures
  503. clmethods <- c("hierarchical","kmeans","pam")
  504. stab <- clValid(df, nClust = 2:6, clMethods = clmethods,
  505. validation = "stability")
  506. # Display only optimal Scores
  507. optimalScores(stab)
  508. ```
  509. ---
  510. #Computing P-value for Hierarchical Clustering
  511. Se foloseste pvclust si se aplica urmatorul algoritm:
  512.  
  513. 1. Generated thousands of bootstrap samples by randomly sampling elements of
  514. the data.
  515.  
  516. 2. Compute hierarchical clustering on each bootstrap copy
  517.  
  518. 3. For each cluster:
  519. - compute the bootstrap probability (BP) value which corresponds to the
  520. frequency that the cluster is identified in bootstrap copies.
  521. - Compute the approximately unbiased (AU) probability values (p-values) by
  522. multiscale bootstrap resampling
  523. ---
  524. # Functia pvclust
  525. ```{r echo=T, eval=F}
  526.  
  527. pvclust(data, method.hclust = "average",
  528. method.dist = "correlation", nboot = 1000)
  529. ```
  530. Functia pvclust realizeaza clusterizarea pe coloanele setului de date.
  531. Parametrii:
  532. - data: matricea datelor folosite
  533. - method.hclust: metoda de aglomerare folosita. Valori posibile
  534. + average --> default
  535. + ward
  536. + single
  537. + complete
  538. + mcquitty
  539. + median
  540. + centroid
  541. - method.dist: masura de distanta folosita:
  542. + correlation
  543. + uncentered,
  544. + abscor
  545. + euclidean
  546. + manhattan
  547. - nboot: numarul de replicari la initializare(bootstrap replications). Default 1000
  548. - iseed: un numar intreg folosit pentru generarea de random seeds. Trebuie folosit daca vrem sa putem reproduce ulterior rezultatele obtinute.
  549.  
  550. # Exemplu de utilizare pvclust()
  551. ```{r echo=T, eval=F}
  552. library(pvclust)
  553. set.seed(123)
  554. res.pv <- pvclust(df, method.dist="cor",
  555.  
  556. method.hclust="average", nboot = 10)
  557. library(pvclust)
  558. set.seed(123)
  559. res.pv <- pvclust(df, method.dist="cor",
  560. # Default plot
  561. plot(res.pv, hang = -1, cex = 0.5)
  562. pvrect(res.pv)
  563. method.hclust="average", nboot = 10)
  564. ```
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement