Advertisement
Guest User

Untitled

a guest
Nov 20th, 2014
212
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 15.41 KB | None | 0 0
  1. """Classes to perform KMeans Clustering"""
  2.  
  3. import math
  4. import random
  5. import numpy
  6.  
  7. # HELPER FUNCTIONS FOR ASSERTS GO HERE
  8. def is_point(thelist):
  9. """Return: True if thelist is a list of int or float"""
  10. if (type(thelist) != list):
  11. return False
  12.  
  13. # All float
  14. okay = True
  15. for x in thelist:
  16. if (not type(x) in [int,float]):
  17. okay = False
  18.  
  19. return okay
  20.  
  21.  
  22. # CLASSES
  23. class Dataset(object):
  24. """Instance is a dataset for k-means clustering.
  25.  
  26. The data is stored as a list of list of numbers
  27. (ints or floats). Each component list is a data point.
  28.  
  29. Instance Attributes:
  30. _dimension [int > 0. Value never changes after initialization]:
  31. the point dimension for this dataset
  32. _contents [a 2D list of numbers (float or int), possibly empty]:
  33. the dataset contents
  34.  
  35. Additional Invariants:
  36. The number of columns in _contents is equal to _dimension. That is,
  37. for every item _contents[i] in the list _contents,
  38. len(_contents[i]) == dimension.
  39.  
  40. None of the attributes should be accessed directly outside of the class
  41. Dataset (e.g. in the methods of class Cluster or KMeans). Instead, this class
  42. has getter and setter style methods (with the appropriate preconditions) for
  43. modifying these values.
  44. """
  45.  
  46. def __init__(self, dim, contents=None):
  47. """Initializer: Makes a database for the given point dimension.
  48.  
  49. The parameter dim is the initial value for attribute _dimension.
  50.  
  51. The optional parameter contents is the initial value of the
  52. attribute _contents. When assigning contents to the attribute
  53. _contents it COPIES the list contents. If contents is None, the
  54. initializer assigns _contents an empty list. The parameter contents
  55. is None by default.
  56.  
  57. Precondition: dim is an int > 0. contents is either None or
  58. it is a 2D list of numbers (int or float). If contents is not None,
  59. then contents if not empty and the number of columns of contents is
  60. equal to dim.
  61. """
  62.  
  63. assert type(dim) is int and dim > 0
  64. self._contents = []
  65. if contents is not None:
  66. for x in contents:
  67. assert len(x) is dim
  68. self._contents.append(x[:])
  69.  
  70. self._dimension = dim
  71.  
  72.  
  73. def getDimension(self):
  74. """Return: The point dimension of this data set.
  75. """
  76. return self._dimension
  77.  
  78. def getSize(self):
  79. """Return: the number of elements in this data set.
  80. """
  81. return len(self._contents)
  82.  
  83. def getContents(self):
  84. """Return: The contents of this data set as a list.
  85.  
  86. This method returns the attribute _contents directly. Any changes
  87. made to this list will modify the data set. If you want to access
  88. the data set, but want to protect yourself from modifying the data,
  89. use getPoint() instead.
  90. """
  91. return self._contents
  92.  
  93. def getPoint(self, i):
  94. """Returns: A COPY of the point at index i in this data set.
  95.  
  96. Often, we want to access a point in the data set, but we want a copy
  97. to make sure that we do not accidentally modify the data set. That
  98. is the purpose of this method.
  99.  
  100. If you actually want to modify the data set, use the method getContents().
  101. That returns the list storing the data set, and any changes to that
  102. list will alter the data set.
  103. While it is possible, to access the points of the data set via
  104. the method getContents(), that method
  105.  
  106. Precondition: i is an int that refers to a valid position in the data
  107. set (e.g. i is between 0 and getSize()).
  108. """
  109. assert type(i) == int
  110. assert i >= 0 and i < self.getSize()
  111.  
  112. copy = self._contents[:]
  113. return self._contents[i][:]
  114.  
  115. def addPoint(self,point):
  116. """Adds a COPY of point at the end of _contents.
  117.  
  118. This method does not add the point directly. It adds a copy of the point.
  119.  
  120. Precondition: point is a list of numbers (int or float),
  121. len(point) = _dimension.
  122. """
  123. assert type(point) == list
  124. assert is_point(point)
  125. assert len(point) is self._dimension
  126.  
  127. return self._contents.append(point[:])
  128.  
  129. # PROVIDED METHODS: Do not modify!
  130. def __str__(self):
  131. """Returns: String representation of the centroid of this cluster."""
  132. return str(self._contents)
  133.  
  134. def __repr__(self):
  135. """Returns: Unambiguous representation of this cluster. """
  136. return str(self.__class__) + str(self)
  137.  
  138.  
  139. class Cluster(object):
  140. """An instance is a cluster, a subset of the points in a dataset.
  141.  
  142. A cluster is represented as a list of integers that give the indices
  143. in the dataset of the points contained in the cluster. For instance,
  144. a cluster consisting of the points with indices 0, 4, and 5 in the
  145. dataset's data array would be represented by the index list [0,4,5].
  146.  
  147. A cluster instance also contains a centroid that is used as part of
  148. the k-means algorithm. This centroid is an n-D point (where n is
  149. the dimension of the dataset), represented as a list of n numbers,
  150. not as an index into the dataset. (This is because the centroid
  151. is generally not a point in the dataset, but rather is usually in between
  152. the data points.)
  153.  
  154. Instance attributes:
  155. _dataset [Dataset]: the dataset this cluster is a subset of
  156. _indices [list of int]: the indices of this cluster's points in the dataset
  157. _centroid [list of numbers]: the centroid of this cluster
  158. Extra Invariants:
  159. len(_centroid) == _dataset.getDimension()
  160. 0 <= _indices[i] < _dataset.getSize(), for all 0 <= i < len(_indices)
  161. """
  162.  
  163. # Part A
  164. def __init__(self, ds, centroid):
  165. """A new empty cluster whose centroid is a copy of <centroid> for the
  166. given dataset ds.
  167.  
  168. Pre: ds is an instance of a subclass of Dataset.
  169. centroid is a list of ds.getDimension() numbers.
  170. """
  171. assert isinstance(ds,Dataset)
  172. assert type(centroid) == list and len(centroid) == ds.getDimension()
  173.  
  174. self._dataset = ds
  175. self._centroid = centroid[:]
  176. self._indices = []
  177.  
  178. def getCentroid(self):
  179. """Returns: the centroid of this cluster.
  180.  
  181. This getter method is to protect access to the centroid.
  182. """
  183. return self._centroid
  184.  
  185. def getIndices(self):
  186. """Returns: the indices of points in this cluster
  187.  
  188. This method returns the attribute _indices directly. Any changes
  189. made to this list will modify the cluster.
  190. """
  191. return self._indices
  192.  
  193.  
  194. def addIndex(self, index):
  195. """Add the given dataset index to this cluster.
  196.  
  197. If the index is already in this cluster, this method leaves the
  198. cluster unchanged.
  199.  
  200. Precondition: index is a valid index into this cluster's dataset.
  201. That is, index is an int in the range 0.._dataset.getSize().
  202. """
  203. assert type(index) == int
  204. assert 0 <= index < self._dataset.getSize()
  205.  
  206. i = 0
  207. while i<len(self._indices):
  208. if self._indices[i] == index:
  209. return self._indices
  210. else:
  211. i = i+ 1
  212. return self._indices.append(index)
  213.  
  214. def clear(self):
  215. """Remove all points from this cluster, but leave the centroid unchanged.
  216. """
  217. self._indices[:] = []
  218. return self._indices
  219.  
  220. def getContents(self):
  221. """Return: a new list containing copies of the points in this cluster.
  222.  
  223. The result is a list of list of numbers. It has to be computed from
  224. the indices.
  225. """
  226. contents = []
  227. for x in self._indices:
  228. contents.append(self._dataset.getPoint(x))
  229. return contents
  230.  
  231.  
  232.  
  233. # Part B
  234. def distance(self, point):
  235. """Return: The euclidean distance from point to this cluster's centroid.
  236.  
  237. Pre: point is a list of numbers (int or float)
  238. len(point) = _ds.getDimension()
  239. """
  240. assert type(point) == list
  241. assert is_point(point)
  242. assert len(point) == self._dataset.getDimension()
  243.  
  244. a = numpy.array(point)
  245. b = numpy.array(self._centroid)
  246. distance = numpy.linalg.norm(a-b)
  247.  
  248. return float(distance)
  249.  
  250.  
  251. def updateCentroid(self):
  252. """Returns: Trues if the centroid remains the same after recomputation,
  253. and False otherwise.
  254.  
  255. This method recomputes the _centroid attribute of this cluster. The
  256. new _centroid attribute is the average of the points of _contents
  257. (To average a point, average each coordinate separately).
  258.  
  259. Whether the centroid "remained the same" after recomputation is
  260. determined by numpy.allclose. The return value should be interpreted
  261. as an indication of whether the starting centroid was a "stable"
  262. position or not.
  263.  
  264. If there are no points in the cluster, the centroid. does not change.
  265. """
  266.  
  267. s = 0 #sum of coordinate
  268. new_centroid = []
  269. for n in range(0, len(self._centroid)):
  270. for x in self._indices:
  271. s += self._dataset.getPoint(x)[n]
  272. s /= len(self._indices)
  273. new_centroid.append(s)
  274. s = 0
  275.  
  276. if numpy.allclose(self._centroid, new_centroid):
  277. return True
  278. else:
  279. self._centroid = new_centroid[:]
  280. return False
  281.  
  282. # PROVIDED METHODS: Do not modify!
  283. def __str__(self):
  284. """Returns: String representation of the centroid of this cluster."""
  285. return str(self._centroid)
  286.  
  287. def __repr__(self):
  288. """Returns: Unambiguous representation of this cluster. """
  289. return str(self.__class__) + str(self)
  290.  
  291.  
  292. class ClusterGroup(object):
  293. """An instance is a set of clusters of the points in a dataset.
  294.  
  295. Instance attributes:
  296. _dataset [Dataset]: the dataset which this is a clustering of
  297. _clusters [list of Cluster]: the clusters in this clustering (not empty)
  298. """
  299.  
  300. # Part A
  301. def __init__(self, ds, k, seed_inds=None):
  302. """A clustering of the dataset ds into k clusters.
  303.  
  304. The clusters are initialized by randomly selecting k different points
  305. from the database to be the centroids of the clusters. If seed_inds
  306. is supplied, it is a list of indices into the dataset that specifies
  307. which points should be the initial cluster centroids.
  308.  
  309. Pre: ds is an instance of a subclass of Dataset.
  310. k is an int, 0 < k <= ds.getSize().
  311. seed_inds is None, or a list of k valid indices into ds.
  312. """
  313. assert isinstance(ds,Dataset)
  314. assert type(k)== int
  315. assert (k > 0) and (k <= ds.getSize())
  316. assert (seed_inds == None) or (type(seed_inds) == list and len(seed_inds) == k) #assert that the elements inside the list are ints
  317. if seed_inds == list:
  318. for x in seed_inds:
  319. assert 0 <= x < ds.getSize()
  320.  
  321. self._dataset = ds
  322. self._clusters = []
  323.  
  324. if seed_inds == None:
  325. simplelist = random.sample(ds._contents,k) #gives us the centroids
  326. for x in simplelist:
  327. clust = Cluster(ds,x)
  328. self._clusters.append(clust)
  329. else:
  330. for x in seed_inds:
  331. clust1 = Cluster(ds,Dataset.getPoint(self._dataset,x))
  332. self._clusters.append(clust1)
  333.  
  334. #ryl34 helped with this function
  335.  
  336. def getClusters(self):
  337. """Return: The list of clusters in this object.
  338.  
  339. This method returns the attribute _clusters directly. Any changes
  340. made to this list will modify the set of clusters.
  341. """
  342. return self._clusters
  343.  
  344. # Part B
  345. def _nearest_cluster(self, point):
  346. """Returns: Cluster nearest to point
  347.  
  348. This method uses the distance method of each Cluster to compute
  349. the distance between point and the cluster centroid. It returns
  350. the Cluster that is the closest.
  351.  
  352. Ties are broken in favor of clusters occurring earlier in the
  353. list of self._clusters.
  354.  
  355. Pre: point is a list of numbers (int or float),
  356. len(point) = self._dataset.getDimension().
  357. """
  358. assert type(point) == list
  359. assert is_point(point)
  360. assert len(point) == self._dataset.getDimension()
  361.  
  362. somecluster = self._clusters #cluster thing
  363. near = somecluster[0] #one cluster in somecluster
  364. odist = near.distance(point)#name of centroid
  365.  
  366. for x in somecluster:
  367. distspace = x.distance(point)
  368. if distspace < odist:
  369. near = x
  370. return near
  371.  
  372.  
  373. def _partition(self):
  374. """Repartition the dataset so each point is in exactly one Cluster.
  375. """
  376. # First, clear each cluster of its points. Then, for each point in the
  377. # dataset, find the nearest cluster and add the point to that cluster.
  378.  
  379. for x in range(len(self._clusters)):
  380. self._clusters[x] = self._clusters[x].clear()
  381.  
  382. print self._dataset._contents[0]
  383.  
  384. for point in self._dataset._contents:
  385. nearpoint = self._nearest_cluster([0.0,0.0])
  386. self._clusters.append(nearpoint)
  387.  
  388. # Part C
  389. def _update(self):
  390. """Return:True if all centroids are unchanged after an update; False otherwise.
  391.  
  392. This method first updates the centroids of all clusters'. When it is done, it
  393. checks whether any of them have changed. It then returns the appropriate value.
  394. """
  395. # IMPLEMENT ME
  396. pass
  397.  
  398. def step(self):
  399. """Return: True if the algorithm converges after one step; False otherwise.
  400.  
  401. This method performs one cycle of the k-means algorithm. It then checks if
  402. the algorithm has converged and returns the appropriate value.
  403. """
  404. # In a cycle, we partition the points and then update the means.
  405. # IMPLEMENT ME
  406. pass
  407.  
  408. # Part D
  409. def run(self, maxstep):
  410. """Continue clustering until either it converges or maxstep steps
  411. (which ever comes first).
  412.  
  413. Precondition: maxstep is int >= 0.
  414. """
  415. assert type(maxstep) == int and maxstep > 0
  416. # Call step repeatedly, up to maxstep times, until the algorithm
  417. # converges. Stop after maxstep iterations even if the algorithm has not
  418. # converged.
  419.  
  420. # IMPLEMENT ME
  421. pass
  422.  
  423. # PROVIDED METHODS: Do not modify!
  424. def __str__(self):
  425. """Returns: String representation of the centroid of this cluster."""
  426. return str(self._clusters)
  427.  
  428. def __repr__(self):
  429. """Returns: Unambiguous representation of this cluster. """
  430. return str(self.__class__) + str(self)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement