Untitled

My project idea is to use a variation of the k-means algorithm where instead of using Euclidean distance, you use multiplicative distance instead, so that the distance function is not sensitive to the scale of the individual features, eliminating the need to scale features. This approach might also mitigate the effects of the curse of dimensionality.

Non-Euclidean distances normally don’t work with K-means, but this is due to Euclidean distance being the basis of computing the centroid. This approach also computes this centroid using geometric relationships, so you can compute geometric distances using this.

This approach would essentially convert given data values to values representing the ratio between that data point’s attribute and the geometric average of that attribute over the whole dataset. So instead of using Euclidean distance, the distance is based on the above ratio.


As my dataset, I am thinking of using UCI’s forest cover dataset, https://archive.ics.uci.edu/ml/datasets/Covertype. I chose this for the moderate number of attributes, the fact that the attributes use many different scales, and that the results of clustering are easy to verify as the dataset includes the resulting cover type to compare my clusters with.