CSC277 - Project 5
K-Means Clustering

Assigned Mar 31st 11 a.m.
Due April 7th 9:45 a.m.


Algorithm

Write a program to calculate the clusters for a given dataset and a given value of K using the k-means algorithm:

kmeans <datafile> <k> 

Your input vector data will be in a standard data format, where the first line of the file will be the number of genes, followed by the number of experiments per gene on the second line, and followed finally by the data.

The greedy algorithm discussed in class for K-means is as follows:

kmeans(dataset, k):
    Read in data vectors
    Randomly select k data points to be centers of clusters
    While the clusters have changed composition:
    	Asign every data point to it's nearest cluster center (euclidean distance)
    	Recalculate the center for each cluster as the average of all points in that cluster

The goal of K-means is to minimize the squared error distortion, which for a set V of data points and X of centers, is defined as

for the center X closest to each vi.

Output

The output of your program should be the clusters found and their membership, as well as the squared error distortion.

Testing

Test you above algorithm on the data found in the expression.txt file provided.

Run your algorithm for various values of k and plot the resulting squared error distortion versus the choice of k. What do you believe is the optimal number of clusters?

Turn in your code and sample output from testing in your csc277 directory on the cs.centenary.edu server in a project 5 folder.