kmeans <datafile> <k>
Your input vector data will be in a standard data format, where the first line of the file will be the number of genes, followed by the number of experiments per gene on the second line, and followed finally by the data.
The greedy algorithm discussed in class for K-means is as follows:
kmeans(dataset, k): Read in data vectors Randomly select k data points to be centers of clusters While the clusters have changed composition: Asign every data point to it's nearest cluster center (euclidean distance) Recalculate the center for each cluster as the average of all points in that cluster
The goal of K-means is to minimize the squared error distortion, which for a set V of data points and X of centers, is defined as
for the center X closest to each vi.
Run your algorithm for various values of k and plot the resulting squared error distortion versus the choice of k. What do you believe is the optimal number of clusters?
Turn in your code and sample output from testing in your csc277 directory on the cs.centenary.edu server in a project 5 folder.