CSCI 385 - Lab 4
K-means Clustering and PCA

Dataset of top 50 downloaded books from Project Gutenberg

Book Zip File

Description

For this lab, you will be analyzing text using the unsupervised learning techniques of K-means clustering and Principal Component Analysis.

You will be working individually on this assignment.

Specifically, you should create an annotated IPython notebook that performs the following steps.

Load in the 50 books provided above from Project Gutenberg.
Calculate the 20 most frequent words in the whole corpus. Be sure to clean up the data (standardize capitalization, remove punctuation, etc.). Regular expressions might help.
Create a 20-dimensional vector for each book based on the normalized frequency of the top 20 words for each book.
Cluster these books using K-means clustering, determining the best K by creating an elbow chart for K from 2 to 10.
Make sure to note which books go into which clusters. You can find the title for each book near the beginning of each text file. Do these clusters make sense based on what you can find out about these books?
Perform Principal Component Analysis on the data vectors.
Plot the data using the first two principal components, and color the data according to the clusters found above.
Examine the first two principal components, and determine which words were the most important factors in these new transformed dimensions.

Successfully completing the above steps will earn you a 90% on this assignment.

For a 100%, download a book not listed in the top 50 from Project Gutenberg and determine its cluster membership. Project this book into the new axes orientation produced by PCA and plot it, noting its location.

For funsies, remove the Stop Words from the corpus first, and compare your analysis.

What to Turn In

This assignment is due at 5 pm on Wednesday, September 24st in Moodle.

Use Moodle to turn in your IPython notebook.

CSCI 385 - Lab 4 K-means Clustering and PCA

Dataset of top 50 downloaded books from Project Gutenberg

Description

What to Turn In

CSCI 385 - Lab 4
K-means Clustering and PCA