CSCI 385 - Lab 4
K-means Clustering and PCA
Dataset of top 50 downloaded books from Project Gutenberg
Description
For this lab, you will be analyzing text using the unsupervised learning techniques
of K-means clustering and Principal Component Analysis.
You will be working individually on this assignment.
Specifically, you should create an annotated IPython notebook that performs the following steps.
- Load in the 50 books provided above from Project Gutenberg.
- Calculate the 20 most frequent words in the whole corpus. Be sure to clean up the
data (standardize capitalization, remove punctuation, etc.). Regular
expressions might help.
- Create a 20-dimensional vector for each book based on the normalized frequency
of the top 20 words for each book.
- Cluster these books using K-means clustering, determining the best K by creating
an elbow chart for K from 2 to 10.
- Make sure to note which books go into which clusters. You can find the title for each
book near the beginning of each text file. Do these clusters make sense
based on what you can find out about these books?
- Perform Principal Component Analysis on the data vectors.
- Plot the data using the first two principal components, and color the data according
to the clusters found above.
- Examine the first two principal components, and determine which words were the most
important factors in these new transformed dimensions.
Successfully completing the above steps will earn you a 90% on this assignment.
For a 100%, download a book not listed in the top 50 from Project Gutenberg and determine its
cluster membership. Project this book into the new axes orientation produced by PCA and
plot it, noting its location.
For funsies, remove the Stop
Words from the corpus first, and compare your analysis.
What to Turn In
This assignment is due at 5 pm on Wednesday, September 24st in Moodle.
Use Moodle to turn in your IPython notebook.