CSCI 385 - Lab 4
K-means Clustering and PCA


Dataset of top 50 downloaded books from Project Gutenberg

Description

For this lab, you will be analyzing text using the unsupervised learning techniques of K-means clustering and Principal Component Analysis.

You will be working individually on this assignment.

Specifically, you should create an annotated IPython notebook that performs the following steps.

Successfully completing the above steps will earn you a 90% on this assignment.

For a 100%, download a book not listed in the top 50 from Project Gutenberg and determine its cluster membership. Project this book into the new axes orientation produced by PCA and plot it, noting its location.

For funsies, remove the Stop Words from the corpus first, and compare your analysis.

What to Turn In

This assignment is due at 5 pm on Wednesday, September 24st in Moodle.

Use Moodle to turn in your IPython notebook.