CSCI 385 - Lab 7
Markov Models


Materials

Markov Chains

When modern browsers like Chrome visit a webpage, there is sometimes an option to automatically translate the text into English from another language. How does the browser detect other languages? One way is to pass the text of the webpage through prebuilt Markov chains for many languages, and determine which language has the highest probability of generating this text.

Download the four classic books above, and learn the probabilities for a Markov chain that models the adjacent letter patterns found in each of the given languages. Ignore the header and footer text in the files, only focus on the actual book text. Your states in the Markov chain will be each possible letter in any of the four languages and the space character. The Markov chain should be fully connected, so that any state can be followed by any other state. Use the concept of a pseudocount to initialize every possible observation to be 1, then increment these counts for the adjacent pairs found in the text, and finally normalize your counts to be probabilities. For simplicity, ignore all punctuation and capitalization. You will need to incorporate log probabilities.

Use your learned Markov Chains to predict the language used in the following phrases. Clearly show the probabilities found for each of the four possible languages.

Hidden Markov Models

We can also analyze languages and uncover syntactic properties using HMMs. Use the text from Alice in Wonderland and the Baum-Welsh Algorithm to learn the parameters of a Hidden Markov Model with four hidden states. Again, treat each letter as a state, ignore punctuation and capitalization, but retain spaces. Initialize your transition and emission probabilities randomly as discussed in class. You will need to incorporate log probabilities. Treat each line of the text as a separate sequence for training purposes.

What do you notice about the transition and emission probabilities of the states you learned? Do they align with anything in English?

Use the Viterbi algorithm to determine the most probable path for the following phrases:

Food for Thought

Ideas similar to what we have examined in this lab can help us understand relationships between different languages, and recover information from dying languages or interstellar communication.