CSC207 - Lab 12
Gibberish Generator


Materials

Overview

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
This assignment explores the use of the dictionary object to generate pronounceable nonsense words, known as pseudowords or logatomes, similar to those found in The Jabberwocky by Lewis Carroll, using three probabilistic models of word creation that become progressively more accurate.

Naive Design

As a first approach, we can start by selecting characters from the alphabet uniformly at random. To determine how many letters will be in the word, write a function called word_length to calculate a histogram of word lengths based on a comprehensive list of words in the English language. We can see in the figure below that this is not a uniform distribution. This function should return a dictionary with integer lengths as keys and the frequency counts as values.

Next, write a second function called proportionalprob that takes a histogram as a parameter and returns a key proportional to the histogram counts. For example, if words of length 4 comprised 17% of the word list, your function should return 4 on average 17% of the time.

Finally, write a function called nonsense that will bring in a count of how many words to generate. Calculate the word length frequency dictionary, then for each word, randomly select a length, and then uniformly select random characters for this word.

Letter Frequency

The words generated naively above are still unpronounceable. Our second approach is to change the letter selection from uniform to proportional as well. Recall that the frequency of letters in English is not uniform: E is the most frequent letter, followed by, T, A, O, I, N... Write a function called letter_freq which will calculate a histogram dictionary with letters as keys and letter frequency counts as values.

Revise your naive function above to use the proportional probability function twice, once for length, and then once for letters, to generate the words.

Markov Chains

We are getting closer to pronounceability, but to really make the right kind of words, we need to care about the arrangement of the letters along with their frequency. For example, the letter T is much more likely to be followed by an H than by a Z. To do this, we create a Markov Chain to generate the next letter in a sequence based on the previous letters.

We start with a Markov Chain of order 1. The order of the Markov Chain is how many of these previous letters to use. This generates the graph above, where each letter is connected to every other letter, including itself, and the edges are the probability of choosing the connected letter next. If we also include "\n" as a possible character, our words will naturally terminate.

Write a function markov that brings in a word list and an order n. This function will create a dictionary with the previous n letters of a word as the key, and a second dictionary as the value. These inner dictionaries are similar to the previous letter frequency dictionaries and can be used in the same way to generate the next letter. Revise your function above to use this dictionary to generate words. When you generate a "\n" character, start with the next word.

Experiment with your function to find the order which best creates nonsense words. What happens when the order is very large?

Spells

As a final task, substitute the English word list with Latin to generate nonsense spells, such as "patuscus immoveo" or "stulor oportus."
© Mark Goadrich 2010, Centenary College of Louisiana