CSC277 - Project 4
CpG Islands


Overview

CpG Islands are genomic regions that contain a high frequency of Cytosine nucleotides next to Guanine nucleotides. They are usually associated with the promoter region of genes. We can help find them by creating a hidden markov model and using the Viterbi algorithm to find the most probable path through the hidden states. For an interesting recent article on finding both CpG-related and non-CpG-related promoter regions in order to locate first exons, see Davuluri R. V., Grosse I., Zhang M.Q. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001 Dec;29(4):412-7.

Algorithm

Write a program to calculate the most likely location of CpG islands using a Hidden Markov Model.

viterbi <transitions> <emissions> <sequences>

Your input sequences will be in FASTA format, and your adjacency matrix can be found in transitions.txt. There are 8 states, one for A, C, T, and G in the island, and a, c, t, and g not inside the island. The emmision probabilities will be 100% for the letter of the associated state, and 0% for the other three letters.

Assume there is a silent starting state with equal probability to transition to the 8 internal states. Also, be sure to use log probabilities in your calculations to prevent underflow (when the probabilities become too small to store in the computer).

Helpful Data Structures

You may reuse many of your data structures from Project 3. Namely, you will need both a score matrix and an arrow matrix to generate the path.

Output

The output of your program should be the resulting predictions for the states.

Testing

Test you above algorithm on the nine sequences found in the cpgisland.fasta

Turn in your code and sample output from testing in your csc277 directory on the cs.centenary.edu server.