CSC277 - Project 4
CpG Islands
Overview
CpG Islands are genomic regions that contain
a high frequency of Cytosine nucleotides next to Guanine nucleotides. They are usually
associated with the promoter region of genes. We can help find them by creating a
hidden markov model and using the Viterbi algorithm to find the most probable path
through the hidden states.
For an interesting recent article on finding both
CpG-related and non-CpG-related promoter regions in order to locate first
exons, see
Davuluri R. V., Grosse I., Zhang M.Q.
Computational identification of promoters and first exons in the human genome.
Nat Genet. 2001 Dec;29(4):412-7.
Algorithm
Write a program to calculate the most likely location of CpG islands using a Hidden Markov Model.
viterbi <transitions> <emissions> <sequences>
Your input sequences will be in FASTA
format, and your adjacency matrix can be found in transitions.txt.
There are 8 states, one for A, C, T, and G in the island, and a, c, t, and g not inside the island.
The emmision probabilities will be 100% for the letter of the associated state, and 0% for the other
three letters.
Assume there is a silent starting state with equal probability to transition to
the 8 internal states. Also, be sure to use log probabilities in your calculations
to prevent underflow (when the probabilities become too small to store in the computer).
Helpful Data Structures
You may reuse many of your data structures from Project 3. Namely, you will need both a
score matrix and an arrow matrix to generate the path.
Output
The output of your program should be the resulting predictions for the states.
Testing
Test you above algorithm on the nine sequences found in the
cpgisland.fasta
Turn in your code and sample output from testing in your csc277
directory on the cs.centenary.edu server.