One thread common to all life on Earth is Deoxyribonucleic acid (DNA). DNA is composed of long strands of four basic molecules: Adenine, Cytosine, Guanine and Thymine. These four bases coil together to form a double-helix structure, and we have 23 such structures called chromosomes in every cell of our body. The bases have the interesting property of pairing up across the double-helix, such that Adenine always matches Thymine, and Cytosine always matches Guanine. For our purposes as computer scientists, we can abbreviate these bases as A, C, G, and T.
DNA contains segments called genes detailing how every piece of our cells should work. A gene becomes a protein through a process known as the Central dogma of molecular biology: DNA is transcribed into RNA (Ribonucleic Acid) and then translated into proteins. Proteins are the machinery that makes our cells work; they form the cell walls, transport molecules around the cell, and in general make everything work. It is this process of DNA to RNA to Proteins that we will study in this lab, using Python and String functions to analyze small pieces of DNA. The process may seem a bit complex, but with Python we can make very fast progress.
s = "catgcttcgcataacatgactgct"
Depending on the database we use to find our DNA sequences, some may be in lowercase, while others
are all capital letters. To make our life easier, we can convert any string to all uppercase
by using the string.upper
(string) function from the string module to
eliminate any variations from the source.
G-C Content
After capitalization, we now have
s = "CATGCTTCGCATAACATGACTGCT"
One way to determine if there might be a gene
hiding in a strand of DNA is to examine it's G-C content. Because of how genes are encoded, most
genes have a balanced number of Gs and Cs in relation to the number of As and Ts. To determine the
G-C content of a DNA strand, we can add up the number of Gs and Cs we find, and divide this by
the total number of bases in the string. In this case, there would be 11 Gs and Cs out of 24 total
bases, giving us 45.83%. Most genes have between 25% and 75% Gs and Cs, so this gives our strand
a good shot of being a gene. Python lets us do this with the string.count
(string,
substring) function.
Transcription from DNA to RNA
Our next step is to transcribe the DNA into RNA. RNA is a single-stranded molecule as opposed to
the double-stranded DNA. This means it cannot replicate itself as does DNA, but RNA still
plays many useful roles in cell function. Primarily, it is RNA that gets translated into
proteins, so having these steps be separate eases the burden on DNA to process all the work.
In the transcription
from DNA to RNA, every Thymine molecule is replace with Uracil. So when manipulating
these strings in Python, we can replace T with U using the
string.replace
(string, find, replace) function. This gives us the RNA strand
r = "CAUGCUUCGCAUAACAUGACUGCU"
Translation from RNA to proteins
With our strand now in RNA form, we are ready to make the final step to proteins. This works by reading the Genetic Code of RNA. Just like words are composed of individual letters, genes are divided up into segments called codons. A codon consists of three sequential bases, so with 4 choices for each base (from A, C, G, and U), this gives us 4 X 4 X 4 = 64 different combinations. Each of these 64 possibilities translates into one of 20 amino acids (redundancy is built into this system to decrease the effects of random mutations), and it is the conjunction of these amino acids that ultimately fold up and create proteins. Not all of the bases in a strand of RNA will encode for a gene, and actually, only 10% - 20% of our current DNA is known to encode genes; the other 80% - 90% of our genome is mostly unexplained and unexplored for different modes of functionality.
When you are reading a strand of RNA, there are two important locations to find: where to start reading and where to stop reading. There are actually three different ways to change a strand into amino acids. We could start with the first base at index 0, such that our codons are substrings from 0:3, 3:6, 6:9, etc. Staring with index 1 makes our substrings 1:4, 4:7, 7:10, etc, and starting with index 2 gives us 2:5, 5:8, 8:11, etc.
These initial indices (0, 1, 2) define the reading frame for this gene.
In English we start every sentence with a capital letter, and stop each sentence with a period.
RNA uses an interesting way of signaling the start of a gene by designating one particular codon,
AUG, as the start codon. For our purposes, wherever you find AUG, this marks the start of a gene;
this also defines our reading frame, which can be found by modding the initial index of the start
codon by 3. In our sample sequence above, we can use the function
string.find
(string, substring) to locate the index for "AUG", and determine
that the reading frame is 1 % 3 = 1.
For robustness, there are three codons which tell us when to stop translating RNA into amino acids: UAG, UGA and UAA. As soon as any one of these three stop codons is found in the same reading frame after the start codon, translation stops and the protein is complete. In our RNA strand, we first find UAA at index 11, however 11 % 3 = 2 and is not in the same reading frame as the start codon. Our next instance of a stop codon is at index 16, and since 16 % 3 = 1, we have found our stop codon. This means that the gene to be translated is
gene = "CUUCGCAUAACA"
We will leave the final translation step into the actual amino acids for a later project. For now, it will suffice to find the start and stop codons within our RNA strand.
Download histone_h4.txt
to your
working Python directory (the default where your files are saved).
This relatively small section of DNA (most genes are between 1000 and 10,000 bases long)
that encodes for
Histone H4.
Histone H4 plays a major role in how the DNA strands wind up to form chromosomes,
and this gene is highly conserved across almost all
Eukaryotic organisms
(those which have a nucleus).
Perform the following steps on the DNA strand, and record the results in
lab3_evaluation.txt
:
cs.centenary.edu
through either Secure FTP or WinSCP using your
cs login and password. Create a subdirectory from csc104
called lab3
and make it read, write and execute privileges for only the
owner. Write up your results from step 2 into a file called
lab3_evaluation.txt
and copy it into this directory.
You must hand in: