CSC104 - Lab 4
Strands of DNA

Assigned September 16th 2 p.m.
Due September 18th 2 p.m.

Overview

DNA molecules provide the basis for all known life on Earth. In this lab we will explore the String data type by analyzing and manipulating strands of DNA. I first provide a brief overview of DNA and RNA, followed by steps in Python to carry out these molecular processes.

Materials

String Methods in Python
histone_h4.txt
Lab Partner

Description

The search for intelligent life in the universe, according to some conservative estimates for the Drake equation, may take a very long time before we are successful. It is much more likely that we will first find evidence of life at a much more basic level. Astrobiology is the study of life itself and where it might exist in the universe, intelligent or not. By studying life on Earth and the conditions under which it survives, be they normal temperate climates or harsh extremes where only Extremophiles can be found, we can begin to reason about other locations in our solar system and galaxy that might also harbor life.

One thread common to all life on Earth is Deoxyribonucleic acid (DNA). DNA is composed of long strands of four basic molecules: Adenine, Cytosine, Guanine and Thymine. These four bases coil together to form a double-helix structure, and we have 23 such structures called chromosomes in every cell of our body. The bases have the interesting property of pairing up across the double-helix, such that Adenine always matches Thymine, and Cytosine always matches Guanine. For our purposes as computer scientists, we can abbreviate these bases as A, C, G, and T.

DNA contains segments called genes detailing how every piece of our cells should work. A gene becomes a protein through a process known as the Central dogma of molecular biology: DNA is transcribed into RNA (Ribonucleic Acid) and then translated into proteins. Proteins are the machinery that makes our cells work; they form the cell walls, transport molecules around the cell, and in general make everything work. It is this process of DNA to RNA to Proteins that we will study in this lab, using Python and String functions to analyze small pieces of DNA. The process may seem a bit complex, but with Python we can make very fast progress.

Step 1

We'll start with a small example to get familiar with how to analyze these DNA strings with Python. Our sample DNA for these first experiments will be

s = "catgcttcgcataacatgactgct"

Depending on the database we use to find our DNA sequences, some may be in lowercase, while others are all capital letters. To make our life easier, we can convert any string to all uppercase by using the s.upper() method of a string s to eliminate any variations from the source.

G-C Content

After capitalization, we now have

s = "CATGCTTCGCATAACATGACTGCT"

One way to determine if there might be a gene hiding in a strand of DNA is to examine it's G-C content. Because of how genes are encoded, most genes have a balanced number of Gs and Cs in relation to the number of As and Ts. To determine the G-C content of a DNA strand, we can add up the number of Gs and Cs we find, and divide this by the total number of bases in the string. In this case, there would be 11 Gs and Cs out of 24 total bases, giving us 45.83%. Most genes have between 25% and 75% Gs and Cs, so this gives our strand a good shot of being a gene. Python lets us do this with the s.count(substring) method.

Transcription from DNA to RNA

Our next step is to transcribe the DNA into RNA. RNA is a single-stranded molecule as opposed to the double-stranded DNA. This means it cannot replicate itself as does DNA, but RNA still plays many useful roles in cell function. Primarily, it is RNA that gets translated into proteins, so having these steps be separate eases the burden on DNA to process all the work. In the transcription from DNA to RNA, every Thymine molecule is replace with Uracil. So when manipulating these strings in Python, we can replace T with U using the s.replace(find, replace) method. This gives us the RNA strand

r = "CAUGCUUCGCAUAACAUGACUGCU"

Translation from RNA to proteins

With our strand now in RNA form, we are ready to make the final step to proteins. This works by reading the Genetic Code of RNA. Just like words are composed of individual letters, genes are divided up into segments called codons. A codon consists of three sequential bases, so with 4 choices for each base (from A, C, G, and U), this gives us 4 X 4 X 4 = 64 different combinations. Each of these 64 possibilities translates into one of 20 amino acids (redundancy is built into this system to decrease the effects of random mutations), and it is the conjunction of these amino acids that ultimately fold up and create proteins. Not all of the bases in a strand of RNA will encode for a gene, and actually, only 10% - 20% of our current DNA is known to encode genes; the other 80% - 90% of our genome is mostly unexplained and unexplored for different modes of functionality.

When you are reading a strand of RNA, there are two important locations to find: where to start reading and where to stop reading. There are actually three different ways to change a strand into amino acids. We could start with the first base at index 0, such that our codons are substrings from 0:3, 3:6, 6:9, etc. Staring with index 1 makes our substrings 1:4, 4:7, 7:10, etc, and starting with index 2 gives us 2:5, 5:8, 8:11, etc.

These initial indices (0, 1, 2) define the reading frame for this gene. In English we start every sentence with a capital letter, and stop each sentence with a period. RNA uses an interesting way of signaling the start of a gene by designating one particular codon, AUG, as the start codon. For our purposes, wherever you find AUG, this marks the start of a gene; this also defines our reading frame, which can be found by modding the initial index of the start codon by 3. In our sample sequence above, we can use the method s.find(substring) to locate the index for "AUG", and determine that the reading frame is 1 % 3 = 1.

For robustness, there are three codons which tell us when to stop translating RNA into amino acids: UAG, UGA and UAA. As soon as any one of these three stop codons is found in the same reading frame after the start codon, translation stops and the protein is complete. In our RNA strand, we first find UAA at index 11, however 11 % 3 = 2 and is not in the same reading frame as the start codon. Our next instance of a stop codon is at index 16, and since 16 % 3 = 1, we have found our stop codon. This means that the gene to be translated is

gene = "CUUCGCAUAACA"

We will leave the final translation step into the actual amino acids for a later project. For now, it will suffice to find the start and stop codons within our RNA strand.

Step 2

Now that you have some familiarity with the processing of DNA, we will be using the interpreter to analyze very long DNA strings.

Download histone_h4.txt to your working Python directory (the default where your files are saved). This relatively small section of DNA (most genes are between 1000 and 10,000 bases long) that encodes for Histone H4. Histone H4 plays a major role in how the DNA strands wind up to form chromosomes, and this gene is highly conserved across almost all Eukaryotic organisms (those which have a nucleus).

Perform the following steps on the DNA strand, and record the results in lab4_evaluation.txt:

Open this file through the interpreter and read in the data into a string.
Convert this string to uppercase and record the first 10 bases.
Find and record the GC content of the DNA as a percent of the overall length of the DNA.
Translate this gene into RNA. Record how many bases were changed from T to U.
Determine the reading frame for the first gene present in the RNA.
Find the segment of RNA that encodes the first gene, excluding the start and stop codon. This will take some trial and error from you; use the functions described in the string module. Remember that the stop codon must be in the same reading frame as the start codon. Record the locations of both the start and stop codon.

What to Hand In

Log in to cs.centenary.edu through either Secure FTP or WinSCP using your cs login and password. Create a subdirectory from csc207 called lab4 and make it read, write and execute privileges for only the owner. Write up your results from step 2 into a file called lab4_evaluation.txt and copy it into this directory.

You must hand in:

lab4_evaluation.txt

CSC104 - Lab 4Strands of DNA

Assigned September 16th 2 p.m. Due September 18th 2 p.m.