CSC277 - Project 2
Bioinformatics Scavenger Hunt
Green Fluorescent Protein
One of the most well studied proteins in molecular biology
is the green fluorescent protein, or GFP.
For this project, you will visit popular online biological databases websites
and gather information on GFP.
GenBank:
- GenBank is a database of nucleotide sequences. It can be accessed at the
NCBI website (National Center for Biotechnology
Information) at http://www.ncbi.nlm.nih.gov/.
In the search pull down menu at
the top, make sure "Nucleotide" is selected. In the text
box at the top of the screen where it
solicits input for searching, type "GFP" and hit the Go
button.
- This
search will bring up over 11201 results. To narrow the search, click on "Limits" just
to the right of the box where you typed "GFP".
Limit the Search Field Tag to "Gene Name" (in the dropdown box) and
click the "Go" button again. You will now have approximately 152 results. Go to page 3 of the list (you
will have to click "next" a few times (the "next" link
appears to the right).
- The 58th and 59th entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653,
look over the GenBank record, and answer the following questions:
- How long is the nucleotide sequence?
- How many "guanines" appear in the gene's DNA sequence? Is there a bias towards any
particular nucleotide?
- What is the Latin name of the organism whose DNA was sequenced for this GFP?
Swissprot:
- Uniprot is a database of amino acid sequences that can
be accessed at http://www.uniprot.org/. At the Uniprot homepage, type
GFP and click the Search button. Find the entry in the Uniprot section (not the trembl section)
which is GFP_AEQVI (P42212).
- Examine the web page for this protein, and answer the following:
- How
many references are cited?
- This
Swissprot record has links to other databases. Pfam (Protein Families) is a database of multiple
alignments. Pfam accession numbers begin with the letters PF, followed by
five numbers (e.g. PF12345).
What is the Pfam accession number for GFP_AEQVI? (NOTE: An
accession number is simply a tag that you can use to refer to a
particular item in a database.
Many of the databases you will use will have accession
numbers. There is no
standard formatting for accession numbers across databases.)
- The
Uniprot database is available via ftp. To see the data in its textual format (i.e. what you
get when you ftp), scroll back to the top of the GFP_AEQVI web page,
and click the link that says "TEXT" . Answer the following
questions:
- The
first two letters on each line identify what kind of line it is (e.g. ID
= Identifier, DT = date, etc.)
Find the line that has the Latin name for the species.
What two letters appear at the
beginning of the line?
- What two symbols, which appear on a line by themselves at the bottom of the
file, indicate the end of the record for GFP_AEQVI? (In the ftp file that you can
download, these symbols are the "record separators".)
Protein Data Bank:
- The PDB (Protein Data Bank) is a database of protein
structures at http://www.rcsb.org/pdb.
From that page, enter "1EMB" and
click the "Search" button. Click on the result. Then click on the "Display Files"
link (on the top right). Then, click on the link to display the structure file
in PDB file format complete with coordinates as HTML (PDB File).
- In
this file the majority of lines are "ATOM" lines. Scroll down until you see those
lines and note how the atoms are numbered (in this case, 1 to 1908). Answer the following questions:
- What
kind of atom is #16 (3rd column)
- What
kind of amino acid is atom #16 in? (4th column)
- What
are the (x,y,z) coordinates of atom #16?