CSC277 - Project 2
Bioinformatics Scavenger Hunt


Green Fluorescent Protein

One of the most well studied proteins in molecular biology is the green fluorescent protein, or GFP. For this project, you will visit popular online biological databases websites and gather information on GFP.

GenBank:

  1. GenBank is a database of nucleotide sequences. It can be accessed at the NCBI website (National Center for Biotechnology Information) at http://www.ncbi.nlm.nih.gov/.   In the search pull down menu at the top, make sure "Nucleotide" is selected.   In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button.
  2. This search will bring up over 11201 results.  To narrow the search, click on "Limits" just to the right of the box where you typed "GFP".  Limit the Search Field Tag to "Gene Name" (in the dropdown box) and click the "Go" button again.  You will now have approximately 152 results.  Go to page 3 of the list (you will have to click "next" a few times (the "next" link appears to the right).
  3. The 58th and 59th entries, M62653 and M62654, are from a seminal 1992 paper.  Click on M62653, look over the GenBank record, and answer the following questions:
    1. How long is the nucleotide sequence?
    2. How many "guanines" appear in the gene's DNA sequence?  Is there a bias towards any particular nucleotide?
    3. What is the Latin name of the organism whose DNA was sequenced for this GFP?

Swissprot:

  1. Uniprot is a database of amino acid sequences that can be accessed at http://www.uniprot.org/.  At the Uniprot homepage, type GFP and click the Search button.  Find the entry in the Uniprot section (not the trembl section) which is GFP_AEQVI (P42212).
  2. Examine the web page for this protein, and answer the following:
    1. How many references are cited?
    2. This Swissprot record has links to other databases.  Pfam (Protein Families) is a database of multiple alignments. Pfam accession numbers begin with the letters PF, followed by five numbers (e.g. PF12345).  What is the Pfam accession number for GFP_AEQVI? (NOTE: An accession number is simply a tag that you can use to refer to a particular item in a database.  Many of the databases you will use will have accession numbers.  There is no standard formatting for accession numbers across databases.)
  3. The Uniprot database is available via ftp.  To see the data in its textual format (i.e. what you get when you ftp), scroll back to the top of the GFP_AEQVI web page, and click the link that says "TEXT" . Answer the following questions:
    1. The first two letters on each line identify what kind of line it is (e.g. ID = Identifier, DT = date, etc.)  Find the line that has the Latin name for the species.   What two letters appear at the beginning of the line?
    2. What two symbols, which appear on a line by themselves at the bottom of the file, indicate the end of the record for GFP_AEQVI?  (In the ftp file that you can download, these symbols are the "record separators".)

Protein Data Bank:

  1. The PDB (Protein Data Bank) is a database of protein structures at http://www.rcsb.org/pdb. From that page, enter "1EMB" and click the "Search" button. Click on the result. Then click on the "Display Files" link (on the top right). Then, click on the link to display the structure file in PDB file format complete with coordinates as HTML (PDB File).
  2. In this file the majority of lines are "ATOM" lines.  Scroll down until you see those lines and note how the atoms are numbered (in this case, 1 to 1908).  Answer the following questions:
    1. What kind of atom is #16 (3rd column)
    2. What kind of amino acid is atom #16 in? (4th column)
    3. What are the (x,y,z) coordinates of atom #16?