BioChem 508 Project

Travis Emmitt





Part 1 - Protein Family Membership




Protein Family: Cysteine Proteases


  1. Located the cysteine proteases family in the PFAM database (and others):

  2. Examined PFAM's multiple alignments and SwissProt's list of family members, and noticed that FASTA and other alignment-related databases use different nomenclature for sequence labeling. (i.e., GenBank and PIR numbers). There would be no easy way to compare FASTA results to the favorites of the PFAM family.



    Prepared for Searching


  3. Looked at the sequences in the PFAM multiple alignment (family4.htm) and downloaded to a local directory the corresponding SwissProt database entry for each sequence:

  4. Concatenated all of the downloaded entries into one big page and then stripped out everything except the PFAM sequence name (e.g., CATL_CHICK) and the DR fields (the ones that provide GenBank, PIR, and Prosite numbers). I updated all the URLs to absolute (non-relative) links, so that so that I could directly reference the different database entries for each of the family members.

    I now had a map of PFAM names to Genbank and PIR numbers, and was able to commence my database searching.

  5. Selected a sequence in my family (CATL_HUMAN) that aligned well with several of the other members and whose description on the ProSite family page (family2.htm) made it sound pretty representative of the family (i.e. fairly safe bet that this sequence would not end up being a borderline family member).

    Examined its graphical view via Entrez to find out what part of it was being included in the PFAM alignment.

  6. Did the same thing for another sequence (CYSP_PLAFA), one that tended to match a different subset of the family



    Sequence Similarity Searching


  7. Ran a BLAST search against the GenBank database using CATL_HUMAN's nucleotide sequence.

  8. Checked the search results against my family list using my PFAM-to-PIR map (members2.htm). Some of the time I had to look at the full concatenation (members1.htm) because many sequences' numbers in the search results were not on the SwissProt database pages.

  9. Made a table of family members and the expectation values that I got for them from the BLAST search.

  10. Ran a FASTA search against the same database (Genbank) with the same sequence (CATL_HUMAN), and updated the expectation values table accordingly. FASTA kept failing when I tried to search all GenBank databases, so I restricted the search to rodents, primates, and other mammals.

  11. Ran another FASTA search, this time using a different sequence, CYSP_PLAFA, against the PIR database.

  12. Ran another FASTA search on the PIR database, on CATL_HUMAN.

  13. Ran several other searches, trying to find more similarities between the members, but almost all the other close matches I got after that involved matches in the SwissProt database instead of GenBank and PIR.


    Distantly Related (or Unrelated) Sequence


  14. CAT3_FASHE (Putative Cthepsin L3) does not appear to be closely related to any of the other members of the cysteine protease family. Its expectation score relative to the other searches have all been worse than the suggested threshold of 10^-4.

  15. Performed a FASTA search on CAT3_FASHE. Its best expectation values were around 10^-5, which is just barely under the 10^-4 threshold. In fact, only three other sequences matched it with expectation values better than the cutoff, and those only slightly (less than an order of magnitude).

  16. Did a BLAST search on CAT3_FASHE and those scores were even worse - the best expectation value next to its own was 0.022, which is horrible.

  17. You can also tell by looking at the PFAM database's cysteine protease multiple alignments that CAT3_FASHE isn't very well aligned with most of the other sequences:

  18. Tried to run motifs on CAT3_FASHE but it couldn't find any patterns. This is probably because CAT3_FASHE is only 19 characters long.

  19. Noticed by looking at the Entrez graphical views that CAT3_FASHE doesn't have any sites to share with the rest of the sequences.

  20. Ran PRSS on CAT3_FASHE versus CATL_CHICK and got:
      Lambda: 0.24963 K: 0.16414; P(95)= 3.8128e-08
      For 500 sequences, a score >=95 is expected 1.91e-05 times
    

    This implies that CAT3_FASHE and CATL_CHICK are indeed related, and is consistent with the score that CATL_CHICK got when FASTA was run on CAT3_FASHE.

  21. Found a PubMed article which briefly mentions reason why it and other developmental proteins can be considered part of the cysteine proteases family:





    Part 2 - Multiple Sequence Alignments




    Member Selection


  22. Selected 16 "favorite" family members for the next part of the assignment. (I was going to trim down to 10 but got so deep into things that I didn't want to redo all my input and output files). I selected them based on the identity scores given by a recent FASTA search. (i.e. I looked at their identities relative to CYSP_PLAFA)



    Complete AA Sequence Alignment


  23. Made a file containing each of the favorites' amino acid sequences, which I obtained from the SwissProt database pages.

    (I later stumbled upon PFAM family PFAM-B_194, which happened to contain 7 of my 16 favorites.)

  24. Performed a multiple sequence alignment with ClustalW. I used two different online versions of ClustalW; the first one we discussed in the class, but the second one indicates conserved positions. They work a little differently when dealing with sequences whose matching sections aren't in close ordinal proximity.

  25. Performed the same alignment using PileUp (via SeqLab). I set the gap penalties to -8 (start) and -2 (continue), as suggested in class. I did not enable leading and trailing gap penalties, since for the complete sequences, those tended to be so high.

  26. MSA could only handle 3 of the 16 sequences, since it had a 800 character limit and each of the sequences was at least a couple hundred characters long. I picked three of the smallest sequences (CATL_CHICK, CATS_BOVIN, and and PAP2_CARPA), and aligned them.

  27. Downloaded and compiled treealign. I had to edit the source in order to get it to build (added a #define to rename one of his duplicate function definitions). In order to get it to run, I had to adjust my input file to a new format. I wasn't permitted to set the gap extension penalty to less than 3 (this restriction was explicitly stated in the README file).



    Partial AA Sequence Alignment


  28. Since the first and last part of the sequence alignments contained so many gaps, and since I wanted to trim these gaps before constructing the phylogenetic trees, I decided to find a portion of the alignments that was consistently matched and then just use that.

    In the ClustalW alignment, the first position by which all of the sequences had started was position 114 of CATL_HUMAN. The last position in which each all of the sequences were present was position 333 in CATL_HUMAN.

    I looked back at the original PFAM alignments (family4.htm) and saw that the span of CATL_HUMAN in that alignments was from positions 114 to 332. I got similar results for the other sequences.

    I therefore constructed another sequences file, this time containing only those positions which appeared in the PFAM alignment (family4.htm). I called this set of sequences the "partial sequences," as opposed to the complete sequences. I wanted to see how my tree- based on alignment of partial sequences differed from trees based on alignments of complete sequences.

  29. Ran ClustalW on the partial sequences and got the following results.

    Notice how the PFAM alignments cut off a significant portion of the end of the CYSP_P* sequences. This eliminated section includes the "extra" middle portion starting at around position 500 of CYSP_PLAFA. Because of this descrepancy, I decided to do trees of both complete (slow, but seemingly more accurate) and partial (fast, but missing data for CYSP_P*) sequences.

  30. Ran PileUp on the partial sequences, and got similar results.

  31. Ran MSA on the partial sequences. This time, since the sequences were smaller, it could handle 4 of them: CYSP_PLAFA, CYSP_PLAVI, CYSP_PLAVN, and PAPA_CARPA.

  32. Ran treealign on the partial sequences. This ran a lot faster than the complete sequences.



    DNA Sequence Alignment


  33. Repeated the above steps, this time using nucleotides instead of amino acids. I used BackTranslate (from the command line) to convert each of the amino acid sequences into nucleotide sequences (I used this instead of the pre-written database translations because there were multiple database back-translations and by doing it myself, I could be consistent, though maybe not historically accurate).

    Each time I ran BackTranslate, I selected the most probable sequence, and used the default codon frequency file (GenRunData:ecohigh.cod). I used Reformat to prepare my sequences before the back translation, and CompressText to remove the spaces and trailing blanks afterwards.

  34. Ran ClustalW on both sets of DNA sequences:

  35. Ran PileUp on both sets of DNA sequences. I enabled leading and trailing gap penalties for only the partial sequences, since the complete sequences had so many.

  36. Could not run MSA on any of the DNA sequences, since each was at least 400 characters long and there would be nothing to compare it against. Not even the partial DNA sequences were small enough.

  37. Ran treealign on the DNA sequences. This only worked for partial sequences - the program ran out of memory when trying to process the complete sequences.





    Part 3 - Phylogenetic Tree Construction




    Choosing an Alignment


  38. Of all the alignment methods, the best looking results were with ClustalW. PileUp had a few more gaps (especially annoying was the gap right after the start of the well-aligned middle section); MSA could only handle three sequences, so wasn't much use as far as building trees; and treealign had a large number of gaps and coudln't handled the complete DNA sequences (which I'll get to later...)


    File Format Conversions


  39. Converted ClustalW AA sequence alignments (complete and partial) into Pearson/FASTA format using ReadSeq:

  40. Converted the FASTA AA sequence alignments into DNA sequence alignments using mrtrans. I used dna1.msf and dna2.msf for my cDNA libraries. I had to reorder the cDNA library so that its sequences were in the same order as the AA alignment's sequences.

  41. Converted the FASTA DNA alignments into Phylip DNA alignments using ReadSeq.



    Parsimony Method


  42. Ran dnapars on both sets of input sequences, first as a single run, and then bootstrapped (using 100 trees). I jumbled the input 10 times for each data set. I kept my trees rootless, so did not specify an outgroup.

    Notice that the complete sequence bootstrapping results in the same tree over and over again throughout the 100 data sets (there are only a couple of exceptions). This consistency greatly increases our confidence in a single-run parsomony tree's lack of arbitrariness.

    The partial sequence bootstrapping results in more varied trees within the 100. This might be because of the missing portions of the CYSP_P* sequences.



    Maximum Likelihood Method


  43. Ran dnaml on both sets of input sequences.

    dnaml ran much too slowly producing a single output tree for me to want to risk slowing down the network by jumbling and producing multiple output trees.

  44. Ran dnamlk (Maximum Likelihood method with molecular clock enabled). The results were:



    Distance Methods


  45. Ran dnadist on both sets of input sequences.

  46. Ran each the fitch model on the obtained distance scores.

  47. Ran the kitsch model on the obtained distance scores.

  48. Ran the neighbor model on the obtained distance scores.