BioChem 508 Project

Travis Emmitt

Part 1 - Protein Family Membership

Protein Family: Cysteine Proteases

Located the cysteine proteases family in the PFAM database (and others):
- family1.htm - PFAM's entry for cysteine protease family
- family2.htm - Prosite's description of the cysteine protease family
- family3.htm - SwissProt's intro to peptidases and list of cysteine peptidase clans and sub-families
- family4.htm - PFAM's multiple sequence alignment of cysteine protease members
Examined PFAM's multiple alignments and SwissProt's list of family members, and noticed that FASTA and other alignment-related databases use different nomenclature for sequence labeling. (i.e., GenBank and PIR numbers). There would be no easy way to compare FASTA results to the favorites of the PFAM family.
- formats.htm - an example of the different databases' output, to illustrate the nonuniformity of sequence nomenclature
Prepared for Searching
Looked at the sequences in the PFAM multiple alignment (family4.htm) and downloaded to a local directory the corresponding SwissProt database entry for each sequence:
- sp_exmpl.htm - example of a SwissProt database entry for one of the cysteine protease family members
Concatenated all of the downloaded entries into one big page and then stripped out everything except the PFAM sequence name (e.g., CATL_CHICK) and the DR fields (the ones that provide GenBank, PIR, and Prosite numbers). I updated all the URLs to absolute (non-relative) links, so that so that I could directly reference the different database entries for each of the family members.
- members1.htm - complete concatenation of pages (very long)
- members2.htm - stripped version, sans everything except PFAM name and Genbank/PIR numbers
I now had a map of PFAM names to Genbank and PIR numbers, and was able to commence my database searching.
Selected a sequence in my family (CATL_HUMAN) that aligned well with several of the other members and whose description on the ProSite family page (family2.htm) made it sound pretty representative of the family (i.e. fairly safe bet that this sequence would not end up being a borderline family member).
- sp_human.htm - SwissProt page for CATL_HUMAN
Examined its graphical view via Entrez to find out what part of it was being included in the PFAM alignment.
- en_human.gif - graphical overview of CATL_HUMAN
- en_human.htm - zoomed in look at specific regions
Did the same thing for another sequence (CYSP_PLAFA), one that tended to match a different subset of the family
- sp_plafa.htm - SwissProt page for CYSP_PLAFA
- en_plafa.gif - graphical overview of CYSP_PLAFA
- en_plafa.htm - zoomed in look at specific regions
Sequence Similarity Searching
Ran a BLAST search against the GenBank database using CATL_HUMAN's nucleotide sequence.
- blast1.txt - results of BLAST GenBank search on CATL_HUMAN
Checked the search results against my family list using my PFAM-to-PIR map (members2.htm). Some of the time I had to look at the full concatenation (members1.htm) because many sequences' numbers in the search results were not on the SwissProt database pages.
Made a table of family members and the expectation values that I got for them from the BLAST search.
- e_values.txt - family members and their associated expectation values
Ran a FASTA search against the same database (Genbank) with the same sequence (CATL_HUMAN), and updated the expectation values table accordingly. FASTA kept failing when I tried to search all GenBank databases, so I restricted the search to rodents, primates, and other mammals.
- fasta1.txt - results of partial FASTA GenBank search on CATL_HUMAN
Ran another FASTA search, this time using a different sequence, CYSP_PLAFA, against the PIR database.
- fasta2.txt - results of FASTA PIR search on CYSP_PLAFA
Ran another FASTA search on the PIR database, on CATL_HUMAN.
- fasta3.txt - results of FASTA PIR search on CATL_HUMAN
Ran several other searches, trying to find more similarities between the members, but almost all the other close matches I got after that involved matches in the SwissProt database instead of GenBank and PIR.

Distantly Related (or Unrelated) Sequence
CAT3_FASHE (Putative Cthepsin L3) does not appear to be closely related to any of the other members of the cysteine protease family. Its expectation score relative to the other searches have all been worse than the suggested threshold of 10^-4.
- sp_fache.htm - SwissProt entry for Putative Cathepsin L3
Performed a FASTA search on CAT3_FASHE. Its best expectation values were around 10^-5, which is just barely under the 10^-4 threshold. In fact, only three other sequences matched it with expectation values better than the cutoff, and those only slightly (less than an order of magnitude).
- fasta4.txt - results of FASTA search on CAT3_FACHE
Did a BLAST search on CAT3_FASHE and those scores were even worse - the best expectation value next to its own was 0.022, which is horrible.
You can also tell by looking at the PFAM database's cysteine protease multiple alignments that CAT3_FASHE isn't very well aligned with most of the other sequences:
- family4.htm - PFAM's cysteine protease multiple alignments
Tried to run motifs on CAT3_FASHE but it couldn't find any patterns. This is probably because CAT3_FASHE is only 19 characters long.
Noticed by looking at the Entrez graphical views that CAT3_FASHE doesn't have any sites to share with the rest of the sequences.
Ran PRSS on CAT3_FASHE versus CATL_CHICK and got:
```
  Lambda: 0.24963 K: 0.16414; P(95)= 3.8128e-08
  For 500 sequences, a score >=95 is expected 1.91e-05 times
```
This implies that CAT3_FASHE and CATL_CHICK are indeed related, and is consistent with the score that CATL_CHICK got when FASTA was run on CAT3_FASHE.
Found a PubMed article which briefly mentions reason why it and other developmental proteins can be considered part of the cysteine proteases family:
- pm_fashe.htm - PubMed abstract
Part 2 - Multiple Sequence Alignments

Member Selection
Selected 16 "favorite" family members for the next part of the assignment. (I was going to trim down to 10 but got so deep into things that I didn't want to redo all my input and output files). I selected them based on the identity scores given by a recent FASTA search. (i.e. I looked at their identities relative to CYSP_PLAFA)
- favorit1.txt - selected family members and their identities relative to CYSP_PLAFA
- favorit2.htm - same as the PFAM multiple alignments (family4.htm), but includes only my selected favorites.
Complete AA Sequence Alignment
Made a file containing each of the favorites' amino acid sequences, which I obtained from the SwissProt database pages.
- aa1.raw - favorites' sequences before attempting alignments
- aa1.msf - same file, after using ReadSeq to convert it into a more acceptable format
(I later stumbled upon PFAM family PFAM-B_194, which happened to contain 7 of my 16 favorites.)
Performed a multiple sequence alignment with ClustalW. I used two different online versions of ClustalW; the first one we discussed in the class, but the second one indicates conserved positions. They work a little differently when dealing with sequences whose matching sections aren't in close ordinal proximity.
- aa1_cla.out - results using BCM launcher
- aa1_clb.out - results using EEI's engine
- aa1_clb.htm - same thing sans header, ready for printing
Performed the same alignment using PileUp (via SeqLab). I set the gap penalties to -8 (start) and -2 (continue), as suggested in class. I did not enable leading and trailing gap penalties, since for the complete sequences, those tended to be so high.
- aa1_pil.out - alignment results
- aa1_pil.htm - same thing sans header, ready for printing
- aa1_pil.gif - GIF of the generated tree
MSA could only handle 3 of the 16 sequences, since it had a 800 character limit and each of the sequences was at least a couple hundred characters long. I picked three of the smallest sequences (CATL_CHICK, CATS_BOVIN, and and PAP2_CARPA), and aligned them.
- aa1_msa.out - results for 3 complete sequences
- aa1_msa.htm - same thing sans header, ready for printing
Downloaded and compiled treealign. I had to edit the source in order to get it to build (added a #define to rename one of his duplicate function definitions). In order to get it to run, I had to adjust my input file to a new format. I wasn't permitted to set the gap extension penalty to less than 3 (this restriction was explicitly stated in the README file).
- aa1_ta.in - sequences in treealign's input file format
- aa1_ta.out - resulting alignment
- aa1_ta.htm - same thing sans header, ready for printing
- aa1_ta.tree - resulting tree
Partial AA Sequence Alignment
Since the first and last part of the sequence alignments contained so many gaps, and since I wanted to trim these gaps before constructing the phylogenetic trees, I decided to find a portion of the alignments that was consistently matched and then just use that.
In the ClustalW alignment, the first position by which all of the sequences had started was position 114 of CATL_HUMAN. The last position in which each all of the sequences were present was position 333 in CATL_HUMAN.
- aa1_clb.htm - ClustalW (EEI) results, with "unanimously present" subset bolded.
I looked back at the original PFAM alignments (family4.htm) and saw that the span of CATL_HUMAN in that alignments was from positions 114 to 332. I got similar results for the other sequences.
I therefore constructed another sequences file, this time containing only those positions which appeared in the PFAM alignment (family4.htm). I called this set of sequences the "partial sequences," as opposed to the complete sequences. I wanted to see how my tree- based on alignment of partial sequences differed from trees based on alignments of complete sequences.
- aa2.raw - initial partial sequences
- aa2.msf - partial sequences after running ReadSeq
Ran ClustalW on the partial sequences and got the following results.
- aa2_cla.out - results using BCM launcher
- aa2_clb.out - results using EEI's engine
Notice how the PFAM alignments cut off a significant portion of the end of the CYSP_P* sequences. This eliminated section includes the "extra" middle portion starting at around position 500 of CYSP_PLAFA. Because of this descrepancy, I decided to do trees of both complete (slow, but seemingly more accurate) and partial (fast, but missing data for CYSP_P*) sequences.
Ran PileUp on the partial sequences, and got similar results.
- aa2_pil.out - results on partial sequences
- aa2_pil.gif - GIF of the new generated tree
Ran MSA on the partial sequences. This time, since the sequences were smaller, it could handle 4 of them: CYSP_PLAFA, CYSP_PLAVI, CYSP_PLAVN, and PAPA_CARPA.
- aa2_msa.out - results for 4 partial sequences
Ran treealign on the partial sequences. This ran a lot faster than the complete sequences.
- aa2_ta.in - input file
- aa2_ta.out - alignment
- aa2_ta.tree - tree
DNA Sequence Alignment
Repeated the above steps, this time using nucleotides instead of amino acids. I used BackTranslate (from the command line) to convert each of the amino acid sequences into nucleotide sequences (I used this instead of the pre-written database translations because there were multiple database back-translations and by doing it myself, I could be consistent, though maybe not historically accurate).
Each time I ran BackTranslate, I selected the most probable sequence, and used the default codon frequency file (GenRunData:ecohigh.cod). I used Reformat to prepare my sequences before the back translation, and CompressText to remove the spaces and trailing blanks afterwards.
- dna1.raw - initial complete DNA sequences
- dna1.msf - after ReadSeq
- dna2.raw - initial partial DNA sequences
- dna2.msf - after ReadSeq
Ran ClustalW on both sets of DNA sequences:
- dna1_cla.out - using BCM (didn't work)
- dna1_clb.out - using EEI
- dna2_cla.out - using BCM
- dna2_clb.out - using EEI
Ran PileUp on both sets of DNA sequences. I enabled leading and trailing gap penalties for only the partial sequences, since the complete sequences had so many.
- dna1_pil.out - complete DNA sequences
- dna1_pil.gif - GIF of tree
- dna2_pil2.out - partial DNA sequences
- dna2_pil.gif - GIF of tree
Could not run MSA on any of the DNA sequences, since each was at least 400 characters long and there would be nothing to compare it against. Not even the partial DNA sequences were small enough.
Ran treealign on the DNA sequences. This only worked for partial sequences - the program ran out of memory when trying to process the complete sequences.
- dna2_ta.in - input (partial sequences only)
- dna2_ta.out - alignment
- dna2_ta.tree - tree
Part 3 - Phylogenetic Tree Construction

Choosing an Alignment
Of all the alignment methods, the best looking results were with ClustalW. PileUp had a few more gaps (especially annoying was the gap right after the start of the well-aligned middle section); MSA could only handle three sequences, so wasn't much use as far as building trees; and treealign had a large number of gaps and coudln't handled the complete DNA sequences (which I'll get to later...)

File Format Conversions
Converted ClustalW AA sequence alignments (complete and partial) into Pearson/FASTA format using ReadSeq:
- aa1_clb.fasta - complete sequence alignments in FASTA format
- aa2_clb.fasta - partial sequence alignments in FASTA format
Converted the FASTA AA sequence alignments into DNA sequence alignments using mrtrans. I used dna1.msf and dna2.msf for my cDNA libraries. I had to reorder the cDNA library so that its sequences were in the same order as the AA alignment's sequences.
- dna1.fasta - complete DNA alignments, still in FASTA format
- dna2.fasta - partial DNA sequences...
Converted the FASTA DNA alignments into Phylip DNA alignments using ReadSeq.
- dna1.in - complete DNA sequences in Phylip format
- dna2.in - partial DNA sequences...
Parsimony Method
Ran dnapars on both sets of input sequences, first as a single run, and then bootstrapped (using 100 trees). I jumbled the input 10 times for each data set. I kept my trees rootless, so did not specify an outgroup.
- par1_one.dsc - single run - tree description file (complete DNA sequences)
- par1_one.out - single run - program output
- par1_boot.dsc - bootstrapped - tree description file
- par1_boot.out - bootstrapped - program output
- par1_cons.dsc - consensus - tree description file
- par1_cons.out - consensus - program output
- par2_one.dsc - single run - tree description file (partial DNA sequences)
- par2_one.out - single run - program output
- par2_boot.dsc - bootstrapped - tree description file
- par2_boot.out - bootstrapped - program output
- par2_cons.dsc - consensus - tree description file
- par2_cons.out - consensus - program output
Notice that the complete sequence bootstrapping results in the same tree over and over again throughout the 100 data sets (there are only a couple of exceptions). This consistency greatly increases our confidence in a single-run parsomony tree's lack of arbitrariness.
The partial sequence bootstrapping results in more varied trees within the 100. This might be because of the missing portions of the CYSP_P* sequences.

Maximum Likelihood Method
Ran dnaml on both sets of input sequences.
- ml1_one.dsc - single run - tree description file (complete DNA sequences)
- ml1_one.out - single run - program output
- ml2_one.dsc - single run - tree description file (partial DNA sequences)
- ml2_one.out - single run - program output
dnaml ran much too slowly producing a single output tree for me to want to risk slowing down the network by jumbling and producing multiple output trees.
Ran dnamlk (Maximum Likelihood method with molecular clock enabled). The results were:
- mlk1_one.dsc - single run - tree description file (complete DNA sequences)
- mlk1_one.out - - single run - program output
- mlk2_one.dsc - single run - tree description file (partial DNA sequences)
- mlk2_one.out - single run - program output
Distance Methods
Ran dnadist on both sets of input sequences.
- dis1_one.tmp - single run - distance scores (complete DNA sequences)
- dis2_one.tmp - single run - distance scores (partial DNA sequences)
Ran each the fitch model on the obtained distance scores.
- fit1_one.dsc - - single run - tree description file (complete DNA sequences, fitch model)
- fit1_one.out - - single run - output
- fit1_boot.dsc - - bootstrapped - tree description file
- fit1_boot.out - - bootstrapped - output
- fit1_cons.dsc - - consensus - tree description file
- fit1_cons.out - - consensus - output
- fit2_one.dsc - - single run - tree description file (partial DNA sequences, fitch model)
- fit2_one.out - - single run - output
- fit2_boot.dsc - - bootstrapped - tree description file
- fit2_boot.out - - bootstrapped - output
- fit2_cons.dsc - - consensus - tree description file
- fit2_cons.out - - consensus - output
Ran the kitsch model on the obtained distance scores.
- kit1_one.dsc - - single run - tree description file (complete DNA sequences, kitsch model)
- kit1_one.out - - single run - output
- kit1_boot.dsc - - bootstrapped - tree description file
- kit1_boot.out - - bootstrapped - output
- kit1_cons.dsc - - consensus - tree description file
- kit1_cons.out - - consensus - output
- kit2_one.dsc - - single run - tree description file (partial DNA sequences, kitsch model)
- kit2_one.out - - single run - output
- kit2_boot.dsc - - bootstrapped - tree description file
- kit2_boot.out - - bootstrapped - output
- kit2_cons.dsc - - consensus - tree description file
- kit2_cons.out - - consensus - output
Ran the neighbor model on the obtained distance scores.
- nei1_one.dsc - - single run - tree description file (complete DNA sequences, neighbor model)
- nei1_one.out - - single run - output
- nei1_boot.dsc - - bootstrapped - tree description file
- nei1_boot.out - - bootstrapped - output
- nei1_cons.dsc - - consensus - tree description file
- nei1_cons.out - - consensus - output
- nei2_one.dsc - - single run - tree description file (partial DNA sequences, neighbor model)
- nei2_one.out - - single run - output
- nei2_boot.dsc - - bootstrapped - tree description file
- nei2_boot.out - - bootstrapped - output
- nei2_cons.dsc - - consensus - tree description file
- nei2_cons.out - - consensus - output

BioChem 508 Project Travis Emmitt

Part 1 - Protein Family Membership

Protein Family: Cysteine Proteases

Prepared for Searching

Sequence Similarity Searching

Distantly Related (or Unrelated) Sequence

Part 2 - Multiple Sequence Alignments

Member Selection

Complete AA Sequence Alignment

Partial AA Sequence Alignment

DNA Sequence Alignment

Part 3 - Phylogenetic Tree Construction

Choosing an Alignment

File Format Conversions

Parsimony Method

Maximum Likelihood Method

Distance Methods

BioChem 508 Project

Travis Emmitt