- Located the cysteine proteases family in the
PFAM database (and others):
- family1.htm
- PFAM's entry for cysteine protease family
- family2.htm
- Prosite's description of the cysteine
protease family
- family3.htm
- SwissProt's intro to peptidases
and list of cysteine peptidase clans and
sub-families
- family4.htm
- PFAM's multiple sequence alignment of
cysteine protease members
- Examined PFAM's multiple alignments and
SwissProt's list of family members, and
noticed that FASTA and other alignment-related
databases use different nomenclature for sequence
labeling. (i.e., GenBank and PIR numbers).
There would be no easy way to compare FASTA
results to the favorites of the PFAM family.
- formats.htm
- an example of the different databases' output,
to illustrate the nonuniformity of sequence
nomenclature
Prepared for Searching
- Looked at the sequences in the PFAM multiple
alignment (family4.htm)
and downloaded to a local directory the corresponding
SwissProt database entry for each sequence:
- sp_exmpl.htm
- example of a SwissProt database entry
for one of the cysteine protease family members
- Concatenated all of the downloaded entries into one big
page and then stripped out everything except the
PFAM sequence name (e.g., CATL_CHICK) and
the DR fields (the ones that provide GenBank,
PIR, and Prosite numbers). I updated
all the URLs to absolute (non-relative) links, so that
so that I could directly reference the different
database entries for each of the family members.
- members1.htm
- complete concatenation of pages (very long)
- members2.htm
- stripped version, sans everything except
PFAM name and Genbank/PIR numbers
I now had a map of PFAM names to Genbank
and PIR numbers, and was able to commence my
database searching.
- Selected a sequence in my family (CATL_HUMAN) that
aligned well with several of the other members
and whose description on the ProSite family page
(family2.htm)
made it sound pretty representative of the family
(i.e. fairly safe bet that this sequence would not
end up being a borderline family member).
Examined its graphical
view via Entrez to find out what part of it
was being included in the PFAM alignment.
- Did the same thing for another sequence (CYSP_PLAFA),
one that tended to match a different subset of the family
Sequence Similarity Searching
- Ran a BLAST search against the GenBank database
using CATL_HUMAN's nucleotide sequence.
- blast1.txt
- results of BLAST GenBank search on
CATL_HUMAN
- Checked the search results against my family list
using my PFAM-to-PIR map
(members2.htm).
Some of the time I had to look at the full concatenation
(members1.htm) because many
sequences' numbers in the search results were not
on the SwissProt database pages.
- Made a table of family members and the expectation
values that I got for them from the BLAST
search.
- e_values.txt
- family members and their associated expectation
values
- Ran a FASTA search against the same database
(Genbank) with the same sequence
(CATL_HUMAN), and updated the expectation values
table accordingly. FASTA kept failing
when I tried to search all GenBank
databases, so I restricted the search to rodents,
primates, and other mammals.
- fasta1.txt
- results of partial FASTA GenBank search on
CATL_HUMAN
- Ran another FASTA search, this time using
a different sequence, CYSP_PLAFA, against the
PIR database.
- fasta2.txt
- results of FASTA PIR search on
CYSP_PLAFA
- Ran another FASTA search on the
PIR database, on CATL_HUMAN.
- fasta3.txt
- results of FASTA PIR search on
CATL_HUMAN
- Ran several other searches, trying to find
more similarities between the members, but
almost all the other close matches I got
after that involved matches in the SwissProt
database instead of GenBank and
PIR.
Distantly Related (or Unrelated) Sequence
- CAT3_FASHE (Putative Cthepsin L3) does not appear
to be closely related to any of the
other members of the cysteine protease family.
Its expectation score relative to the other searches
have all been worse than the suggested threshold
of 10^-4.
- Performed a FASTA
search on CAT3_FASHE. Its best expectation values were
around 10^-5, which is just barely under the 10^-4
threshold. In fact, only three other sequences matched it
with expectation values better than the cutoff, and those
only slightly (less than an order of magnitude).
- fasta4.txt
- results of FASTA search on
CAT3_FACHE
- Did a BLAST search on CAT3_FASHE and
those scores were even worse - the best expectation
value next to its own was 0.022, which is horrible.
- You can also tell by looking at the PFAM
database's cysteine protease multiple alignments
that CAT3_FASHE isn't very well aligned with most
of the other sequences:
- family4.htm
- PFAM's cysteine protease multiple alignments
- Tried to run motifs on CAT3_FASHE but it couldn't
find any patterns. This is probably because CAT3_FASHE
is only 19 characters long.
- Noticed by looking at the Entrez graphical
views that CAT3_FASHE doesn't have any sites to share
with the rest of the sequences.
- Ran PRSS on CAT3_FASHE versus CATL_CHICK and got:
Lambda: 0.24963 K: 0.16414; P(95)= 3.8128e-08
For 500 sequences, a score >=95 is expected 1.91e-05 times
This implies that CAT3_FASHE and CATL_CHICK are indeed
related, and is consistent with the score that CATL_CHICK
got when FASTA was run on CAT3_FASHE.
- Found a PubMed article which briefly mentions
reason why it and other developmental proteins
can be considered part of the
cysteine proteases family:
Part 2 - Multiple Sequence Alignments
Member Selection
- Selected 16 "favorite" family members for the next part of
the assignment. (I was going to trim down to 10 but
got so deep into things that I didn't want to redo all my
input and output files).
I selected them based on the identity scores given by a
recent FASTA search. (i.e. I looked at their
identities relative to CYSP_PLAFA)
- favorit1.txt
- selected family members and their identities
relative to CYSP_PLAFA
- favorit2.htm
- same as the PFAM multiple alignments
(family4.htm),
but includes only my selected favorites.
Complete AA Sequence Alignment
- Made a file containing each of the favorites' amino
acid sequences, which I obtained from the
SwissProt database pages.
- aa1.raw - favorites'
sequences before attempting alignments
- aa1.msf - same file,
after using ReadSeq to convert it into a
more acceptable format
(I later stumbled upon PFAM family
PFAM-B_194,
which happened to contain 7 of my
16 favorites.)
- Performed a multiple sequence alignment with
ClustalW. I used
two different online versions of ClustalW; the first
one we discussed in the class, but the second one indicates
conserved positions. They work a little differently when
dealing with sequences whose matching sections aren't in
close ordinal proximity.
- Performed the same alignment using PileUp (via SeqLab).
I set the gap penalties to -8 (start) and -2 (continue), as
suggested in class. I did not enable leading and trailing
gap penalties, since for the complete sequences, those
tended to be so high.
- MSA could only handle 3 of the 16 sequences, since it
had a 800 character limit and each of the sequences
was at least a couple hundred characters long. I
picked three of the
smallest sequences (CATL_CHICK, CATS_BOVIN, and
and PAP2_CARPA), and aligned them.
- Downloaded and compiled treealign.
I had to edit the
source in order to get it to build (added a #define
to rename one of his duplicate function definitions).
In order to get it to run, I had to adjust my input
file to a new format.
I wasn't permitted to set the gap extension penalty to
less than 3 (this restriction was explicitly
stated in the README file).
Partial AA Sequence Alignment
- Since the first and last part of the sequence alignments
contained so many gaps, and since I wanted to trim
these gaps before constructing the phylogenetic trees,
I decided to find a portion of
the alignments that was consistently matched and then
just use that.
In the ClustalW alignment, the
first position by which all of the sequences had started
was position 114 of CATL_HUMAN. The last position in
which each all of the sequences were present was position
333 in CATL_HUMAN.
- aa1_clb.htm
- ClustalW (EEI)
results, with "unanimously present" subset bolded.
I looked back at the original PFAM alignments
(family4.htm)
and saw that the span of CATL_HUMAN in that alignments
was from positions 114 to 332. I got similar results
for the other sequences.
I therefore constructed another sequences file, this
time containing only those positions which appeared in
the PFAM alignment
(family4.htm). I called
this set of sequences the "partial sequences," as opposed
to the complete sequences. I wanted to see how my tree-
based on alignment of partial sequences differed from
trees based on alignments of complete sequences.
- aa2.raw - initial
partial sequences
- aa2.msf -
partial sequences after running ReadSeq
- Ran ClustalW on the partial sequences and got
the following results.
Notice how the PFAM alignments cut off a significant
portion of the end of the CYSP_P* sequences. This
eliminated section includes the "extra" middle portion
starting at around position 500 of CYSP_PLAFA. Because of
this descrepancy, I decided to do trees of both complete
(slow, but seemingly more accurate) and partial (fast, but
missing data for CYSP_P*) sequences.
- Ran PileUp on the partial sequences, and got
similar results.
- Ran MSA on the partial sequences. This time,
since the sequences were
smaller, it could handle 4 of them: CYSP_PLAFA,
CYSP_PLAVI, CYSP_PLAVN, and PAPA_CARPA.
- Ran treealign on the partial sequences. This ran
a lot faster than the complete sequences.
DNA Sequence Alignment
- Repeated the above steps, this time using nucleotides
instead of amino acids. I used BackTranslate
(from the
command line) to convert each of the amino acid
sequences into nucleotide sequences (I used this instead
of the pre-written database translations because there
were multiple database back-translations and by doing it
myself, I could be consistent, though maybe not
historically accurate).
Each time I ran BackTranslate, I selected the
most probable sequence, and used the default codon frequency file
(GenRunData:ecohigh.cod).
I used Reformat to prepare my sequences before the
back translation, and CompressText to remove the spaces
and trailing blanks afterwards.
- Ran ClustalW on both sets of DNA sequences:
- Ran PileUp on both sets of DNA sequences.
I enabled leading and trailing gap penalties for only the
partial sequences, since the complete sequences had
so many.
- Could not run MSA on any of the DNA sequences, since
each was at least 400 characters long and there would
be nothing to compare it against. Not even the partial
DNA sequences were small enough.
- Ran treealign on the DNA sequences. This only worked
for partial sequences - the program ran out of memory
when trying to process the complete sequences.
Part 3 - Phylogenetic Tree Construction
Choosing an Alignment
- Of all the alignment methods, the best looking results were
with ClustalW. PileUp had a few more gaps
(especially annoying was the gap right after the start of
the well-aligned middle section); MSA could
only handle three sequences, so wasn't much use as far
as building trees; and treealign had a large
number of gaps and coudln't handled the complete DNA
sequences (which I'll get to later...)
File Format Conversions
- Converted ClustalW AA sequence alignments (complete
and partial) into Pearson/FASTA format using
ReadSeq:
- Converted the FASTA AA sequence alignments into DNA sequence
alignments using mrtrans. I used
dna1.msf and
dna2.msf
for my cDNA libraries. I had to reorder the cDNA library
so that its sequences were in the same order as the
AA alignment's sequences.
- Converted the FASTA DNA alignments into
Phylip DNA alignments using ReadSeq.
- dna1.in
- complete DNA sequences in Phylip format
- dna2.in
- partial DNA sequences...
Parsimony Method
- Ran dnapars on both sets of input
sequences, first as a single run, and then
bootstrapped (using 100 trees). I jumbled
the input 10 times for each data set.
I kept my trees rootless,
so did not specify an outgroup.
Notice that the complete sequence bootstrapping results
in the same tree over and over again throughout the
100 data sets (there are only a couple of exceptions).
This consistency greatly increases our
confidence in a single-run parsomony tree's lack of
arbitrariness.
The partial sequence bootstrapping results in more
varied trees within the 100. This might be because of the
missing portions of the CYSP_P* sequences.
Maximum Likelihood Method
- Ran dnaml on both sets
of input sequences.
- ml1_one.dsc
- single run - tree description file (complete DNA sequences)
- ml1_one.out
- single run - program output
- ml2_one.dsc
- single run - tree description file (partial DNA sequences)
- ml2_one.out
- single run - program output
dnaml ran much too slowly producing a single output
tree for me to want to risk slowing down the
network by jumbling and producing multiple output
trees.
- Ran dnamlk (Maximum Likelihood method
with molecular clock enabled). The results were:
Distance Methods
- Ran dnadist on both sets of input
sequences.
- dis1_one.tmp
- single run - distance scores (complete DNA sequences)
- dis2_one.tmp
- single run - distance scores (partial DNA sequences)
- Ran each the fitch model on the obtained distance scores.
- Ran the kitsch model on the obtained distance scores.
- Ran the neighbor model on the obtained distance scores.