WWW Sequence Database Searching Practical
Michael Nilges
Unité de Bio-Informatique Structurale
Institut Pasteur
February 12, 2004
This practical is largely based on a practical by Toby Gibson's Group at EMBL Heidelberg

In this practical we will run some database search tools available through the WWW. Examination of the outputs may reveal some differences between the results, depending on
the type of algorithm or the options used in the sequence comparison. This will show that it is important to put a little thought in the comparison and not trust the default values
blindly.


WWW DB search Tools

We will use:


Getting started

The teaching machines are X-terminals and you are connected to the computer central at the Institut Pasteur.


Step 1 Choosing an snRNP SM protein as query

SM proteins are found in snRNP complexes. There are quite a number in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible) to detect them all in a search with a single query sequence. All SM proteins share a small globular domain, but many have a C-terminal non-globular domain too. This will be used to illustrate the problems of searching with multi-domain proteins.

You now have the sequence of human SM-B protein available in a form that can be cut and pasted into the DB query forms (at the bottom of the entry).
The sequence numbers, new lines, and blanks do not matter; however, you should not copy the "//" at the end of the sequence.
The sequence can also be found here.


Step 2  BLASTsearching with human SM-B protein

BLAST2 is an upgraded version of BLAST, one of the most widely used database search packages. The BLAST programs find the best matching ungapped sections in a sequence comparison. The most important modification for the user to note in BLAST2 is that neighbouring ungapped segments can now be concatenated by allowing gaps between them. This improves both sensitivity and interpretation of the results.

Questions Step 2B  BLASTsearch with SM-B and a filter

Now repeat the search but filter out segments of "reduced sequence complexity".

Questions


Step 2C  psi-BLASTsearch with SM-B and a filter

Now repeat the search with a different BLAST server which can set up and run profile searches automatically.

Questions


Now run iteration 2.


Questions


Step 3. Comparing a sequence to a database of protein domains

Since profile searches are so sensitive, it would make sense to query an unknown sequence against a set of profiles for known protein families. There are several very useful databases of modules that are found in multidomain proteins, including PFAM at the Sanger Centre, PROSITE at ISREC and SMART at EMBL. They use a form of profile technically described as a "hidden Markov model", but the end result is very much like the profiles we just ran. We will search for protein domains in an "unknown" protein using the SMART server.

Questions

Optional: Step 4A Bic_SW search with human SM-B protein

Alternatively to the Blast and Psi-Blast searches above, we can do a full Smith-Waterman search, construct a multiple sequence alignment (e.g., with ClustalW), build a sequence profile, and do a full Smith-Waterman profile search. As you will see, this is much more time consuming, but can be more sensitive in some cases. One of the few publicly awaylable resources to do this is the Bioaccelerator at the EMBL. The queue may be quite busy.

The Bioccelerator is fast dedicated hardware exclusively designed to speed up dynamic programming (i.e. slow but sensitive) sequence comparison. It is built by the Israeli company Compugen. It can perform a number of search permutations including basic Smith-Waterman, profile searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman search finds the best matching segments between any two sequences, allowing for gaps to be inserted at any position.

The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output and compare the results with BLAST2.

Questions

Optional: Step4B Bic-SW search with the SM Domain only

Now repeat the search but use the globular N-terminal domain only.

Questions

Optional: Step 5. Bic_profilesearch based on an alignment of SM proteins

Profile searches are one of the most sensitive search tools currently available. The raw materials for profile searching are a multiple sequence alignment in conjunction with a residue exchange matrix (e.g. the Gonnet Pam250 matrix). A profile scores the amino acids at each position in the alignment: conserved positions score more strongly than unconserved ones (whereas in a single sequence, they are all equally significant). We can compare the sensitivity to the searches with a single sequence as query.

The multiple sequence alignment has already been prepared by the Gibson group. If there is time, you can

Very Optional:

Optional: Step 5A. Preparing a profile from an SM alignment Step 5B. BIC_Profilesearch with an SM domain profile prepared with the Gonnet Pam250 matrix The search will take a couple of minutes (unless the Bic is busy). When it is finished you can look at the high-score list and alignments in the output. Use the SRS links to learn about the top hits.

Questions