WWW Sequence Database Searching Practical
Michael Nilges
Unité de Bio-Informatique Structurale
Institut Pasteur
March 5, 2002
This practical is largely based on a practical by Toby Gibson's
Group at EMBL Heidelberg
In this practical we will run some database search tools available through
the WWW. Examination of the outputs may reveal some differences between
the results, depending on
the type of algorithm or the options used in the sequence comparison.
This will show that it is important to put a little thought in the comparison
and not trust the default values
blindly.
WWW DB search Tools
We will use:
-
SRS6 to extract query sequences. We
will use the SRS server at EBI, Hinxton.
-
SRS6 is also installed at Pasteur,
but for some reason this practical does not work there.
-
See also the link to the list of public SRS servers on the
SRS6 top page at EBI.
-
SRS - Sequence Retrieval System - is a powerful and widely used tool
to retrieve information from sequence and related databases.
-
The EBI BLAST server for BLAST
searches (fast ungapped comparison).
-
Blast servers can be found at many locations.
-
The NCBI psi-BLAST server
for automated profile searches with BLAST.
-
The NCBI server has several useful features, such as a direct link to a
domain database, and the psi-blast option
-
The Bork group's SMART server
at the EMBL to compare a sequence against a protein domain database.
-
Optionally, we may use the Gibson group's Bioccelerators
for Smith-Waterman and profile searches (exhaustive gapped comparison).
-
Optionally, we may use WWWProfileWeight
to make profiles from a sequence alignment, and set up a profile search.
Getting started
The teaching machines are X-terminals and you are connected to the computer
central
at the Institut Pasteur.
-
Login with your user name and password.
-
Start netscape.
-
Load this page into it. You find it at http://www.pasteur.fr/recherche/unites/binfs/courses/practical2002
-
Check that javascript and style sheets are enabled in the
netscape preferences in the advanced options.
Step 1 Choosing an snRNP SM protein as query
SM proteins are found in snRNP complexes. There are quite a number
in Swiss-Prot and they are fairly divergent, so it is difficult (or impossible)
to detect them all in a search with a single query sequence. All SM proteins
share a small globular domain, but many have a C-terminal non-globular
domain too. This will be used to illustrate the problems of searching with
multi-domain proteins.
-
Open a new navigator window and load this page into it.
-
Load SRS6 and Start the session.
-
If you do this by holding down the Alt key, this should automatically open
a new window.
-
Look at the action of the + and - buttons in the library selection.
-
Tick the box for the Swiss-Prot database and click the Standard
query
form button.
-
Set the field selection (by default All Text) to Description.
-
Type snrnp & sm & b in the Description box, then
Do
Query.
-
Click on the entry RSMB_Human.
-
Swiss-prot entries often have useful hypertext links:
-
What does the PFAM link do?
-
Look at the features: mouse over the diagram. What do you notice?
You now have the sequence of human SM-B protein available in a form that
can be cut and pasted into the DB query forms (at the bottom of the entry).
The sequence numbers, new lines, and blanks do not matter; however,
you should not copy the "//" at the end of the sequence.
The sequence can also be found here.
Step 2 BLASTsearching with human SM-B protein
BLAST2 is an upgraded version of BLAST, one of the most widely used
database search packages. The BLAST programs find the best matching ungapped
sections in a sequence comparison. The most important modification for
the user to note in BLAST2 is that neighbouring ungapped segments can now
be concatenated by allowing gaps between them. This improves both sensitivity
and interpretation of the results.
-
Open a new navigator window and load this page into it.
-
Load a BLAST2 query submission
page.
-
It is worth familiarising yourself with the layout and consulting the helps.
-
Paste in the RSMB_Human sequence from the SRS browser (consult the
help for format).
-
Select the Swiss-Prot database.
-
Set the filter option to none.
-
Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
-
Start the BLAST search: this should take a few minutes at most.
-
Save output to a new file, so that you do not lose it.
-
Examine output, and investigate the detected entries by using the links.
Questions
-
1. How many SM proteins are detected above the first false positive?
-
2. Is there another class of protein that is strongly detected?
-
3. If so, is this biologically meaningful?
-
4. Are the P-values a reliable guide to homology?
Step 2B BLASTsearch with SM-B and a filter
Now repeat the search but filter out segments of "reduced sequence complexity".
-
Reload a BLAST2 query submission
page.
-
Paste in the RSMB_Human sequence.
-
Select the Swiss-Prot database.
-
Set the filter option to SEG+XNU.
-
Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
-
Start the BLAST search.
-
Save output to a new file, so that you do not lose it.
-
Examine output, and investigate the detected entries by using the links.
Questions
-
1. How many SM proteins are detected above the first false positive?
-
2. Is there another class of protein that is strongly detected?
-
3. Why are rather few sequences listed?
-
4. How does this setup compare in sensitivity to the unfiltered search?
-
5. Are the P-values are reliable guide to homology?
Step 2C psi-BLASTsearch with SM-B and a filter
Now repeat the search with a different BLAST server which can set up
and run profile searches automatically.
-
Load a BLAST query
submission page at the NCBI.
-
Click the psi-BLAST button
-
Paste in the RSMB_Human sequence.
-
Select the nrdb database.
-
Check if the filter option is turned on (this is the default).
-
Check the number of top hit descriptions and alignments to be shown: set
to 100 or so.
-
Check the inclusion threshold for Psi-BLAST (should be 0.005).
-
Start the BLAST search.
-
The output is different from the EBI BLAST server.
-
The search is first against a domain data base (DART, which is a combination
of Pfam and SMART).
-
Mouse over the diagram, Alt-click the symbol
-
Click the FORMAT button and wait...wait...
-
Save the output in a new file.
Questions
-
1.How many SM proteins are detected above the first false positive?
-
2. Is there another class of protein that is strongly detected?
-
3. Why are rather few sequences listed?
-
4. How does this setup compare in sensitivity to the unfiltered search?
-
5. Are the P-values are reliable guide to homology?
Now run iteration 2.
-
Remove the obvious false positive(s), judging from the annotation.
-
Press the iteration 2 button.
-
Press the FORMAT button after a while, and wait.
Questions
-
1.What changed from iteration 1?
-
2. Did the false positive appear again?
-
3. Where there any new sequences found?
Step 3. Comparing a sequence to a database of protein
domains
Since profile searches are so sensitive, it would make sense to query
an unknown sequence against a set of profiles for known protein families.
There are several very useful databases of modules that are found in multidomain
proteins, including PFAM
at the Sanger Centre, PROSITE
at ISREC and SMART at EMBL.
They use a form of profile technically described as a "hidden Markov model",
but the end result is very much like the profiles we just ran. We will
search for protein domains in an "unknown" protein using the SMART server.
-
Open a new navigator window and load this page in it.
-
Load the SMART query page.
-
Toggle on PFAM domains (includes more domains).
-
Get the "unknown
sequence" and cut and paste it into SMART's Sequence box.
-
Click on the Sequence SMART Button.
-
The search should take about a minute unless the server is busy.
-
When you get the results, note the domain "bubble" diagram and the table
of matching domains.
Questions
-
Based on your recent experiences would you say the E-value scores are
good?
-
What happens if you click on a domain bubble?
-
Is the domain common?
-
Is there any literature on the domain?
-
Are there structures for any of these domains?
-
Is there any known genetic disease related to this domain?
-
Is this protein likely to be in the nucleus, cytoplasm or extracellular
compartments?
-
Can you say what kind of protein it is?
-
Do you think this protein has especially many or few domains?
-
Try repeating the SMART search with FBN1_HUMAN,
the Marfan Syndrome protein. The sequence can also be found here.
Optional: Step 4A Bic_SW search with human SM-B protein
Alternatively to the Blast and Psi-Blast searches above, we can do
a full Smith-Waterman search, construct a multiple sequence alignment (e.g.,
with ClustalW), build a sequence profile, and do a full Smith-Waterman
profile search. As you will see, this is much more time consuming, but
can be more sensitive in some cases. One of the few publicly awaylable
resources to do this is the Bioaccelerator at the EMBL. The queue may be
quite busy.
The Bioccelerator is fast dedicated hardware exclusively
designed to speed up dynamic programming (i.e. slow but sensitive) sequence
comparison. It is built by the Israeli company Compugen. It can perform
a number of search permutations including basic Smith-Waterman, profile
searches and Protein v. DNA frame-shifting comparisons. The Smith-Waterman
search finds the best matching segments between any two sequences, allowing
for gaps to be inserted at any position.
-
Open a new navigator window and load this page into it.
-
Load the Bioccelerator
home page.
-
Go to the Bioccelerator Searches page.
-
Select application sw_p.
-
It is worth familiarising yourself with the layout and consulting the help
links.
-
Paste the RSMB_human sequence from the SRS browser into the Query
Sequence box.
-
Select the Swiss-Prot database. (It may already be the default selection
in gcg format).
-
Now Do Search to start the Bioccelerator run.
-
When you get the output, save to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output and compare the results with BLAST2.
Questions
-
1. How are SM proteins distributed in the output?
-
2. What position is the highest false positive?
-
3. Is another class of proteins strongly detected?
-
4. Are the E-values a reliable guide to the SM protein detections?
-
5. Compared to BLAST:
-
(a) Which, if any, is more sensitive?
-
(b) Which output is easier to understand?
Optional: Step4B Bic-SW search with the SM Domain only
Now repeat the search but use the globular N-terminal domain only.
-
Reload the Bioccelerator
home page.
-
Go to the Bioccelerator Searches page.
-
Select application sw_p.
-
Paste the range 1-82 of RSMB_human into the Bic query form.
-
Select the Swiss-Prot database. (It may already be the default selection).
-
Now Do Search to start the Bioccelerator run.
-
When you get the output, save to a new file, so that you do not lose it.
Questions
-
1. Are more or less SM proteins detected?
-
2. Is another class of proteins strongly detected?
-
3. Are the E-values a reliable guide to the SM protein detections?
-
4. Compared to the BLAST filtered search which, if any, is more sensitive?
-
5. Collect a multiple alignment using the buttons in the header:
-
Is this useful to judge the detections?
-
Which entries have incomplete sequence fragments?
Optional: Step 5. Bic_profilesearch based on an alignment
of SM proteins
Profile searches are one of the most sensitive search tools currently
available. The raw materials for profile searching are a multiple sequence
alignment in conjunction with a residue exchange matrix (e.g. the Gonnet
Pam250 matrix). A profile scores the amino acids at each position in the
alignment: conserved positions score more strongly than unconserved ones
(whereas in a single sequence, they are all equally significant). We can
compare the sensitivity to the searches with a single sequence as query.
The multiple sequence alignment has already been prepared by the
Gibson group. If there is time, you can
Very Optional:
-
Load the alignment SM
domain.aln in a new netscape window.
-
Save the file locally.
-
Start clustalx (by typing clustalx in a terminal window).
-
Load the alignment into clustalx (via the file - load sequences menu).
-
Now you can look at or re-align the sequences.
-
If you have no alignment file to start with, you can create one from the
Bioaccelerator page:
-
In the Bioaccelerator output, select the sequences you want to include
-
choose "create multiple sequence alignment" or "get selected sequences
with SRS".
Optional: Step 5A. Preparing a profile from an SM alignment
-
Open a new navigator window and load this page into it.
-
Load WWW ProfileWeight.
-
Load the alignment
SM domain.aln in a new netscape window.
-
Cut and Paste the alignment into the Paste box.
-
Run ProfileWeight to make the profile.
-
Look at the resulting profile:
-
(a) See how scores for amino acids vary for each position in the alignment.
-
(b) See how the position-specific gap penalties are lowered at existing
gaps.
-
(c) Note the suggested gap penalties in the header: these are only a rough
guide.
-
Save Profile to save the profile to a file (e.g. as Sm.prf)
for use in the profile search.
Step 5B. BIC_Profilesearch with an SM domain profile prepared
with the Gonnet Pam250 matrix
-
Open a new navigator window and load this page into it.
-
Load the Bioccelerator
home page.
-
Go to the Bioccelerator Searches Page.
-
Select Profilesearch in the Application box.
-
In the Upload a file box, give the full directory name of the Sm
protein profile.
-
(Alternatively you could cut and paste into the Query Sequence box.)
-
Give Gap opening penalty 1.0 and extension penalty 0.2.
-
Select the Swiss-Prot database.
-
Do Search.
-
Save the output to a new file, so that you do not lose it.
The search will take a couple of minutes (unless the Bic is busy). When
it is finished you can look at the high-score list and alignments in the
output. Use the SRS links to learn about the top hits.
Questions
-
1. How are SM entries distributed in the output?
-
2. Are the E-values a reliable guide to SM protein detections?
-
3. Is the profile search more or less sensitive than the single sequence
queries?
-
4. Collect a multiple alignment using the buttons in the header:
-
Is this useful to judge the detections?
-
Can you see any conserved positions in the alignment?