17.5. Searching for patterns.

17.5.1. Introduction to regular expressions

Regular expression is a term used in computer science to refer to the language that is used to describe patterns to be searched for in a text. The term comes from the computing languages theory where regular expressions are used to denote regular languages. In turn, regular languages are defined as the languages that can be recognized by a finite-state automaton (a topic that will be introduced in the algorithmic course).

The aim of a pattern is to define not only one word to be searched for, but a set of words. This definition is provided in a given language, depending on the system you are working with; the set of corresponding words also depends on the system.

In the Unix shell, for instance:

ls s*
means list all the files beginning by 's', and potentially followed by anything. The command:
ls s[ie]n*
means list all the files beginning by 's', followed by either 'i' or 'e', and followed by anything, including nothing. So, in the context of the ls function within a Unix shell, the set of words is defined as the files existing on the filesytem (as opposed, for instance, to the files stored on another computer, not being made available through a distributed filesystem).

In the Prosite database, patterns describing protein domains are described, such as:

which represent the following set of amino-acid sequences: sequences begining by 'HCH', followed by 3 positions containing any amino-acid letter, followed by 'H', followed again by 3 free positions, followed by either 'A' or 'G', followed by either 'L' or 'M'. As you can notice, the language to define patterns is different from the Unix shell, as well as the set of corresponding words. Here, they are just sequences of amino-acids letters.

In the grep Unix command, a command to search for patterns in files, although similar to the shell pattern syntax, there is a slight difference. Say that a file contains:

the following command:
grep 's*' file
will return ... all the lines, because 's*' means all the words beginning by 0 or any number of 's'. In the grep command, the set of words is composed of the lines of the file. So the term "set of words" must be understood in a broad sense. It must be understood also that the set is not actually generated, of course: it can be infinite! Instead, an operational representation of this set is built, through a finite-state automaton.

In SQL languages, you can also provide patterns:

select clone_name from CLONE where clone_id like '[^A]%[02468]'
means that you restrict the query to the clones whose identifier does not begin with a 'A', is followed by any character N times, and ends with an even number (Sybase).

While the set of words corresponding to a pattern is described by the given expression and the semantics of the system being used, the set of found words, also called occurrences, depends on data. So, occurrences are the words from the set of words that were actually found within data. Figure 17.5 summarizes the concepts.

Figure 17.5. Pattern searching

17.5.2. Regular expressions in Python

A detailed presentation of Python regular expressions is available here: Regular Expression HOWTO. To get information about the re module, see pydoc, but also the sre module (Support for regular expressions), for which re is a wrapper.

In Python, regular expressions are handled by a module: re:

>>> import re
Before searching for a pattern, you must first compile it (this builds the "set of words", or rather the representation of this set):
>>> expression = '[AP]{1,2}D'
>>> pattern = re.compile(expression)
pattern is a pattern object. You then issue a search, for instance in the small sequence seq, by a request to the pattern object:
>>> seq = "RPAD"
>>> match = pattern.search(seq)
This establishes the matching, or correspondances, between the set of possible words and data. match is called a match object. To get the occurrences, you can ask the match object for the start and end of the match in the searched text:
>>> print match.start(), match.end(), seq[match.start():match.end()]
1 4 PAD
or the group:
>>> match.group(0)
Figure 17.6 summarizes this system.

Figure 17.6. Python regular expressions

Example 17.6. Searching for the occurrence of PS00079 and PS00080 Prosite patterns in the Human Ferroxidase protein

import sys
import re
from Bio.SwissProt import SProt

sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq

PS00079 = 'G.[FYW].[LIVMFYW].[CST].{8,8}G[LM]...[LIVMFYW]'                (1)
pattern = re.compile(PS00079)                                             (2)
match = pattern.search(seq.tostring())                                    (3)
print PS00079
print match.start(), match.end(), seq[match.start():match.end()]          (4)

The regular expression is stored in a string.


The regular expression is compiled in a pattern.


The compiled pattern is searched in the sequence.


The result of the search is printed.

There are several methods to search: search and match, the difference being that match looks for a match at the beginning of the string. So, back to the example, the following statement:

match = pattern.match(seq.tostring())
would return a positive result only if the sequence begins by an occurrence of PS00079.

A convenient feature enables to associate a name to sub-parts of the matched text:

import sys
import re
from Bio.SwissProt import SProt

sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq

PS00080 = '(?P<copper3>H)CH...H...[AG](?P<copper1>[LM])'                  (1)
pattern = re.compile(PS00080)
match = pattern.search(seq.tostring())
print PS00080
print match.start(), match.end(), seq[match.start():match.end()]

print 'copper type 3 binding residue: ', match.group('copper3')           (2)
print 'copper type 1 binding residue: ', match.group('copper1')


The regular expression now contains 2 identifiers: copper1 and copper3.


You can print the sub-parts of the result identified by variables: copper1 and copper3.

Shortcuts.  The re module provides shortcuts to directly search for an expression without compiling the pattern into a pattern object:

>>> match = re.search('[AP]{1,2}D', "RPAD")
>>> match.group(0)
You can also directly get the occurrences of a pattern object in a string:
>>> pattern = re.compile('[AP]{1,2}D', )
>>> pattern.findall("RPAD")
or even the occurences of an expression, without compiling the pattern:
>>> re.findall('[AP]{1,2}D', "RPAD")
Figure 17.7 summarizes re module objects and methods to perform pattern searching.

Figure 17.7. Python regular expressions: classes and methods summary

Search modes. 

Text substitutions. 

17.5.3. Prosite

This section presents the Prosite classes in Biopython, which have a common interface with the Python pattern module. Prosite Dictionary

Biopython defines several dictionaries to access biological databases. Having a dictionary means that you can fetch an entry by:

	  entry = prosite['PS00079']
For this to work, you first need to create the dictionary:
	  prosite = Bio.Prosite.ExPASyDictionary()
As you can guess by the name of the module, you actually fetch the Prosite entry on the Web. You could also fetch the Prosite entry from a local database with the golden program (see ???). The entry fetched above is actually a string. In order to have the dictionary return a record, you must rather create it like this:
	  prosite = Bio.Prosite.ExPASyDictionary(parser=Bio.Prosite.RecordParser()) Prosite patterns

The Bio.Prosite package defines a Pattern class that enables to create patterns which may be searched for in sequences objects, as in the re Python module for regular expressions. The result of a search is a PrositeMatch, that behaves in a way similar to a regular expression match.

17.5.4. Searching for patterns and parsing

As a conclusion, let us summarize how pattern searching and parsing interact when analyzing text. Basic text analysis is performed by searching for patterns, that are extracted (or scanned) from a text. Then patterns occurrences may be analyzed (or parsed) as a more general structure. In more complex parsing architectures, as the ones that have been introduced in Section 17.4, it is the parser engine which drives the process, by asking a scanner to feed him with basic token found in the text. In such systems, regular expressions are defined and used in the scanner.