Regular expression is a term used in computer science to refer to the language that is used to describe patterns to be searched for in a text. The term comes from the computing languages theory where regular expressions are used to denote regular languages. In turn, regular languages are defined as the languages that can be recognized by a finite-state automaton (a topic that will be introduced in the algorithmic course).
The aim of a pattern is to define not only one word to be searched for, but a set of words. This definition is provided in a given language, depending on the system you are working with; the set of corresponding words also depends on the system.
In the Unix shell, for instance:
ls s*
means list all the files beginning by 's', and potentially followed
by anything. The command:
ls s[ie]n*
means list all the files beginning by 's', followed by either 'i'
or 'e', and followed by anything, including nothing. So, in the
context of the ls function within a Unix
shell, the set of words is defined as the files existing on the
filesytem (as opposed, for instance, to the files stored on another
computer, not being made available through a distributed filesystem).
In the Prosite database, patterns describing protein domains are described, such as:
H-C-H-x(3)-H-x(3)-[AG]-[LM]
which represent the following set of amino-acid sequences:
sequences begining by 'HCH', followed by 3 positions containing
any amino-acid letter, followed by 'H', followed again by 3 free
positions, followed by either 'A' or 'G', followed by either 'L'
or 'M'. As you can notice, the language to define patterns is
different from the Unix shell, as well as the set of corresponding
words. Here, they are just sequences of amino-acids letters.
In the grep Unix command, a command to search for patterns in files, although similar to the shell pattern syntax, there is a slight difference. Say that a file contains:
science
s
another
the following command:
grep 's*' file
will return ... all the lines, because 's*' means all the words
beginning by 0 or any number of 's'. In the grep
command, the set of words is composed of the lines of the file. So
the term "set of words" must be understood in a broad sense. It must
be understood also that the set is not actually
generated, of course: it can be infinite! Instead, an
operational representation of this set is built, through a
finite-state automaton.
In SQL languages, you can also provide patterns:
select clone_name from CLONE where clone_id like '[^A]%[02468]'
means that you restrict the query to the clones whose identifier
does not begin with a 'A', is followed by any character N times, and ends
with an even number (Sybase).
While the set of words corresponding to a pattern is described by the given expression and the semantics of the system being used, the set of found words, also called occurrences, depends on data. So, occurrences are the words from the set of words that were actually found within data. Figure 17.5 summarizes the concepts.
A detailed presentation of Python regular expressions is available here: Regular Expression HOWTO. To get information about the re module, see pydoc, but also the sre module (Support for regular expressions), for which re is a wrapper.
In Python, regular expressions are handled by a module: re:
>>> import re
Before searching for a pattern, you must first
compile it (this builds the "set of words",
or rather the representation of this set):
>>> expression = '[AP]{1,2}D'
>>> pattern = re.compile(expression)
pattern is a pattern object.
You then issue a search, for instance in the
small sequence seq, by a request to the pattern
object:
>>> seq = "RPAD"
>>> match = pattern.search(seq)
This establishes the matching, or
correspondances, between the set of possible words
and data. match is called a match
object. To get the occurrences, you can ask the match
object for the start and end of the match in the searched text:
>>> print match.start(), match.end(), seq[match.start():match.end()]
1 4 PAD
or the group:
>>> match.group(0)
'PAD'
Figure 17.6 summarizes this system.
Example 17.6. Searching for the occurrence of PS00079 and PS00080 Prosite patterns in the Human Ferroxidase protein
import sys
import re
from Bio.SwissProt import SProt
sp = open(sys.argv[1])
iterator = SProt.Iterator(sp, SProt.SequenceParser())
seq = iterator.next().seq
sp.close()
PS00079 = 'G.[FYW].[LIVMFYW].[CST].{8,8}G[LM]...[LIVMFYW]'
pattern = re.compile(PS00079)
match = pattern.search(seq.tostring())
print PS00079
print match.start(), match.end(), seq[match.start():match.end()]
![]() | The regular expression is stored in a string. |
![]() | The regular expression is compiled in a pattern. |
![]() | The compiled pattern is searched in the sequence. |
![]() | The result of the search is printed. |
There are several methods to search: search and match, the difference being that match looks for a match at the beginning of the string. So, back to the example, the following statement:
match = pattern.match(seq.tostring())would return a positive result only if the sequence begins by an occurrence of PS00079.
A convenient feature enables to associate a name to sub-parts of the matched text:
import sys import re from Bio.SwissProt import SProt sp = open(sys.argv[1]) iterator = SProt.Iterator(sp, SProt.SequenceParser()) seq = iterator.next().seq sp.close() PS00080 = '(?P<copper3>H)CH...H...[AG](?P<copper1>[LM])'pattern = re.compile(PS00080) match = pattern.search(seq.tostring()) print PS00080 print match.start(), match.end(), seq[match.start():match.end()] print 'copper type 3 binding residue: ', match.group('copper3')
print 'copper type 1 binding residue: ', match.group('copper1')
![]() | The regular expression now contains 2 identifiers: copper1 and copper3. |
![]() | You can print the sub-parts of the result identified by variables: copper1 and copper3. |
Shortcuts. The re module provides shortcuts to directly search for an expression without compiling the pattern into a pattern object:
>>> match = re.search('[AP]{1,2}D', "RPAD")
>>> match.group(0)
'PAD'
You can also directly get the occurrences of a pattern object
in a string:
>>> pattern = re.compile('[AP]{1,2}D', )
>>> pattern.findall("RPAD")
['PAD']
or even the occurences of an expression, without compiling the
pattern:
>>> re.findall('[AP]{1,2}D', "RPAD")
['PAD']
Figure 17.7 summarizes
re module
objects and methods to perform pattern searching.
Search modes.
Text substitutions.
This section presents the Prosite classes in Biopython, which have a common interface with the Python pattern module.
Biopython defines several dictionaries to access biological databases. Having a dictionary means that you can fetch an entry by:
entry = prosite['PS00079']For this to work, you first need to create the dictionary:
prosite = Bio.Prosite.ExPASyDictionary()As you can guess by the name of the module, you actually fetch the Prosite entry on the Web. You could also fetch the Prosite entry from a local database with the golden program (see ???). The entry fetched above is actually a string. In order to have the dictionary return a record, you must rather create it like this:
prosite = Bio.Prosite.ExPASyDictionary(parser=Bio.Prosite.RecordParser())
The Bio.Prosite package defines a Pattern class that enables to create patterns which may be searched for in sequences objects, as in the re Python module for regular expressions. The result of a search is a PrositeMatch, that behaves in a way similar to a regular expression match.
As a conclusion, let us summarize how pattern searching and parsing interact when analyzing text. Basic text analysis is performed by searching for patterns, that are extracted (or scanned) from a text. Then patterns occurrences may be analyzed (or parsed) as a more general structure. In more complex parsing architectures, as the ones that have been introduced in Section 17.4, it is the parser engine which drives the process, by asking a scanner to feed him with basic token found in the text. In such systems, regular expressions are defined and used in the scanner.