In Bioinformatics, parsing is very important, since it enables to extract informations from data files or to extract results produced by various analysis programs, and to make them available in your programs. For instance, a Blast parser will transform a text output into a list of hits and their alignment, that can be made available as a data structure, such as, for example, Biopython Bio.Blast.Record objects, that you can use in a Python program.
The purpose of this section is not to present everything about parsing, but just to introduce some basic notions.
Parsing means analyzing a text and producing structured data in a form that is useful for programs. It can be a list of strings, a set of classes instances, or just a boolean result: this depends on the needs and the parsing system you are using. An important aspect of parsing is the architecture that is used to process the text that you want to analyze.
Parsing can be done by just reading a file or a stream line by line, and by looking for the occurrence(s) of a word, or a pattern. In Figure 17.1, lines are searched for a header pattern, or a hit pattern, and processed accordingly.
You can tell an external component which set of words you are interested in, by providing a description of set of words to be found. You then feed this component a document to parse, and wait him to tell you when these words are found as well as sending you the occurrences. This kind of system is said to be event-driven, and XML provides tools for such type of parsing.
You can describe the whole document by set of hierarchical subparts, and associate to each sub-part actions to be automatically executed by a parsing engine. Figure 17.3 shows such a system to parse a Blast report. A Blast report is described as a header followed by a list of hits. A hit is described as a score and a list of HSP, and a HSP is described as a score and a list of alignments. You define these subparts by a set of rules, sometimes using patterns, in what is usually called a grammar, or even a decorated grammar, since you decorate each sub-part with an associated action. The lex and yacc system is such a parsing engine.
You can have a parsing engine process your data according to a grammar, and returns a hierarchical data structure in a form that your program can understand (for instance, in Python, as a set of objects). The XML/DOM engine behaves like this.
So, in all the cases, there is an engine driving the whole process, be it a simple loop or a specialized component. In this chapter, we will just do some "manual" parsing with patterns that are introduced in Section 17.5, as well as some event-driven parsing will be done as a practical work on abstract frameworks (see Exercise 19.4), and during the Web/XML course.