Python course in Bioinformatics

Katja Schuerer

Catherine Letondal

Abstract

Introduction to Python and Biopython with biological examples.

The picture above represents the 3D structure of the Human Ferroxidase protein, that we use in some of the exercises in this course.

This course is designed for biologists who already have some programming knowledge, in other languages such as perl or C. For this reason, while presenting a substantial introduction to the Python language, it does not constitute an introduction to programming itself (as [Tis2001] or our course in informatics for biology, with an online programming course in Python). What distinguishes this course from general Python introductory courses, is however the important focus on biological examples that are used throughout the course, as well as the suggested exercises drawn from the field of biology. The second half of the course describes the Biopython (http://www.biopython.org/) set of modules. This course can be considered a complement to the Biopython tutorial, and what's more often refers to it, by bringing practical exercises using these components.

Contact: help@pasteur.fr

Comments are welcome.

PDF version of this course


Table of Contents

1. General introduction
1.1. Running Python
1.2. Documentation
1.2.1. General informations
1.2.2. Getting information
1.2.3. Making documentation
1.3. Working environment
1.3.1. Emacs
2. Introduction to basic types in Python
2.1. Strings
2.2. Lists
2.3. Tuples
2.4. Sequence types: Summary
2.4.1. Lists and Tuples
2.4.2. Xrange types
2.4.3. Strings and Unicode strings
2.4.4. Buffers
2.5. Dictionaries
2.6. Numbers
2.7. Type conversions
2.8. Files
2.8.1. The print statement
3. Syntax rules
3.1. Indentation
3.1.1. Line structure
3.1.2. Block structure
3.2. Special objects
4. Variables and namespaces
4.1. Variables
4.1.1. Multiple assignments
4.2. Assignments, references and copies of objects
4.3. Namespaces
4.3.1. Accessing namespaces
5. Control flow
5.1. Conditionals
5.2. Loops
5.2.1. while
5.2.2. for
5.2.3. More about loops
6. Functions
6.1. Some definitions
6.2. Operators
6.2.1. Order of evaluation
6.2.2. Object comparisons
6.2.3. . (dot) operator
6.2.4. String formatting
6.3. Defining functions
6.4. Passing arguments to parameters
6.4.1. Reference arguments
6.4.2. Passing arguments by keywords
6.5. Default values of parameters
6.6. Variable number of parameters
7. Functional programming or more about lists
8. Exceptions
8.1. General Mechanism
8.2. Python built-in exceptions
8.3. Raising exceptions
8.4. Defining exceptions
9. Modules and packages
9.1. Modules
9.1.1. Where are the modules?
9.1.2. Loading
9.2. Packages
9.2.1. Loading
10. Classes: Using classes
10.1. Creating instances
10.2. Getting information on a class
11. Biopython: Introduction
11.1. Introduction
11.2. Documentation
11.3. Bio.Seq and Bio.SeqRecord modules
11.3.1. Using Seq class
11.3.2. Sequences reading and writing
11.3.3. Bio classes for sequences
11.4. Bio.SwissProt.SProt and Bio.WWW.ExPASy
11.4.1. Reading entries
11.4.2. Regular expressions in Python
11.4.3. Prosite
11.5. Bio.GenBank
11.5.1. Reading entries
11.6. Running Blast and Clustalw
11.6.1. Blast
11.6.2. Clustalw
12. Classes: Defining a new class
12.1. Basic class definition
12.2. Defining operators for classes
12.3. Inheritance
12.4. Classes variables
13. Biopython, continued
13.1. Parsers
13.1.1. Introduction
13.1.2. Exercises: building parsing classes for Enzyme
13.1.3. Iterator
13.1.4. Exercises: building parsing classes for Enzyme (cont)
13.1.5. Dictionary
13.1.6. Using the parsers classes
13.1.7. Building parsing classes for phylogenetic trees
13.2. Practical: studying disulfid bonds in Human Ferroxidase 3D structure and alignments
13.2.1. Working with PDB
13.2.2. Study of disulfid bonds
14. Graphics in Python
14.1. Tutorials
14.2. Software
14.3. Summary of examples and exercises with some graphics in this course
A. Solutions
A.1. Introduction to basic types in Python
A.2. Control Flow
A.3. Functions
A.4. Modules and packages
A.5. Biopython: Introduction
A.5.1. Bio.Seq package
A.5.2. Bio.SwissProt.SProt and Bio.WWW.ExPASy
A.5.3. GenBank
A.5.4. Blast
A.5.5. Clustalw
A.6. Classes
A.7. Biopython, continued
A.7.1. Enzyme
A.7.2. Building parsing classes for phylogenetic trees
A.7.3. PDB
B. Bibliography

List of Figures

2.1. Diagram of some built-in types
4.1. Assignment by referencing
4.2. Reference copy
6.1. Referencing Arguments
8.1. Exceptions class hierarchy
9.1. Loading specific components
11.1. Overview of the Biopython course
11.2. Seq, SeqRecord and SeqFeatures modules and classes hierarchies
11.3. SeqRecord links to other classes
13.1. Parsers class hierarchy
A.1. Plotting codons frequencies
A.2. Cys conserved positions
A.3. Biopython Alphabet class hierachy

List of Tables

2.1. Built-in sequence types
2.2. Sequence types: Operators and Functions
2.3. Operations on mutable sequence types
2.4. List methods
2.5. Dictionary methods and operations
2.6. Number built-in types
2.7. Type conversion functions
2.8. File methods
2.9. File modes
6.1. Order of operator evaluation (highest to lowest)
6.2. String formatting: Conversion characters
6.3. String formatting: Modifiers

List of Examples

2.1. Introduction to strings
2.2. slices
2.3. Find substrings
2.4. Introduction of lists
2.5. Functions returning a list
2.6. Generate all possible digests with two enzymes
2.7. Distance of two points in space
2.8. Introduction to dictionaries
2.9. Protein 3-Letter-Code to 1-Letter-Code
2.10. Calculation with complex numbers
2.11. Reading Fasta
2.12. Print statement
3.1. None and pass
4.1. Local variable definition
4.2. Global statement
4.3.
4.4. Assignment by referencing
4.5. Copy composed objects
4.6. Independent copy
4.7. Function execution namespaces
5.1. Test the character of a DNA base
5.2. More complex tests
5.3. Find all occurrences of a restriction site
5.4. Remove whitespace characters from a string
5.5. Find a unique occurrence of a restriction site
5.6. Find all possible start codons in a cds
6.1. Differences between functions and procedures
6.2. Defining functions
6.3. Remove enzymes with ambiguous restriction patterns
6.4. Passing arguments by keywords
6.5. Default values of parameters
6.6. Variable number of parameters
6.7. Optional arguments as keywords
8.1. Filename error
8.2. Raising an exception in case of a wrong DNA character
8.3. Raising your own exception in case of a wrong DNA character
8.4. Exceptions defined in Biopython
9.1. A module
9.2. Loading a module's components
9.3. Using the Bio.Fasta package
11.1. Building Seq sequences from strings
11.2. Reading a FASTA sequence with the Bio.Fasta package
11.3. Plotting codon frequency
11.4. Fetching a SwissProt entry from a file
11.5. Searching for the occurrence of PS00079 and PS00080 Prosite patterns in the Human Ferroxidase protein
11.6. Using a NCBIDictionary
11.7. GenBank Iterator class
11.8. Loading a Clustalw file
11.9. Get the consensus sequence of an alignment
12.1. A sequence class
12.2. Seq operators
12.3. biopython FastaAlignment class
12.4. Exceptions class hierarchy
12.5. Bio.Data.CodonTable class variables
13.1. Using SProt.RecordParser and SProt.SequenceParser

List of Exercises

2.1. GC content
2.2. DNA complement
2.3. Restriction site occurrences as a list
2.4. Restriction digest
2.5. Get the codon list from a DNA sequence
2.6. Reverse Complement of DNA
2.7. String methods
2.8. Translate a DNA sequence
2.9. Operators
2.10. Write a sequence in fasta format
2.11. Header function
5.1. Count ambiguous bases
5.2. Check DNA alphabet
6.1. DNA complement function
6.2. Variable number of arguments
9.1. Loading and using modules
9.2. Creating a module for DNA utilities
9.3. Locating modules
9.4. Locating components in modules
9.5. Bio.Seq module
9.6. Bio.SwissProt package
9.7. Using a class from a module
9.8. Import from Bio.Clustalw
11.1. Length of a Seq sequence
11.2. GC content of a Seq sequence
11.3. Write a sequence in FASTA format
11.4. Code reading: Bio.sequtils
11.5. Random mutation of a sequence
11.6. Random mutation of a sequence: count codons frequency
11.7. Random mutation of a sequence: plot codons frequency
11.8. Code reading: connecting with ExPASy and parsing SwissProt records
11.9. SwissProt to FASTA
11.10. Fetch an entry from a local SwissProt database
11.11. Enzymes referenced in a SwissProt entry
11.12. Print the pattern of a Prosite entry
11.13. Display the Prosite references of a SwissProt protein.
11.14. Search for occurrences of a protein PROSITE patterns in the sequence
11.15. Extracting the complete CDS from a GenBank entry
11.16. Local Blast, run and display results
11.17. Remote Blast, run and save results
11.18. Extract 10 best HSPs
11.19. Local PSI-Blast
11.20. Search Prosite patterns with PHI-blast
11.21. Running FASTA
11.22. Doing a Clustalw alignmnent
11.23. Align Blast HSPs
11.24. Get the PSSM from an alignment
11.25. Plotting Cys conserved positions
12.1. A class to store PDB residues
12.2. A class to store PDB residues (cont)
12.3. A class to store PDB residues (cont)
12.4. Code reading: Bio.GenBank.Dictionary class
12.5. Biopython Alphabet class hierachy
12.6. A class to store PDB residues (cont')
13.1. EnzymeConsumer, reading one entry from a file
13.2. EnzymeConsumer, reading n entries from a file
13.3. EnzymeParser
13.4. Code reading: Bio.Swissprot.SProt.Iterator class
13.5. EnzymeIterator
13.6. EnzymeIterator with lookup
13.7. EnzymeDictionary
13.8. EnzymeParsing module
13.9. Fetching enzymes referenced in a SwissProt entry and display related proteins
13.10. Scanner
13.11. Consumer
13.12. Parser
13.13. Fetch a PDB entry from the RCSB Web server
13.14. Define a PDBStructure class
13.15. Define a PDBConsumer class
13.16. Compute disulfid bonds in 1KCW
13.17. Compare 3D disulfid bonds with Cys positions in the alignment (take #1).
13.18. Compare 3D disulfid bonds with Cys positions in the alignment (take #2).
14.1. Code reading: Drawing by Numbers