16.5. Documentation strings

The code of the program written in this section (Example 16.2) fills about one page. If you put it away for a while and look on it a week later, it might be difficult to remember all the choices made during the development and it might take a lot of time to understand it. This fact underlines the importance to document programs.

For this purpose you can use comments that are ignored by the interpreter. In Python, all lines starting with a # character are comments. But Python has a more powerful mechanism to document programs. If the first statement of a function or file is a string, this string is used as documentation string which is available by the pydoc command in a shell or the help function of the interpreter.

Here is a documented version of our restriction program.

#!/usr/bin/python
"""
program that finds all occurences of restriction enzymes in a sequence

at the moment the sequence is contained in the seq variable
  todo: reads the sequence from a field

enzyme data are read from the file: rebase.dat
and results are written to file: restrict.res

restriction: the program finds only occurences of EXACT restriction patterns
"""

def isexact(pat):
    """
checks if a pattern is an exact DNA sequence
all strings containing only the following characters are considered to be
an exact DNA sequence: atgcATGC
    """
    for c in pat.upper():
        if c not in 'ATGC':
            return 0
    return 1

def print_matches(ofh, enz, matches):
    """
write starting positions of occurences of a restriction site to a file
positions in the sequence starts by 1
    """
    if matches:
        print >>ofh, "Enzyme %s matches at:" % enz,
        for m in matches:
            print >>ofh, m+1,
        print >>ofh
    else:
        print >>ofh, "No match found for enzyme %s." % enz

def get_site_only(pat):
    """
clean a pattern string read from Rebase to obtain only the recognition site
sequence 
   """
    newpat = ""
    for c in pat:
        if c.isalpha():
            newpat += c
    return newpat

def findpos(seq, pat):
    """
find all occurences of restriction site pattern in a sequence
RETURN VALUE: a list containing the start positions of all occurences
RESTRICTION: can only process exact patterns, searching with ambiguous
pattern strings would not produce an error
    """
    matches = []
    current_match = seq.find(pat)
    while current_match != -1:
        matches.append(current_match)
        current_match =seq.find(pat, current_match+1)
    return matches

# starting main stream

seq = """atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa""".replace("\n","").upper()

# open the input and output file
ifh = open("../data/rebase.dat")
ofh = open("rebase.res", "w")       

# process enzyme by enzyme
line = ifh.readline()
while line:
    # extract enzyme name and pattern string
    fields = line.split()
    name = fields[0]
    pat = get_site_only(fields[2])

    # find pattern and write the result
    if isexact(pat):
        print_matches(ofh, name, findpos(seq, pat))
    # get the next enzyme
    line = ifh.readline()

# close opened files
ofh.close() 
ifh.close() 

and the result of the documentation formatted with pydoc

Python Library Documentation: module restrict_documented

NAME
    restrict_documented - Program that find all occurrences of restriction enzymes in a sequence

FILE
FILE
    /home/maufrais/cours_python/code/restrict_documented.py

DESCRIPTION
    At the moment the sequence is contained in the seq variable
         
    Enzyme data are red from the file: rebase.dat
    and results are written to file: restrict.res
    
    restriction: the program finds only occurrences of EXACT restriction patterns

FUNCTIONS
    get_site_only(pat)
        clean a pattern string red from Rebase to obtain only the recognition site
        sequence
    
    isexact(pat)
        checks if a pattern is an exact DNA sequence
        all strings containing only the following characters are considered to be
        an exact DNA sequence: atgcATGC
    
    print_matches(ofh, enz, matches)
        write starting positions of occurrences of a restriction site to a file
        positions in the sequence starts by 1
    
    findpos(seq, pat)
        find all occurrences of restriction site pattern in a sequence
        RETURN VALUE: a list containing the start positions of all occurrences
        RESTRICTION: can only process exact patterns, searching with ambiguous
        pattern strings would not produce an error

DATA
    __file__ = './restrict_documented.pyc'
    __name__ = 'restrict_documented'
    fields = ['ZraI', '3', "GAC'GTC", '0', '!', 'AatII', '>I', '136']
    ifh = <closed file '../data/rebase.dat', mode 'r'>
    line = ''
    name = 'ZraI'
    ofh = <closed file 'rebase.res', mode 'w'>
    pat = 'GACGTC'
    seq = 'ATGAGTGAACGTCTGAGCATTACCCCGCTGGGGCCGTATATCGGCGC...CATCGGGCGACGA...