16.4. Design problems

Here is the complete program to find all occurrences of restriction sites from a set of enzymes in a sequence. The enzyme data are read from a file and the results are also stored in a file.

def findpos(seq, pat):
    matches = []
    current_match = seq.find(pat)
    while current_match != -1:
        matches.append(current_match)
        current_match =seq.find(pat, current_match+1)
    return matches
    
def isexact(pat):
    for c in pat.upper():
        if c not in 'ATGC':
            return 0
    return 1

def print_matches(ofh, enz, matches):
    if matches:
        print >>ofh, "Enzyme %s matches at:" % enz,
        for m in matches:
            print >>ofh, m,
        print >>ofh
    else:
        print >>ofh, "No match found for enzyme %s." % enz

def get_site_only(pat):
    newpat = ""
    for c in pat:
        if c.isalpha():
            newpat += c
    return newpat

def read_rebase(filename):
    enz_dict = {}
    infh = open(filename)
    for line in infh.xreadlines():
        fields = line.split()
        name = fields[0]
        pat = fields[2]
        enz_dict[name] = get_site_only(pat)

    infh.close()
    return enz_dict

def findpos(seq, pat):
    matches = []
    current_match = seq.find(pat)
    while current_match != -1:
        matches.append(current_match)
        current_match =seq.find(pat, current_match+1)
    return matches


seq = """atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa""".replace("\n","").upper()


enzdict = read_rebase("../data/rebase.dat")

ofh = open("rebase.res", "w")       
for enzname in enzdict.keys():
    pat = enzdict[enzname]
    if isexact(pat):
        print_matches(ofh, enzname, findpos(seq, pat))
ofh.close() 



    

We have written this program step by step by broadening the problem at each step, but we have never looked at the global design of the program. Therefore we have used two strategies while reading and writing. In the reading step all data are read at once, whereas the results are written each time we got them.

In biological problems it is often necessary to handle lots of data and this can lead to problems if you try to handle all of them in memory, as we have done with the enzyme data. Sometimes we need all the data to solve a problem, but often we can treat them each at a time. If the dataset is small, the choice of the strategy does not change a lot, but if you handle a large amount of data, it is worth asking whether you really need it all in memory?

In our restriction problem it is not really necessary to store all the enzyme data in the memory. It is possible to treat enzyme by enzyme with one loop that reads the enzyme data, processes the sequence and writes the result.

Example 16.2. Restriction of a DNA sequence

def isexact(pat):
    for c in pat.upper():
        if c not in 'ATGC':
            return 0
    return 1

def print_matches(ofh, enz, matches):
    if matches:
        print >>ofh, "Enzyme %s matches at:" % enz,
        for m in matches:
            print >>ofh, m,
        print >>ofh
    else:
        print >>ofh, "No match found for enzyme %s." % enz

def get_site_only(pat):
    newpat = ""
    for c in pat:
        if c.isalpha():
            newpat += c
    return newpat

def findpos(seq, pat):
    matches = []
    current_match = seq.find(pat)
    while current_match != -1:
        matches.append(current_match)
        current_match =seq.find(pat, current_match+1)
    return matches


seq = """atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa""".replace("\n","").upper()


ifh = open("../data/rebase.dat")
ofh = open("rebase.res", "w")       

line = ifh.readline()

while line:
    fields = line.split()
    name = fields[0]
    pat = get_site_only(fields[2])

    if isexact(pat):
        print_matches(ofh, name, findpos(seq, pat))
        line = ifh.readline()
    else:
    	line = ifh.readline()

ofh.close() 
ifh.close()

Important

Notice that there are two files opened at the same time during the loop.

Figure 16.2 shows three flowchart comparing our first solution, with the new design and a version that first reads all data in, handles them and writes all results at the end.

Figure 16.2. Flowchart of the processing of the sequence

Exercise 16.1. Multiple sequences for all enzymes

Think about alternative solutions, their advantages and problems if you need to process more than one sequence for all restriction enzymes.