Let us work on a concrete example. In Section 11.3, we have written the findpos function that finds all occurrences of a pattern in a sequence. In the following example, we will show how we can read restriction site patterns of enzymes described in the ReBase database.
ReBase database format. In our example we will use an excerpt of the gcgenz.dat file of the restriction enzyme ReBase database. The information for each restriction enzyme is stored in one line of this file. The information about isoschizomers and the header of the file containing explanations have been omitted in the excerpt that we will use. Figure 16.1 shows a part of this file. Each line starts with the enzyme name, followed by the number of bases after which it cuts the restriction site pattern, the sequence of the restriction site, the length of the overhang and some comments about the isoschizomeres and the availability of the enzyme.
Figure 16.1. ReBase™ file format
AarI 11 CACCTGCnnnn'nnnn_ 4 ! >F 201,386 AatII 5 G_ACGT'C -4 ! ZraI >AEFGIKMNOR 624 AccI 2 GT'mk_AC 2 ! FblI,XmiI >ABEGJKMNORSU 288,374,751 Acc65I 1 G'GTAC_C 4 ! KpnI,Asp718I >FGINR 505 AciI 1 C'CG_C 2 ! >N 497 AclI 2 AA'CG_TT 2 ! Psp1406I >IN 140 AfeI 3 AGC'GCT 0 ! Eco47III,Aor51HI,FunI >IN 15 AflII 1 C'TTAA_G 4 ! BfrI,BspTI,Bst98I,MspCI,Vha464I >ABJKNO 722 AflIII 1 A'CryG_T 4 ! >BGMNS 722 AgeI 1 A'CCGG_T 4 ! AsiAI,BshTI,CspAI,PinAI >GJNR 739
Using the general scheme of reading lines from files shown above (Example 16.1), the following shows one possibility for extracting the restriction site pattern from the line.
Procedure 16.1. read_rebase
INPUT: a file in rebase format
OUTPUT: a dictionary enz_dict containing all restriction patterns indexed by their name
here is one possibility how to translate this procedure into Python:
def get_site_only(pat):
newpat = ""
for c in pat:
if c.isalpha():
newpat +=c
return newpat
def read_rebase(filename):
enz_dict={}
infh= open(filename)
for line in infh.xreadlines():
fields = line.split()
name = fields[0]
pat = fields[2]
enz_dict[name] = get_site_only(pat)
infh.close()
return enz_dict
print read_rebase("../data/rebase.dat")