16.2. Reading data from files

Let us work on a concrete example. In Section 11.3, we have written the findpos function that finds all occurrences of a pattern in a sequence. In the following example, we will show how we can read restriction site patterns of enzymes described in the ReBase database.

ReBase database format.  In our example we will use an excerpt of the gcgenz.dat file of the restriction enzyme ReBase database. The information for each restriction enzyme is stored in one line of this file. The information about isoschizomers and the header of the file containing explanations have been omitted in the excerpt that we will use. Figure 16.1 shows a part of this file. Each line starts with the enzyme name, followed by the number of bases after which it cuts the restriction site pattern, the sequence of the restriction site, the length of the overhang and some comments about the isoschizomeres and the availability of the enzyme.

Figure 16.1. ReBase™ file format

AarI      11 CACCTGCnnnn'nnnn_   4 !                                    >F	201,386
AatII      5 G_ACGT'C      -4 !  ZraI                              >AEFGIKMNOR	624
AccI       2 GT'mk_AC       2 !  FblI,XmiI                         >ABEGJKMNORSU	288,374,751
Acc65I     1 G'GTAC_C       4 !  KpnI,Asp718I                      >FGINR	505
AciI       1 C'CG_C         2 !                                    >N	497
AclI       2 AA'CG_TT       2 !  Psp1406I                          >IN	140
AfeI       3 AGC'GCT        0 !  Eco47III,Aor51HI,FunI             >IN	15
AflII      1 C'TTAA_G       4 !  BfrI,BspTI,Bst98I,MspCI,Vha464I   >ABJKNO	722
AflIII     1 A'CryG_T       4 !                                    >BGMNS	722
AgeI       1 A'CCGG_T       4 !  AsiAI,BshTI,CspAI,PinAI           >GJNR	739

Using the general scheme of reading lines from files shown above (Example 16.1), the following shows one possibility for extracting the restriction site pattern from the line.

Procedure 16.1. read_rebase

INPUT: a file in rebase format

OUTPUT: a dictionary enz_dict containing all restriction patterns indexed by their name

  1. enz_dict <- empty dictionary
  2. infh <- open file for reading
  3. for each line read from infh:
    1. split line in its fields
    2. name <- first field
    3. pat <- third field
    4. clean pat to get a string containing only the sequence recognized by the restriction enzyme
    5. add the cleaned pat to enz_dict with name as key
  4. close infh
  5. return enz_dict

here is one possibility how to translate this procedure into Python:

def get_site_only(pat):
    newpat = ""
    for c in pat:
        if c.isalpha():
            newpat +=c
    return newpat

def read_rebase(filename):
    infh= open(filename)
    for line in infh.xreadlines():
        fields = line.split()
        name = fields[0]
        pat = fields[2]
        enz_dict[name] = get_site_only(pat)

    return enz_dict

print read_rebase("../data/rebase.dat")