11.2. The for loop

Verify protein sequences.  Let us start with an example that checks whether all characters of a protein sequence are valid amino acid characters. How could we do this by hand? One possibility is to start at the beginning of the protein sequence and check each character one by one. Here is a more formal description of the algorithm:

  1. for each character of the protein sequence
    1. if the current character is not an amino acid
      1. print the invalid character
    2. end if
  2. end for

In Python this can be performed with the for statement.

The for loop

The for loop enables to iterate on an ordered collection of objects and to execute the same sequence of statements for each element.

This sort of iteration is also known as a traversal, because the collection is traversed from left to right and a particular task is applied to each element.

So in our example, the collection of elements is a string which we traverse with the for loop. We also have to specify what a valid amino acid is, because computers do not know about it. Nevertheless we have already written a function doing this (Example 8.2).

>>> protein = "SERLSITPLGPYIGAQIJSGADLTRPLSDNQFEQLYHAVLRHQVVFLRDQAITPQQQRALA"

>>> for aa in protein:
...    if not isAminoAcid(aa):
...        print "protein contains unknown amino acid: ", aa
...
protein contains unknown amino acid:  J
      

As def and if, the for statement is also a statement containing a block. The header line gives the two essential information for the traversal. It starts with for followed by a variable name to which each element of the collection is assigned during the execution of the block of statements. The second half of the header indicates the collection in which the elements can be found. As in all functions, operations and expressions, the collection can be specified by a value, a variable or even a composed expression.

Figure 11.1. The for loop

Getting the position of invalid amino acids.  As written above we cannot show the position of the invalid amino acid in the protein sequence, because only the amino acid character is known in the body of the for loop. Therefore, we have to iterate over the indices of the sequence. Here is the algorithm:

  1. for each position in the protein sequence
    1. if the character at this position is not an amino acid
      1. print the position and the invalid character
    2. end if
  2. end for

What are the possible values of the positions? Remember the string indices, they start at 0 and end at one before the length of the string. Python has a function that gives the possibility to create a list of numbers that is called range. Using this function, we can translate the algorithms above into Python:

>>> for i in range(len(protein)):
...    if not isAminoAcid(protein[i]):
...        print "protein contains unknown amino acid " , protein[i] , \
...              " at position ", i
      

Note

There is a convention in most programming languages to use i, k, l as variable names for the iteration variable.

Translate cds sequences.  Let's take another example. Say that we have a cds sequence and that we would like to know such things as the corresponding amino acid sequence or the codon usage. How could we achieve this? For sure, it looks like a traversal of the cds sequence, but the things we are looking for do not correspond anymore to one nucleotide of the cds sequence but to a codon which is a three letter substring. If we could access the codons one by one, the translation into the amino acid sequence would look like this:

In the body of the loop we need to establish the correspondence between a codon sequence and the amino acid. We know how to do this by hand, by looking up a codon in the codon translation table. A dictionary is the data type that gives us such sort of access to the collection of codons.

>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> code['atg']
'M'
      

Now let us go back to the first part of the problem: getting the codons. If we know where a codon starts, we only have to extract the substring of length 3 starting from this position. Here is the algorithm:

Fortunately, the range function can take an optional step argument.

>>> range(10) 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> range(4,10)
[4, 5, 6, 7, 8, 9]
>>> range(0,10,3)
[0, 3, 6, 9]
      

We can now print the codon list:

>>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaataa"

>>> for i in range(0,len(cds),3):
...     print cds[i:i+3],
... 
atg
agt
gaa
cgt
ctg
agc
att
acc
ccg
ctg
ggg
ccg
tat
atc
ggc
gca
caa
taa
taa
      

If we replace the print statement with the access to the dictionary of translation, we can translate the codon into the amino acid.

Example 11.1. Translate a cds sequence into its corresponding protein sequence

>>> def translate(cds, code):
...    prot = ""
...    for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        prot = prot + code[codon]
...    return prot
>>> translate(cds, code)
'MSERLSITPLGPYIGAQ*'
      

What about the computing of the codon usage? We do not need the translation table anymore. But we have to count each codon now. We also need a data structure accessible by codons, although it is not to get information, but to store the result. Here is the algorithm: do not forget that accessing a key which does not exist in a dictionary, is not allowed.

Here is the corresponding Python code:

>>> def count_codons(cds):
...     usage = {}
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        if usage.has_key(codon):
...           usage[codon] += 1
...        else:
...           usage[codon] = 1
...     return usage
...
>>> count_codons(cds)
{'acc': 1, 'atg': 1, 'atc': 1, 'gca': 1, 'agc': 1, 'ggg': 1, 'att': 1, 'ctg': 2,
 'taa': 1, 'ggc': 1, 'tat': 1, 'ccg': 2, 'agt': 1, 'caa': 1, 'cgt': 1, 'gaa': 1}
      

Exercise 11.1. Write the complete codon usage function

Transform the count_codons function into a function getting the real codon usage. Hint: You need to divide each count by the total number of codons.