Chapter 1. Introduction

Table of Contents

1.1. First session
1.1.1. Let's run the python interpreter
1.1.2. Variables
1.1.3. Strings
1.1.4. Functions
1.1.5. Lists
1.1. A first program
1.2. Why Python
1.3. Programming Languages

1.1. First session

1.1.1. Let's run the python interpreter

Python 2.4.2 (#1, Dec 20 2005, 16:25:40) 
[GCC 4.0.0 (Apple Computer, Inc. build 5026)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 1 + 5
6
>>> 2 * 5
10
>>> 1 / 2
0

Is it the right answer?

>>> float(1 / 2)
0.0
>>> 1 / 2.0
0.5
>>> float(1)/2
0.5


>>> 'aaa'
'aaa'
>>> len('aaa')
3

What happened?

>>> len('aaa') + len('ttt')
6
>>> len('aaa') + len('ttt') + 1
7
>>> 'aaa' + 'ttt'
'aaattt'
>>> 'aaa' + 5
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects

What does the error message mean?
'str' objects are character strings ('aaa' is a string of characters). 'int' objects are integer.

You can check what the values you manipulate are, or, in programming
languages parlance, what their type is:

>>> type(1)
<type 'int'>
>>> type('1')
<type 'str'>
>>> type(1.0)
<type 'float'>
    

1.1.2. Variables

So far, we have manipulated values (integers, floats and strings). Instead of manipulating values directly, litterally, you can associate a name to a value and access to the value through the associated name:


>>> a = 3 
>>> a
3

This name is called a 'variable', because its value can change.

Above, the interpreter displays the value (3) of the variable (a).

>>> myVar = 'one sentence'
>>> myVar
'one sentence'

Variable names follow specific rules. See what happens below:

>>> 1string = 'one string'
  File "<stdin>", line 1
      1string = 'one string'
            ^
SyntaxError: invalid syntax


>>> myvar
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
NameError: name 'myvar' is not defined

What happened?

>>> a = 2
>>> a
2
>>> a * 5
10
>>> b = a * 5
>>> b
10
>>> a = 1
>>> b
10

Why hasn't b changed?

>>> a = 1    in this case a is a number
>>> a + 2
3
>>> a = '1'   in this case a is a string
>>> a + 1
 Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects 

What do you conclude?
      

1.1.3. Strings

Some magical stuff, that will be explained later:
>>> from string import * 

We can also perform calculus on character strings:

>>> codon='atg'
>>> codon * 3
'atgatgatg'
>>> seq1 = 'agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaaga'
>>> seq2 = 'cggggagtggggagttgagtcgcaagatgagcgagcggatgtccactatgagcgataata'


How do you concatenate seq1 and seq2 in a single string?


>>> seq = seq1 + seq2

What is the length of the string seq?

>>> len(seq)
120

Does the string seq contain the ambiguous 'n' base?
>>>  'n' in seq
False

Does it contain an adenine base?
>>> 'a' in seq
True

>>> seq[1]
'g'   

Why? 
Because in computer science, strings are numbered from 0 to string length - 1
so the first character is:
>>> seq[0]
'a'  

Display the 12th base.

>>> seq[11]
't'

Find the index of the last character.

>>> len(seq)
120

So, because we know the sequence length,  we can display the last character
by:
>>> seq[119]
'a'

But this is not true for all the sequences we will work on.
Find a more generic way to do it.

>>> seq[len(seq) - 1] 
'a'

Python provides a special form to get the characters from the end of a string:

>>> seq[-1]
'a'
>>> seq[-2]
't'

Find a way to get the first codon from the sequence
>>> seq[0] + seq[1] + seq[2]
'agc'

Python provides a form to get 'slices' from strings:
>>> seq[0:3]
'agc'

You can omit the first indice of the range if it's 0:
>>> seq[:3]
'agc'

>>> seq[3:6]
'gcc'



How would you access the slice of the 3 last characters?
Hints: you want the positions -3 -2 and -1. 
Do not forget that the last indice of the slice should be le last position + 1.

>>> seq[-3:len(seq)]
'ata'

You can omit the second indice of the range if it's len(seq):
>>> seq[-3:]
'ata'

How many of each base does this sequence contains?

>>> count(seq, 'a')
35
>>> count(seq, 'c')
21
>>> count(seq, 'g')
44
>>> count(seq, 't')
12

Count the percentage of each base on the sequence.
Example for the adenine representation

>>> long = len(seq)
>>> nb_a = count(seq, 'a')
>>> (nb_a / long) * 100
0

What happened? How 35 bases from 120 could be 0 percent?
This is due to the way the numbers are represented inside the computer.

>>> float(nb_a) / long * 100    
29.166666666666668

Now, let us say that you want to find specific pattern on a DNA sequence:

>>> dna = """tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga
tccctagctaagatgtattattctgctgtgaattcgatcccactaaagat"""
>>> EcoRI = 'GAATTC'
>>> BamHI = 'GGATCC' 

Looking at the sequence you will see that EcoRI is present twice and
BamHI just once:

tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga
 ~~~~~~                                                ~~~ 
tccctagctaagatgtattattctgctgtgaattcgatcccactaaaga
~~~                          ~~~~~~

>>> count(dna, EcoRI)
0

Why ??

>>> 'atgc' == 'atgc'
True
>>> 'atgc' == 'gcta'
False
>>> 'atgc' == 'ATGC'
False

why are 'atgc' and 'ATGC' different?

We can convert the case of a string:

>>> EcoRI = lower(EcoRI)
>>> EcoRI
'gaattc'
>>> count(dna, EcoRI)
2
>>> find(dna, EcoRI)
1
>>> find(dna, EcoRI, 2)
88
>>> BamHI = lower(BamHI)
>>> count(dna, BamHI)
0

Why ?

Tip: display the sequence:

>>> dna
'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga\ntccctagctaagatgtattattctgctgtgaattcgatcccactaaagat'

What is this '\n' character?

How to remove it?

>>> dna = replace(dna, '\n', '')
>>> dna
'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctggatccctagctaagatgtattattctgctgtgaattcgatcccactaaagat'
>>>find(dna, BamHI)
54  

Using the mechanisms we have learnt so far, produce the complement of
the dna sequence.	    

      

1.1.4. Functions

So far, we have used functions that perform a specific computation: len(), count(), lower(), find(), replace(), ... We can can define our own functions:
>>> def welcome():
...   print "Welcome to Paris"
...   print "I hope you will enjoy it"
...
>>>

Notice the indented block of statements below
the def. This block corresponds to the statements
that will be executed when the function will be called.

Once the function is defined, you can use it:
>>> welcome()
Welcome to Paris
I hope you will enjoy it
>>>

Now, let's say that we want to welcome a specific person:
Welcome to Paris, Isabelle!

We could rewrite the function:
>>> def welcome():
...   print "Welcome to Paris, Isabelle!"
...   print "I hope you will enjoy it"
...
>>>

But what if we want to welcome Pascal? We are not going to write
another function, right? We can specify that our function will take a
parameter:

>>> def welcome(person):
...   print "Welcome to Paris, ", person, "!"
...   print "I hope you will enjoy it"
...
>>> welcome("Isabelle")
Welcome to Paris, Isabelle!
I hope you will enjoy it

>>> welcome("Pascal")
Welcome to Paris, Pascal!
I hope you will enjoy it

      

1.1.5. Lists

You can assemble values of any type into a list:

>>> l = ['a', 'b', 1, 'a cat']

and access it with indices, exactly as you did with character strings:
>>> l[1]
'b'

Guess what the '+' operator does with lists?

>>> ['a', 1] + [2, 'toto']
['a', 1, 2, 'toto']

... it produces a new list!
	          
Now, let's produce a list of numbers:

>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

You can associate this list to a variable named l:

>>> l = range(10)
>>>

... How would you produce the list of numbers from 10 to 20?

To get help on the range() function, try:

>>> help(range)

Now, how do you get the list of even numbers from 0 to 100?

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 
42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 
82, 84, 86, 88, 90, 92, 94, 96, 98, 100]

And how would you produce the following list:
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

?

    

1.1. A first program

Let us conclude this first presentation of python with our first little program. This program will ask the user for a DNA sequence, computes its GC%, and print it on the screen. This introductive example highlights some key notions (the ones in italics) that will be explained and developed in further chapters.
from string import *

def gc_counter(seq):
   nbre_c    = count(seq, 'c')
   nbre_g    = count(seq, 'g')
   long_seq  = len(seq) 
   gc_percent = 100 * float(nbre_c + nbre_g) / long_seq
   return gc_percent

gc =  gc_counter("atgcggctgtgatgtagtcctttc")
print gc
seq = raw_input("enter a dna sequence: ")
print gc_counter(seq)
      

This program contains several components. The first one:

from string import *
      
will be explained later in this course. Let's say it enables our program to use functions for computing common things on character strings: length, counting of elements, etc...

The second element provides the definition of a function:

def gc_counter(seq):
   nbre_c    = count(seq, 'c')
   nbre_g    = count(seq, 'g')
   long_seq  = len(seq) 
   gc_percent = 100 * float(nbre_c + nbre_g) / long_seq
   return gc_percent
      
The function name is: gc_counter. This is just a definition: in order to invoke this function, a call must be issued, as we will have later in the program.

Obviously, this function performs the steps to compute a GC%. It counts the number of 'g' and 'c', adds them, and finally divide the sum by the length of the sequence.

A special statement: return, specifies which value is returned to the statement that invoked the function, i.e the value specified in the return, here gc_percent. So, in the following statement:

gc =  gc_counter("atgcggctgtgatgtagtcctttc")
      
the evaluation of the gc_counter() call will eventually lead to the returned value.

The next elements are 4 statements:

gc =  gc_counter("atgcggctgtgatgtagtcctttc")
print gc
seq = raw_input("enter a dna sequence: ")
print gc_counter(seq)
      
The 1st statement computes the GC% of a character string: "atgcggctgtgatgtagtcctttc" and stores the result in a variable named gc. For this purpose, it calls the gc_counter function and provides this function with the character string as an argument. The next statement prints the resulting GC%.

A 3rd statement prints a message on the screen and input a string entered by the user to the program. The last statement print the result of calling the gc_counter function applied to the string just entered by the user.

You can put this program into a file and try it. For this purpose, run your text editor and copy the python code above into the editor. Save this as a file called: my_gc.py. It is common and convenient to call a python file by a name having ".py" as an extension.

Once saved, you can run your program by issuing the following command:

python my_gc.py
from the command line prompt of your terminal.

This program, eventhough it's quite simple and short, is typical of programs that are used in scientific computing (except for graphical and interactive programs): you provide data as input, a computation is performed and the program outputs the results.