Table of Contents
Python 2.4.2 (#1, Dec 20 2005, 16:25:40) [GCC 4.0.0 (Apple Computer, Inc. build 5026)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> 1 + 5 6 >>> 2 * 5 10 >>> 1 / 2 0 Is it the right answer? >>> float(1 / 2) 0.0 >>> 1 / 2.0 0.5 >>> float(1)/2 0.5 >>> 'aaa' 'aaa' >>> len('aaa') 3 What happened? >>> len('aaa') + len('ttt') 6 >>> len('aaa') + len('ttt') + 1 7 >>> 'aaa' + 'ttt' 'aaattt' >>> 'aaa' + 5 Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: cannot concatenate 'str' and 'int' objects What does the error message mean? 'str' objects are character strings ('aaa' is a string of characters). 'int' objects are integer. You can check what the values you manipulate are, or, in programming languages parlance, what their type is: >>> type(1) <type 'int'> >>> type('1') <type 'str'> >>> type(1.0) <type 'float'>
>>> a = 3 >>> a 3 This name is called a 'variable', because its value can change. Above, the interpreter displays the value (3) of the variable (a). >>> myVar = 'one sentence' >>> myVar 'one sentence' Variable names follow specific rules. See what happens below: >>> 1string = 'one string' File "<stdin>", line 1 1string = 'one string' ^ SyntaxError: invalid syntax >>> myvar Traceback (most recent call last): File "<stdin>", line 1, in ? NameError: name 'myvar' is not defined What happened? >>> a = 2 >>> a 2 >>> a * 5 10 >>> b = a * 5 >>> b 10 >>> a = 1 >>> b 10 Why hasn't b changed? >>> a = 1 in this case a is a number >>> a + 2 3 >>> a = '1' in this case a is a string >>> a + 1 Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: cannot concatenate 'str' and 'int' objects What do you conclude?
>>> from string import * We can also perform calculus on character strings: >>> codon='atg' >>> codon * 3 'atgatgatg' >>> seq1 = 'agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaaga' >>> seq2 = 'cggggagtggggagttgagtcgcaagatgagcgagcggatgtccactatgagcgataata' How do you concatenate seq1 and seq2 in a single string? >>> seq = seq1 + seq2 What is the length of the string seq? >>> len(seq) 120 Does the string seq contain the ambiguous 'n' base? >>> 'n' in seq False Does it contain an adenine base? >>> 'a' in seq True >>> seq 'g' Why? Because in computer science, strings are numbered from 0 to string length - 1 so the first character is: >>> seq 'a' Display the 12th base. >>> seq 't' Find the index of the last character. >>> len(seq) 120 So, because we know the sequence length, we can display the last character by: >>> seq 'a' But this is not true for all the sequences we will work on. Find a more generic way to do it. >>> seq[len(seq) - 1] 'a' Python provides a special form to get the characters from the end of a string: >>> seq[-1] 'a' >>> seq[-2] 't' Find a way to get the first codon from the sequence >>> seq + seq + seq 'agc' Python provides a form to get 'slices' from strings: >>> seq[0:3] 'agc' You can omit the first indice of the range if it's 0: >>> seq[:3] 'agc' >>> seq[3:6] 'gcc' How would you access the slice of the 3 last characters? Hints: you want the positions -3 -2 and -1. Do not forget that the last indice of the slice should be le last position + 1. >>> seq[-3:len(seq)] 'ata' You can omit the second indice of the range if it's len(seq): >>> seq[-3:] 'ata' How many of each base does this sequence contains? >>> count(seq, 'a') 35 >>> count(seq, 'c') 21 >>> count(seq, 'g') 44 >>> count(seq, 't') 12 Count the percentage of each base on the sequence. Example for the adenine representation >>> long = len(seq) >>> nb_a = count(seq, 'a') >>> (nb_a / long) * 100 0 What happened? How 35 bases from 120 could be 0 percent? This is due to the way the numbers are represented inside the computer. >>> float(nb_a) / long * 100 29.166666666666668 Now, let us say that you want to find specific pattern on a DNA sequence: >>> dna = """tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga tccctagctaagatgtattattctgctgtgaattcgatcccactaaagat""" >>> EcoRI = 'GAATTC' >>> BamHI = 'GGATCC' Looking at the sequence you will see that EcoRI is present twice and BamHI just once: tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga ~~~~~~ ~~~ tccctagctaagatgtattattctgctgtgaattcgatcccactaaaga ~~~ ~~~~~~ >>> count(dna, EcoRI) 0 Why ?? >>> 'atgc' == 'atgc' True >>> 'atgc' == 'gcta' False >>> 'atgc' == 'ATGC' False why are 'atgc' and 'ATGC' different? We can convert the case of a string: >>> EcoRI = lower(EcoRI) >>> EcoRI 'gaattc' >>> count(dna, EcoRI) 2 >>> find(dna, EcoRI) 1 >>> find(dna, EcoRI, 2) 88 >>> BamHI = lower(BamHI) >>> count(dna, BamHI) 0 Why ? Tip: display the sequence: >>> dna 'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctgga\ntccctagctaagatgtattattctgctgtgaattcgatcccactaaagat' What is this '\n' character? How to remove it? >>> dna = replace(dna, '\n', '') >>> dna 'tgaattctatgaatggactgtccccaaagaagtaggacccactaatgcagatcctggatccctagctaagatgtattattctgctgtgaattcgatcccactaaagat' >>>find(dna, BamHI) 54 Using the mechanisms we have learnt so far, produce the complement of the dna sequence.
>>> def welcome(): ... print "Welcome to Paris" ... print "I hope you will enjoy it" ... >>> Notice the indented block of statements below the def. This block corresponds to the statements that will be executed when the function will be called. Once the function is defined, you can use it: >>> welcome() Welcome to Paris I hope you will enjoy it >>> Now, let's say that we want to welcome a specific person: Welcome to Paris, Isabelle! We could rewrite the function: >>> def welcome(): ... print "Welcome to Paris, Isabelle!" ... print "I hope you will enjoy it" ... >>> But what if we want to welcome Pascal? We are not going to write another function, right? We can specify that our function will take a parameter: >>> def welcome(person): ... print "Welcome to Paris, ", person, "!" ... print "I hope you will enjoy it" ... >>> welcome("Isabelle") Welcome to Paris, Isabelle! I hope you will enjoy it >>> welcome("Pascal") Welcome to Paris, Pascal! I hope you will enjoy it
>>> l = ['a', 'b', 1, 'a cat'] and access it with indices, exactly as you did with character strings: >>> l 'b' Guess what the '+' operator does with lists? >>> ['a', 1] + [2, 'toto'] ['a', 1, 2, 'toto'] ... it produces a new list! Now, let's produce a list of numbers: >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] You can associate this list to a variable named l: >>> l = range(10) >>> ... How would you produce the list of numbers from 10 to 20? To get help on the range() function, try: >>> help(range) Now, how do you get the list of even numbers from 0 to 100? [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100] And how would you produce the following list: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] ?
from string import * def gc_counter(seq): nbre_c = count(seq, 'c') nbre_g = count(seq, 'g') long_seq = len(seq) gc_percent = 100 * float(nbre_c + nbre_g) / long_seq return gc_percent gc = gc_counter("atgcggctgtgatgtagtcctttc") print gc seq = raw_input("enter a dna sequence: ") print gc_counter(seq)
This program contains several components. The first one:
from string import *will be explained later in this course. Let's say it enables our program to use functions for computing common things on character strings: length, counting of elements, etc...
The second element provides the definition of a function:
def gc_counter(seq): nbre_c = count(seq, 'c') nbre_g = count(seq, 'g') long_seq = len(seq) gc_percent = 100 * float(nbre_c + nbre_g) / long_seq return gc_percentThe function name is: gc_counter. This is just a definition: in order to invoke this function, a call must be issued, as we will have later in the program.
Obviously, this function performs the steps to compute a GC%. It counts the number of 'g' and 'c', adds them, and finally divide the sum by the length of the sequence.
A special statement: return, specifies which value is returned to the statement that invoked the function, i.e the value specified in the return, here gc_percent. So, in the following statement:
gc = gc_counter("atgcggctgtgatgtagtcctttc")the evaluation of the gc_counter() call will eventually lead to the returned value.
The next elements are 4 statements:
gc = gc_counter("atgcggctgtgatgtagtcctttc") print gc seq = raw_input("enter a dna sequence: ") print gc_counter(seq)The 1st statement computes the GC% of a character string: "atgcggctgtgatgtagtcctttc" and stores the result in a variable named gc. For this purpose, it calls the gc_counter function and provides this function with the character string as an argument. The next statement prints the resulting GC%.
A 3rd statement prints a message on the screen and input a string entered by the user to the program. The last statement print the result of calling the gc_counter function applied to the string just entered by the user.
You can put this program into a file and try it. For this purpose, run your text editor and copy the python code above into the editor. Save this as a file called: my_gc.py. It is common and convenient to call a python file by a name having ".py" as an extension.
Once saved, you can run your program by issuing the following command:
python my_gc.pyfrom the command line prompt of your terminal.
This program, eventhough it's quite simple and short, is typical of programs that are used in scientific computing (except for graphical and interactive programs): you provide data as input, a computation is performed and the program outputs the results.