For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016.¶

Dr. Pavel Polishchuk

The basic elements of Python 3¶

Python is an object-oriented language, however it supports procedural programming that makes it perfect for fast development of simple scenario scripts.
Python is open-source
Python is cross-platform
Python is powerful:
- dynamic typing of variables - no need to declare variable types
- automatic memory management (garbage collector)
- many built-in and third party libraries
Python has API to many languages
Python is easy to learn and easy to use

My Python experience¶

measured in lines of code:¶

!find ~/Python -type f -name '*.py' -exec cat {} \; | sed '/^\s*#/d;/^\s*$/d;/^\s*\/\//d' | wc -l

23243

measured in the number of developed open-source tools:¶

SiRMS - https://github.com/DrrDom/sirms
Simplex Representation of Molecular Structure
The tool for calculating of fragment descriptors for single compounds, mixtures, "quasi"-mixtures and reactions with atom labeled by different user-defined properties (charge, lipophilicity, H-bonding, etc).
SPCI - https://github.com/DrrDom/spci
Structural and Physico-Chemical Interpretation of QSAR models
The tool with GUI for automatic mining of chemical datasets which performs model building, validation and interpretation and provides with chemically meaningful output.

More detals are here: http://qsar4u.com/

Content¶

PEP 8 -- Style Guide for Python Code
Built-in data types
Data type conversion
List comprehensions
Generators
Variable assignment, shallow and deep copy of objects
Some built-in functions
Functions
File I/O
Create scripts
Classes
Multiprocessing

PEP 8 -- Style Guide for Python Code¶

https://www.python.org/dev/peps/pep-0008/

Indentation - four spaces.
Lines should contain up to 79 symbols.
Blank lines to separate functions, classes, logical blocks in code, etc.
Import each module on a separate line.
Use whitespaces in expressions.
Use single or double quotes for strings consistently.
Leave useful comments: in-line, block or docstrings.
Name classes in CapitalizeWords, name function_with_underscore.
Avoid name conflicts.
etc.

Built-in data types¶

Immutable:

Numbers
Strings
Tuples

Mutable:

Lists
Dictionaries
Sets

Files
Classes

Numbers¶

integer
float
complex

n1 = 1
type(n1)

int

n2 = 1.0
type(n2)

float

n3 = 3 + 3j
type(n3)

complex

Python provides unlimited precision of integers

n4 = 2 ** 1000
n4

10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376

Pyhton supports all math operations under numbers

5 + 2  # addition

7

5 * 2  # multiplication

10

5 / 2  # division

2.5

5 // 2  # integer part of division

2

5 % 2  # residue of division

1

5 ** 2  # exponentiation

25

Strings¶

are sequences of characters (surrounded by single or double quotations)

s1 = 'Olomouc'
s2 = "Olomouc"
s1 == s2

True

Using double quotes you may represent the string with apostrophes

s3 = "Mom's son"
print(s3)

Mom's son

You may create multiline comments with triple quotes

s4 = """This is
a very long
comment"""
s4

'This is\na very long\ncomment'

print(s4)

This is
a very long
comment

Strings may be concatenated

s1 + " is a nice city"

'Olomouc is a nice city'

or repeated

s1 * 3

'OlomoucOlomoucOlomouc'

There a lot of method which can be applied to strings

s1.find("uc")  # beware! indexing starts from 0

5

s1.replace("omou", "ympi")

'Olympic'

s5 = "  String with heading and trailing spaces   \n"
s5

'  String with heading and trailing spaces   \n'

s5.strip()   # remove heading and trailing whitespaces

'String with heading and trailing spaces'

s4.split("\n")

['This is', 'a very long', 'comment']

Lists¶

are ordered collections of items

ls1 = ['abc', 3, s1]
ls1

['abc', 3, 'Olomouc']

ls2 = [3, 4, 5]

ls1 + ls2  # concatenation of lists returns new list

['abc', 3, 'Olomouc', 3, 4, 5]

ls1.extend(ls2)  # update list with items from another changes the original list
ls1

['abc', 3, 'Olomouc', 3, 4, 5]

ls1.append(10)  # append to the list
ls1

['abc', 3, 'Olomouc', 3, 4, 5, 10]

ls1.append(ls2)  # nested lists
ls1

['abc', 3, 'Olomouc', 3, 4, 5, 10, [3, 4, 5]]

[0, 1] * 5   # repeat of list items

[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

Tuples¶

are ordered collections of items as lists but they are immutable That is particularly very useful when you exchange data between classes or modules to make sure that they will not be changed accidently.

t1 = (2, 3, 4)
t1

(2, 3, 4)

Ordered collections (lists and tuples) of string items can be converted to one string with specified separator that is very useful when you store data to text files.

"\t".join(['1', '2', '3'])

'1\t2\t3'

" ".join(('1', '2', '3'))

'1 2 3'

Slicing and indexing of strings, lists and tuples¶

s1

'Olomouc'

s1[0]  # access items by index

'O'

s1[1]

'l'

s1[-1]  # access last item

'c'

s1[len(s1) - 1]  # the same

'c'

s1[0:3]

'Olo'

s1[:3]  # the first and the last indices may be omitted

'Olo'

s1[4:]

'ouc'

s1[2:-1]

'omou'

s1[:]  # creates a copy of an object, that is particularly useful when work with lists

'Olomouc'

Dictionaries¶

consists of key-value pairs like hash tables or associative arrays.
Keys can be of any immutable type: number, string or tuple.
Values are items of any type without restrictions.

Dictionaries are very fast and efficient. They can be accessed only by keys.

d = {1: 'Olomouc', 2: 'nice', 3: "city"}
print(d)

{1: 'Olomouc', 2: 'nice', 3: 'city'}

d[1]   # get item with key 1 
       # (you cannot use slices like in lists, you need to iterate all keys to return corresponding values)

'Olomouc'

# d[0]   # get item with key 0 which is absent and thus it leads to error

list(d.keys())

[1, 2, 3]

list(d.values())

['Olomouc', 'nice', 'city']

list(d.items())

[(1, 'Olomouc'), (2, 'nice'), (3, 'city')]

if 0 in d.keys():     # check for key existence
    print("Success")
else:
    print("Failure")

Failure

d['list'] = [1, 2, 4]  # add new value, this will rewrite your data if it is already exists with this key
d

{1: 'Olomouc', 2: 'nice', 3: 'city', 'list': [1, 2, 4]}

d[3] = 3  # replace with new value
d

{1: 'Olomouc', 2: 'nice', 3: 3, 'list': [1, 2, 4]}

d[3] = d[3] + 4   # update existing item
d

{1: 'Olomouc', 2: 'nice', 3: 7, 'list': [1, 2, 4]}

d[3] += 4   # the same
d

{1: 'Olomouc', 2: 'nice', 3: 11, 'list': [1, 2, 4]}

del d[3]   # remove item from dict

Sets¶

are unordered sets of unique immutable items

s1 = set([1, 2, 3])   # set can be created from iterable
s1

{1, 2, 3}

s2 = {4, 5, 1, 2}  # set can be created from separate items
s2

{1, 2, 4, 5}

s1 & s2   # intersection

{1, 2}

s1 | s2   # union

{1, 2, 3, 4, 5}

s1 - s2   # difference

{3}

s2 - s1   # difference is not symmetrical

{4, 5}

Data type conversion¶

int('12')   # string to integer

12

int(12.2)   # float to integer

12

float('12')   # string to float

12.0

str(12)   # number to string

'12'

int('10001101', 2)   # convert string to integer with base 2

141

a = [1, 1, 2, 3, 4]
a

[1, 1, 2, 3, 4]

tuple(a)   # converts to tuple

(1, 1, 2, 3, 4)

set(a)   # converts to set and keep only unique items

{1, 2, 3, 4}

list(set(a))   # converts to the set and back to the list - can be used to remove duplicates from the list

[1, 2, 3, 4]

List comprehensions¶

simplify generation of iterable objects (lists, dicts, sets, tuples)

Let's generate list containing the number of characters in each word in the sentence

s = "Chemoinformatics is a bright star on in the scientific universe"

How this can be done. Solution 1.

output = []
for word in s.split(' '):
    output.append(len(word))
output

[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]

Solution 2 using list comprehensions.

output = [len(word) for word in s.split(' ')]
output

[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]

It is possible to create tuple instead of a list

output = tuple(len(word) for word in s.split(' '))
output

(16, 2, 1, 6, 4, 2, 2, 3, 10, 8)

or even dict with words as a key and their length will be values

output = {word: len(word) for word in s.split(' ')}
output

{'Chemoinformatics': 16,
 'a': 1,
 'bright': 6,
 'in': 2,
 'is': 2,
 'on': 2,
 'scientific': 10,
 'star': 4,
 'the': 3,
 'universe': 8}

or set

output = {len(word) for word in s.split(' ')}
output

{1, 2, 3, 4, 6, 8, 10, 16}

Generators¶

are simple functions which return an iterable set of items, one at a time.

def gen_subseq(seq, length):
    for i in range(len(seq) - length):
        yield seq[i:i+length]
        
s = 'AGTGGTCA'
gen_subseq(s, 3)

<generator object gen_subseq at 0x7f8026b258e0>

list(gen_subseq(s, 3))

['AGT', 'GTG', 'TGG', 'GGT', 'GTC']

for subseq in gen_subseq(s, 3):
    if subseq == "GGT":
        break
    else:
        print(subseq)

AGT
GTG
TGG

Recursive generators is very simple starting from Python 3.3. Below is a generator of integers starting from the specified one.

def infinity(start):
    yield start
    yield from infinity(start + 1)

However recursion has a maximum depth. If a program will reach it an error will be raisen. You may increase the recursion depth in system settings or reimplement the procedure without recursion.

Variable assignment, shallow and deep copy of objects¶

Variables are assigned by reference not by value. This may lead to some unxpected situations in case of mutable data types. Compare different situations.

a = 4
b = a
a = 5
print(a)
print(b)

5
4

L = [1, 2, 3]
M = L
L[0] = 9
print(L)
print(M)

[9, 2, 3]
[9, 2, 3]

M is L   # check identity of referenced objects

True

N = L[:]
N is L

False

L[0] = 'p'
print(L)
print(N)

['p', 2, 3]
[9, 2, 3]

L = [1, [2, 3]]
M = L[:]
M is L

False

print(L)
print(M)

[1, [2, 3]]
[1, [2, 3]]

L[1][1] = 5
print(L)
print(M)

[1, [2, 5]]
[1, [2, 5]]

from copy import deepcopy
L = [1, [2, 3]]
M = deepcopy(L)
L[1][1] = 5
print(L)
print(M)

[1, [2, 5]]
[1, [2, 3]]

Some built-in functions¶

min(), max(), sum()

ls = [1, 2, 3, 4]

min(ls)

1

max(ls)

4

sum(ls)

10

zip(*iterables) - makes an iterator that aggregates elements from each of the iterables

s = 'ABCD'

zip(ls, s)

<zip at 0x7f8026b38148>

list(zip(ls, s))

[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D')]

d = dict(zip(ls, s))   # useful for creating dict from separate lists of keys and values
d

{1: 'A', 2: 'B', 3: 'C', 4: 'D'}

enumerate()

enumerate(s)

<enumerate at 0x7f8026ba1e58>

list(enumerate(s))

[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D')]

They are paticularly useful for loops:

for i, (number, letter) in enumerate(zip(ls, s)):    # unpacking zipped values is not neccessary
    print('Iteration %i: number %i is assigned to letter %s' % (i, number, letter))

Iteration 0: number 1 is assigned to letter A
Iteration 1: number 2 is assigned to letter B
Iteration 2: number 3 is assigned to letter C
Iteration 3: number 4 is assigned to letter D

for i, item in enumerate(zip(ls, s)):                # item is a tuple
    print('Iteration %i: number %i is assigned to letter %s' % (i, item[0], item[1]))

Iteration 0: number 1 is assigned to letter A
Iteration 1: number 2 is assigned to letter B
Iteration 2: number 3 is assigned to letter C
Iteration 3: number 4 is assigned to letter D

Functions¶

is a block of organized and reusable code which perform a particular action. This help to keep your code modular and flexible. There are a lot of built-in functions like print, len, sum, etc. Let's create our own function which will calculate the mean value of a list.

def mean(lst):
    if lst:    # check if list is not empty
        return sum(lst) / len(lst)
    else:
        return None

mean([1, 2, 3, 4])

2.5

The same using error handling

def mean(lst):
    try:
        return sum(lst) / len(lst)
    except ZeroDivisionError:
        return None

print(mean([]))

None

def sd(lst):
    if not lst:
        return None        # if list is empty return None
    if len(lst) == 1:
        return 0
    else:
        m = mean(lst)
        return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)

sd([1, 2, 3, 4])

1.6666666666666667

sd([5])

0

sd([])

Variables in Python are passed to functions by reference, which may cause errors.

def func(lst):
    lst.append(mean(lst))   # add mean value to the list
    return sum(lst)         # calc sum and return the value

ls = [1, 2, 3, 4]
s = func(ls)
print(s)
print(ls)                   # list was changed!

12.5
[1, 2, 3, 4, 2.5]

To avoid such behaviour one needs to copy or deepcopy the modified object inside the function before using it.

Variables may be passed to a function by position and by name. However positional argument cannot follow named argument.

def div(x, y):
    return x / y

print(div(2, 5))
print(div(5, 2))
print(div(y=5, x=2))

0.4
2.5
0.4

You may set default values of function arguments

def div(x, y=10):
    return x / y

print(div(2))
print(div(2, 5))
print(div(y=2, x=5))

0.2
0.4
2.5

You may pass arbitrary number of named and not named arguments. Not named arguments can be passed with variable started from *, named arguments can be passed with variable started from **. Not named arguments will be passed as a tuple, named arguments will be passed as a dict.

def func(arg1, *args, **kargs):
    print("arg1 = ", arg1)
    print("not named args = ", args)
    print("named args = ", kargs)

func(1, 2, 3)

arg1 =  1
not named args =  (2, 3)
named args =  {}

func(1, arg2 = 2, arg3 = 3)

arg1 =  1
not named args =  ()
named args =  {'arg2': 2, 'arg3': 3}

func(1, 3, key1=10, key2=2)

arg1 =  1
not named args =  (3,)
named args =  {'key1': 10, 'key2': 2}

File I/O¶

Let's us read the file which has header and each line contains compound name and activity values separated by tab (\t) and calculate average and standard deviation of activity values for each compound and store results to another text file.

Compound_name pIC50
Mol_1 8.6
Mol_1 8.7
Mol_2 7.2
Mol_3 6.5
Mol_3 6.5
Mol_1 9
Mol_4 7.5
Mol_5 6.9
Mol_6 8.1
Mol_7 9.2
Mol_2 4.1

There are several file modes: r - read, w - write, a - append, t -text, b - binary.

f = open("data/activity.txt", 'rt')   # open file for reading in text mode

File descriptor has several attributes:

print("Name of the file: ", f.name)
print("File closed?: ", f.closed)
print("File mode : ", f.mode)

Name of the file:  data/activity.txt
File closed?:  False
File mode :  rt

Iterate over lines and save them in dict

d = {}   # create dict where we will store reading results as a list of values for each compound
         # because some compounds can have several values
f.readline()                             # read the first line from file (header) to skip it
for line in f:
    if line.strip():                     # check if line is not empty (skip empty lines)
        tmp = line.strip().split('\t')   # remove whitespaces and split line on tabs (\t) 
                                         # this will avoid errors if compound names contain spaces
        if tmp[0] not in d.keys():
            d[tmp[0]] = [float(tmp[1])]
        else:
            d[tmp[0]].append(float(tmp[1]))
f.close()                                # close file descriptor
                                         # otherwise it may be inaccessible by other applications

d

{'Mol_1': [8.6, 8.7, 9.0],
 'Mol_2': [7.2, 4.1],
 'Mol_3': [6.5, 6.5],
 'Mol_4': [7.5],
 'Mol_5': [6.9],
 'Mol_6': [8.1],
 'Mol_7': [9.2]}

Full text of above commands with handle of possible exceptions:

f = open("data/activity.txt", 'rt')
d = {}
try:
    f.readline()
    for line in f:
        if line.strip():
            tmp = line.strip().split('\t')
            if tmp[0] not in d.keys():
                d[tmp[0]] = [float(tmp[1])]
            else:
                d[tmp[0]].append(float(tmp[1]))
finally:
    f.close()   # if you use f = open(...) statement you need to use try-finally block to be sure 
                # that in the case of exceptions you file will be closed and file descriptor will be released
                # otherwise file can be blocked to access by other applications
                # (it is true if you open file for editing)

Alternative solution with several improvements:

from collections import defaultdict   # import classes, functions, etc from a module

d = defaultdict(list)   # create dict with default values equals to empty list
                        # if one will access not extisted item it will get with an empty list
    
with open("data/activity.txt", 'rt') as f:           # files opened using with statement 
                                                     # will be closed automatically
    f.readline()                                     # skip header
    for line in f:
        if line.strip():                             # skip empty lines
            name, value = line.strip().split('\t')   # since we know that only two elements are in each line
                                                     # we may use such unpacking
            d[name].append(float(value))

d

defaultdict(list,
            {'Mol_1': [8.6, 8.7, 9.0],
             'Mol_2': [7.2, 4.1],
             'Mol_3': [6.5, 6.5],
             'Mol_4': [7.5],
             'Mol_5': [6.9],
             'Mol_6': [8.1],
             'Mol_7': [9.2]})

Now let's calculate average and standard deviation of our values

output = {}
for k, v in d.items():    # iterate over pairs of keys and values
    avg = mean(v)
    std = sd(v)
    output[k] = (avg, std)

output

{'Mol_1': (8.766666666666666, 0.04333333333333344),
 'Mol_2': (5.65, 4.8050000000000015),
 'Mol_3': (6.5, 0.0),
 'Mol_4': (7.5, 0),
 'Mol_5': (6.9, 0),
 'Mol_6': (8.1, 0),
 'Mol_7': (9.2, 0)}

One-liner solution using list comprehensions

output = {k: (mean(v), sd(v)) for k, v in d.items()}

output

{'Mol_1': (8.766666666666666, 0.04333333333333344),
 'Mol_2': (5.65, 4.8050000000000015),
 'Mol_3': (6.5, 0.0),
 'Mol_4': (7.5, 0),
 'Mol_5': (6.9, 0),
 'Mol_6': (8.1, 0),
 'Mol_7': (9.2, 0)}

Save results to a text file

with open("data/activity_stat.txt", "wt") as f:             # if file exists it will be 
                                                            # silently rewritten
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")   # map applies the specified function 
                                                            # over all items of the given iterable

The whole text of the script:

from collections import defaultdict   # import classes, functions, etc from a module

d = defaultdict(list)
    
with open("data/activity.txt", 'rt') as f:
    f.readline()                                     
    for line in f:
        if line.strip():                             
            name, value = line.strip().split('\t')   
            d[name].append(float(value))
            
output = {k: (mean(v), sd(v)) for k, v in d.items()}

with open("data/activity_stat.txt", "wt") as f:
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")

Create scripts¶

Let's create a script which will take a text file with compounds and thier activities as an input and return text file with average and standard deviation of activity values for each compound as shown in example above.

We already have backbone of our script which makes I/O and all calculations. However to use it we will need to edit file names each time.

from collections import defaultdict  


def mean(lst):
    if lst:   
        return sum(lst) / len(lst)
    else:
        return None
    
    
def sd(lst):
    if not lst:
        return None       
    if len(lst) == 1:
        return 0
    else:
        m = mean(lst)
        return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)
    
    
d = defaultdict(list)
    
with open("data/activity.txt", 'rt') as f:
    f.readline()                                     
    for line in f:
        if line.strip():                             
            name, value = line.strip().split('\t')   
            d[name].append(float(value))
            
output = {k: (mean(v), sd(v)) for k, v in d.items()}

with open("data/activity_stat.txt", "wt") as f:
    for k, v in output.items():
        f.write(k + "\t" + "\t".join(map(str, v)) + "\n")

To run scripts from command line we will need to pass input and output file names to it and parse these command line arguments. Below there is a backbone for a script:

Add functions from above:

Save the script, change permission to executable and run it from command line with different arguments

calc.py -h

calc.py -i input_file_name.txt

calc.py --input input_file_name.txt > output_file_name.txt

calc.py -i input_file_name.txt -o output_file_name.txt

Classes¶

provide better modularity and flexibility of a code.

Class: A user-defined prototype for an object that defines a set of attributes that characterize any object of the class. The attributes are data members (class variables and instance variables) and methods, accessed via dot notation.
Instance: An individual object of a certain class.
Class variable: A variable that is shared by all instances of a class. Class variables are defined within a class but outside any of the class's methods. Class variables are not used as frequently as instance variables are.
Instance variable: A variable that is defined inside a method and belongs only to the current instance of a class.
Method: A special kind of function that is defined in a class definition.
Method overloading: The assignment of more than one behavior to a particular method. The operation performed varies by the types of objects or arguments involved.
Inheritance: The transfer of the characteristics of a class to other classes that are derived from it.

Create class

class A:
    def __init__(self, name, value=0):
        self.name = name                # public instance attribute
        self.__value = value            # private instance attribute
    def get_attr(self):
        return "parent class value: " + str(self.__value)
    def set_attr(self, value):
        self.__value = value
    
class B(A):
    def __init__(self, name, value):
        super(B, self).__init__(name)   # init parent clas if necessary
        self.__value = value            # it will not override parent value
    def get_attr(self):                 # override parent method
        return "child class value: " + str(self.__value)

a = A('Main class', 24)
b = B('Derived class', 42)

print(a.name)
print(a.get_attr())

Main class
parent class value: 24

print(b.name)
print(b.get_attr())
print(super(B, b).get_attr())

Derived class
child class value: 42
parent class value: 0

b.set_attr(55)
print(b.get_attr())

child class value: 42

b.__dict__    # look at the namespace

{'_A__value': 55, '_B__value': 42, 'name': 'Derived class'}

Multiprocessing¶

Create a script which will run a long calculation function in a single thread as usual.

$ time ./single_process.py -i rand_int.txt -o output_single.txt

real 0m22.080s
user 0m22.060s
sys 0m0.016s

Implement the same script using multiprocessing module.
Objects which are passed to a process should be pickable (manual serialization may be required).

$ time ./multi_process_2.py -i rand_int.txt -o multi_output_2.txt -c 2

real 0m13.403s
user 0m26.552s
sys 0m0.028s

The obtained files are identical after sorting them, because implemented multiprocessing does not garantee the same order of output results.

diff <(sort output_single.txt) <(sort multi_output_2.txt)

Explanation of differences between imap/imap_unordered and map/map_async
http://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap

An alternative implementation which uses Queue to pass data to processes.

$ time ./multi_process.py -i rand_int.txt -o multi_output.txt -c 2

real 0m14.779s
user 0m28.624s
sys 0m0.160s

Time gain does not linearly depend on the number of cores due to different reasons:

some overhead is always present, therefore there is not necessary that if one specifies more cores it will increase the speed
synchronization between processes reduces effectiveness
read/write files may be a bottleneck

Literature and other knowledge sources¶

Learning Python, 5th Edition, Mark Lutz
StackOverflow: http://stackoverflow.com/questions/tagged/python
online courses