Dr. Pavel Polishchuk
!find ~/Python -type f -name '*.py' -exec cat {} \; | sed '/^\s*#/d;/^\s*$/d;/^\s*\/\//d' | wc -l
More detals are here: http://qsar4u.com/
https://www.python.org/dev/peps/pep-0008/
Immutable:
Mutable:
n1 = 1
type(n1)
n2 = 1.0
type(n2)
n3 = 3 + 3j
type(n3)
Python provides unlimited precision of integers
n4 = 2 ** 1000
n4
Pyhton supports all math operations under numbers
5 + 2 # addition
5 * 2 # multiplication
5 / 2 # division
5 // 2 # integer part of division
5 % 2 # residue of division
5 ** 2 # exponentiation
are sequences of characters (surrounded by single or double quotations)
s1 = 'Olomouc'
s2 = "Olomouc"
s1 == s2
Using double quotes you may represent the string with apostrophes
s3 = "Mom's son"
print(s3)
You may create multiline comments with triple quotes
s4 = """This is
a very long
comment"""
s4
print(s4)
Strings may be concatenated
s1 + " is a nice city"
or repeated
s1 * 3
There a lot of method which can be applied to strings
s1.find("uc") # beware! indexing starts from 0
s1.replace("omou", "ympi")
s5 = " String with heading and trailing spaces \n"
s5
s5.strip() # remove heading and trailing whitespaces
s4.split("\n")
are ordered collections of items
ls1 = ['abc', 3, s1]
ls1
ls2 = [3, 4, 5]
ls1 + ls2 # concatenation of lists returns new list
ls1.extend(ls2) # update list with items from another changes the original list
ls1
ls1.append(10) # append to the list
ls1
ls1.append(ls2) # nested lists
ls1
[0, 1] * 5 # repeat of list items
are ordered collections of items as lists but they are immutable That is particularly very useful when you exchange data between classes or modules to make sure that they will not be changed accidently.
t1 = (2, 3, 4)
t1
Ordered collections (lists and tuples) of string items can be converted to one string with specified separator that is very useful when you store data to text files.
"\t".join(['1', '2', '3'])
" ".join(('1', '2', '3'))
s1
s1[0] # access items by index
s1[1]
s1[-1] # access last item
s1[len(s1) - 1] # the same
s1[0:3]
s1[:3] # the first and the last indices may be omitted
s1[4:]
s1[2:-1]
s1[:] # creates a copy of an object, that is particularly useful when work with lists
consists of key-value pairs like hash tables or associative arrays.
Keys can be of any immutable type: number, string or tuple.
Values are items of any type without restrictions.
Dictionaries are very fast and efficient. They can be accessed only by keys.
d = {1: 'Olomouc', 2: 'nice', 3: "city"}
print(d)
d[1] # get item with key 1
# (you cannot use slices like in lists, you need to iterate all keys to return corresponding values)
# d[0] # get item with key 0 which is absent and thus it leads to error
list(d.keys())
list(d.values())
list(d.items())
if 0 in d.keys(): # check for key existence
print("Success")
else:
print("Failure")
d['list'] = [1, 2, 4] # add new value, this will rewrite your data if it is already exists with this key
d
d[3] = 3 # replace with new value
d
d[3] = d[3] + 4 # update existing item
d
d[3] += 4 # the same
d
del d[3] # remove item from dict
are unordered sets of unique immutable items
s1 = set([1, 2, 3]) # set can be created from iterable
s1
s2 = {4, 5, 1, 2} # set can be created from separate items
s2
s1 & s2 # intersection
s1 | s2 # union
s1 - s2 # difference
s2 - s1 # difference is not symmetrical
int('12') # string to integer
int(12.2) # float to integer
float('12') # string to float
str(12) # number to string
int('10001101', 2) # convert string to integer with base 2
a = [1, 1, 2, 3, 4]
a
tuple(a) # converts to tuple
set(a) # converts to set and keep only unique items
list(set(a)) # converts to the set and back to the list - can be used to remove duplicates from the list
simplify generation of iterable objects (lists, dicts, sets, tuples)
Let's generate list containing the number of characters in each word in the sentence
s = "Chemoinformatics is a bright star on in the scientific universe"
How this can be done. Solution 1.
output = []
for word in s.split(' '):
output.append(len(word))
output
Solution 2 using list comprehensions.
output = [len(word) for word in s.split(' ')]
output
It is possible to create tuple instead of a list
output = tuple(len(word) for word in s.split(' '))
output
or even dict with words as a key and their length will be values
output = {word: len(word) for word in s.split(' ')}
output
or set
output = {len(word) for word in s.split(' ')}
output
are simple functions which return an iterable set of items, one at a time.
def gen_subseq(seq, length):
for i in range(len(seq) - length):
yield seq[i:i+length]
s = 'AGTGGTCA'
gen_subseq(s, 3)
list(gen_subseq(s, 3))
for subseq in gen_subseq(s, 3):
if subseq == "GGT":
break
else:
print(subseq)
Recursive generators is very simple starting from Python 3.3. Below is a generator of integers starting from the specified one.
def infinity(start):
yield start
yield from infinity(start + 1)
However recursion has a maximum depth. If a program will reach it an error will be raisen. You may increase the recursion depth in system settings or reimplement the procedure without recursion.
Variables are assigned by reference not by value. This may lead to some unxpected situations in case of mutable data types. Compare different situations.
a = 4
b = a
a = 5
print(a)
print(b)
L = [1, 2, 3]
M = L
L[0] = 9
print(L)
print(M)
M is L # check identity of referenced objects
N = L[:]
N is L
L[0] = 'p'
print(L)
print(N)
L = [1, [2, 3]]
M = L[:]
M is L
print(L)
print(M)
L[1][1] = 5
print(L)
print(M)
from copy import deepcopy
L = [1, [2, 3]]
M = deepcopy(L)
L[1][1] = 5
print(L)
print(M)
min(), max(), sum()
ls = [1, 2, 3, 4]
min(ls)
max(ls)
sum(ls)
zip(*iterables)
- makes an iterator that aggregates elements from each of the iterables
s = 'ABCD'
zip(ls, s)
list(zip(ls, s))
d = dict(zip(ls, s)) # useful for creating dict from separate lists of keys and values
d
enumerate()
enumerate(s)
list(enumerate(s))
They are paticularly useful for loops:
for i, (number, letter) in enumerate(zip(ls, s)): # unpacking zipped values is not neccessary
print('Iteration %i: number %i is assigned to letter %s' % (i, number, letter))
for i, item in enumerate(zip(ls, s)): # item is a tuple
print('Iteration %i: number %i is assigned to letter %s' % (i, item[0], item[1]))
is a block of organized and reusable code which perform a particular action. This help to keep your code modular and flexible. There are a lot of built-in functions like print, len, sum, etc. Let's create our own function which will calculate the mean value of a list.
def mean(lst):
if lst: # check if list is not empty
return sum(lst) / len(lst)
else:
return None
mean([1, 2, 3, 4])
The same using error handling
def mean(lst):
try:
return sum(lst) / len(lst)
except ZeroDivisionError:
return None
print(mean([]))
def sd(lst):
if not lst:
return None # if list is empty return None
if len(lst) == 1:
return 0
else:
m = mean(lst)
return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)
sd([1, 2, 3, 4])
sd([5])
sd([])
Variables in Python are passed to functions by reference, which may cause errors.
def func(lst):
lst.append(mean(lst)) # add mean value to the list
return sum(lst) # calc sum and return the value
ls = [1, 2, 3, 4]
s = func(ls)
print(s)
print(ls) # list was changed!
To avoid such behaviour one needs to copy or deepcopy the modified object inside the function before using it.
Variables may be passed to a function by position and by name. However positional argument cannot follow named argument.
def div(x, y):
return x / y
print(div(2, 5))
print(div(5, 2))
print(div(y=5, x=2))
You may set default values of function arguments
def div(x, y=10):
return x / y
print(div(2))
print(div(2, 5))
print(div(y=2, x=5))
You may pass arbitrary number of named and not named arguments. Not named arguments can be passed with variable started from *, named arguments can be passed with variable started from **. Not named arguments will be passed as a tuple, named arguments will be passed as a dict.
def func(arg1, *args, **kargs):
print("arg1 = ", arg1)
print("not named args = ", args)
print("named args = ", kargs)
func(1, 2, 3)
func(1, arg2 = 2, arg3 = 3)
func(1, 3, key1=10, key2=2)
Let's us read the file which has header and each line contains compound name and activity values separated by tab (\t) and calculate average and standard deviation of activity values for each compound and store results to another text file.
Compound_name pIC50
Mol_1 8.6
Mol_1 8.7
Mol_2 7.2
Mol_3 6.5
Mol_3 6.5
Mol_1 9
Mol_4 7.5
Mol_5 6.9
Mol_6 8.1
Mol_7 9.2
Mol_2 4.1
There are several file modes: r - read, w - write, a - append, t -text, b - binary.
f = open("data/activity.txt", 'rt') # open file for reading in text mode
File descriptor has several attributes:
print("Name of the file: ", f.name)
print("File closed?: ", f.closed)
print("File mode : ", f.mode)
Iterate over lines and save them in dict
d = {} # create dict where we will store reading results as a list of values for each compound
# because some compounds can have several values
f.readline() # read the first line from file (header) to skip it
for line in f:
if line.strip(): # check if line is not empty (skip empty lines)
tmp = line.strip().split('\t') # remove whitespaces and split line on tabs (\t)
# this will avoid errors if compound names contain spaces
if tmp[0] not in d.keys():
d[tmp[0]] = [float(tmp[1])]
else:
d[tmp[0]].append(float(tmp[1]))
f.close() # close file descriptor
# otherwise it may be inaccessible by other applications
d
Full text of above commands with handle of possible exceptions:
f = open("data/activity.txt", 'rt')
d = {}
try:
f.readline()
for line in f:
if line.strip():
tmp = line.strip().split('\t')
if tmp[0] not in d.keys():
d[tmp[0]] = [float(tmp[1])]
else:
d[tmp[0]].append(float(tmp[1]))
finally:
f.close() # if you use f = open(...) statement you need to use try-finally block to be sure
# that in the case of exceptions you file will be closed and file descriptor will be released
# otherwise file can be blocked to access by other applications
# (it is true if you open file for editing)
Alternative solution with several improvements:
from collections import defaultdict # import classes, functions, etc from a module
d = defaultdict(list) # create dict with default values equals to empty list
# if one will access not extisted item it will get with an empty list
with open("data/activity.txt", 'rt') as f: # files opened using with statement
# will be closed automatically
f.readline() # skip header
for line in f:
if line.strip(): # skip empty lines
name, value = line.strip().split('\t') # since we know that only two elements are in each line
# we may use such unpacking
d[name].append(float(value))
d
Now let's calculate average and standard deviation of our values
output = {}
for k, v in d.items(): # iterate over pairs of keys and values
avg = mean(v)
std = sd(v)
output[k] = (avg, std)
output
One-liner solution using list comprehensions
output = {k: (mean(v), sd(v)) for k, v in d.items()}
output
Save results to a text file
with open("data/activity_stat.txt", "wt") as f: # if file exists it will be
# silently rewritten
for k, v in output.items():
f.write(k + "\t" + "\t".join(map(str, v)) + "\n") # map applies the specified function
# over all items of the given iterable
The whole text of the script:
from collections import defaultdict # import classes, functions, etc from a module
d = defaultdict(list)
with open("data/activity.txt", 'rt') as f:
f.readline()
for line in f:
if line.strip():
name, value = line.strip().split('\t')
d[name].append(float(value))
output = {k: (mean(v), sd(v)) for k, v in d.items()}
with open("data/activity_stat.txt", "wt") as f:
for k, v in output.items():
f.write(k + "\t" + "\t".join(map(str, v)) + "\n")
Let's create a script which will take a text file with compounds and thier activities as an input and return text file with average and standard deviation of activity values for each compound as shown in example above.
We already have backbone of our script which makes I/O and all calculations. However to use it we will need to edit file names each time.
from collections import defaultdict
def mean(lst):
if lst:
return sum(lst) / len(lst)
else:
return None
def sd(lst):
if not lst:
return None
if len(lst) == 1:
return 0
else:
m = mean(lst)
return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)
d = defaultdict(list)
with open("data/activity.txt", 'rt') as f:
f.readline()
for line in f:
if line.strip():
name, value = line.strip().split('\t')
d[name].append(float(value))
output = {k: (mean(v), sd(v)) for k, v in d.items()}
with open("data/activity_stat.txt", "wt") as f:
for k, v in output.items():
f.write(k + "\t" + "\t".join(map(str, v)) + "\n")
To run scripts from command line we will need to pass input and output file names to it and parse these command line arguments. Below there is a backbone for a script:
Add functions from above:
Save the script, change permission to executable and run it from command line with different arguments
calc.py -h
calc.py -i input_file_name.txt
calc.py --input input_file_name.txt > output_file_name.txt
calc.py -i input_file_name.txt -o output_file_name.txt
provide better modularity and flexibility of a code.
Class: A user-defined prototype for an object that defines a set of attributes that characterize any object of the class. The attributes are data members (class variables and instance variables) and methods, accessed via dot notation.
Instance: An individual object of a certain class.
Class variable: A variable that is shared by all instances of a class. Class variables are defined within a class but outside any of the class's methods. Class variables are not used as frequently as instance variables are.
Instance variable: A variable that is defined inside a method and belongs only to the current instance of a class.
Method: A special kind of function that is defined in a class definition.
Method overloading: The assignment of more than one behavior to a particular method. The operation performed varies by the types of objects or arguments involved.
Inheritance: The transfer of the characteristics of a class to other classes that are derived from it.
Create class
class A:
def __init__(self, name, value=0):
self.name = name # public instance attribute
self.__value = value # private instance attribute
def get_attr(self):
return "parent class value: " + str(self.__value)
def set_attr(self, value):
self.__value = value
class B(A):
def __init__(self, name, value):
super(B, self).__init__(name) # init parent clas if necessary
self.__value = value # it will not override parent value
def get_attr(self): # override parent method
return "child class value: " + str(self.__value)
a = A('Main class', 24)
b = B('Derived class', 42)
print(a.name)
print(a.get_attr())
print(b.name)
print(b.get_attr())
print(super(B, b).get_attr())
b.set_attr(55)
print(b.get_attr())
b.__dict__ # look at the namespace
Create a script which will run a long calculation function in a single thread as usual.
$ time ./single_process.py -i rand_int.txt -o output_single.txt
real 0m22.080s
user 0m22.060s
sys 0m0.016s
Implement the same script using multiprocessing module.
Objects which are passed to a process should be pickable (manual serialization may be required).
$ time ./multi_process_2.py -i rand_int.txt -o multi_output_2.txt -c 2
real 0m13.403s
user 0m26.552s
sys 0m0.028s
The obtained files are identical after sorting them, because implemented multiprocessing does not garantee the same order of output results.
diff <(sort output_single.txt) <(sort multi_output_2.txt)
Explanation of differences between imap/imap_unordered
and map/map_async
http://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap
An alternative implementation which uses Queue
to pass data to processes.
$ time ./multi_process.py -i rand_int.txt -o multi_output.txt -c 2
real 0m14.779s
user 0m28.624s
sys 0m0.160s
Time gain does not linearly depend on the number of cores due to different reasons: