{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016. \n",
"Dr. Pavel Polishchuk"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The basic elements of Python 3\n",
"\n",
"1. Python is an object-oriented language, however it supports procedural programming that makes it perfect for fast development of simple scenario scripts.\n",
"2. Python is open-source\n",
"3. Python is cross-platform\n",
"4. Python is powerful: \n",
" - dynamic typing of variables - no need to declare variable types\n",
" - automatic memory management (garbage collector)\n",
" - many built-in and third party libraries\n",
"5. Python has API to many languages\n",
"6. Python is easy to learn and easy to use\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"#### My Python experience \n",
"##### measured in lines of code:"
]
},
{
"cell_type": "code",
"execution_count": 265,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"23243\r\n"
]
}
],
"source": [
"!find ~/Python -type f -name '*.py' -exec cat {} \\; | sed '/^\\s*#/d;/^\\s*$/d;/^\\s*\\/\\//d' | wc -l"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### measured in the number of developed open-source tools:\n",
"1. SiRMS - https://github.com/DrrDom/sirms \n",
"Simplex Representation of Molecular Structure \n",
"The tool for calculating of fragment descriptors for single compounds, mixtures, \"quasi\"-mixtures and reactions with atom labeled by different user-defined properties (charge, lipophilicity, H-bonding, etc).\n",
"2. SPCI - https://github.com/DrrDom/spci \n",
"Structural and Physico-Chemical Interpretation of QSAR models \n",
"The tool with GUI for automatic mining of chemical datasets which performs model building, validation and interpretation and provides with chemically meaningful output.\n",
"\n",
"More detals are here: http://qsar4u.com/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Content\n",
"1. PEP 8 -- Style Guide for Python Code\n",
"1. Built-in data types\n",
" 1. Numbers\n",
" 1. Strings\n",
" 1. Lists\n",
" 1. Tuples\n",
" 1. Slicing and indexing of strings, lists and tuples\n",
" 1. Dictionaries\n",
" 1. Sets\n",
"1. Data type conversion\n",
"1. List comprehensions\n",
"1. Generators\n",
"1. Variable assignment, shallow and deep copy of objects\n",
"1. Some built-in functions\n",
"1. Functions\n",
"1. File I/O\n",
"1. Create scripts\n",
"1. Classes\n",
"1. Multiprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### PEP 8 -- Style Guide for Python Code\n",
"https://www.python.org/dev/peps/pep-0008/\n",
"\n",
"1. Indentation - four spaces.\n",
"2. Lines should contain up to 79 symbols.\n",
"3. Blank lines to separate functions, classes, logical blocks in code, etc.\n",
"4. Import each module on a separate line.\n",
"5. Use whitespaces in expressions.\n",
"6. Use single or double quotes for strings consistently.\n",
"7. Leave useful comments: in-line, block or docstrings.\n",
"8. Name classes in CapitalizeWords, name function_with_underscore.\n",
"9. Avoid name conflicts. \n",
"etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Built-in data types\n",
"\n",
"Immutable:\n",
"- Numbers\n",
"- Strings\n",
"- Tuples\n",
"\n",
"Mutable:\n",
"- Lists\n",
"- Dictionaries\n",
"- Sets\n",
"\n",
"\n",
"- Files\n",
"- Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Numbers\n",
"\n",
"- integer\n",
"- float\n",
"- complex"
]
},
{
"cell_type": "code",
"execution_count": 266,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"int"
]
},
"execution_count": 266,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n1 = 1\n",
"type(n1)"
]
},
{
"cell_type": "code",
"execution_count": 267,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"float"
]
},
"execution_count": 267,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n2 = 1.0\n",
"type(n2)"
]
},
{
"cell_type": "code",
"execution_count": 268,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"complex"
]
},
"execution_count": 268,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n3 = 3 + 3j\n",
"type(n3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python provides unlimited precision of integers"
]
},
{
"cell_type": "code",
"execution_count": 269,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376"
]
},
"execution_count": 269,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n4 = 2 ** 1000\n",
"n4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pyhton supports all math operations under numbers"
]
},
{
"cell_type": "code",
"execution_count": 270,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 270,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 + 2 # addition"
]
},
{
"cell_type": "code",
"execution_count": 271,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 271,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 * 2 # multiplication"
]
},
{
"cell_type": "code",
"execution_count": 272,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.5"
]
},
"execution_count": 272,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 / 2 # division"
]
},
{
"cell_type": "code",
"execution_count": 273,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 273,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 // 2 # integer part of division"
]
},
{
"cell_type": "code",
"execution_count": 274,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 274,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 % 2 # residue of division"
]
},
{
"cell_type": "code",
"execution_count": 275,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"25"
]
},
"execution_count": 275,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"5 ** 2 # exponentiation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Strings\n",
"are sequences of characters (surrounded by single or double quotations)"
]
},
{
"cell_type": "code",
"execution_count": 276,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 276,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = 'Olomouc'\n",
"s2 = \"Olomouc\"\n",
"s1 == s2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using double quotes you may represent the string with apostrophes"
]
},
{
"cell_type": "code",
"execution_count": 277,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mom's son\n"
]
}
],
"source": [
"s3 = \"Mom's son\"\n",
"print(s3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may create multiline comments with triple quotes"
]
},
{
"cell_type": "code",
"execution_count": 278,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'This is\\na very long\\ncomment'"
]
},
"execution_count": 278,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s4 = \"\"\"This is\n",
"a very long\n",
"comment\"\"\"\n",
"s4"
]
},
{
"cell_type": "code",
"execution_count": 279,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This is\n",
"a very long\n",
"comment\n"
]
}
],
"source": [
"print(s4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Strings may be concatenated"
]
},
{
"cell_type": "code",
"execution_count": 280,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olomouc is a nice city'"
]
},
"execution_count": 280,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 + \" is a nice city\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or repeated"
]
},
{
"cell_type": "code",
"execution_count": 281,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'OlomoucOlomoucOlomouc'"
]
},
"execution_count": 281,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 * 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There a lot of method which can be applied to strings"
]
},
{
"cell_type": "code",
"execution_count": 282,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 282,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.find(\"uc\") # beware! indexing starts from 0"
]
},
{
"cell_type": "code",
"execution_count": 283,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olympic'"
]
},
"execution_count": 283,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1.replace(\"omou\", \"ympi\")"
]
},
{
"cell_type": "code",
"execution_count": 284,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"' String with heading and trailing spaces \\n'"
]
},
"execution_count": 284,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s5 = \" String with heading and trailing spaces \\n\"\n",
"s5"
]
},
{
"cell_type": "code",
"execution_count": 285,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'String with heading and trailing spaces'"
]
},
"execution_count": 285,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s5.strip() # remove heading and trailing whitespaces"
]
},
{
"cell_type": "code",
"execution_count": 286,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['This is', 'a very long', 'comment']"
]
},
"execution_count": 286,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s4.split(\"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Lists\n",
"are ordered collections of items"
]
},
{
"cell_type": "code",
"execution_count": 287,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['abc', 3, 'Olomouc']"
]
},
"execution_count": 287,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ls1 = ['abc', 3, s1]\n",
"ls1"
]
},
{
"cell_type": "code",
"execution_count": 288,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ls2 = [3, 4, 5]"
]
},
{
"cell_type": "code",
"execution_count": 289,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['abc', 3, 'Olomouc', 3, 4, 5]"
]
},
"execution_count": 289,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ls1 + ls2 # concatenation of lists returns new list"
]
},
{
"cell_type": "code",
"execution_count": 290,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['abc', 3, 'Olomouc', 3, 4, 5]"
]
},
"execution_count": 290,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ls1.extend(ls2) # update list with items from another changes the original list\n",
"ls1"
]
},
{
"cell_type": "code",
"execution_count": 291,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['abc', 3, 'Olomouc', 3, 4, 5, 10]"
]
},
"execution_count": 291,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ls1.append(10) # append to the list\n",
"ls1"
]
},
{
"cell_type": "code",
"execution_count": 292,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['abc', 3, 'Olomouc', 3, 4, 5, 10, [3, 4, 5]]"
]
},
"execution_count": 292,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ls1.append(ls2) # nested lists\n",
"ls1"
]
},
{
"cell_type": "code",
"execution_count": 293,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]"
]
},
"execution_count": 293,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[0, 1] * 5 # repeat of list items"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tuples\n",
"are ordered collections of items as lists but they are immutable\n",
"That is particularly very useful when you exchange data between classes or modules to make sure that they will not be changed accidently."
]
},
{
"cell_type": "code",
"execution_count": 294,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(2, 3, 4)"
]
},
"execution_count": 294,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t1 = (2, 3, 4)\n",
"t1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ordered collections (lists and tuples) of string items can be converted to one string with specified separator that is very useful when you store data to text files."
]
},
{
"cell_type": "code",
"execution_count": 295,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'1\\t2\\t3'"
]
},
"execution_count": 295,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"\\t\".join(['1', '2', '3'])"
]
},
{
"cell_type": "code",
"execution_count": 296,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'1 2 3'"
]
},
"execution_count": 296,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\" \".join(('1', '2', '3'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Slicing and indexing of strings, lists and tuples"
]
},
{
"cell_type": "code",
"execution_count": 297,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olomouc'"
]
},
"execution_count": 297,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1"
]
},
{
"cell_type": "code",
"execution_count": 298,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'O'"
]
},
"execution_count": 298,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[0] # access items by index"
]
},
{
"cell_type": "code",
"execution_count": 299,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'l'"
]
},
"execution_count": 299,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[1]"
]
},
{
"cell_type": "code",
"execution_count": 300,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'c'"
]
},
"execution_count": 300,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[-1] # access last item"
]
},
{
"cell_type": "code",
"execution_count": 301,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'c'"
]
},
"execution_count": 301,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[len(s1) - 1] # the same"
]
},
{
"cell_type": "code",
"execution_count": 302,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olo'"
]
},
"execution_count": 302,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[0:3]"
]
},
{
"cell_type": "code",
"execution_count": 303,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olo'"
]
},
"execution_count": 303,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[:3] # the first and the last indices may be omitted"
]
},
{
"cell_type": "code",
"execution_count": 304,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'ouc'"
]
},
"execution_count": 304,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[4:]"
]
},
{
"cell_type": "code",
"execution_count": 305,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'omou'"
]
},
"execution_count": 305,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[2:-1]"
]
},
{
"cell_type": "code",
"execution_count": 306,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olomouc'"
]
},
"execution_count": 306,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1[:] # creates a copy of an object, that is particularly useful when work with lists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dictionaries\n",
"consists of key-value pairs like hash tables or associative arrays. \n",
"Keys can be of any immutable type: number, string or tuple. \n",
"Values are items of any type without restrictions.\n",
"\n",
"Dictionaries are very fast and efficient. They can be accessed only by keys."
]
},
{
"cell_type": "code",
"execution_count": 307,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{1: 'Olomouc', 2: 'nice', 3: 'city'}\n"
]
}
],
"source": [
"d = {1: 'Olomouc', 2: 'nice', 3: \"city\"}\n",
"print(d)"
]
},
{
"cell_type": "code",
"execution_count": 308,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Olomouc'"
]
},
"execution_count": 308,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[1] # get item with key 1 \n",
" # (you cannot use slices like in lists, you need to iterate all keys to return corresponding values)"
]
},
{
"cell_type": "code",
"execution_count": 309,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# d[0] # get item with key 0 which is absent and thus it leads to error"
]
},
{
"cell_type": "code",
"execution_count": 310,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[1, 2, 3]"
]
},
"execution_count": 310,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(d.keys())"
]
},
{
"cell_type": "code",
"execution_count": 311,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Olomouc', 'nice', 'city']"
]
},
"execution_count": 311,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(d.values())"
]
},
{
"cell_type": "code",
"execution_count": 312,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(1, 'Olomouc'), (2, 'nice'), (3, 'city')]"
]
},
"execution_count": 312,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(d.items())"
]
},
{
"cell_type": "code",
"execution_count": 313,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Failure\n"
]
}
],
"source": [
"if 0 in d.keys(): # check for key existence\n",
" print(\"Success\")\n",
"else:\n",
" print(\"Failure\")"
]
},
{
"cell_type": "code",
"execution_count": 314,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'Olomouc', 2: 'nice', 3: 'city', 'list': [1, 2, 4]}"
]
},
"execution_count": 314,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d['list'] = [1, 2, 4] # add new value, this will rewrite your data if it is already exists with this key\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 315,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'Olomouc', 2: 'nice', 3: 3, 'list': [1, 2, 4]}"
]
},
"execution_count": 315,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[3] = 3 # replace with new value\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 316,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'Olomouc', 2: 'nice', 3: 7, 'list': [1, 2, 4]}"
]
},
"execution_count": 316,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[3] = d[3] + 4 # update existing item\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 317,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'Olomouc', 2: 'nice', 3: 11, 'list': [1, 2, 4]}"
]
},
"execution_count": 317,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[3] += 4 # the same\n",
"d"
]
},
{
"cell_type": "code",
"execution_count": 318,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"del d[3] # remove item from dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Sets\n",
"\n",
"are unordered sets of unique immutable items"
]
},
{
"cell_type": "code",
"execution_count": 319,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2, 3}"
]
},
"execution_count": 319,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 = set([1, 2, 3]) # set can be created from iterable\n",
"s1"
]
},
{
"cell_type": "code",
"execution_count": 320,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2, 4, 5}"
]
},
"execution_count": 320,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s2 = {4, 5, 1, 2} # set can be created from separate items\n",
"s2"
]
},
{
"cell_type": "code",
"execution_count": 321,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2}"
]
},
"execution_count": 321,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 & s2 # intersection"
]
},
{
"cell_type": "code",
"execution_count": 322,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2, 3, 4, 5}"
]
},
"execution_count": 322,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 | s2 # union"
]
},
{
"cell_type": "code",
"execution_count": 323,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{3}"
]
},
"execution_count": 323,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s1 - s2 # difference"
]
},
{
"cell_type": "code",
"execution_count": 324,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{4, 5}"
]
},
"execution_count": 324,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s2 - s1 # difference is not symmetrical"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data type conversion"
]
},
{
"cell_type": "code",
"execution_count": 325,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 325,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"int('12') # string to integer"
]
},
{
"cell_type": "code",
"execution_count": 326,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"execution_count": 326,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"int(12.2) # float to integer"
]
},
{
"cell_type": "code",
"execution_count": 327,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"12.0"
]
},
"execution_count": 327,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"float('12') # string to float"
]
},
{
"cell_type": "code",
"execution_count": 328,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'12'"
]
},
"execution_count": 328,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"str(12) # number to string"
]
},
{
"cell_type": "code",
"execution_count": 329,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"141"
]
},
"execution_count": 329,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"int('10001101', 2) # convert string to integer with base 2"
]
},
{
"cell_type": "code",
"execution_count": 330,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[1, 1, 2, 3, 4]"
]
},
"execution_count": 330,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a = [1, 1, 2, 3, 4]\n",
"a"
]
},
{
"cell_type": "code",
"execution_count": 331,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(1, 1, 2, 3, 4)"
]
},
"execution_count": 331,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tuple(a) # converts to tuple"
]
},
{
"cell_type": "code",
"execution_count": 332,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2, 3, 4}"
]
},
"execution_count": 332,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(a) # converts to set and keep only unique items"
]
},
{
"cell_type": "code",
"execution_count": 333,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[1, 2, 3, 4]"
]
},
"execution_count": 333,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(set(a)) # converts to the set and back to the list - can be used to remove duplicates from the list"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List comprehensions\n",
"\n",
"simplify generation of iterable objects (lists, dicts, sets, tuples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's generate list containing the number of characters in each word in the sentence"
]
},
{
"cell_type": "code",
"execution_count": 334,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = \"Chemoinformatics is a bright star on in the scientific universe\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How this can be done. Solution 1."
]
},
{
"cell_type": "code",
"execution_count": 335,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]"
]
},
"execution_count": 335,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = []\n",
"for word in s.split(' '):\n",
" output.append(len(word))\n",
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Solution 2 using list comprehensions."
]
},
{
"cell_type": "code",
"execution_count": 336,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[16, 2, 1, 6, 4, 2, 2, 3, 10, 8]"
]
},
"execution_count": 336,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = [len(word) for word in s.split(' ')]\n",
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is possible to create tuple instead of a list"
]
},
{
"cell_type": "code",
"execution_count": 337,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(16, 2, 1, 6, 4, 2, 2, 3, 10, 8)"
]
},
"execution_count": 337,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = tuple(len(word) for word in s.split(' '))\n",
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or even dict with words as a key and their length will be values"
]
},
{
"cell_type": "code",
"execution_count": 338,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'Chemoinformatics': 16,\n",
" 'a': 1,\n",
" 'bright': 6,\n",
" 'in': 2,\n",
" 'is': 2,\n",
" 'on': 2,\n",
" 'scientific': 10,\n",
" 'star': 4,\n",
" 'the': 3,\n",
" 'universe': 8}"
]
},
"execution_count": 338,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = {word: len(word) for word in s.split(' ')}\n",
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or set"
]
},
{
"cell_type": "code",
"execution_count": 339,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{1, 2, 3, 4, 6, 8, 10, 16}"
]
},
"execution_count": 339,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output = {len(word) for word in s.split(' ')}\n",
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generators\n",
"\n",
"are simple functions which return an iterable set of items, one at a time."
]
},
{
"cell_type": "code",
"execution_count": 340,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 340,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def gen_subseq(seq, length):\n",
" for i in range(len(seq) - length):\n",
" yield seq[i:i+length]\n",
" \n",
"s = 'AGTGGTCA'\n",
"gen_subseq(s, 3)"
]
},
{
"cell_type": "code",
"execution_count": 341,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['AGT', 'GTG', 'TGG', 'GGT', 'GTC']"
]
},
"execution_count": 341,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(gen_subseq(s, 3))"
]
},
{
"cell_type": "code",
"execution_count": 342,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AGT\n",
"GTG\n",
"TGG\n"
]
}
],
"source": [
"for subseq in gen_subseq(s, 3):\n",
" if subseq == \"GGT\":\n",
" break\n",
" else:\n",
" print(subseq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recursive generators is very simple starting from Python 3.3. Below is a generator of integers starting from the specified one."
]
},
{
"cell_type": "code",
"execution_count": 343,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def infinity(start):\n",
" yield start\n",
" yield from infinity(start + 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However recursion has a maximum depth. If a program will reach it an error will be raisen. You may increase the recursion depth in system settings or reimplement the procedure without recursion."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Variable assignment, shallow and deep copy of objects\n",
"\n",
"Variables are assigned by reference not by value. This may lead to some unxpected situations in case of mutable data types. Compare different situations."
]
},
{
"cell_type": "code",
"execution_count": 344,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"4\n"
]
}
],
"source": [
"a = 4\n",
"b = a\n",
"a = 5\n",
"print(a)\n",
"print(b)"
]
},
{
"cell_type": "code",
"execution_count": 345,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[9, 2, 3]\n",
"[9, 2, 3]\n"
]
}
],
"source": [
"L = [1, 2, 3]\n",
"M = L\n",
"L[0] = 9\n",
"print(L)\n",
"print(M)"
]
},
{
"cell_type": "code",
"execution_count": 346,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 346,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M is L # check identity of referenced objects"
]
},
{
"cell_type": "code",
"execution_count": 347,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 347,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"N = L[:]\n",
"N is L"
]
},
{
"cell_type": "code",
"execution_count": 348,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['p', 2, 3]\n",
"[9, 2, 3]\n"
]
}
],
"source": [
"L[0] = 'p'\n",
"print(L)\n",
"print(N)"
]
},
{
"cell_type": "code",
"execution_count": 349,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 349,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L = [1, [2, 3]]\n",
"M = L[:]\n",
"M is L"
]
},
{
"cell_type": "code",
"execution_count": 350,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, [2, 3]]\n",
"[1, [2, 3]]\n"
]
}
],
"source": [
"print(L)\n",
"print(M)"
]
},
{
"cell_type": "code",
"execution_count": 351,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, [2, 5]]\n",
"[1, [2, 5]]\n"
]
}
],
"source": [
"L[1][1] = 5\n",
"print(L)\n",
"print(M)"
]
},
{
"cell_type": "code",
"execution_count": 352,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, [2, 5]]\n",
"[1, [2, 3]]\n"
]
}
],
"source": [
"from copy import deepcopy\n",
"L = [1, [2, 3]]\n",
"M = deepcopy(L)\n",
"L[1][1] = 5\n",
"print(L)\n",
"print(M)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some built-in functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`min(), max(), sum()`"
]
},
{
"cell_type": "code",
"execution_count": 353,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ls = [1, 2, 3, 4]"
]
},
{
"cell_type": "code",
"execution_count": 354,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 354,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"min(ls)"
]
},
{
"cell_type": "code",
"execution_count": 355,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 355,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"max(ls)"
]
},
{
"cell_type": "code",
"execution_count": 356,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"10"
]
},
"execution_count": 356,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum(ls)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`zip(*iterables)` - makes an iterator that aggregates elements from each of the iterables"
]
},
{
"cell_type": "code",
"execution_count": 357,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"s = 'ABCD'"
]
},
{
"cell_type": "code",
"execution_count": 358,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 358,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"zip(ls, s)"
]
},
{
"cell_type": "code",
"execution_count": 359,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D')]"
]
},
"execution_count": 359,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(zip(ls, s))"
]
},
{
"cell_type": "code",
"execution_count": 360,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{1: 'A', 2: 'B', 3: 'C', 4: 'D'}"
]
},
"execution_count": 360,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dict(zip(ls, s)) # useful for creating dict from separate lists of keys and values\n",
"d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`enumerate()`"
]
},
{
"cell_type": "code",
"execution_count": 361,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 361,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"enumerate(s)"
]
},
{
"cell_type": "code",
"execution_count": 362,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D')]"
]
},
"execution_count": 362,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(enumerate(s))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They are paticularly useful for loops:"
]
},
{
"cell_type": "code",
"execution_count": 363,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iteration 0: number 1 is assigned to letter A\n",
"Iteration 1: number 2 is assigned to letter B\n",
"Iteration 2: number 3 is assigned to letter C\n",
"Iteration 3: number 4 is assigned to letter D\n"
]
}
],
"source": [
"for i, (number, letter) in enumerate(zip(ls, s)): # unpacking zipped values is not neccessary\n",
" print('Iteration %i: number %i is assigned to letter %s' % (i, number, letter))"
]
},
{
"cell_type": "code",
"execution_count": 364,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Iteration 0: number 1 is assigned to letter A\n",
"Iteration 1: number 2 is assigned to letter B\n",
"Iteration 2: number 3 is assigned to letter C\n",
"Iteration 3: number 4 is assigned to letter D\n"
]
}
],
"source": [
"for i, item in enumerate(zip(ls, s)): # item is a tuple\n",
" print('Iteration %i: number %i is assigned to letter %s' % (i, item[0], item[1]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Functions\n",
"\n",
"is a block of organized and reusable code which perform a particular action. This help to keep your code modular and flexible. There are a lot of built-in functions like print, len, sum, etc. \n",
"Let's create our own function which will calculate the mean value of a list."
]
},
{
"cell_type": "code",
"execution_count": 365,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def mean(lst):\n",
" if lst: # check if list is not empty\n",
" return sum(lst) / len(lst)\n",
" else:\n",
" return None"
]
},
{
"cell_type": "code",
"execution_count": 366,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"2.5"
]
},
"execution_count": 366,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean([1, 2, 3, 4])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same using error handling"
]
},
{
"cell_type": "code",
"execution_count": 367,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"def mean(lst):\n",
" try:\n",
" return sum(lst) / len(lst)\n",
" except ZeroDivisionError:\n",
" return None\n",
"\n",
"print(mean([]))"
]
},
{
"cell_type": "code",
"execution_count": 368,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def sd(lst):\n",
" if not lst:\n",
" return None # if list is empty return None\n",
" if len(lst) == 1:\n",
" return 0\n",
" else:\n",
" m = mean(lst)\n",
" return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)"
]
},
{
"cell_type": "code",
"execution_count": 369,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1.6666666666666667"
]
},
"execution_count": 369,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sd([1, 2, 3, 4])"
]
},
{
"cell_type": "code",
"execution_count": 370,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 370,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sd([5])"
]
},
{
"cell_type": "code",
"execution_count": 371,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sd([])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Variables in Python are passed to functions by reference, which may cause errors."
]
},
{
"cell_type": "code",
"execution_count": 372,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"12.5\n",
"[1, 2, 3, 4, 2.5]\n"
]
}
],
"source": [
"def func(lst):\n",
" lst.append(mean(lst)) # add mean value to the list\n",
" return sum(lst) # calc sum and return the value\n",
"\n",
"ls = [1, 2, 3, 4]\n",
"s = func(ls)\n",
"print(s)\n",
"print(ls) # list was changed!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To avoid such behaviour one needs to copy or deepcopy the modified object inside the function before using it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Variables may be passed to a function by position and by name. However positional argument cannot follow named argument."
]
},
{
"cell_type": "code",
"execution_count": 373,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.4\n",
"2.5\n",
"0.4\n"
]
}
],
"source": [
"def div(x, y):\n",
" return x / y\n",
"\n",
"print(div(2, 5))\n",
"print(div(5, 2))\n",
"print(div(y=5, x=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may set default values of function arguments"
]
},
{
"cell_type": "code",
"execution_count": 374,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.2\n",
"0.4\n",
"2.5\n"
]
}
],
"source": [
"def div(x, y=10):\n",
" return x / y\n",
"\n",
"print(div(2))\n",
"print(div(2, 5))\n",
"print(div(y=2, x=5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may pass arbitrary number of named and not named arguments. Not named arguments can be passed with variable started from \\*, named arguments can be passed with variable started from \\*\\*. Not named arguments will be passed as a tuple, named arguments will be passed as a dict."
]
},
{
"cell_type": "code",
"execution_count": 375,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def func(arg1, *args, **kargs):\n",
" print(\"arg1 = \", arg1)\n",
" print(\"not named args = \", args)\n",
" print(\"named args = \", kargs)"
]
},
{
"cell_type": "code",
"execution_count": 376,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"arg1 = 1\n",
"not named args = (2, 3)\n",
"named args = {}\n"
]
}
],
"source": [
"func(1, 2, 3)"
]
},
{
"cell_type": "code",
"execution_count": 377,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"arg1 = 1\n",
"not named args = ()\n",
"named args = {'arg2': 2, 'arg3': 3}\n"
]
}
],
"source": [
"func(1, arg2 = 2, arg3 = 3)"
]
},
{
"cell_type": "code",
"execution_count": 378,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"arg1 = 1\n",
"not named args = (3,)\n",
"named args = {'key1': 10, 'key2': 2}\n"
]
}
],
"source": [
"func(1, 3, key1=10, key2=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### File I/O"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's us read the file which has header and each line contains compound name and activity values separated by tab (\\t) and calculate average and standard deviation of activity values for each compound and store results to another text file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compound_name\tpIC50 \n",
"Mol_1\t8.6 \n",
"Mol_1\t8.7 \n",
"Mol_2\t7.2 \n",
"Mol_3\t6.5 \n",
"Mol_3\t6.5 \n",
"Mol_1\t9 \n",
"Mol_4\t7.5 \n",
"Mol_5\t6.9 \n",
"Mol_6\t8.1 \n",
"Mol_7\t9.2 \n",
"Mol_2\t4.1 "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"There are several file modes: r - read, w - write, a - append, t -text, b - binary."
]
},
{
"cell_type": "code",
"execution_count": 379,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"f = open(\"data/activity.txt\", 'rt') # open file for reading in text mode"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"File descriptor has several attributes:"
]
},
{
"cell_type": "code",
"execution_count": 380,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name of the file: data/activity.txt\n",
"File closed?: False\n",
"File mode : rt\n"
]
}
],
"source": [
"print(\"Name of the file: \", f.name)\n",
"print(\"File closed?: \", f.closed)\n",
"print(\"File mode : \", f.mode)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Iterate over lines and save them in dict"
]
},
{
"cell_type": "code",
"execution_count": 381,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"d = {} # create dict where we will store reading results as a list of values for each compound\n",
" # because some compounds can have several values\n",
"f.readline() # read the first line from file (header) to skip it\n",
"for line in f:\n",
" if line.strip(): # check if line is not empty (skip empty lines)\n",
" tmp = line.strip().split('\\t') # remove whitespaces and split line on tabs (\\t) \n",
" # this will avoid errors if compound names contain spaces\n",
" if tmp[0] not in d.keys():\n",
" d[tmp[0]] = [float(tmp[1])]\n",
" else:\n",
" d[tmp[0]].append(float(tmp[1]))\n",
"f.close() # close file descriptor\n",
" # otherwise it may be inaccessible by other applications"
]
},
{
"cell_type": "code",
"execution_count": 382,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'Mol_1': [8.6, 8.7, 9.0],\n",
" 'Mol_2': [7.2, 4.1],\n",
" 'Mol_3': [6.5, 6.5],\n",
" 'Mol_4': [7.5],\n",
" 'Mol_5': [6.9],\n",
" 'Mol_6': [8.1],\n",
" 'Mol_7': [9.2]}"
]
},
"execution_count": 382,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Full text of above commands with handle of possible exceptions:"
]
},
{
"cell_type": "code",
"execution_count": 383,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"f = open(\"data/activity.txt\", 'rt')\n",
"d = {}\n",
"try:\n",
" f.readline()\n",
" for line in f:\n",
" if line.strip():\n",
" tmp = line.strip().split('\\t')\n",
" if tmp[0] not in d.keys():\n",
" d[tmp[0]] = [float(tmp[1])]\n",
" else:\n",
" d[tmp[0]].append(float(tmp[1]))\n",
"finally:\n",
" f.close() # if you use f = open(...) statement you need to use try-finally block to be sure \n",
" # that in the case of exceptions you file will be closed and file descriptor will be released\n",
" # otherwise file can be blocked to access by other applications\n",
" # (it is true if you open file for editing)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Alternative solution with several improvements:"
]
},
{
"cell_type": "code",
"execution_count": 384,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import defaultdict # import classes, functions, etc from a module\n",
"\n",
"d = defaultdict(list) # create dict with default values equals to empty list\n",
" # if one will access not extisted item it will get with an empty list\n",
" \n",
"with open(\"data/activity.txt\", 'rt') as f: # files opened using with statement \n",
" # will be closed automatically\n",
" f.readline() # skip header\n",
" for line in f:\n",
" if line.strip(): # skip empty lines\n",
" name, value = line.strip().split('\\t') # since we know that only two elements are in each line\n",
" # we may use such unpacking\n",
" d[name].append(float(value))"
]
},
{
"cell_type": "code",
"execution_count": 385,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(list,\n",
" {'Mol_1': [8.6, 8.7, 9.0],\n",
" 'Mol_2': [7.2, 4.1],\n",
" 'Mol_3': [6.5, 6.5],\n",
" 'Mol_4': [7.5],\n",
" 'Mol_5': [6.9],\n",
" 'Mol_6': [8.1],\n",
" 'Mol_7': [9.2]})"
]
},
"execution_count": 385,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's calculate average and standard deviation of our values"
]
},
{
"cell_type": "code",
"execution_count": 386,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"output = {}\n",
"for k, v in d.items(): # iterate over pairs of keys and values\n",
" avg = mean(v)\n",
" std = sd(v)\n",
" output[k] = (avg, std)"
]
},
{
"cell_type": "code",
"execution_count": 387,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"{'Mol_1': (8.766666666666666, 0.04333333333333344),\n",
" 'Mol_2': (5.65, 4.8050000000000015),\n",
" 'Mol_3': (6.5, 0.0),\n",
" 'Mol_4': (7.5, 0),\n",
" 'Mol_5': (6.9, 0),\n",
" 'Mol_6': (8.1, 0),\n",
" 'Mol_7': (9.2, 0)}"
]
},
"execution_count": 387,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One-liner solution using list comprehensions"
]
},
{
"cell_type": "code",
"execution_count": 388,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"output = {k: (mean(v), sd(v)) for k, v in d.items()}"
]
},
{
"cell_type": "code",
"execution_count": 389,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'Mol_1': (8.766666666666666, 0.04333333333333344),\n",
" 'Mol_2': (5.65, 4.8050000000000015),\n",
" 'Mol_3': (6.5, 0.0),\n",
" 'Mol_4': (7.5, 0),\n",
" 'Mol_5': (6.9, 0),\n",
" 'Mol_6': (8.1, 0),\n",
" 'Mol_7': (9.2, 0)}"
]
},
"execution_count": 389,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save results to a text file"
]
},
{
"cell_type": "code",
"execution_count": 390,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open(\"data/activity_stat.txt\", \"wt\") as f: # if file exists it will be \n",
" # silently rewritten\n",
" for k, v in output.items():\n",
" f.write(k + \"\\t\" + \"\\t\".join(map(str, v)) + \"\\n\") # map applies the specified function \n",
" # over all items of the given iterable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The whole text of the script:"
]
},
{
"cell_type": "code",
"execution_count": 391,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import defaultdict # import classes, functions, etc from a module\n",
"\n",
"d = defaultdict(list)\n",
" \n",
"with open(\"data/activity.txt\", 'rt') as f:\n",
" f.readline() \n",
" for line in f:\n",
" if line.strip(): \n",
" name, value = line.strip().split('\\t') \n",
" d[name].append(float(value))\n",
" \n",
"output = {k: (mean(v), sd(v)) for k, v in d.items()}\n",
"\n",
"with open(\"data/activity_stat.txt\", \"wt\") as f:\n",
" for k, v in output.items():\n",
" f.write(k + \"\\t\" + \"\\t\".join(map(str, v)) + \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create scripts\n",
"\n",
"Let's create a script which will take a text file with compounds and thier activities as an input and return text file with average and standard deviation of activity values for each compound as shown in example above."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We already have backbone of our script which makes I/O and all calculations. However to use it we will need to edit file names each time."
]
},
{
"cell_type": "code",
"execution_count": 392,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import defaultdict \n",
"\n",
"\n",
"def mean(lst):\n",
" if lst: \n",
" return sum(lst) / len(lst)\n",
" else:\n",
" return None\n",
" \n",
" \n",
"def sd(lst):\n",
" if not lst:\n",
" return None \n",
" if len(lst) == 1:\n",
" return 0\n",
" else:\n",
" m = mean(lst)\n",
" return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)\n",
" \n",
" \n",
"d = defaultdict(list)\n",
" \n",
"with open(\"data/activity.txt\", 'rt') as f:\n",
" f.readline() \n",
" for line in f:\n",
" if line.strip(): \n",
" name, value = line.strip().split('\\t') \n",
" d[name].append(float(value))\n",
" \n",
"output = {k: (mean(v), sd(v)) for k, v in d.items()}\n",
"\n",
"with open(\"data/activity_stat.txt\", \"wt\") as f:\n",
" for k, v in output.items():\n",
" f.write(k + \"\\t\" + \"\\t\".join(map(str, v)) + \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To run scripts from command line we will need to pass input and output file names to it and parse these command line arguments. Below there is a backbone for a script:"
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true
},
"source": [
"#!/usr/bin/env python3 # shebang string to specify path to the Python interpreter\n",
"\n",
"import argparse # import module to parse command line args and create a help message\n",
"\n",
"\n",
"\n",
"\n",
"if __name__ == '__main__': # entry point of the script\n",
" \n",
" parser = argparse.ArgumentParser(description='Calculate average and standard deviation of compound activity.')\n",
" parser.add_argument('-i', '--input', metavar='input.txt', required=True,\n",
" help='text file with header and two tab-separated column with compound name and '\n",
" 'activity value.')\n",
" parser.add_argument('-o', '--out', metavar='output.txt', required=False, default=None,\n",
" help='output text file in tab separated format with average and sd for each compound. '\n",
" 'If the name will be omitted output will be in stdout.')\n",
"\n",
" args = vars(parser.parse_args())\n",
" for k, v in args.items():\n",
" if k == \"input\": input_fname = v\n",
" if k == \"out\": output_fname = v\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Add functions from above:"
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true
},
"source": [
"#!/usr/bin/env python3 \n",
"\n",
"import argparse \n",
"from collections import defaultdict \n",
"\n",
"\n",
"def mean(lst):\n",
" if lst: \n",
" return sum(lst) / len(lst)\n",
" else:\n",
" return None\n",
"\n",
"\n",
"def sd(lst):\n",
" if not lst:\n",
" return None \n",
" if len(lst) == 1:\n",
" return 0\n",
" else:\n",
" m = mean(lst)\n",
" return sum((item - m) ** 2 for item in lst) / (len(lst) - 1)\n",
"\n",
"\n",
"def load_file(fname):\n",
" d = defaultdict(list)\n",
" with open(fname, 'rt') as f:\n",
" f.readline() \n",
" for line in f:\n",
" if line.strip(): \n",
" name, value = line.strip().split('\\t') \n",
" d[name].append(float(value))\n",
" return d\n",
"\n",
"\n",
"def save_file(fname, data):\n",
" \"\"\"\n",
" fname - file name\n",
" data - dict, keys are compound names, values are tuples of average and sd activity values\n",
" \"\"\"\n",
" with open(fname, \"wt\") as f:\n",
" for k, v in output.items():\n",
" f.write(k + \"\\t\" + \"\\t\".join(map(str, v)) + \"\\n\")\n",
"\n",
"\n",
"if __name__ == '__main__': # entry point of the script\n",
"\n",
" parser = argparse.ArgumentParser(description='Calculate average and standard deviation of compound activity.')\n",
" parser.add_argument('-i', '--input', metavar='input.txt', required=True,\n",
" help='text file with header and two tab-separated column with compound name and '\n",
" 'activity value.')\n",
" parser.add_argument('-o', '--out', metavar='output.txt', required=False, default=None,\n",
" help='output text file in tab separated format with average and sd for each compound. '\n",
" 'If the name will be omitted output will be in stdout.')\n",
"\n",
" args = vars(parser.parse_args())\n",
" for k, v in args.items():\n",
" if k == \"input\": input_fname = v\n",
" if k == \"out\": output_fname = v\n",
"\n",
" d = load_file(input_fname)\n",
"\n",
" output = {k: (mean(v), sd(v)) for k, v in d.items()}\n",
"\n",
" if output_fname is None: # print to stdout (may also be done with sys.stdout.write)\n",
" for k, v in output.items():\n",
" print(k, *v, sep=\"\\t\") # * unpacks iterable\n",
" else:\n",
" save_file(output_fname, output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Save the script, change permission to executable and run it from command line with different arguments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" calc.py -h"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" calc.py -i input_file_name.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" calc.py --input input_file_name.txt > output_file_name.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" calc.py -i input_file_name.txt -o output_file_name.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Classes\n",
"\n",
"provide better modularity and flexibility of a code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Class**: A user-defined prototype for an object that defines a set of attributes that characterize any object of the class. The attributes are data members (class variables and instance variables) and methods, accessed via dot notation. \n",
"**Instance**: An individual object of a certain class. \n",
"**Class variable**: A variable that is shared by all instances of a class. Class variables are defined within a class but outside any of the class's methods. Class variables are not used as frequently as instance variables are. \n",
"**Instance variable**: A variable that is defined inside a method and belongs only to the current instance of a class. \n",
"**Method**: A special kind of function that is defined in a class definition. \n",
"**Method overloading**: The assignment of more than one behavior to a particular method. The operation performed varies by the types of objects or arguments involved. \n",
"**Inheritance**: The transfer of the characteristics of a class to other classes that are derived from it. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create class"
]
},
{
"cell_type": "code",
"execution_count": 393,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class A:\n",
" def __init__(self, name, value=0):\n",
" self.name = name # public instance attribute\n",
" self.__value = value # private instance attribute\n",
" def get_attr(self):\n",
" return \"parent class value: \" + str(self.__value)\n",
" def set_attr(self, value):\n",
" self.__value = value\n",
" \n",
"class B(A):\n",
" def __init__(self, name, value):\n",
" super(B, self).__init__(name) # init parent clas if necessary\n",
" self.__value = value # it will not override parent value\n",
" def get_attr(self): # override parent method\n",
" return \"child class value: \" + str(self.__value)"
]
},
{
"cell_type": "code",
"execution_count": 394,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"a = A('Main class', 24)\n",
"b = B('Derived class', 42)"
]
},
{
"cell_type": "code",
"execution_count": 395,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Main class\n",
"parent class value: 24\n"
]
}
],
"source": [
"print(a.name)\n",
"print(a.get_attr())"
]
},
{
"cell_type": "code",
"execution_count": 396,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Derived class\n",
"child class value: 42\n",
"parent class value: 0\n"
]
}
],
"source": [
"print(b.name)\n",
"print(b.get_attr())\n",
"print(super(B, b).get_attr())"
]
},
{
"cell_type": "code",
"execution_count": 397,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"child class value: 42\n"
]
}
],
"source": [
"b.set_attr(55)\n",
"print(b.get_attr())"
]
},
{
"cell_type": "code",
"execution_count": 398,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'_A__value': 55, '_B__value': 42, 'name': 'Derived class'}"
]
},
"execution_count": 398,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b.__dict__ # look at the namespace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a script which will run a long calculation function in a single thread as usual."
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true
},
"source": [
"#!/usr/bin/env python3 \n",
"\n",
"import argparse\n",
"\n",
"\n",
"def long_calc(n):\n",
" i = 0\n",
" while i < n * 1000:\n",
" i += 1\n",
" return n * 2\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
"\n",
" parser = argparse.ArgumentParser(description='Read text file with numbers, calculate reasults and save them.')\n",
" parser.add_argument('-i', '--input', metavar='input.txt', required=True,\n",
" help='input text file with single numbers on separate lines.')\n",
" parser.add_argument('-o', '--out', metavar='output.txt', required=True,\n",
" help='output text file.')\n",
"\n",
" args = vars(parser.parse_args())\n",
" for o, v in args.items():\n",
" if o == \"input\": in_fname = v\n",
" if o == \"out\": out_fname = v\n",
" \n",
" with open(in_fname) as f:\n",
" with open(out_fname, \"wt\") as f_out:\n",
" for line in f:\n",
" v = line.strip()\n",
" if v:\n",
" f_out.write(str(long_calc(int(v))) + \"\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```$ time ./single_process.py -i rand_int.txt -o output_single.txt``` \n",
" \n",
"``real 0m22.080s`` \n",
"``user 0m22.060s`` \n",
"``sys 0m0.016s``"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Implement the same script using multiprocessing module. \n",
"Objects which are passed to a process should be pickable (manual serialization may be required)."
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true
},
"source": [
"#!/usr/bin/env python3\n",
"\n",
"import os\n",
"import argparse\n",
"from multiprocessing import Pool, cpu_count\n",
"\n",
"\n",
"def long_calc(line):\n",
" if line.strip():\n",
" n = int(line.strip())\n",
" i = 0\n",
" while i < n * 500:\n",
" i += 1\n",
" return n * 2\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
"\n",
" parser = argparse.ArgumentParser(description='Multiprocessing example. Read text file with numbers, calculate reasults and save them.')\n",
" parser.add_argument('-i', '--input', metavar='input.txt', required=True,\n",
" help='input text file with single numbers on separate lines.')\n",
" parser.add_argument('-o', '--out', metavar='output.txt', required=True,\n",
" help='output text file.')\n",
" parser.add_argument('-c', '--ncpu', metavar='NCPU', required=False, default=None,\n",
" help='number of cores used for calculation. By default all but one will be used.')\n",
"\n",
" args = vars(parser.parse_args())\n",
" for o, v in args.items():\n",
" if o == \"input\": in_fname = v\n",
" if o == \"out\": out_fname = v\n",
" if o == \"ncpu\":\n",
" ncpu = cpu_count() - 1 if v is None else max(min(int(v), cpu_count()), 1)\n",
"\n",
" if os.path.isfile(out_fname):\n",
" os.remove(out_fname)\n",
" \n",
" p = Pool(ncpu)\n",
" with open(out_fname, \"wt\") as f:\n",
" for res in p.imap_unordered(long_calc, open(in_fname), chunksize=10):\n",
" f.write(str(res) + \"\\n\")\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`$ time ./multi_process_2.py -i rand_int.txt -o multi_output_2.txt -c 2` \n",
" \n",
"`real 0m13.403s` \n",
"`user 0m26.552s` \n",
"`sys 0m0.028s`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The obtained files are identical after sorting them, because implemented multiprocessing does not garantee the same order of output results. \n",
"\n",
"`diff <(sort output_single.txt) <(sort multi_output_2.txt)`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Explanation of differences between `imap/imap_unordered` and `map/map_async` \n",
"http://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An alternative implementation which uses `Queue` to pass data to processes."
]
},
{
"cell_type": "raw",
"metadata": {
"collapsed": true
},
"source": [
"#!/usr/bin/env python3\n",
"\n",
"import os\n",
"import argparse\n",
"from multiprocessing import Process, Manager, cpu_count\n",
"\n",
"\n",
"def long_calc(n):\n",
" i = 0\n",
" while i < n * 500:\n",
" i += 1\n",
" return n * 2\n",
"\n",
"\n",
"def long_calc_queue(q, output_file, lock):\n",
" \n",
" while True:\n",
" value = q.get()\n",
" if value is None:\n",
" break\n",
" res = long_calc(value)\n",
" lock.acquire()\n",
" try:\n",
" open(output_file, \"at\").write(str(res) + \"\\n\")\n",
" finally:\n",
" lock.release()\n",
"\n",
"\n",
"if __name__ == '__main__':\n",
"\n",
" parser = argparse.ArgumentParser(description='Multiprocessing example. Read text file with numbers, calculate reasults and save them.')\n",
" parser.add_argument('-i', '--input', metavar='input.txt', required=True,\n",
" help='input text file with single numbers on separate lines.')\n",
" parser.add_argument('-o', '--out', metavar='output.txt', required=True,\n",
" help='output text file.')\n",
" parser.add_argument('-c', '--ncpu', metavar='NCPU', required=False, default=None,\n",
" help='number of cores used for calculation. By default all but one will be used.')\n",
"\n",
" args = vars(parser.parse_args())\n",
" for o, v in args.items():\n",
" if o == \"input\": in_fname = v\n",
" if o == \"out\": out_fname = v\n",
" if o == \"ncpu\":\n",
" ncpu = cpu_count() - 1 if v is None else max(min(int(v), cpu_count()), 1)\n",
"\n",
" if os.path.isfile(out_fname):\n",
" os.remove(out_fname)\n",
"\n",
" manager = Manager()\n",
" lock = manager.Lock()\n",
" q = manager.Queue(10 * ncpu)\n",
"\n",
" pool = []\n",
" for _ in range(ncpu):\n",
" p = Process(target=long_calc_queue, args=(q, out_fname, lock))\n",
" p.start()\n",
" pool.append(p)\n",
"\n",
" with open(in_fname) as f:\n",
" for line in f:\n",
" v = line.strip()\n",
" if v:\n",
" q.put(int(v))\n",
"\n",
" for _ in range(ncpu):\n",
" q.put(None)\n",
" \n",
" for p in pool:\n",
" p.join()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`$ time ./multi_process.py -i rand_int.txt -o multi_output.txt -c 2` \n",
" \n",
"`real 0m14.779s` \n",
"`user 0m28.624s` \n",
"`sys 0m0.160s` "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time gain does not linearly depend on the number of cores due to different reasons: \n",
"- some overhead is always present, therefore there is not necessary that if one specifies more cores it will increase the speed\n",
"- synchronization between processes reduces effectiveness\n",
"- read/write files may be a bottleneck"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Literature and other knowledge sources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Learning Python, 5th Edition, Mark Lutz \n",
"- StackOverflow: http://stackoverflow.com/questions/tagged/python\n",
"- online courses"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python3-main",
"language": "python",
"name": "python3-main"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}