{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016. \n",
"Dr. Pavel Polishchuk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic of RDKit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"RDKit is a an open-source cross-platform chemoinformatics toolkit. \n",
"Written in C++, supports Python 2 and 3, Java and C#. \n",
"BSD license \n",
"\n",
"2000-2006: Developed and used at Rational Discovery for building predictive models for ADME, Tox, biological activity \n",
"June 2006: Open-source (BSD license) release of software, Rational Discovery shuts down \n",
"to present: Open-source development continues, use within Novartis, contributions from Novartis back to open-source version "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Detailed documentation with tutorials and examples are available on http://www.rdkit.org/docs/index.html \n",
"Recently new github repository was created to manage RDKit tutorials: https://github.com/rdkit/rdkit-tutorials \n",
"I highly recommend to subscribe on RDKit maillist https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading and writing molecules"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"RDKit supports various formats: SMILES, Mol, SDF, Mol2, PDB, FASTA, etc. "
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from rdkit import Chem"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from rdkit.Chem.Draw import IPythonConsole\n",
"from rdkit.Chem import Draw\n",
"IPythonConsole.ipython_useSVG=True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading\n",
"In the case of successful reading the functions return mol object, otherwise `None`. The latter can be used to check whether reading was successful or not."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SMILES \n",
"Coordinates for 2D depiction are generated automatically."
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m = Chem.MolFromSmiles(\"c1ccccc1OC\")"
]
},
{
"cell_type": "code",
"execution_count": 169,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAEXElEQVR4nO3d3XLaShCFUXQq7//K\nOhfEjoOBYLZ+ZrrXKl+4KmVbmOGrHkkxy7quFwDe9d/ZBwAwNxkFiMgoQERGASIyChCRUYCIjAJE\nZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQg\nIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChD5dfYBUNmyLNdP1nU9\n90hgPzLKxj7TeflSz2VZlJSqLG5SX7t5eTx4KilVWdn82IvdvPuF1hv1WNa85O5W/b3vY8lRjDXN\nfW+PnK98Z6uOSixoftuvm3d/loVHGVZza1tt1d/70dYeNbjhqaNjbucUSprwv5iaWtd178at63pz\nouD1f4WJyGg7Rw6JSkoHMsq+lJTyZJTdKSm1yShHUFIKk9FeHp0YPaBiSkpVMspxlJSSZBTX7iEi\no40Mcj+8klKMjHKCOUq6LH8+4DEZ7e6sEXX0ki7LZV3/fCgpj8loF4Ps6F93ZkmvDf37aJSUR2SU\n0/wzlIeW1Oadd/kLT5zpGsrTxuSv3ZxqVGcoMtraCDv95yXdvrPSydZktIURcvnE7iWVTvYkowxh\n85JeT6quH1//06O5vcr0/aITfJDRvkYbUcOSbvxeUjeX5kf6RTEaGa1vtFw+8aOS7v4efJP80jid\njDKWV0r6+fmBx/WNbT4f3Dfa1Mgj6iv3k55/8G7I54OMFjdyLp94VNKxHo6ScrlcZJRhDZTLJ5QU\nGe1prJludkranoxCTEl7k9HKik2dxR4OZchoO2K0CwNpYzIKG1HSrmS0rGJT5xwPR0lbklHY1Olv\nf8LhZLSXOWa6yZ3/RlIcS0ZrKpbL6R6OkrYio7ALJe1DRhuZbqabnZI2IaOwIyXtQEYLejR1TjqK\nGqIZnIzCvgyk5cloQV63o/GM1CajNZV53ZbZ0Zd5RvhORsvyuh2NZ6QqGa3M63Y0NSZrbshocVOX\ntMyOntpktL6pSwrjk1GAiIy2YCCF/choF9OV1IlRZiGjjUxXUpiCjPaipLA5GW1nipLa0TMRGe1o\nhJIuH849DMj9OvsAOMe1pAdPfF+jadikDFun1g4o6RvptKNnLqbR1naaSU2dtCKj3W1VUumkLRnl\n/ZJKJ1ycG+XTiyXdO51OjDId0yi/PZ9JP+u5X+Pc/MSkZJSX7FRPpwUowAaKv+y9p74ZOS0/CpBR\nbm1bUt2kPBnljrCktuq0IqPc96OSGjnpTEZ56JUL91dWEZ3JKM98LamtOtwlo/zDAXeMwtRkFCDi\nzzYDRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwC\nRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQU\nICKjABEZBYjIKEDkfz4OV+uUbmtMAAAAAElFTkSuQmCC\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 169,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading of a structure with errors leads to errors, result will be `None`."
]
},
{
"cell_type": "code",
"execution_count": 170,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"RDKit ERROR: [21:09:49] ERROR: non-ring atom 26 marked aromatic\n",
"RDKit ERROR: [21:53:10] \n",
"RDKit ERROR: \n",
"RDKit ERROR: ****\n",
"RDKit ERROR: Pre-condition Violation\n",
"RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()\n",
"RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp\n",
"RDKit ERROR: Failed Expression: d_explicitValence > -1\n",
"RDKit ERROR: ****\n",
"RDKit ERROR: \n",
"RDKit ERROR: [21:53:39] \n",
"RDKit ERROR: \n",
"RDKit ERROR: ****\n",
"RDKit ERROR: Pre-condition Violation\n",
"RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()\n",
"RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp\n",
"RDKit ERROR: Failed Expression: d_explicitValence > -1\n",
"RDKit ERROR: ****\n",
"RDKit ERROR: \n",
"RDKit ERROR: [22:19:58] Explicit valence for atom # 6 O, 3, is greater than permitted\n"
]
}
],
"source": [
"m = Chem.MolFromSmiles(\"c1ccccc1O(C)C\")"
]
},
{
"cell_type": "code",
"execution_count": 171,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 171,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m is None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### reading from MolBlock \n",
"MolBlock doesn't contain property fields like in SDF file."
]
},
{
"cell_type": "code",
"execution_count": 172,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAADW0lEQVR4nO3d0W6bMABA0Xja//8y\ne4jUbVXXJlybjnDOUyqZhgd0ZQOBsW3bDYC9fnz3DgCcm4wCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwC\nJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDI\nKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkPz87h2A2cb4/Xnbvm8/uAoZ5SBjjO2AqI3xVzrf/QkL\nyCjLjTFut9u2bW8f1n3T+2hum5Kymoyy0Ltu3j8sjykcS0ZZ5V+r+LeYKimvQUaZ75H55hFrfDiE\nG56Y7D7NfCSO92Hjzwvru75v7Xj4ioUV0+yeXe7c8B7Ed1s9cqX+ww1hLxllginL8yf+yecdfPC+\nUTFlEhmleupi0ZeDPx8wxthuM9vnShedY4j9npqEPj74w5Hrrke50kUko+yxKKAfbnVM5sSU3WSU\npz2+EO5tOnjRbY3PDu4bZZUzJul0O8z/wH2jzDfG2NHQegMpfBOzUWZyhpELklHmEFAuS0aZ4Iyn\nQWEW50aZQEO5MhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZJQJPJyJK5NRJrg/o15MuSaP\nJmGO+8/qPeeJC5JRZhJTLsiinvm2bXt7Fd1TWy3aH1hKRlnFCVMuwtN22ePIFyw/v3dwKBllv0Ux\nFVDORUapnnqDyJeDvY+E03HIMsGU+aNJKCclo0yzeyIpoJyajDLTjiBaxXN2jmDmezCmJqG8Bhll\nlc+nmSahvAyHMgt9ON80CeXFyCjLvXVTQHlJMspBrOJ5VX5Tz0E0lFclowCJjAIkMgqQyChAIqMA\niYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQy\nCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA\nIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA8gsqvUD55tyDWgAAAABJRU5E\nrkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 172,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"molblock = \"\"\"\n",
" Mrv1661310131608212D \n",
"\n",
" 16 16 0 0 0 0 999 V2000\n",
" -5.9598 1.6732 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -6.6743 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -6.6743 0.4357 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -5.9598 0.0232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -5.2454 0.4357 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -5.2454 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -4.5309 1.6732 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -3.8164 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -5.9598 2.4982 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -7.3888 1.6732 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -7.3888 0.0232 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -5.9598 -0.8018 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -4.5309 0.0232 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -4.2289 0.5462 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -3.1019 0.8482 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -3.4039 1.9752 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 4 0 0 0 0\n",
" 2 3 4 0 0 0 0\n",
" 3 4 4 0 0 0 0\n",
" 4 5 4 0 0 0 0\n",
" 5 6 4 0 0 0 0\n",
" 1 6 4 0 0 0 0\n",
" 6 7 1 0 0 0 0\n",
" 7 8 1 0 0 0 0\n",
" 1 9 1 0 0 0 0\n",
" 2 10 1 0 0 0 0\n",
" 3 11 1 0 0 0 0\n",
" 4 12 1 0 0 0 0\n",
" 5 13 1 0 0 0 0\n",
" 8 14 1 0 0 0 0\n",
" 8 15 1 0 0 0 0\n",
" 8 16 1 0 0 0 0\n",
"M END\n",
"\"\"\"\n",
"m = Chem.MolFromMolBlock(molblock)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mol files"
]
},
{
"cell_type": "code",
"execution_count": 173,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m = Chem.MolFromMolFile(\"data/anisole.mol\")"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAADW0lEQVR4nO3d0W6bMABA0Xja//8y\ne4jUbVXXJlybjnDOUyqZhgd0ZQOBsW3bDYC9fnz3DgCcm4wCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwC\nJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDI\nKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkPz87h2A2cb4/Xnbvm8/uAoZ5SBjjO2AqI3xVzrf/QkL\nyCjLjTFut9u2bW8f1n3T+2hum5Kymoyy0Ltu3j8sjykcS0ZZ5V+r+LeYKimvQUaZ75H55hFrfDiE\nG56Y7D7NfCSO92Hjzwvru75v7Xj4ioUV0+yeXe7c8B7Ed1s9cqX+ww1hLxllginL8yf+yecdfPC+\nUTFlEhmleupi0ZeDPx8wxthuM9vnShedY4j9npqEPj74w5Hrrke50kUko+yxKKAfbnVM5sSU3WSU\npz2+EO5tOnjRbY3PDu4bZZUzJul0O8z/wH2jzDfG2NHQegMpfBOzUWZyhpELklHmEFAuS0aZ4Iyn\nQWEW50aZQEO5MhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZJQJPJyJK5NRJrg/o15MuSaP\nJmGO+8/qPeeJC5JRZhJTLsiinvm2bXt7Fd1TWy3aH1hKRlnFCVMuwtN22ePIFyw/v3dwKBllv0Ux\nFVDORUapnnqDyJeDvY+E03HIMsGU+aNJKCclo0yzeyIpoJyajDLTjiBaxXN2jmDmezCmJqG8Bhll\nlc+nmSahvAyHMgt9ON80CeXFyCjLvXVTQHlJMspBrOJ5VX5Tz0E0lFclowCJjAIkMgqQyChAIqMA\niYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQy\nCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA\nIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA8gsqvUD55tyDWgAAAABJRU5E\nrkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### reading from SDF"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"5\n",
"5\n",
"5\n"
]
}
],
"source": [
"iterator = Chem.SDMolSupplier(\"data/logBB.sdf\")\n",
"for m in iterator:\n",
" if m is not None: # test whether molecule was read\n",
" print(m.GetNumAtoms()) # returns number of heavy atoms only, Hs were stripped"
]
},
{
"cell_type": "code",
"execution_count": 176,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 176,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mols = [m for m in iterator]\n",
"len(mols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the second case if some molecules failed to read there will be some `None`s in the output list which should be removed"
]
},
{
"cell_type": "code",
"execution_count": 177,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 177,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mols = [m for m in iterator if m is not None]\n",
"len(mols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may use Supplier as a random-access object:"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAC4UlEQVR4nO3dQZKiQBRAwWZibkTf\n/wR6puqd4eBqfGC1mLkyXBgsiGd9KHEZY3wB8Kw/sw8A4L3JKEAiowCJjAIkMgqQyChAIqMAiYwC\nJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDI\nKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMA\niYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQ/J19AJzT9/W6eeeyrq88gOW63F6Pddze\nvL2GvcgoR3lxN+9tcqmeHMpQz9k8RlNDOZSMAiSGeo5yf3l04oAPR5NRjiKdfAhDPUAio5zNWMf9\nbqevfzc/we4M9ZzQpqTu1HOoZQxnGMDzDPUAiYwCJDLKqz3+3H5H7ibxeq6NMsH39brZVbosz+Rv\nc/b67TxTuFPPr+DrnPdlqGeCy7ruPtpbijKLjDLHviXVUCZybZRplmW302/Hj4L/ZTXKNGOM5+4s\nbWgoc8koQCKjzNQXpJaiTGfDE/PtMtrDLDLKTNaSnIChHiCRUaaxFOUcZJRp1stl9iHADmSUOR6f\nTgJvSkaZQEM5ExkFSGSUV7MU5WTcKgVIrEYBEhkFSGQUIJFRgERGARIZBUg8KI9DPP5dnb2inJWM\nchTd5EMY6gESGQVIDPUc5f7yqAGfE5NRjiKdfAhDPUAiowCJjAIknjcKkFiNAiQyCpDIKEAiowCJ\njAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIK\nkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQ/2qR7jngffjsAAAAASUVORK5CYII=\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = iterator[2]\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAACm0lEQVR4nO3dS0rEQBRAUSPuSPe/\nAl1TOWv8BJG+LfWkzxmFkEENwiWhKqljrfUAwLUedw8A4H+TUYBERgESGQVIZBQgkVGAREYBEhkF\nSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCR\nUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYB\nEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgGSp90DgHPH23E5Xs/rcvJyDEPIKBN9yaV6MpmXesb5\nHk0NZTIZBUhkFCCRUYBERgESGWWc9bw+rnZ6+Lz4Caax4ImJvpTUTD2THWu5QQGu56UeIJFRgERG\nmc78EsPJKEAiowCJjAIkMgqQyChAYvk90x2Hu5TRPI0CJDIKkPg1CX9i+4Z02wfA/ZBRbm/7hnTb\nB8BdkVFu7Pcb0p1/5flyduXx+crXk2t+eOTUUP6UjLLNed2+nTNTz3CmmAASGQVIZBQgkVFubPuG\ndNsHwL0xxcTtbd+QbvsAuCvmQJnOTD3DeakHSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQg\nkVGAREYBEhkFSGQUIJFRpvOXPIaTUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhk\nFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGA\nREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZ\nBUhkFCCRUYBERgESGQVIZBQgeQdH5m+/8KNj/gAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = iterator[0]\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### reading from gzipped SDF and other file-like objects\n",
"In this case you cannot use random access."
]
},
{
"cell_type": "code",
"execution_count": 180,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 180,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import gzip\n",
"iterator = Chem.ForwardSDMolSupplier(gzip.open(\"data/logBB.sdf.gz\"))\n",
"mols = [m for m in iterator if m is not None]\n",
"len(mols)"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAACm0lEQVR4nO3dS0rEQBRAUSPuSPe/\nAl1TOWv8BJG+LfWkzxmFkEENwiWhKqljrfUAwLUedw8A4H+TUYBERgESGQVIZBQgkVGAREYBEhkF\nSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCR\nUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYB\nEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgGSp90DgHPH23E5Xs/rcvJyDEPIKBN9yaV6MpmXesb5\nHk0NZTIZBUhkFCCRUYBERgESGWWc9bw+rnZ6+Lz4Caax4ImJvpTUTD2THWu5QQGu56UeIJFRgERG\nmc78EsPJKEAiowCJjAIkMgqQyChAYvk90x2Hu5TRPI0CJDIKkPg1CX9i+4Z02wfA/ZBRbm/7hnTb\nB8BdkVFu7Pcb0p1/5flyduXx+crXk2t+eOTUUP6UjLLNed2+nTNTz3CmmAASGQVIZBQgkVFubPuG\ndNsHwL0xxcTtbd+QbvsAuCvmQJnOTD3DeakHSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQg\nkVGAREYBEhkFSGQUIJFRpvOXPIaTUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhk\nFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGA\nREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZ\nBUhkFCCRUYBERgESGQVIZBQgeQdH5m+/8KNj/gAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 181,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mols[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hydrogens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default hydrogens are removed during reading."
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAADW0lEQVR4nO3d0W6bMABA0Xja//8y\ne4jUbVXXJlybjnDOUyqZhgd0ZQOBsW3bDYC9fnz3DgCcm4wCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwC\nJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDI\nKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkPz87h2A2cb4/Xnbvm8/uAoZ5SBjjO2AqI3xVzrf/QkL\nyCjLjTFut9u2bW8f1n3T+2hum5Kymoyy0Ltu3j8sjykcS0ZZ5V+r+LeYKimvQUaZ75H55hFrfDiE\nG56Y7D7NfCSO92Hjzwvru75v7Xj4ioUV0+yeXe7c8B7Ed1s9cqX+ww1hLxllginL8yf+yecdfPC+\nUTFlEhmleupi0ZeDPx8wxthuM9vnShedY4j9npqEPj74w5Hrrke50kUko+yxKKAfbnVM5sSU3WSU\npz2+EO5tOnjRbY3PDu4bZZUzJul0O8z/wH2jzDfG2NHQegMpfBOzUWZyhpELklHmEFAuS0aZ4Iyn\nQWEW50aZQEO5MhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZJQJPJyJK5NRJrg/o15MuSaP\nJmGO+8/qPeeJC5JRZhJTLsiinvm2bXt7Fd1TWy3aH1hKRlnFCVMuwtN22ePIFyw/v3dwKBllv0Ux\nFVDORUapnnqDyJeDvY+E03HIMsGU+aNJKCclo0yzeyIpoJyajDLTjiBaxXN2jmDmezCmJqG8Bhll\nlc+nmSahvAyHMgt9ON80CeXFyCjLvXVTQHlJMspBrOJ5VX5Tz0E0lFclowCJjAIkMgqQyChAIqMA\niYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQy\nCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA\nIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChA8gsqvUD55tyDWgAAAABJRU5E\nrkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 182,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromMolFile(\"data/anisole.mol\")\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may add them back manually, however their coordinates will not be recalculated."
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAFeElEQVR4nO3dW26jSgBFUbt1Z+xp\nxGOmPywRLi8DB5oqWEv9kTi8ora3Cgo7z6ZpHgBs9efsAwCom4wCRGQUICKjABEZBYjIKEBERgEi\nMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQ\nkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGA\niIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRP47+wCo3PP5+3XTnHcccBoZJfB8/i+dvW/h\nHpzUX9bz+Xx2h4oH7KAfzaZ5HLpHKJLR6AV96tk0Ta+kjaEiHEBGL6UN6OfbzxfdqnYfB3YhoxfR\nC2hXG9P2p4aosCMZvYJuIqe0o9GmaboLSyqEvr/8KNnMIHRmlanlv5/1z0/NDyedzN1zAzJaqw0B\nXb7uyBD188hwla/3jU6tCFcho1Vacha/ZCOPBSHeZ2JKTLkuGa3M2kHo1+B+WaBzVp5fRd2l/lAa\nT+tqbAjowuVHl5xffXNSk2sRUCYZrcBxAR1da9vu/sERQplktHSrIpXnKTnvXjtEFVOuwX2jFei+\nGWn4+EcJSRq9HXXmkHq/l55SKRmtxrAyvbDudffoLpa/Y2q4pJ5SFxmt3miwej860fJ3TOkplZLR\nKo2ewu87XD3CwrN+PaUuMlqf5efjo/cwbdjOEZac9espVZDR+mxOSW8w2P3ovHDLiSVDVD2lZDJ6\na6PD1RNT9XWIqqcUSEb59RmfljBn9XViSk8ph4wyqZA5q15Sf35+Ho/H6/Xq/VRPOYuMssL8nNVw\ngeMO4P1+tw9+kqqnnMWbQUt30Hz61GbD3Z0yXG2T2g5RewfjSc6hZLR0dWV0uLXeI4c+3/SUU8ho\n6arO6Oj2u98etK/hKX9v75727EhGS3exjA5313tk972PDlH1lB2ZYqrPlT5DfupmgM90/GNwer5B\nu4XuELU7H9Wb+oe1ZLQ+Gz5ZuSK9X6rbvo/Nveuu2G52OPWvp6x1nXHNVX39e8j7/lmk8oe6vbCG\n1RteRZ26rgpTSn/NkP5NupVrlZ/Rnh2rOhyQGqKyRGWvmRta0rUNw9LLZLRnl4sAesoqdb9m7mB5\n13a5YFp7RoeS4erMKb+e0rraa+Z6hrcEtf7B25BmdlSpzcPVXkD1lJaMVmBqmDlT2FFfr7FOLbNk\nR/U+kdYOV3tDVFNSyGg18r9Wv292R3e37zZPsaqqhqg8ZLQ6G2I6s/D81tZmd8mBVVfbhRcBpoao\nenoHMlqlfDbpoBv49xrwlnwZ4etwtdtQPb0DGa3Y5hSWMx0/lcv2zaBDbY+G646udXS/ZoarenoT\npbyc2GxVTC/wLtJhtnL7pm04XJ06Zkm9Bhm9iK99vEBAt9kxu9uqN9PQySHqzDWN+/0Plk9GL2W0\nlbcN6DZ5dhdO7s+t+HyO53LqcU4loxfUdlNA/4Ft2R39uKmZjL7f79frJaNl8kF5F/TpZjsDs+G+\npW17vKe1Z/qfaB5xhZezyOhlnfsh9seZmcSHU8goO7jzaBRkFErkrL8iMgol6l1yVdWSmamH8rhv\ntCp/zj4AoO/98/Nomu6/30coj4wCRJzUA0SMRgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEi\nMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQ\nkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGAiIwCRGQUICKjABEZBYjIKEBERgEiMgoQkVGA\niIwCRGQUICKjAJG/b2du2nY1SNcAAAAASUVORK5CYII=\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 183,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.AddHs(m)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may generate 2D coordinates for atoms to a obtain more reasonable depiction"
]
},
{
"cell_type": "code",
"execution_count": 184,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit.Chem import AllChem"
]
},
{
"cell_type": "code",
"execution_count": 185,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 185,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"AllChem.Compute2DCoords(m)"
]
},
{
"cell_type": "code",
"execution_count": 186,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAFrUlEQVR4nO3dXXKjOBSAUZiaHdPL\nMGtmHsg4bmxjwUUIS+c8dVd+nEq5vlyBgH6apg6Avf4p/QMAfDcZBQiRUYAQGQUIkVGAEBkFCJFR\ngBAZBQiRUYAQGQUIkVGAEBkFCJFRgBAZBQiRUYAQGSWvcRxX/gsVkFGAEBkFCPm39A9A/SzkqZuM\nkt0wDPd/Syr1sagHCJFRgBAZBQjpPaceIMI0yhmcWaJiMgoQIqMAITIKECKjACEyChAio2Q3juPj\n9aBQGRkFCJFRgBAZBQiRUYAQ19STXd97m1Ez0yhAiIwChMjo1/MEYyhLRgFCZJS8nF+iep4MWgML\neShIRmvgCcZQkEU9QIiM1sxkCieQ0ZoNwzCOY8GY9n1f6qXhNM6iNuH8O37OAZ3fXY//hvrIaCvm\nmTR3TO/j5/P7ys4nauWd3ZZ8Y2nKyKmkVMnbujnHjqUr4+fK55d81z0ervXm5wgy2qj4WBoJYrGx\ntO//Sufiv7CLjLZrdwfTv3All2VKKqNk4Cqmdt1Po29aj3dbyjtN07vvP39o03eDa5LR1q2U7i7S\nu5Vcbu34HnP6lZqcbL/np3TPW+X7/03TFCnd/OXvtuKvfCik73/W7NduqNvFVsA0Stc9DYY5ltvn\nLfD3TaCOk7KXjPIr9/HK7Av8jwGdprcbnuYPKSnbyShLWc/5rOcy5UDta+kTaEpkz42phfy3k1EK\nOHCB3/f91B0Xvvn7nDuWul3st3OKiTLWTzqlnHe6n/46PnmLtf/Ryt52i8OZRvkg34ak9cHz3cS6\nY/vqvh8uxwL/nBvEcDIZ5df5VxalHCp9/LQudz3/fvmVn22T++wpoFVyMSi/3k1/J7xJ1l+l7PVO\nkVc3frbANMolfDyzVPDv/Y7NWH3f3263TkDbIKOsOXOZf8a1oQGJWwjcKKBBMsq17N86mt9K6E86\n8cUlySiXc/ESFT7xxfXIKFd3weFUQHlk+z3sccG4U4qMwmYayiMZhW0eG5rrZql8FRnlhwkrhd8S\nz2SUtyRj4eUvxECKjEISf1R4R0bhs/WGGkgbJ6PwwTiO5lBWyCisGccx5fYi0zS5E3OzZJRLK3tE\nMrGhs2EYlLRNMgqvbWroTEnbJKPwwo6G0iwZ5bWW9/dEGmogbZCM0nVtR3MhPocqaWtkFH4dtZZX\n0qbIKPxyPJQdZJTXrPGDDKTtcESMS9/F3UFbrs802rS+7+dOzVeFN3Jh+GJINDMS5FlMLXr5GMvi\nzzd+HoqNonwFGW3Lx/V7kecbX/moAnwko60Yx/HPnz8pqVo8QDirUgG1kOdAMlq/ORnDMKTv5sm9\nwH95VOFMj7+K3Eld7EV1mWl9ZLRm94Du+/IcC/z08bOR811UQEbrFAzo3YEL/K0BLX7KCxLJaG2O\nCuhdMGfp6/d3+wcqOAHlUGzd/Kmvx+EBXdha0n3j51Gvfh2OjVbP9vsvs7J1fNNJpB3St+g/7ur/\n+Gkpn9l5bBwXZlHPBh8X+IkT6L51eh0LfOpjGmWzlcEwcQJNGT9Xvr9DjVzKtx5vatZzQUodaNs0\nGB6+UTT3gWBIZ1H/fc7cOr4i8Qx+pmX4/EtwuoYrkFFC3m3RP+c6pfs9PcWUgmSUqMWZn5PPAhlL\nKU5GOUCRgD4yllKQU0wc5go75I2lnK/8+55qXCGjcD77RgFCZJQ6eeASp5FRgBAZ5RgOjNIsG56o\nloU855BRqnWRq2apnkU9QIiMAoTIKMe43W6lfwQoQ0ap0+KSUFeIko+MAoTIKECIjAKEyCgHcHs6\nWiajACEyChAiowAhMgoQIqMAIe4RCRBiGgUIkVGAEBkFCJFRgBAZZSdPMIaZjAKEyChAiCeDsp+F\nPHQySoQnGENnUQ8QJKMAITIKEOLWJAAhplGAEBkFCJFRgBAZBQiRUYAQGQUIkVGAEBkFCJFRgBAZ\nBQiRUYCQ/wASuYLKtN9cZgAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 186,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To avoid lose of hydrogens add `removeHs = False`, it will keep them during reading."
]
},
{
"cell_type": "code",
"execution_count": 187,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAEiklEQVR4nO3d3W6rOACF0TCaN04f\nIzwzc4GUqQglmI35XUvnoo2SJpV6PtlgcNN13QOApf7Z+wMAnJuMAkRkFCAiowARGQWIyChAREYB\nIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIK\nEJFRgIiMAkRkFCAiowARGQWIyChAREapq23br4/AqckoQERGASIyChD5d+8PwPU5GMq1ySjVPZ/P\n39+qKhdjUg8QkVGAiIwCRJqu6/b+DAAnZjQKEJFRgIiMshHrnLgqGYUv3F2FaTIKEJFRgIiMAkRc\nUw/fORjKBBllIz8/P4N7lJyIu6swwaQeICKjABEZvZ3rr4Jsmv//QX2OjXItTfP4fbedwbeLfB7S\nPe9BXmqQUaprmubxeHRd9/6i3jsNo9l1q5QUJsgoFQ262X9RPaawLRm9o20OhjbN+N1s3zE9fknb\ntjV/5ysZvaPaqyDnjDe3mONnNJSZnKlnqG3bJKz9MHNOHPunNeH59NKXO33P2oxGGeqHYH1Ji4Zj\ny0aXy4elfRB/v+rzhNJfJ50GL/xgKMp8Msq4opiG0/PiU08THXxX8v3t6HMmf4iGUuQEh/nZ3XRW\nik4WfX3y9BOapukeXwaSRUbf7hSnvzgOfy7MMjpULBqEzn9y/l5FBj9ZQynlL4YCyxbSLyvgdov2\nP97OfwqK+Iuh2PzQ5AXcOGoaygJOMVHL6ZLUf+CDr2blgGSU9S2exR8hXq5YpZTl96ypaZr5y+8P\nZRDx/le42i0EqcNolHWcffj2er0+H3w+nwsuQ+BuDjGN4lw+Z9+rzMd3XML5db29mDJBRilWKW17\nZXT+NUtiyiiTephLQBnlFBO35vJ5cjLKrSUNvf7mgMwjowARGQWIyChAxJl6WM7BUB4yConamwNy\nCib1ABEZZQXp7p6VWZlEVTLKCvrbdB48plCJY6Os44a36fxcuu+CqHuSUdZ0w5iCjLK+d0yLSlo1\nuw6GUo+MUsuh9jWyMol6ZJQCpZseJ3P8jTdYhsVklFkGLSvqY2lMk/eC7VnwxHdt247uUtc/OHOd\n08wn/7Uj3vvl5uMcjU1EmDJz24xVhorzf4jNPDgUGWXcglQt3jdpWYXFlIOQUYaSPC0IYrhpnV1A\n2J2Msr6ZMXXWiGuQUWqZHmZuswE9bMCZ+tvZ7HZHf92vpH+wRkPdyYldWDdKRYMln2bxXJKMUt2y\nS+zhLEzqASJGo3d04SOGF/7VOCwZvaML3+7owr8ah2VSDxCRUTbyer32/ghQhYwCRKxBYSMufueq\njEYBIjIKEJFRgIhjowARo1GAiIwCRGQUICKjABEZBYjIKEBERgEiMkpdtpnj8mQUICKjABEZBYjY\ni4nqHAzl2mSU6mwzx7WZ1ANEZBQgIqMAEbdtBogYjQJEZBQgIqMAERkFiMgoQERGASIyChCRUYCI\njAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJE\n/gPpbkSyqwbC9wAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 187,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromMolFile(\"data/anisole.mol\", removeHs = False)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" This can be particularly important for molecules with chiral centers with attached hydrogens."
]
},
{
"cell_type": "code",
"execution_count": 188,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAC2ElEQVR4nO3cQY6bMABA0VD1Run9\nTzA5E11UQimZ6aTzB2LQeysLEYkF+rJlh2me5wsAX/Xj1Q8AcGwyCpDIKEAiowCJjAIkMgqQyChA\nIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImM\nAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQ\nyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAyc9XPwCsTbdpGc/X\nebm4jGEoMspYVrlUT8ZnUc9AHqOpoYxPRgESGQVIZBQgkVGAREYZyHyd7087Xf4+/ARjcuCJsaxK\naqee8U3z7DUF+DqLeoBERgESGWVc9pc4BBkFSGQUIJFRgERGARIZBUgcv2dc0+T95ADMRgESGQVI\nZBQgkVGAxIfyGNo0rf8Pen17W8Zv1+vjT37dbqsr794G30VGGct9Nz/dpr8v5n0rdZM9ySiv9zjl\nvDzR0ItcMgYZ5QXe7eY9x0U5EBllJ5+mc9Eb+tFiH7Ygo2zo+XQuvmUeKp3syYEnNvS/TbSW54hk\nlFFoKAclo2zryThqKMflCzpszr4852Y2yub+XUkN5ehklD181EoN5QRklJfRUM5BRtnJKpoaymnI\nKC+goZyJjLKfP/XUUE5GRtmVhnI+MgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJ\njAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIK\nkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowCJjAIkMgqQyChAIqMAiYwCJDIKkPwG7HZ8lbty8XUAAAAASUVORK5CYII=\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 188,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromMolFile(\"data/chlorofluoroethane.mol\")\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 189,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAC4klEQVR4nO3cUW6bQBRA0VB1R+7O\nso7sLFkT/ahkudhpEt/WPKpzvjDCkj/Q1YwZZlnX9QmAe33b+wcAHJuMAiQyCpDIKEAiowCJjAIk\nMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgo\nQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJ\njAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEDyfe8fAFvL23I+Xk/r+eT5GEaRUWbZ\n5FI9mc+knkGuo6mhzCejAImMAiQyCpDIKEAiowyyntbL1U5Pvy9+gpkseGKWTUk9qWe+ZV3dpgD3\nM6kHSGQUIJFR5vJ8iUOQUYBERgESGQVIZBQgkVGAxPJ75loW9ycHYDQKkMgoQCKjAImMAiQ2ymO0\nZdm+D3p6fT0fv55O11/58fa2OXPzMvhbZJSJftXzw8f0l8W8bKVu8kgyyiw3A/peVeWSCWSUKb4U\nUJhDRtnfe628ufz+M2vy35vsw78go+zsD1m8e2QqnTySjLKbz2fx+kqTfeaQUXYgoPxPZJSH+lIE\nN/N9AWUmO+jwICWCAspkRqM8wt1b3gko8xmNMpf9RjkEW5Mwl4ZyCDIKkMgoQCKjAImMAiQyypG8\nvLx8eAYeTEYBEhkFSGQUIPEyKAfjz1CmkVEO5vn5+fKjqrI7k3qAREYBEhkFSGxEBpAYjQIkMgqQ\nyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKj\nAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIk\nMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEDyE9kKlrg0xl0YAAAAAElFTkSuQmCC\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 189,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromMolFile(\"data/chlorofluoroethane.mol\", removeHs = False)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SMILES reading automatically takes into account chiral hydrogen and preserves them."
]
},
{
"cell_type": "code",
"execution_count": 190,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAADQElEQVR4nO3dQW6bQABAUVPlRu79\nT9CciS6QrBTHSeyPYdK8t0IkNrMYfQFj42me5xMAj/p19AAAvjcZBUhkFCCRUYBERgESGQVIZBQg\nkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERG\nARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVI\nZBQgeTl6APC+6XW6bM/n+bLzsg2DkFFGtMqlejIyF/UM5zqaGsrIZBQgkVGAREYBEhkFSGSU4czn\n+e2nnU7/fvgJRuMDT4xoVVIr9YxsmmcTFOBxLuoBEhkFSGSU0VlfYnAyCpDIKEAiowCJjAIkMgqQ\nyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEDi6feMbprMUobmbBQgkVGAREYBEhkFSGQUIJFR\ngERGARIZBUhkFCCRUYBERgESGQVIZBQgeTl6AHDTNE2n02me52Vj2T50RPAOjyBjRJeArvZcmLeM\nQ0YZy3VAV39aMYE5nIwyig8Cev1v18xkjiKjDOGuR9zfKulJTDmCjHKwL56EvvuqD5jY7EZGOcxj\nAV29/FNmOM8moxwgBnT1Pl9kqvMkMsretv2lz7tKujDn2ZaMsp+tTkJvvfO9TH42IaPs4XkBXR3i\n68x8tuLLoDzXDgFdvP3OKOzJ2SjPsltArw/6KdOeDXnCE0+xrCPtXyt9ZH8yysamadp2Lf5enx5a\natmWi3o2c8hV/C2+es9uLDGxgaECulgG8/Gt0t+vr6s9f87nJ46J/5SMkgwY0LdWy/fX49RNOvdG\nedxR60h3GXx4/AdklMd9l0It4/wuo+XbscTEz7W6N+oCn8e4N8qPJp10LuoBEhkFSGQUILHEBJA4\nGwVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQU\nIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBE\nRgESGQVIZBQgkVGA5C83geFCA3ZQFAAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 190,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles(\"[C@@H](F)(Cl)C\")\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, explicitly specified Hs will be removed."
]
},
{
"cell_type": "code",
"execution_count": 191,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAC/UlEQVR4nO3cQW6bUBRAUVNlR8n+\nV1CviQ4iWWniqBXXwLNyzgh5gBigq/fhm2Vd1wsAW/06+wIAnpuMAiQyCpDIKEAiowCJjAIkMgqQ\nyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKj\nAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIk\nMgqQvJx9AXDfcl1ux+vrevvxdgxDyCgTfcqlejKZRT3jfI2mhjKZjAIkMgqQyChAIqMAiYwyzvq6\nftztdPl78xNMY8MTE30qqTf1TLasqxsUYDuLeoBERgESGWU675cYTkYBEhkFSGQUIJFRgERGARIZ\nBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSHz9numWxV3KaKZRgERGARIZBUhkFCCRUYBERgES\nGQVIZBQgkVGAREYBEhkFSGQUIJFRgOTl7AuAby3Lcrlc1nW9HZx9RXCHT5Ax0d1uiikzySiz/LOV\nPj/KNO5Ipvj/YdNYyigyyggbZkxjKUO4ETlZGS2NpUwgo5zmUREUU85l3ygnWJblfUn+kPa9n+ft\neu2ngg1Moxxtv2ea7yX9/fq6x8nhOzLKcY5ZfYspB5NRjnD840sx5TAyyr7Off8jphxARtnLnBfo\nYsquZJRdDNwbL6bsZNy9zrObM4Te9Xa9KimPJaM8zPCAwk58b5QHeNKAft2xb1BlAxkledKA3ugm\nnYyy3cD3SHA8/6lnOw2Fi2mUH+7j41ELfLaRUX406aSzqAdIZBQgkVGAxIYVgMQ0CpDIKEAiowCJ\njAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIK\nkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAi\nowDJHxP3ziAkhbdlAAAAAElFTkSuQmCC\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 191,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles(\"C([H])(F)(Cl)C\")\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Writing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### SMILES\n",
"By default saving to SMILES provides canonical SMILES but without chirality"
]
},
{
"cell_type": "code",
"execution_count": 192,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'CC(F)Cl'"
]
},
"execution_count": 192,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles(\"[C@@H](F)(Cl)C\")\n",
"Chem.MolToSmiles(m)"
]
},
{
"cell_type": "code",
"execution_count": 193,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'C[C@H](F)Cl'"
]
},
"execution_count": 193,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Chem.MolToSmiles(m, isomericSmiles = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### MolBlock"
]
},
{
"cell_type": "code",
"execution_count": 194,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" RDKit \n",
"\n",
" 4 3 0 0 0 0 0 0 0 0999 V2000\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 1 0\n",
" 1 3 1 0\n",
" 1 4 1 0\n",
"M END\n",
"\n"
]
}
],
"source": [
"molblock = Chem.MolToMolBlock(m)\n",
"print(molblock)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, chirality was lost. Specifying `includeStereo = True` doesn't help."
]
},
{
"cell_type": "code",
"execution_count": 195,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" RDKit \n",
"\n",
" 4 3 0 0 0 0 0 0 0 0999 V2000\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 1 0\n",
" 1 3 1 0\n",
" 1 4 1 0\n",
"M END\n",
"\n"
]
}
],
"source": [
"molblock = Chem.MolToMolBlock(m, includeStereo = True)\n",
"print(molblock)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may note that `MolToMolBlock` produces output with all coordinates identical. Therefore it may be useful to generate them to obtain reasonable 2D depiction. At the same time stereoinformation will appear."
]
},
{
"cell_type": "code",
"execution_count": 196,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" RDKit 2D\n",
"\n",
" 4 3 0 0 0 0 0 0 0 0999 V2000\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1.2990 -0.7500 0.0000 F 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 1.5000 0.0000 Cl 0 0 0 0 0 0 0 0 0 0 0 0\n",
" -1.2990 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 1 1\n",
" 1 3 1 0\n",
" 1 4 1 0\n",
"M END\n",
"\n"
]
}
],
"source": [
"AllChem.Compute2DCoords(m)\n",
"molblock = Chem.MolToMolBlock(m)\n",
"print(molblock)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Mol files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may use Python file objects to write mol files."
]
},
{
"cell_type": "code",
"execution_count": 197,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"print(Chem.MolToMolBlock(m), file=open('data/foo.mol','w+'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"#### SDF"
]
},
{
"cell_type": "code",
"execution_count": 198,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"w = Chem.SDWriter('data/foo.sdf')\n",
"for m in mols:\n",
" w.write(m)\n",
"w.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may use file object and write molecules to gzipped sdf"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"f = gzip.open('data/foo.sdf.gz', 'a')\n",
"w = Chem.SDWriter(f)\n",
"for m in mols:\n",
" w.write(m)\n",
"w.close()\n",
"f.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analogously there is SmilesWriter to create text files containing SMILES representation of molecules"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Structure Sanitization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"http://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization \n",
"The molecule parsing functions all, by default, perform a “sanitization” operation on the molecules read. The idea is to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to ensure that the molecules are “reasonable”: that they can be represented with octet-complete Lewis dot structures.\n",
"\n",
"Here are the steps involved, in order.\n",
"\n",
"1. `clearComputedProps`: removes any computed properties that already exist on the molecule and its atoms and bonds. This step is always performed.\n",
"\n",
"2. `cleanUp`: standardizes a small number of non-standard valence states. The clean up operations are:\n",
"\n",
" - Neutral 5 valent Ns with double bonds to Os are converted to the zwitterionic form. Example: N(=O)=O -> \\[N+\\](=O)O-] \n",
" - Neutral 5 valent Ns with triple bonds to another N are converted to the zwitterionic form. Example: C-N=N#N -> C-N=[N+]=[N-] \n",
" - Neutral 5 valent phosphorus with one double bond to an O and another to either a C or a P are converted to the zwitterionic form. Example: C=P(=O)O -> C=\\[P+\\]([O-])O \n",
" - Neutral Cl, Br, or I with exclusively O neighbors, and a valence of 3, 5, or 7, are converted to the zwitterionic form. This covers things like chlorous acid, chloric acid, and perchloric acid. Example: O=Cl(=O)O -> \\[O-\\]\\[Cl+2\\][O-]O \n",
"This step should not generate exceptions. \n",
" \n",
" \n",
"3. `updatePropertyCache`: calculates the explicit and implicit valences on all atoms. This generates exceptions for atoms in higher-than-allowed valence states. This step is always performed, but if it is “skipped” the test for non-standard valences will not be carried out. \n",
"\n",
"4. `symmetrizeSSSR`: calls the symmetrized smallest set of smallest rings algorithm (discussed in the Getting Started document). \n",
"\n",
"5. `Kekulize`: converts aromatic rings to their Kekule form. Will raise an exception if a ring cannot be kekulized or if aromatic bonds are found outside of rings. \n",
"\n",
"6. `assignRadicals`: determines the number of radical electrons (if any) on each atom. \n",
"\n",
"7. `setAromaticity`: identifies the aromatic rings and ring systems (see above), sets the aromatic flag on atoms and bonds, sets bond orders to aromatic. \n",
"\n",
"8. `setConjugation`: identifies which bonds are conjugated \n",
"\n",
"9. `setHybridization`: calculates the hybridization state of each atom \n",
"\n",
"10. `cleanupChirality`: removes chiral tags from atoms that are not sp3 hybridized. \n",
"\n",
"11. `adjustHs`: adds explicit Hs where necessary to preserve the chemistry. This is typically needed for heteroatoms in aromatic rings. The classic example is the nitrogen atom in pyrrole. \n",
"\n",
"The individual steps can be toggled on or off when calling `MolOps::sanitizeMol` or `Chem.SanitizeMol`."
]
},
{
"cell_type": "code",
"execution_count": 199,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"RDKit ERROR: [22:19:59] non-ring atom 1 marked aromatic\n"
]
}
],
"source": [
"m = Chem.MolFromSmiles('Cn(:o):o')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Working with molecules"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Looping over atoms and bonds"
]
},
{
"cell_type": "code",
"execution_count": 200,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m = Chem.MolFromSmiles('C1OC=C1')"
]
},
{
"cell_type": "code",
"execution_count": 201,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6\n",
"8\n",
"6\n",
"6\n"
]
}
],
"source": [
"for atom in m.GetAtoms():\n",
" print(atom.GetAtomicNum())"
]
},
{
"cell_type": "code",
"execution_count": 202,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SINGLE\n",
"SINGLE\n",
"DOUBLE\n",
"SINGLE\n"
]
}
],
"source": [
"for bond in m.GetBonds():\n",
" print(bond.GetBondType())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Individual atoms and bonds can be accesses as well as their properties"
]
},
{
"cell_type": "code",
"execution_count": 203,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'O'"
]
},
"execution_count": 203,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetAtomWithIdx(1).GetSymbol()"
]
},
{
"cell_type": "code",
"execution_count": 204,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 204,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetAtomWithIdx(2).GetExplicitValence()"
]
},
{
"cell_type": "code",
"execution_count": 205,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 205,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetBondWithIdx(0).GetBeginAtomIdx()"
]
},
{
"cell_type": "code",
"execution_count": 206,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"rdkit.Chem.rdchem.BondType.SINGLE"
]
},
"execution_count": 206,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetBondBetweenAtoms(0,1).GetBondType()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may loop through neigbours of particular atoms"
]
},
{
"cell_type": "code",
"execution_count": 207,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"root atom C is connected to O by SINGLE bond\n",
"root atom C is connected to C by DOUBLE bond\n"
]
}
],
"source": [
"atom = m.GetAtomWithIdx(2)\n",
"\n",
"for nei in atom.GetNeighbors():\n",
" print(\"root atom %s is connected to %s by %s bond\" % (atom.GetSymbol(), nei.GetSymbol(), m.GetBondBetweenAtoms(atom.GetIdx(), nei.GetIdx()).GetBondType()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Molecules properties"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may set and read properties of molecules, which can be stored in property fields of sdf files."
]
},
{
"cell_type": "code",
"execution_count": 208,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"m.SetProp(\"Activity\", \"inactive\")"
]
},
{
"cell_type": "code",
"execution_count": 209,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m.SetIntProp(\"Boiling point\", 40)"
]
},
{
"cell_type": "code",
"execution_count": 210,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'40'"
]
},
"execution_count": 210,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetProp(\"Boiling point\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Magic properties"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a lot of 'magic' properties of atoms/bonds and molecules. More details can be found at http://www.rdkit.org/docs/RDKit_Book.html#magic-property-values \n",
"One of them is a title or a name of a molecule (\"\\_Name\")"
]
},
{
"cell_type": "code",
"execution_count": 211,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m.SetProp(\"_Name\", \"molecule name\")"
]
},
{
"cell_type": "code",
"execution_count": 212,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'molecule name'"
]
},
"execution_count": 212,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetProp(\"_Name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you save this molecule to sdf file, \\_Name property will be stored as a title, all others as ordinary property fields."
]
},
{
"cell_type": "code",
"execution_count": 213,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"w = Chem.SDWriter('data/bar.sdf')\n",
"w.write(m)\n",
"w.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3D structures and conformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generation of 3D structure"
]
},
{
"cell_type": "code",
"execution_count": 214,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAEEklEQVR4nO3d0VLjOBRFUXtq/v+X\nPQ/0pNOOoSEnjnWv1ioeqK4CHFpsJMUR67ZtCwDP+ufqCwCoTUYBIjIKEJFRgIiMAkRkFCAiowAR\nGQWIyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWI\nyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIKEJFRgIiMAkT+vfoC6Gxd1493tm27\n9krgPDLKi93SudzVc11XJaUrg5vUfTeXzyeeSkpXRjY/9s1uHn6g8UY/hjXfcrhUf+7zGHI0Y0xz\n7Okp53c+s1FHJwY0v5zXzcOvZeDRhtE8tVct1Z/70sYePRjKMxrkdk4lpQevYprUtm2XJ2zbtt1O\nAlQko9MZag64bZuQUp2McrFtW5SU0mSU6ykppckoQ1BS6pLRuQy1MbqjpBQlowxESalIRgEiMjqR\nkVf0NyaklCOjDEdJqUVGGZGSUog/IsKgPkp64SbEICcPMD4ZnUWJjdGdN1/vZ0cFVvzW8U4yytS+\nc1TgxxEqSspnZHQWGvDh/2z+bMGupHzByJjDtbuMl9o9VZV8G5SUQ2ajdPPCbu6Yk3JIRqnhcT69\n+5dbPU+tnJLyyH2jE+iyov/6TtJt+/V2Nof2syOjlOGefMYko/BjJqTck1EqGWdCqqTcyGh3XTZG\nb5SU0cgoPE9JWWSUisaZkC5Kiow2125FfzPUw1LSyckoNTx2U0kZhIy2NlRpulPSackovIySzklG\nqaFKnZR0QjIKL6aks5FRCih3x4GSTkVG4RRKOg8ZBYg4gLaX++lPl//Zciv6e854noHT7xt5PA7e\nD/DVnJY/A4v6Lh6jOdQrzydmk7Q9GYXTKWlvMsrQ2uxMKGljMgpvoqRdySi8j+eaWpLRLh6fUKq/\nHq7/CJiCG54a2ZW0Q4HWZWnwKGhORnvpkE4oxqJ+GtWe3HDXOlXI6DTcjQ/nkNGZKCmcQEYno6Tw\najI6nwoltTFKITIKEJHRKVWYkEIVMjqrgUtqRU8tMjqxgUsKhcjo3JQUYjI6PSWFjIyyLCOdg2lj\nlHJklGUZ40ThdV0vvwZ4gt/8/Pb+meB9Nw1FipJR/nB2SXfzTcOPBmSUvdeWVDdpT0Y5EJbUUp2p\nyCjHflRSU05mJqN86ouS6ibcyChfuS+ppTocklH+4lZPQwUOyShAxKuYACIyChCRUYCIjAJEZBQg\nIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMA\nERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQg8h/JuZzj\nGKpoOwAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 214,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles('O1CCN(CC)CC1')\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since by default RDKit doesn't keep hydrogens they should be added before 3D structure generation to obtain rasonable geometry"
]
},
{
"cell_type": "code",
"execution_count": 215,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAGkklEQVR4nO3dXXKcOBSA0WZqdtxZ\nhnvNzAMTCtN/Aklwhc6pPEwqLjfO2F+uEA3DOI43APb65+wDAGibjAJkkVGALDIKkEVGAbLIKEAW\nGQXIIqMAWWQUIIuMAmSRUYAsMgqQRUYBssgoQBYZBcgio3zyeDw+/Ba4yShAJhkFyPLv2QdAdEcu\n5B+Px/1+f/dbiElG+WLVtROPBGKyqOci7IZxFtMoscgfzZFRYnEOgebIaBd2b92sPiz4ho/scgoZ\nZY+Ye+gmWU5hi4k97ve7TsHENNqLyNWbZ9u2ziHAREZ7UXzBOw2k+aWLeX4A0lnUs1/+0r5gQw+e\nZF2mykxGOY05lGuQUbIE32uKfGxchnOjXai64N13kvQCo6hGMzGN9uj0n//DGlp1WL4vVHoJmiCj\nFLCpVheYQ2FJRntUY0a73+/DMHz9sOMbGvzsLRcgoxxnGIam51A55iUZpZhxHD8MpMMwjON45PHM\nigykqznaG66YySh853wuH5w2IHC6Sml4OXWeOIrO9n290xiroXzgulEKm5b2y2hGaOg+hlBSWNRT\nV5yGbj1DqqEkklHK+7zXdKI/f/4kfqSGkk5G+3XABZVxRtFJYt81lE2cG6WWaA1N1OhhcyLfMV2r\nkYxp3Js+bcwkfTiqmAdMcKZRypgXy8sMzYvoJtqkoezj+6Z3me14Wc/ir1Lc8/FEO0IaYhplp01j\n5vPFpKFEPjbik1G2SRw/n4Va4C+zrqFkklGS7K7nUsBshToYGiWjfFF8hBzHcRhuEdqloRQho7xW\nZPx8ZxxvZ5V09XUpKfl8D/ErJVXr+fS60wvVfp3ptd5+XXHO2NIo0yj/O74m00tVHUtT/lUwlpLJ\ne+pDW73nvdRb4Iffbn8jckpHpgV+WcPw/9c4/pVwGEFvp0J8ptFWpSf1+bZG0a48n0uafxSLz7P5\ncwW/uJWwZLRV6bcgauJmRfkL/CIhjnBx6+r+Um43FZ+MRtfV0yj37eCX3apyqpStZDS61WBS/PNH\n68WmY6m312+BTzoZpUkHXCx1Ykm7WoJcgIzSpGPidtip0tXzR2svQShLRolrNXKe8sanqqdK50Ta\nRGqajIa2+unyw3aW4gv81fhJ02SU0E589/3TkRRY4Bs/L8leZNeCb0ZPAZ0zGqSnO/7ShmH4+fm5\nqedFmUaJLs5AOtm0wD/9Yn4OIKOw2deSHnmjLE4no11r5Ue8xu1LMr07VWr87JCMwk6ra6EEtFuh\ndxioLtRJx2YJaOfcb5Tooi3nV4Jf7cABZBT201BuMto1K/py3Dy/ZzJKaJFTvxpFlbRbMgp7PB4P\ny3kmMgqbvXuwh4G0TzIK23x+OJKSdkhGAbLIaMec2tsu5Tmd4zi6ZX1XZJS4om3Tpz/r+H6/K2k/\nZBRSuVsoL8kokTW8V2Mg7YeMQi1K2gkZBcjixgr9WV7VGPj//mXu+pG+MUWjTKOdmTa/518uFP9t\ntQYvsiS3tL88GQXIIqNQnYH02jyLCX6p1DunRy9MRuGXZe+MkKSwqOevSNtNl9mmpwem0c6sdueX\nqZr+SLxgIxntz4dQzpEVU0hm6cQrp46ln297/PPzM//3y32b1eXurn6nNtMor5y0wJ8COt1APuUf\n+NUWUPxcSvwlyShvHLjAn8fPOZ2JJdUgIpBR3psqVnMsncfPsp/WhUocSUb5psIC/3n8LCvytZ/R\njod8MkqCv0+7zK9epfGzIZETzz4ySpIpfLuvit8xfqZvNMG5ZJQNxo1jae3FO0Qgo2yTOJaeuHhf\nbd/bzac2iyZ2elnSsuOnRT1N8G3KfsuRs9L4qaTEZ1HPflUDOlm+9RNicqM8spgWQUbZT0PhJqPs\ntmzofCFUcZ5iRHzOjbKZdyLBkmmUbaYh9Lmh9QZSCE5G2cDJUHgmo6T62lADKX2SUZI8Hg9zKLwk\no3yX/qyLGgOpzXqCk1G+2Pq8IO87ojcyyic7nrlmeKQ3MspbcZ5bGeQw4CUZ5bWchhpI6YqM8kL+\nHKqk9ENGWYuzlocmyCi/FGyogZROeG8fdZltuTzTKNGtRloTLtHIKHVZ2nN5Mtqd44c7i3quzW2b\naYB5lshklAYs51lJJRoZ7ZESQUEy2iPDHRRkiwkgi4xykN1XCKw2+u37E42MAmSR0e4Y7qAsW0wc\nx3YWlySjHMcVAlySRT1AFhkFyCKjAFncthkgi2kUIIuMAmSRUYAsMgqQRUYBssgoQBYZBcgiowBZ\nZBQgi4wCZPkP7zTIC53uSsoAAAAASUVORK5CYII=\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 215,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.AddHs(m)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This command generates 3D structure for a molecule usinf distance matrix approach"
]
},
{
"cell_type": "code",
"execution_count": 216,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 216,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"AllChem.EmbedMolecule(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The obtained geometry usually is quite ugly and refinement is neccessary. This can be done by using universal force field (UFF) or Merck molecular force field (MMFF)."
]
},
{
"cell_type": "code",
"execution_count": 217,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 217,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"AllChem.UFFOptimizeMolecule(m)"
]
},
{
"cell_type": "code",
"execution_count": 218,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAFoUlEQVR4nO3dXZKbOBSAUUhlx/Yy\nzJrJA9UuDMGNfQXo55yHVE1nKrEzk68lIcn9OI4dAN/6c/ULACibjAKEyChAiIwChMgoQIiMAoTI\nKECIjAKEyChAiIwChMgoQIiMAoTIKECIjAKEyChAiIwChMgoQIiMAoTIKECIjAKEyChAiIwChMgo\nQIiMAoTIaCuGYfj1K8AXZBQgREYBQmQUIOTv1S+A81gMhSPIaENut9v8H1UVkjCpBwiRUYAQGQUI\n6cdxvPo1ABTMaBQgREYBQmS0LTY5QXIyChAiowAhMgoQIqMNGYZhcR4UiJNRgBAZBQiR0WL4FBDI\nk4w25H6/X/Vb+x5AxWS0FX3v/gQ4hIwChLj9viTvJ8LTz9rSBCeT0ZK8/xSQ6WezjanFUGolo7WZ\nx3R6pjSOYw4Loz4JilrJaJ2mZk0/9n3fecQEh5HRJkwBnXoqppCWjBZjvdz56QKomMIRTPQqtzWX\nF1NIxWi0UfORaaenEGD7fc22hqLPp+Tjj77vn0kFPiKjdF3XPR6PI2JqVxMtkNF6fb7DaT4yjfe0\nhVuiXblCZ220Wn3ffbvc+Yyvx1Cwh4w2as+gKbJBqoWhKExktEY7hqL7j2Z+EVMNpSkyWp3AdP6N\n/RukWjt1ajEUGa3LMQ19smy65soVZJRvbM30WxuKQiejVdk9FI0fz584pA+djFblopAtlk2hNaZg\nJDPN6J3TpzVOMbXroIchzunTGpN6jvI62Tc2pVpGo41KvkN+6xn9z9i06/uu4rGpfU4tMxrlJFNj\np5IamVITGeVUYkp9ZLQ084nxtx064sz74/HY/y8vXvh6w+vBp7EgJRktyqIuFcWmordCczxiKse6\nNNODm8/d7/c0L+lHfHj77VuJSnLvshutGiejACEyWpHCtxRdNSCFIGujFZk/BZ9/5VXyS5gSTmmn\nkq5fnZVTciaj1Xl9BlXHEaLnQPWI9xHfOX+/362NtkxGy7Eeqv06SPvJ56HXhaQtyNbU/rgNp+5d\nJkhGi7JozO6cLG6tz/xy5TeLpHbvkyEZLU04Hhnetbx+Fb8NsrtOTMmGjDZkPgjNMKafElMykfXk\njrS25vKXxDTtw/dpraK76FtC5oskHM1otBXDMGz9VT9/ZJp8A9M4dl1X/PiaQsloE/Zs7axgmt/V\n8i4oi1NMvDjh8z9O2Ev/67uI72rqZzqf6Nc2Ga3fF6eM3mcocp3HmeeRtt7Fzj+Q/q1xZvq9FmGl\nHTLKpnmGkqThkjOd65jOL7jaGcq1+a/wLOni4/yGYbCZvwXWRiuX4gq7l637hS44vn6+XspjXf+9\nr3r+yz5L6sBorWS0cgn/6s5L9MUvm8P1ItM48eTvBM8/Kz2tlYzymXEch2GYYrr/g0Nq3Vn50WB/\n0VMxrYaM8o3FvqL3K4D5NDSTV7K+DOXXr5AzGeV7e6b5mZTrIGJHJ6PEzaf5i2LW3VCYyChprI8P\naSiN8D962fJcVku4nSihbLP+35Xly/8jsp/RKOk99xU9H+iLwntu4C+ajHKg5+gvhy0++7dnwUcc\nBuUMt9vtdrstDkdGzuZDPoxGi5dherZWIYue2ue5DE0OZLR4ltVKt26xOpfFpJ4mGDlyHKNRrmTs\nTAVklPT2PxMva0Ui85fHVWS0bJbVzlRW9DmNtVESswpJa2SUJig7x5FRLmNFgjrIKOwi+myRUVKy\nMEqDZBQgJNMbGAFKYTQKECKjVMtFfJxDRgFCZBQgREYBQlxNQs0shnICGaVm7mTiBCb1hHgaDjIK\nECKjACEySrXcycQ5PGIiymIojZNRojwNp3Em9QAhMgoQIqMAIa5tBggxGgUIkVGAEBkFCJFRgBAZ\nBQiRUYAQGQUIkVGAEBkFCJFRgBAZBQiRUYAQGQUIkVGAEBkFCJFRgBAZBQiRUYAQGQUI+QcCUhJp\n/0DrJQAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 218,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "code",
"execution_count": 219,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 219,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"AllChem.MMFFOptimizeMolecule(m)"
]
},
{
"cell_type": "code",
"execution_count": 220,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAFrUlEQVR4nO3da27bRhSAUbLIjqNl\nmGue/phGVUSLlniH5DzOQVGgjuHQaPxlXiTnlNIEwF7/XH0BAG2TUYAQGQUIkVGAEBkFCJFRgBAZ\nBQiRUYAQGQUIkVGAEBkFCJFRgBAZBQiRUYAQGQUIkVGAEBkFCJFRgBAZBQiRUYAQGQUIkVGAEBkF\nCJFRgBAZBQiR0SEsy/LjR4B9ZBQgREYBQmQUIOTX1RfASSyGwkFkdBS/f/9+/E9VhVJM6gFCZBQg\nREYBQuaU0tXXANAwo1GAEBkFCJHRgTjkBEeQUYAQGQUIkVGAEBkdxbIsT/eDAkXIKECIjAKEyChA\niJtBRzHP/l/DIYxGOYN36tExGW2GEkGdZBQgREaHYGEUjuNdTP24z/HrPGZvCYJeyWhLtkt0r2f+\ntNvtNk1TPYNQ79SjVzLakjdLlD8t/3ue5/xB83o4iLXRzuV0ppRSSvM836sKlCKjA3mM6VNPTbFh\nNxnt3Houn1L6+vo6eXC63vWqcx8MdrBe1rNX66GPD827l9SfBNjHFtPocj2XZck9FVP4lIz2a3Nr\nfr0Ymj+57ODUs6IZgYx2ap6nzQ6+Ojt1r2d8cKqhDEJG+d4Rg1Pokoz26Keh6Pt2D04NRRmHA0/d\nKdfQRxtnTtcGbKjHGI5MRvvyXkN3n+JMf2zHNN/OD4MwqWePjZVTN+8zGhntyDHT+Q0Ft/WhXTLa\nkQ8rVnAF82lwOuaA1GLosGSUYvKa6VNSp2FGqR6oOiwZpZjHQejTamn+2PTxiBkaYKeew/3Z3p9S\nmub5/3+q4sQSuxmNDuqqo52Po9FcUuNTWmc0ymXy+DRbD05rG65u80DVkckoZcR359vqJtyZ1Lfm\nMTY1zYe/vr6CXyGvnF74PVkMZR8ZbcpTZgLVud1uZp1P4ieWBnyYAJNJfUvW0czjt4509w0xBBml\nLq9KWuEZKchM6vty0crpCZPZ/N04I0WFZLQv62OZ0/TfLUTttGdjan9cTOMnliyMDktG+3W/L3Oa\nptZuct9eJDUypSoy2o71gaBPdurXN7lX9Rym9YX8eGliSiVsMTXl6ab0QDzeeYj9+y6czD7eqr+D\ns6LEVTQe4RzrQeiFD10ue94+D7Knt7+XsquZVY3uOZNJPX89dPnMEBS/ZymlKS8Fexo/Z5LRsSzL\n8iouJ7+Y/tD7Pt/5i8FQlFJkdCDvhOOc1yudc++89+5xDhkdxaeDr+Nm+ic/f8R79ziajLKleEwv\nfIbT+r1761+FHRx4GkJwHfDxdFSuz75Xblz7HLws5zKtzG9bf01LBIMzGh1Ckb2Up9nxp1+z8ta8\nf21KyhMZ5WMppWVZhl1qXH/LuaGnHXKgNib17LSe6W+oZ7BW/EruX3C9PlDwd6Fmtfzhpi3rldDb\n7Ta9GIjV09DpyIx++0v53Sqe/NQ3k3p2+vbJcuuZflUNPcLGS6ju3/jj3zqS2h8ZpaSr7iut3GM6\n70nV027IKOU9ndCsSvwNpo92nCRbf/76i3gCdFtsMfVg3ynOo903XvJ+y/K3+6fVefHwPqNR9vjo\nlRtm9/TNaBT2M/VmMhrlOBXu0Ve75mgdo2ky2ommfw6bvvgi1ltMV10JO8hoJ5r+OWz64sHaKEcp\ne7QIqiWjACEyykDq3F/66PQYFZLRHjT9c9jExbtHgA0yyiGqPVoExckoQIiMAoQ4NwpvsRjKKzIK\nb3GPAK+Y1HOISvaX7LBzAhkFCJFRgBAZhZ81cY8AV7HFROcshnI0GaVzdtg5mkk9QIiMUoajRQxL\nRgFCZJSe2WHnBDIKEGKnnmIshjImGaUYR4sYk0k9QIiMAoTIKEDInFK6+hoAGmY0ChAiowAhMgoQ\nIqMAITIKECKjACEyChAiowAhMgoQIqMAITIKECKjACEyChAiowAhMgoQIqMAITIKECKjACEyChAi\nowAhMgoQ8i8269tgytJldQAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 220,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Generation of multiple conformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a lot of input options. The major ones are the number of conformers and RMS threshold, which will help to discard too similar 3D structures and keep diverse conformers."
]
},
{
"cell_type": "code",
"execution_count": 221,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cids = AllChem.EmbedMultipleConfs(m, numConfs=10, pruneRmsThresh=1)"
]
},
{
"cell_type": "code",
"execution_count": 222,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"7"
]
},
"execution_count": 222,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(cids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You may see that from 10 required conformers only 7 were generated. If you will descrese the RMS threshold value more conformers will be kept up to specified maximum value (10)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generated conformers require geometry optimization to produce more reasonable 3D structures."
]
},
{
"cell_type": "code",
"execution_count": 223,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"for cid in cids:\n",
" AllChem.MMFFOptimizeMolecule(m, confId=cid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For conformers you may corresponding return energy values. Since we optimized geometry with MMFF we will use the same force field for energy calculation."
]
},
{
"cell_type": "code",
"execution_count": 224,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"64.09232266554531\n",
"56.52251338020919\n",
"64.09232254579031\n",
"64.0923226735707\n",
"56.52251334421818\n",
"56.522513373936874\n",
"64.53225822753267\n"
]
}
],
"source": [
"for cid in cids:\n",
" ff = AllChem.MMFFGetMoleculeForceField(m, AllChem.MMFFGetMoleculeProperties(m), confId=cid)\n",
" print(ff.CalcEnergy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Substructure search"
]
},
{
"cell_type": "code",
"execution_count": 225,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAAEEklEQVR4nO3d0VLjOBRFUXtq/v+X\nPQ/0pNOOoSEnjnWv1ioeqK4CHFpsJMUR67ZtCwDP+ufqCwCoTUYBIjIKEJFRgIiMAkRkFCAiowAR\nGQWIyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWI\nyChAREYBIjIKEJFRgIiMAkRkFCAiowARGQWIyChAREYBIjIKEJFRgIiMAkT+vfoC6Gxd1493tm27\n9krgPDLKi93SudzVc11XJaUrg5vUfTeXzyeeSkpXRjY/9s1uHn6g8UY/hjXfcrhUf+7zGHI0Y0xz\n7Okp53c+s1FHJwY0v5zXzcOvZeDRhtE8tVct1Z/70sYePRjKMxrkdk4lpQevYprUtm2XJ2zbtt1O\nAlQko9MZag64bZuQUp2McrFtW5SU0mSU6ykppckoQ1BS6pLRuQy1MbqjpBQlowxESalIRgEiMjqR\nkVf0NyaklCOjDEdJqUVGGZGSUog/IsKgPkp64SbEICcPMD4ZnUWJjdGdN1/vZ0cFVvzW8U4yytS+\nc1TgxxEqSspnZHQWGvDh/2z+bMGupHzByJjDtbuMl9o9VZV8G5SUQ2ajdPPCbu6Yk3JIRqnhcT69\n+5dbPU+tnJLyyH2jE+iyov/6TtJt+/V2Nof2syOjlOGefMYko/BjJqTck1EqGWdCqqTcyGh3XTZG\nb5SU0cgoPE9JWWSUisaZkC5Kiow2125FfzPUw1LSyckoNTx2U0kZhIy2NlRpulPSackovIySzklG\nqaFKnZR0QjIKL6aks5FRCih3x4GSTkVG4RRKOg8ZBYg4gLaX++lPl//Zciv6e854noHT7xt5PA7e\nD/DVnJY/A4v6Lh6jOdQrzydmk7Q9GYXTKWlvMsrQ2uxMKGljMgpvoqRdySi8j+eaWpLRLh6fUKq/\nHq7/CJiCG54a2ZW0Q4HWZWnwKGhORnvpkE4oxqJ+GtWe3HDXOlXI6DTcjQ/nkNGZKCmcQEYno6Tw\najI6nwoltTFKITIKEJHRKVWYkEIVMjqrgUtqRU8tMjqxgUsKhcjo3JQUYjI6PSWFjIyyLCOdg2lj\nlHJklGUZ40ThdV0vvwZ4gt/8/Pb+meB9Nw1FipJR/nB2SXfzTcOPBmSUvdeWVDdpT0Y5EJbUUp2p\nyCjHflRSU05mJqN86ouS6ibcyChfuS+ppTocklH+4lZPQwUOyShAxKuYACIyChCRUYCIjAJEZBQg\nIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMA\nERkFiMgoQERGASIyChCRUYCIjAJEZBQgIqMAERkFiMgoQERGASIyChCRUYCIjAJEZBQg8h/JuZzj\nGKpoOwAAAABJRU5ErkJggg==\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 225,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles('O1CCN(CC)CC1')\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 226,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"q = Chem.MolFromSmarts('CCN')"
]
},
{
"cell_type": "code",
"execution_count": 227,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 227,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.HasSubstructMatch(q)"
]
},
{
"cell_type": "code",
"execution_count": 228,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(1, 2, 3)"
]
},
"execution_count": 228,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetSubstructMatch(q)"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"((1, 2, 3), (5, 4, 3), (7, 6, 3))"
]
},
"execution_count": 229,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.GetSubstructMatches(q)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TASK 1. Write a script to retrive compound names having carboxylic acid group from the sdf file - `logBB_big.sdf`. \n",
"TASK 2. Write a script to retrive compound names having more than one carboxylic acid group."
]
},
{
"cell_type": "code",
"execution_count": 230,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MolID_208\n",
"MolID_215\n",
"MolID_280\n",
"MolID_281\n",
"MolID_317\n",
"MolID_172\n",
"MolID_71\n",
"MolID_41\n",
"MolID_180\n",
"MolID_118\n",
"MolID_160\n",
"MolID_184\n",
"MolID_101\n",
"MolID_152\n",
"MolID_100\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 0 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 15523\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 0 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 17846\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19055\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19968\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 20965\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 20 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22625\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 20 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22844\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 19 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 23484\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 19 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 12 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 24286\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 12 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 23 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 25470\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 23 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 26 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 26268\n"
]
}
],
"source": [
"for m in Chem.SDMolSupplier('data/logBB_big.sdf'):\n",
" if m is not None:\n",
" if m.HasSubstructMatch(Chem.MolFromSmarts('C(=O)[OH]')):\n",
" print(m.GetProp(\"_Name\"))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Descriptors calculation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a lot of descriptors available in RDKit - http://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors"
]
},
{
"cell_type": "code",
"execution_count": 231,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit.Chem import Descriptors\n",
"m = Chem.MolFromSmiles('c1ccncc1C(=O)O')"
]
},
{
"cell_type": "code",
"execution_count": 232,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"50.19"
]
},
"execution_count": 232,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Descriptors.TPSA(m)"
]
},
{
"cell_type": "code",
"execution_count": 233,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.7797999999999998"
]
},
"execution_count": 233,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Descriptors.MolLogP(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Charges are computed differently"
]
},
{
"cell_type": "code",
"execution_count": 234,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"atom C id=0 has -0.044587 charge\n",
"atom C id=1 has -0.042959 charge\n",
"atom C id=2 has 0.026786 charge\n",
"atom N id=3 has -0.263835 charge\n",
"atom C id=4 has 0.041281 charge\n",
"atom C id=5 has 0.077736 charge\n",
"atom C id=6 has 0.336759 charge\n",
"atom O id=7 has -0.246333 charge\n",
"atom O id=8 has -0.477599 charge\n"
]
}
],
"source": [
"AllChem.ComputeGasteigerCharges(m)\n",
"for a in m.GetAtoms():\n",
" print(\"atom %s id=%i has %f charge\" % (a.GetSymbol(), a.GetIdx(), float(a.GetProp('_GasteigerCharge'))))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Fingerprints"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Topological fingerprints"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"They are topological paths between pairs of atoms on a specified distance (defaults: min 1, max 7)"
]
},
{
"cell_type": "code",
"execution_count": 235,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit import DataStructs\n",
"from rdkit.Chem.Fingerprints import FingerprintMols"
]
},
{
"cell_type": "code",
"execution_count": 236,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m1 = Chem.MolFromSmiles(\"CCCCO\")\n",
"m2 = Chem.MolFromSmiles(\"c1ccccc1CO\")"
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fp1 = FingerprintMols.FingerprintMol(m1)\n",
"fp2 = FingerprintMols.FingerprintMol(m2)"
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.14"
]
},
"execution_count": 238,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"By default for calculation of similarity Tanimoto is used, but there are some other implemented metrics: Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky"
]
},
{
"cell_type": "code",
"execution_count": 239,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.24561403508771928"
]
},
"execution_count": 239,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1,fp2, metric=DataStructs.DiceSimilarity)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### MACCS keys"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"They are SMARTS patterns of different functiong groups and common fragments: http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html"
]
},
{
"cell_type": "code",
"execution_count": 240,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit.Chem import MACCSkeys"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fp1 = MACCSkeys.GenMACCSKeys(m1)\n",
"fp2 = MACCSkeys.GenMACCSKeys(m2)"
]
},
{
"cell_type": "code",
"execution_count": 242,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.3684210526315789"
]
},
"execution_count": 242,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Morgan fingerprints (Circular Fingerprints)"
]
},
{
"cell_type": "code",
"execution_count": 243,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fp1 = AllChem.GetMorganFingerprint(m1, 2)\n",
"fp2 = AllChem.GetMorganFingerprint(m2, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Morgan fingerprints by default are counts, therefore dice similarity should be used as a metric."
]
},
{
"cell_type": "code",
"execution_count": 244,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.16666666666666666"
]
},
"execution_count": 244,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.DiceSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But it is possible calculate Morgan fingerprints as bit vectors."
]
},
{
"cell_type": "code",
"execution_count": 245,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2)\n",
"fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2)"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.24"
]
},
"execution_count": 246,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.DiceSimilarity(fp1,fp2)"
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.13636363636363635"
]
},
"execution_count": 247,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Atoms can be labeled by feature types (like pharmacophore features): http://www.rdkit.org/docs/GettingStartedInPython.html#feature-definitions-used-in-the-morgan-fingerprints"
]
},
{
"cell_type": "code",
"execution_count": 248,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2, useFeatures=True)\n",
"fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2, useFeatures=True)"
]
},
{
"cell_type": "code",
"execution_count": 249,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.3"
]
},
"execution_count": 249,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.DiceSimilarity(fp1,fp2)"
]
},
{
"cell_type": "code",
"execution_count": 250,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.17647058823529413"
]
},
"execution_count": 250,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TASK 3. Write a script to retrive SMILES of the similar compounds from logBB_big.sdf file to the N1CCNCC1c1ccccc1 molecule based on Tanimoto score and topological and Morgan fingerprints."
]
},
{
"cell_type": "code",
"execution_count": 251,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 26 marked aromatic\n",
"RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic\n",
"RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561\n",
"RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 0 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 15523\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 0 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 17846\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19055\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19968\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 20965\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 20 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22625\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 20 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22844\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 19 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 23484\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 19 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 12 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 24286\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 12 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 23 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 25470\n",
"RDKit ERROR: [22:20:00] ERROR: non-ring atom 23 marked aromatic\n",
"RDKit ERROR: [22:20:00] non-ring atom 26 marked aromatic\n",
"RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 26268\n"
]
}
],
"source": [
"mol = Chem.MolFromSmiles(\"N1CCNCC1c1ccccc1\")\n",
"tp = FingerprintMols.FingerprintMol(mol)\n",
"mg = AllChem.GetMorganFingerprintAsBitVect(mol, 2, useFeatures=True)\n",
"\n",
"d = {}\n",
"for m in Chem.SDMolSupplier('data/logBB_big.sdf'):\n",
" if m is not None:\n",
" tp_ = FingerprintMols.FingerprintMol(m)\n",
" mg_ = AllChem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True)\n",
" d[Chem.MolToSmiles(m)] = (DataStructs.FingerprintSimilarity(tp, tp_), DataStructs.FingerprintSimilarity(mg, mg_))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort output dictionary by descending order of the topological similarity and return top 5 molecules"
]
},
{
"cell_type": "code",
"execution_count": 252,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333)),\n",
" ('CCCOC(C)=O', (0.5238095238095238, 0.037037037037037035)),\n",
" ('CC(=O)OC(C)C', (0.5238095238095238, 0.04)),\n",
" ('CC(N)Cc1ccccc1', (0.5054945054945055, 0.30434782608695654)),\n",
" ('FC(F)OC(F)C(F)(F)F', (0.5041322314049587, 0.037037037037037035))]"
]
},
"execution_count": 252,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(d.items(), key=lambda x: -x[1][0])[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the same with Morgan similarity"
]
},
{
"cell_type": "code",
"execution_count": 253,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('CC(C)(C)c1ccccc1', (0.35978835978835977, 0.3684210526315789)),\n",
" ('c1cncc(C2CCCN2)c1', (0.4752186588921283, 0.3448275862068966)),\n",
" ('Cc1ccccc1', (0.4603174603174603, 0.3333333333333333)),\n",
" ('Cc1ccccc1C', (0.4051724137931034, 0.3333333333333333)),\n",
" ('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333))]"
]
},
"execution_count": 253,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sorted(d.items(), key=lambda x: -x[1][1])[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2D pharmacophore fingerprints"
]
},
{
"cell_type": "code",
"execution_count": 254,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit.Chem.Pharm2D import Gobbi_Pharm2D, Generate"
]
},
{
"cell_type": "code",
"execution_count": 255,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"m1 = Chem.MolFromSmiles(\"NCCCCO\")\n",
"m2 = Chem.MolFromSmiles(\"NCCCCCCO\")"
]
},
{
"cell_type": "code",
"execution_count": 256,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fp1 = Generate.Gen2DFingerprint(m1, Gobbi_Pharm2D.factory)\n",
"fp2 = Generate.Gen2DFingerprint(m2, Gobbi_Pharm2D.factory)"
]
},
{
"cell_type": "code",
"execution_count": 257,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.06153846153846154"
]
},
"execution_count": 257,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataStructs.FingerprintSimilarity(fp1, fp2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reactions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reactions can be constructed based on SMARTS definitions of reactants and products"
]
},
{
"cell_type": "code",
"execution_count": 258,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]')"
]
},
{
"cell_type": "code",
"execution_count": 259,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"p = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'),Chem.MolFromSmiles('NC')))"
]
},
{
"cell_type": "code",
"execution_count": 260,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 260,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(p)"
]
},
{
"cell_type": "code",
"execution_count": 261,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAcIAAACWCAIAAADCEh9HAAADSklEQVR4nO3c207qQABA0dMT//+X\nxwcSjAiK7N5d64lLokMCOzOdttMY4x8Ar/q/9QAAjk1GARIZBUhkFCCRUYBERgESGQVIZBQgkVGA\nREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZ\nBUhkFCCRUYBERgESGQVIZBQgkVGAREYBEhkFSGQUIJFRgERGARIZBUhkFCCRUYBERgESGQVI3rYe\nAH/GNH08HmO7ccDMZJRVTNOndN48hSOzqGd5X6M5xqfJKRyZjAIkMgqQyChAIqMAiYyyvK8bSnbq\nOREnPLGKm5JqKCcio6xFOjkpi3qAREYBEhlleY8uWHIhE6cgowCJjAIkMgqQyChAIqMAiYwCJDIK\nkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMsriHt0Oz23yOAcZBUhkFCCRUYBERgESGQVIZBQgkVGA\nREYBEhkFSGQUIJFRgERGARIZZVXTdIYbkpzjUzCXaYyx9Rj4Ey7pGWNcH2w9olccevAsREZZ3N30\nHK5Hhxswq5FRFvRjeo7SpmnyS+EhXw6W8nx69hzTPY+NnZBR5vdaenYYLJNQnuFbwpx6CncS08tW\n/Naj4BhklHnMm79p2ixhAspvvW09AM5g9sXvGBvkTEB5jYySLLcGv/zJddImoBQyyovWOYh5jely\n/2fDAwicg2OjvGL9LWwTRnbLbJTf2WonfYVpKbxGRnnWHk5F2mTrCb4no/xsDwG9WnPrCZ4ho/xg\nn1fyWOOzHzLKQ7uahN6146Hxh7htM3dM03SZhO65oTe+3kn5+so3b0FnNson+5+BfsMan03IKB/2\neRj0eZd9/CN/Ag7Jop4Ph24obMVslFN5NCF1MJTlyChnc7ekN09VlRlZ1AMkMsoJXa8ZhRXIKOdk\nt4zVHPsEF4DNmY0CJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgo\nQCKjAImMAiQyCpDIKEAiowCJjAIkMgqQyChAIqMAiYwCJDIKkMgoQCKjAImMAiQyCpDIKEAiowCJ\njAIkMgqQyChAIqMAiYwCJDIKkMgoQPIOa2TTHPKxwjkAAAAASUVORK5CYII=\n",
"image/svg+xml": [
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 261,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p[0][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generation of combinatorial library"
]
},
{
"cell_type": "code",
"execution_count": 262,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"acids = ['c1ccccc1C(=O)O', 'CC(=O)O', 'OC(=O)CC(=O)O']\n",
"bases = ['CCN', 'CCNC', 'CNCCN']\n",
"\n",
"a = [Chem.MolFromSmiles(s) for s in acids]\n",
"b = [Chem.MolFromSmiles(s) for s in bases]"
]
},
{
"cell_type": "code",
"execution_count": 263,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from itertools import product\n",
"\n",
"p = []\n",
"for s in product(a, b):\n",
" p.extend(rxn.RunReactants(s)) "
]
},
{
"cell_type": "code",
"execution_count": 264,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CCNC(=O)c1ccccc1\n",
"CCN(C)C(=O)c1ccccc1\n",
"CN(CCN)C(=O)c1ccccc1\n",
"CNCCNC(=O)c1ccccc1\n",
"CCNC(C)=O\n",
"CCN(C)C(C)=O\n",
"CC(=O)N(C)CCN\n",
"CNCCNC(C)=O\n",
"CCNC(=O)CC(=O)O\n",
"CCNC(=O)CC(=O)O\n",
"CCN(C)C(=O)CC(=O)O\n",
"CCN(C)C(=O)CC(=O)O\n",
"CN(CCN)C(=O)CC(=O)O\n",
"CNCCNC(=O)CC(=O)O\n",
"CN(CCN)C(=O)CC(=O)O\n",
"CNCCNC(=O)CC(=O)O\n"
]
}
],
"source": [
"for item in p:\n",
" print(Chem.MolToSmiles(item[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, products may contain duplicates, they can be removed based on canonical SMILES of products"
]
},
{
"cell_type": "code",
"execution_count": 265,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'CC(=O)N(C)CCN',\n",
" 'CCN(C)C(=O)CC(=O)O',\n",
" 'CCN(C)C(=O)c1ccccc1',\n",
" 'CCN(C)C(C)=O',\n",
" 'CCNC(=O)CC(=O)O',\n",
" 'CCNC(=O)c1ccccc1',\n",
" 'CCNC(C)=O',\n",
" 'CN(CCN)C(=O)CC(=O)O',\n",
" 'CN(CCN)C(=O)c1ccccc1',\n",
" 'CNCCNC(=O)CC(=O)O',\n",
" 'CNCCNC(=O)c1ccccc1',\n",
" 'CNCCNC(C)=O'}"
]
},
"execution_count": 265,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(Chem.MolToSmiles(item[0]) for item in p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Drawing molecules"
]
},
{
"cell_type": "code",
"execution_count": 266,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rdkit.Chem import Draw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a molecule and compute coordinates"
]
},
{
"cell_type": "code",
"execution_count": 267,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 267,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = Chem.MolFromSmiles('CC(=O)O')\n",
"AllChem.Compute2DCoords(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Image can be directly saved to a file or obtaind as PIL image"
]
},
{
"cell_type": "code",
"execution_count": 268,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"Draw.MolToFile(m, 'data/mol.png')"
]
},
{
"cell_type": "code",
"execution_count": 269,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAGQAAABkCAIAAAD/gAIDAAAB3klEQVR4nO3cwXKCMAAGYdLp+79y\nenAGRdS6hiQ/4+7JemjhmwRSREqtdbH3+pm9AWdKLJBYILFAYoHEAokFEgskFkgskFggsUBigcQC\niQUSCyQWSCzQ7+wN2FXK9XXYJe8wrFI2QHc/zi5pGu5pat0MtNklYcUnFkgskFigJKz94TzsbBi2\ndLjzSpJa4rCWOKDbkqZhfGKBkrAeLtZdwZ80sUBigcQCiQUSCyQWSCyQWCCxQGKBxAKJBRILFIT1\n8FpM0AWaKKz8xAKJBRILJBZILJBYILFAYoHEAokFEgsUilWSbnFYK2nPorkw1VrXF7O36FoQ1kOd\nKLIIrH9FSsnYzrkb8f7ASRhiM7E+GC9zyeZgNe7zLLLRWEfu5/C75Mfd2n38cFhvmh9FNgKr46y5\n/M5RZN2xRpz1V7LOf6gj1ujDcP9Z2QVr2gm+8xA7GCth6dhviB2JFfJPybL0OvAfgxUxoPYdPStb\nsUKZbksYWSdgetaz73/ux+D2nU+wTsy0ND1nA2MFHcU/6NlzNt7bI3wN/sRSzeV9R3puLz8oEWvb\n/gB/U+hHYZl9GVbbcza+bxo2PGfjzOuA4X3ZNGxLLJBYILFAYoHEAokFEgskFkgskFggsUBigcQC\n/QGof7u5Mfvo1gAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"execution_count": 269,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Draw.MolToImage(m, (100, 100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Multiple molecules can be visualized as a grid. Let's visualize unique products from conbinatorial library generation."
]
},
{
"cell_type": "code",
"execution_count": 270,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['CCN(C)C(=O)c1ccccc1',\n",
" 'CCN(C)C(C)=O',\n",
" 'CCNC(=O)CC(=O)O',\n",
" 'CC(=O)N(C)CCN',\n",
" 'CCNC(C)=O',\n",
" 'CN(CCN)C(=O)CC(=O)O',\n",
" 'CN(CCN)C(=O)c1ccccc1',\n",
" 'CNCCNC(C)=O',\n",
" 'CNCCNC(=O)c1ccccc1',\n",
" 'CNCCNC(=O)CC(=O)O',\n",
" 'CCNC(=O)c1ccccc1',\n",
" 'CCN(C)C(=O)CC(=O)O']"
]
},
"execution_count": 270,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pu = list(set(Chem.MolToSmiles(item[0]) for item in p))\n",
"pu"
]
},
{
"cell_type": "code",
"execution_count": 271,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"execution_count": 271,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Draw.MolsToGridImage([Chem.MolFromSmiles(s) for s in pu], molsPerRow=4, subImgSize=(200,200), legends=pu)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Recommended sources:\n",
"\n",
"1. RDKit official web-site: http://www.rdkit.org/docs/index.html\n",
"2. RDKit maillists: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}