For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016.

Dr. Pavel Polishchuk

Basic of RDKit

RDKit is a an open-source cross-platform chemoinformatics toolkit.
Written in C++, supports Python 2 and 3, Java and C#.
BSD license

2000-2006: Developed and used at Rational Discovery for building predictive models for ADME, Tox, biological activity
June 2006: Open-source (BSD license) release of software, Rational Discovery shuts down
to present: Open-source development continues, use within Novartis, contributions from Novartis back to open-source version

Detailed documentation with tutorials and examples are available on http://www.rdkit.org/docs/index.html
Recently new github repository was created to manage RDKit tutorials: https://github.com/rdkit/rdkit-tutorials
I highly recommend to subscribe on RDKit maillist https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/

Reading and writing molecules

RDKit supports various formats: SMILES, Mol, SDF, Mol2, PDB, FASTA, etc.

In [166]:
from rdkit import Chem
In [167]:
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True

Reading

In the case of successful reading the functions return mol object, otherwise None. The latter can be used to check whether reading was successful or not.

SMILES

Coordinates for 2D depiction are generated automatically.

In [168]:
m = Chem.MolFromSmiles("c1ccccc1OC")
In [169]:
m
Out[169]:
O

Reading of a structure with errors leads to errors, result will be None.

In [170]:
m = Chem.MolFromSmiles("c1ccccc1O(C)C")
RDKit ERROR: [21:09:49] ERROR: non-ring atom 26 marked aromatic
RDKit ERROR: [21:53:10] 
RDKit ERROR: 
RDKit ERROR: ****
RDKit ERROR: Pre-condition Violation
RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()
RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp
RDKit ERROR: Failed Expression: d_explicitValence > -1
RDKit ERROR: ****
RDKit ERROR: 
RDKit ERROR: [21:53:39] 
RDKit ERROR: 
RDKit ERROR: ****
RDKit ERROR: Pre-condition Violation
RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()
RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp
RDKit ERROR: Failed Expression: d_explicitValence > -1
RDKit ERROR: ****
RDKit ERROR: 
RDKit ERROR: [22:19:58] Explicit valence for atom # 6 O, 3, is greater than permitted
In [171]:
m is None
Out[171]:
True

reading from MolBlock

MolBlock doesn't contain property fields like in SDF file.

In [172]:
molblock = """
  Mrv1661310131608212D          

 16 16  0  0  0  0            999 V2000
   -5.9598    1.6732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.6743    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.6743    0.4357    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598    0.0232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2454    0.4357    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2454    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.5309    1.6732    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.8164    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598    2.4982    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3888    1.6732    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3888    0.0232    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598   -0.8018    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -4.5309    0.0232    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -4.2289    0.5462    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.1019    0.8482    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4039    1.9752    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  4  0  0  0  0
  2  3  4  0  0  0  0
  3  4  4  0  0  0  0
  4  5  4  0  0  0  0
  5  6  4  0  0  0  0
  1  6  4  0  0  0  0
  6  7  1  0  0  0  0
  7  8  1  0  0  0  0
  1  9  1  0  0  0  0
  2 10  1  0  0  0  0
  3 11  1  0  0  0  0
  4 12  1  0  0  0  0
  5 13  1  0  0  0  0
  8 14  1  0  0  0  0
  8 15  1  0  0  0  0
  8 16  1  0  0  0  0
M  END
"""
m = Chem.MolFromMolBlock(molblock)
m
Out[172]:
O

Mol files

In [173]:
m = Chem.MolFromMolFile("data/anisole.mol")
In [174]:
m
Out[174]:
O

reading from SDF

In [175]:
iterator = Chem.SDMolSupplier("data/logBB.sdf")
for m in iterator:
    if m is not None:              # test whether molecule was read
        print(m.GetNumAtoms())     # returns number of heavy atoms only, Hs were stripped
5
5
5
5
In [176]:
mols = [m for m in iterator]
len(mols)
Out[176]:
4

In the second case if some molecules failed to read there will be some Nones in the output list which should be removed

In [177]:
mols = [m for m in iterator if m is not None]
len(mols)
Out[177]:
4

You may use Supplier as a random-access object:

In [178]:
m = iterator[2]
m
Out[178]:
F F Cl
In [179]:
m = iterator[0]
m
Out[179]:
Cl Cl Cl

reading from gzipped SDF and other file-like objects

In this case you cannot use random access.

In [180]:
import gzip
iterator = Chem.ForwardSDMolSupplier(gzip.open("data/logBB.sdf.gz"))
mols = [m for m in iterator if m is not None]
len(mols)
Out[180]:
4
In [181]:
mols[0]
Out[181]:
Cl Cl Cl

Hydrogens

By default hydrogens are removed during reading.

In [182]:
m = Chem.MolFromMolFile("data/anisole.mol")
m
Out[182]:
O

You may add them back manually, however their coordinates will not be recalculated.

In [183]:
m = Chem.AddHs(m)
m
Out[183]:
O H H H H H H H H

You may generate 2D coordinates for atoms to a obtain more reasonable depiction

In [184]:
from rdkit.Chem import AllChem
In [185]:
AllChem.Compute2DCoords(m)
Out[185]:
0
In [186]:
m
Out[186]:
O H H H H H H H H

To avoid lose of hydrogens add removeHs = False, it will keep them during reading.

In [187]:
m = Chem.MolFromMolFile("data/anisole.mol", removeHs = False)
m
Out[187]:
O H H H H H H H H

This can be particularly important for molecules with chiral centers with attached hydrogens.

In [188]:
m = Chem.MolFromMolFile("data/chlorofluoroethane.mol")
m
Out[188]:
F Cl H
In [189]:
m = Chem.MolFromMolFile("data/chlorofluoroethane.mol", removeHs = False)
m
Out[189]:
F Cl H

SMILES reading automatically takes into account chiral hydrogen and preserves them.

In [190]:
m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
m
Out[190]:
F Cl H

However, explicitly specified Hs will be removed.

In [191]:
m = Chem.MolFromSmiles("C([H])(F)(Cl)C")
m
Out[191]:
F Cl

Writing

SMILES

By default saving to SMILES provides canonical SMILES but without chirality

In [192]:
m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
Chem.MolToSmiles(m)
Out[192]:
'CC(F)Cl'
In [193]:
Chem.MolToSmiles(m, isomericSmiles = True)
Out[193]:
'C[C@H](F)Cl'

MolBlock

In [194]:
molblock = Chem.MolToMolBlock(m)
print(molblock)
     RDKit          

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  3  1  0
  1  4  1  0
M  END

However, chirality was lost. Specifying includeStereo = True doesn't help.

In [195]:
molblock = Chem.MolToMolBlock(m, includeStereo = True)
print(molblock)
     RDKit          

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  3  1  0
  1  4  1  0
M  END

You may note that MolToMolBlock produces output with all coordinates identical. Therefore it may be useful to generate them to obtain reasonable 2D depiction. At the same time stereoinformation will appear.

In [196]:
AllChem.Compute2DCoords(m)
molblock = Chem.MolToMolBlock(m)
print(molblock)
     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990   -0.7500    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.5000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
   -1.2990   -0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  1
  1  3  1  0
  1  4  1  0
M  END

Mol files

You may use Python file objects to write mol files.

In [197]:
print(Chem.MolToMolBlock(m), file=open('data/foo.mol','w+'))

SDF

In [198]:
w = Chem.SDWriter('data/foo.sdf')
for m in mols:
    w.write(m)
w.close()

You may use file object and write molecules to gzipped sdf

f = gzip.open('data/foo.sdf.gz', 'a') w = Chem.SDWriter(f) for m in mols: w.write(m) w.close() f.close()

Analogously there is SmilesWriter to create text files containing SMILES representation of molecules

Structure Sanitization

http://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization
The molecule parsing functions all, by default, perform a “sanitization” operation on the molecules read. The idea is to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to ensure that the molecules are “reasonable”: that they can be represented with octet-complete Lewis dot structures.

Here are the steps involved, in order.

  1. clearComputedProps: removes any computed properties that already exist on the molecule and its atoms and bonds. This step is always performed.

  2. cleanUp: standardizes a small number of non-standard valence states. The clean up operations are:

    • Neutral 5 valent Ns with double bonds to Os are converted to the zwitterionic form. Example: N(=O)=O -> [N+](=O)O-]
    • Neutral 5 valent Ns with triple bonds to another N are converted to the zwitterionic form. Example: C-N=N#N -> C-N=[N+]=[N-]
    • Neutral 5 valent phosphorus with one double bond to an O and another to either a C or a P are converted to the zwitterionic form. Example: C=P(=O)O -> C=[P+]([O-])O
    • Neutral Cl, Br, or I with exclusively O neighbors, and a valence of 3, 5, or 7, are converted to the zwitterionic form. This covers things like chlorous acid, chloric acid, and perchloric acid. Example: O=Cl(=O)O -> [O-][Cl+2][O-]O
      This step should not generate exceptions.
  1. updatePropertyCache: calculates the explicit and implicit valences on all atoms. This generates exceptions for atoms in higher-than-allowed valence states. This step is always performed, but if it is “skipped” the test for non-standard valences will not be carried out.

  2. symmetrizeSSSR: calls the symmetrized smallest set of smallest rings algorithm (discussed in the Getting Started document).

  3. Kekulize: converts aromatic rings to their Kekule form. Will raise an exception if a ring cannot be kekulized or if aromatic bonds are found outside of rings.

  4. assignRadicals: determines the number of radical electrons (if any) on each atom.

  5. setAromaticity: identifies the aromatic rings and ring systems (see above), sets the aromatic flag on atoms and bonds, sets bond orders to aromatic.

  6. setConjugation: identifies which bonds are conjugated

  7. setHybridization: calculates the hybridization state of each atom

  8. cleanupChirality: removes chiral tags from atoms that are not sp3 hybridized.

  9. adjustHs: adds explicit Hs where necessary to preserve the chemistry. This is typically needed for heteroatoms in aromatic rings. The classic example is the nitrogen atom in pyrrole.

The individual steps can be toggled on or off when calling MolOps::sanitizeMol or Chem.SanitizeMol.

In [199]:
m = Chem.MolFromSmiles('Cn(:o):o')
RDKit ERROR: [22:19:59] non-ring atom 1 marked aromatic

Working with molecules

Looping over atoms and bonds

In [200]:
m = Chem.MolFromSmiles('C1OC=C1')
In [201]:
for atom in m.GetAtoms():
    print(atom.GetAtomicNum())
6
8
6
6
In [202]:
for bond in m.GetBonds():
    print(bond.GetBondType())
SINGLE
SINGLE
DOUBLE
SINGLE

Individual atoms and bonds can be accesses as well as their properties

In [203]:
m.GetAtomWithIdx(1).GetSymbol()
Out[203]:
'O'
In [204]:
m.GetAtomWithIdx(2).GetExplicitValence()
Out[204]:
3
In [205]:
m.GetBondWithIdx(0).GetBeginAtomIdx()
Out[205]:
0
In [206]:
m.GetBondBetweenAtoms(0,1).GetBondType()
Out[206]:
rdkit.Chem.rdchem.BondType.SINGLE

You may loop through neigbours of particular atoms

In [207]:
atom = m.GetAtomWithIdx(2)

for nei in atom.GetNeighbors():
    print("root atom %s is connected to %s by %s bond" % (atom.GetSymbol(), nei.GetSymbol(), m.GetBondBetweenAtoms(atom.GetIdx(), nei.GetIdx()).GetBondType()))
root atom C is connected to O by SINGLE bond
root atom C is connected to C by DOUBLE bond

Molecules properties

You may set and read properties of molecules, which can be stored in property fields of sdf files.

In [208]:
m.SetProp("Activity", "inactive")
In [209]:
m.SetIntProp("Boiling point", 40)
In [210]:
m.GetProp("Boiling point")
Out[210]:
'40'

Magic properties

There are a lot of 'magic' properties of atoms/bonds and molecules. More details can be found at http://www.rdkit.org/docs/RDKit_Book.html#magic-property-values
One of them is a title or a name of a molecule ("_Name")

In [211]:
m.SetProp("_Name", "molecule name")
In [212]:
m.GetProp("_Name")
Out[212]:
'molecule name'

When you save this molecule to sdf file, _Name property will be stored as a title, all others as ordinary property fields.

In [213]:
w = Chem.SDWriter('data/bar.sdf')
w.write(m)
w.close()

3D structures and conformers

Generation of 3D structure

In [214]:
m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m
Out[214]:
O N

Since by default RDKit doesn't keep hydrogens they should be added before 3D structure generation to obtain rasonable geometry

In [215]:
m = Chem.AddHs(m)
m
Out[215]:
O N H H H H H H H H H H H H H

This command generates 3D structure for a molecule usinf distance matrix approach

In [216]:
AllChem.EmbedMolecule(m)
Out[216]:
0

The obtained geometry usually is quite ugly and refinement is neccessary. This can be done by using universal force field (UFF) or Merck molecular force field (MMFF).

In [217]:
AllChem.UFFOptimizeMolecule(m)
Out[217]:
0
In [218]:
m
Out[218]:
O N H H H H H H H H H H H H H
In [219]:
AllChem.MMFFOptimizeMolecule(m)
Out[219]:
0
In [220]:
m
Out[220]:
O N H H H H H H H H H H H H H

Generation of multiple conformers

There are a lot of input options. The major ones are the number of conformers and RMS threshold, which will help to discard too similar 3D structures and keep diverse conformers.

In [221]:
cids = AllChem.EmbedMultipleConfs(m, numConfs=10, pruneRmsThresh=1)
In [222]:
len(cids)
Out[222]:
7

You may see that from 10 required conformers only 7 were generated. If you will descrese the RMS threshold value more conformers will be kept up to specified maximum value (10).

Generated conformers require geometry optimization to produce more reasonable 3D structures.

In [223]:
for cid in cids:
    AllChem.MMFFOptimizeMolecule(m, confId=cid)

For conformers you may corresponding return energy values. Since we optimized geometry with MMFF we will use the same force field for energy calculation.

In [224]:
for cid in cids:
    ff = AllChem.MMFFGetMoleculeForceField(m, AllChem.MMFFGetMoleculeProperties(m), confId=cid)
    print(ff.CalcEnergy())
64.09232266554531
56.52251338020919
64.09232254579031
64.0923226735707
56.52251334421818
56.522513373936874
64.53225822753267
In [225]:
m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m
Out[225]:
O N
In [226]:
q = Chem.MolFromSmarts('CCN')
In [227]:
m.HasSubstructMatch(q)
Out[227]:
True
In [228]:
m.GetSubstructMatch(q)
Out[228]:
(1, 2, 3)
In [229]:
m.GetSubstructMatches(q)
Out[229]:
((1, 2, 3), (5, 4, 3), (7, 6, 3))

TASK 1. Write a script to retrive compound names having carboxylic acid group from the sdf file - logBB_big.sdf.
TASK 2. Write a script to retrive compound names having more than one carboxylic acid group.

In [230]:
for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
    if m is not None:
        if m.HasSubstructMatch(Chem.MolFromSmarts('C(=O)[OH]')):
            print(m.GetProp("_Name"))
MolID_208
MolID_215
MolID_280
MolID_281
MolID_317
MolID_172
MolID_71
MolID_41
MolID_180
MolID_118
MolID_160
MolID_184
MolID_101
MolID_152
MolID_100
RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561
RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 0 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 15523
RDKit ERROR: [22:19:59] ERROR: non-ring atom 0 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 17846
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19055
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19968
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 20965
RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 20 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22625
RDKit ERROR: [22:19:59] ERROR: non-ring atom 20 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22844
RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 19 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 23484
RDKit ERROR: [22:19:59] ERROR: non-ring atom 19 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 12 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 24286
RDKit ERROR: [22:19:59] ERROR: non-ring atom 12 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 23 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 25470
RDKit ERROR: [22:19:59] ERROR: non-ring atom 23 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 26 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 26268

Descriptors calculation

In [231]:
from rdkit.Chem import Descriptors
m = Chem.MolFromSmiles('c1ccncc1C(=O)O')
In [232]:
Descriptors.TPSA(m)
Out[232]:
50.19
In [233]:
Descriptors.MolLogP(m)
Out[233]:
0.7797999999999998

Charges are computed differently

In [234]:
AllChem.ComputeGasteigerCharges(m)
for a in m.GetAtoms():
    print("atom %s id=%i has %f charge" % (a.GetSymbol(), a.GetIdx(), float(a.GetProp('_GasteigerCharge'))))
atom C id=0 has -0.044587 charge
atom C id=1 has -0.042959 charge
atom C id=2 has 0.026786 charge
atom N id=3 has -0.263835 charge
atom C id=4 has 0.041281 charge
atom C id=5 has 0.077736 charge
atom C id=6 has 0.336759 charge
atom O id=7 has -0.246333 charge
atom O id=8 has -0.477599 charge

Fingerprints

Topological fingerprints

They are topological paths between pairs of atoms on a specified distance (defaults: min 1, max 7)

In [235]:
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
In [236]:
m1 = Chem.MolFromSmiles("CCCCO")
m2 = Chem.MolFromSmiles("c1ccccc1CO")
In [237]:
fp1 = FingerprintMols.FingerprintMol(m1)
fp2 = FingerprintMols.FingerprintMol(m2)
In [238]:
DataStructs.FingerprintSimilarity(fp1, fp2)
Out[238]:
0.14

By default for calculation of similarity Tanimoto is used, but there are some other implemented metrics: Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky

In [239]:
DataStructs.FingerprintSimilarity(fp1,fp2, metric=DataStructs.DiceSimilarity)
Out[239]:
0.24561403508771928

MACCS keys

They are SMARTS patterns of different functiong groups and common fragments: http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html

In [240]:
from rdkit.Chem import MACCSkeys
In [241]:
fp1 = MACCSkeys.GenMACCSKeys(m1)
fp2 = MACCSkeys.GenMACCSKeys(m2)
In [242]:
DataStructs.FingerprintSimilarity(fp1, fp2)
Out[242]:
0.3684210526315789

Morgan fingerprints (Circular Fingerprints)

In [243]:
fp1 = AllChem.GetMorganFingerprint(m1, 2)
fp2 = AllChem.GetMorganFingerprint(m2, 2)

Morgan fingerprints by default are counts, therefore dice similarity should be used as a metric.

In [244]:
DataStructs.DiceSimilarity(fp1, fp2)
Out[244]:
0.16666666666666666

But it is possible calculate Morgan fingerprints as bit vectors.

In [245]:
fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2)
In [246]:
DataStructs.DiceSimilarity(fp1,fp2)
Out[246]:
0.24
In [247]:
DataStructs.FingerprintSimilarity(fp1, fp2)
Out[247]:
0.13636363636363635

Atoms can be labeled by feature types (like pharmacophore features): http://www.rdkit.org/docs/GettingStartedInPython.html#feature-definitions-used-in-the-morgan-fingerprints

In [248]:
fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2, useFeatures=True)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2, useFeatures=True)
In [249]:
DataStructs.DiceSimilarity(fp1,fp2)
Out[249]:
0.3
In [250]:
DataStructs.FingerprintSimilarity(fp1, fp2)
Out[250]:
0.17647058823529413

TASK 3. Write a script to retrive SMILES of the similar compounds from logBB_big.sdf file to the N1CCNCC1c1ccccc1 molecule based on Tanimoto score and topological and Morgan fingerprints.

In [251]:
mol = Chem.MolFromSmiles("N1CCNCC1c1ccccc1")
tp = FingerprintMols.FingerprintMol(mol)
mg = AllChem.GetMorganFingerprintAsBitVect(mol, 2, useFeatures=True)

d = {}
for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
    if m is not None:
        tp_ = FingerprintMols.FingerprintMol(m)
        mg_ = AllChem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True)
        d[Chem.MolToSmiles(m)] = (DataStructs.FingerprintSimilarity(tp, tp_), DataStructs.FingerprintSimilarity(mg, mg_))
RDKit ERROR: [22:19:59] ERROR: non-ring atom 26 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561
RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 0 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 15523
RDKit ERROR: [22:20:00] ERROR: non-ring atom 0 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 17846
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19055
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19968
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 20965
RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 20 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22625
RDKit ERROR: [22:20:00] ERROR: non-ring atom 20 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22844
RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 19 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 23484
RDKit ERROR: [22:20:00] ERROR: non-ring atom 19 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 12 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 24286
RDKit ERROR: [22:20:00] ERROR: non-ring atom 12 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 23 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 25470
RDKit ERROR: [22:20:00] ERROR: non-ring atom 23 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 26 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 26268

Sort output dictionary by descending order of the topological similarity and return top 5 molecules

In [252]:
sorted(d.items(), key=lambda x: -x[1][0])[:5]
Out[252]:
[('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333)),
 ('CCCOC(C)=O', (0.5238095238095238, 0.037037037037037035)),
 ('CC(=O)OC(C)C', (0.5238095238095238, 0.04)),
 ('CC(N)Cc1ccccc1', (0.5054945054945055, 0.30434782608695654)),
 ('FC(F)OC(F)C(F)(F)F', (0.5041322314049587, 0.037037037037037035))]

Do the same with Morgan similarity

In [253]:
sorted(d.items(), key=lambda x: -x[1][1])[:5]
Out[253]:
[('CC(C)(C)c1ccccc1', (0.35978835978835977, 0.3684210526315789)),
 ('c1cncc(C2CCCN2)c1', (0.4752186588921283, 0.3448275862068966)),
 ('Cc1ccccc1', (0.4603174603174603, 0.3333333333333333)),
 ('Cc1ccccc1C', (0.4051724137931034, 0.3333333333333333)),
 ('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333))]

2D pharmacophore fingerprints

In [254]:
from rdkit.Chem.Pharm2D import Gobbi_Pharm2D, Generate
In [255]:
m1 = Chem.MolFromSmiles("NCCCCO")
m2 = Chem.MolFromSmiles("NCCCCCCO")
In [256]:
fp1 = Generate.Gen2DFingerprint(m1, Gobbi_Pharm2D.factory)
fp2 = Generate.Gen2DFingerprint(m2, Gobbi_Pharm2D.factory)
In [257]:
DataStructs.FingerprintSimilarity(fp1, fp2)
Out[257]:
0.06153846153846154

Reactions

Reactions can be constructed based on SMARTS definitions of reactants and products

In [258]:
rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]')
In [259]:
p = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'),Chem.MolFromSmiles('NC')))
In [260]:
len(p)
Out[260]:
1
In [261]:
p[0][0]
Out[261]:
O NH

Generation of combinatorial library

In [262]:
acids = ['c1ccccc1C(=O)O', 'CC(=O)O', 'OC(=O)CC(=O)O']
bases = ['CCN', 'CCNC', 'CNCCN']

a = [Chem.MolFromSmiles(s) for s in acids]
b = [Chem.MolFromSmiles(s) for s in bases]
In [263]:
from itertools import product

p = []
for s in product(a, b):
    p.extend(rxn.RunReactants(s))    
In [264]:
for item in p:
    print(Chem.MolToSmiles(item[0]))
CCNC(=O)c1ccccc1
CCN(C)C(=O)c1ccccc1
CN(CCN)C(=O)c1ccccc1
CNCCNC(=O)c1ccccc1
CCNC(C)=O
CCN(C)C(C)=O
CC(=O)N(C)CCN
CNCCNC(C)=O
CCNC(=O)CC(=O)O
CCNC(=O)CC(=O)O
CCN(C)C(=O)CC(=O)O
CCN(C)C(=O)CC(=O)O
CN(CCN)C(=O)CC(=O)O
CNCCNC(=O)CC(=O)O
CN(CCN)C(=O)CC(=O)O
CNCCNC(=O)CC(=O)O

However, products may contain duplicates, they can be removed based on canonical SMILES of products

In [265]:
set(Chem.MolToSmiles(item[0]) for item in p)
Out[265]:
{'CC(=O)N(C)CCN',
 'CCN(C)C(=O)CC(=O)O',
 'CCN(C)C(=O)c1ccccc1',
 'CCN(C)C(C)=O',
 'CCNC(=O)CC(=O)O',
 'CCNC(=O)c1ccccc1',
 'CCNC(C)=O',
 'CN(CCN)C(=O)CC(=O)O',
 'CN(CCN)C(=O)c1ccccc1',
 'CNCCNC(=O)CC(=O)O',
 'CNCCNC(=O)c1ccccc1',
 'CNCCNC(C)=O'}

Drawing molecules

In [266]:
from rdkit.Chem import Draw

Create a molecule and compute coordinates

In [267]:
m = Chem.MolFromSmiles('CC(=O)O')
AllChem.Compute2DCoords(m)
Out[267]:
0

Image can be directly saved to a file or obtaind as PIL image

In [268]:
Draw.MolToFile(m, 'data/mol.png')
In [269]:
Draw.MolToImage(m, (100, 100))
Out[269]:

Multiple molecules can be visualized as a grid. Let's visualize unique products from conbinatorial library generation.

In [270]:
pu = list(set(Chem.MolToSmiles(item[0]) for item in p))
pu
Out[270]:
['CCN(C)C(=O)c1ccccc1',
 'CCN(C)C(C)=O',
 'CCNC(=O)CC(=O)O',
 'CC(=O)N(C)CCN',
 'CCNC(C)=O',
 'CN(CCN)C(=O)CC(=O)O',
 'CN(CCN)C(=O)c1ccccc1',
 'CNCCNC(C)=O',
 'CNCCNC(=O)c1ccccc1',
 'CNCCNC(=O)CC(=O)O',
 'CCNC(=O)c1ccccc1',
 'CCN(C)C(=O)CC(=O)O']
In [271]:
Draw.MolsToGridImage([Chem.MolFromSmiles(s) for s in pu], molsPerRow=4, subImgSize=(200,200), legends=pu)
Out[271]:
N O CCN(C)C(=O)c1ccccc1 N O CCN(C)C(C)=O NH O O OH CCNC(=O)CC(=O)O O N H2N CC(=O)N(C)CCN NH O CCNC(C)=O N NH2 O O OH CN(CCN)C(=O)CC(=O)O N NH2 O CN(CCN)C(=O)c1ccccc1 NH NH O CNCCNC(C)=O NH NH O CNCCNC(=O)c1ccccc1 NH NH O O OH CNCCNC(=O)CC(=O)O NH O CCNC(=O)c1ccccc1 N O O HO CCN(C)C(=O)CC(=O)O

Recommended sources:

  1. RDKit official web-site: http://www.rdkit.org/docs/index.html
  2. RDKit maillists: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/