For the school on chemoinformatics (BIGCHEM project). Munich, 17-21 October, 2016.¶

Dr. Pavel Polishchuk

Basic of RDKit¶

RDKit is a an open-source cross-platform chemoinformatics toolkit.
Written in C++, supports Python 2 and 3, Java and C#.
BSD license

2000-2006: Developed and used at Rational Discovery for building predictive models for ADME, Tox, biological activity
June 2006: Open-source (BSD license) release of software, Rational Discovery shuts down
to present: Open-source development continues, use within Novartis, contributions from Novartis back to open-source version

Detailed documentation with tutorials and examples are available on http://www.rdkit.org/docs/index.html
Recently new github repository was created to manage RDKit tutorials: https://github.com/rdkit/rdkit-tutorials
I highly recommend to subscribe on RDKit maillist https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/

Reading and writing molecules¶

RDKit supports various formats: SMILES, Mol, SDF, Mol2, PDB, FASTA, etc.

from rdkit import Chem

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True

Reading¶

In the case of successful reading the functions return mol object, otherwise None. The latter can be used to check whether reading was successful or not.

SMILES¶

Coordinates for 2D depiction are generated automatically.

m = Chem.MolFromSmiles("c1ccccc1OC")

m

Reading of a structure with errors leads to errors, result will be None.

m = Chem.MolFromSmiles("c1ccccc1O(C)C")

RDKit ERROR: [21:09:49] ERROR: non-ring atom 26 marked aromatic
RDKit ERROR: [21:53:10] 
RDKit ERROR: 
RDKit ERROR: ****
RDKit ERROR: Pre-condition Violation
RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()
RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp
RDKit ERROR: Failed Expression: d_explicitValence > -1
RDKit ERROR: ****
RDKit ERROR: 
RDKit ERROR: [21:53:39] 
RDKit ERROR: 
RDKit ERROR: ****
RDKit ERROR: Pre-condition Violation
RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()
RDKit ERROR: Violation occurred on line 174 in file /home/rdkit/miniconda/conda-bld/work/Code/GraphMol/Atom.cpp
RDKit ERROR: Failed Expression: d_explicitValence > -1
RDKit ERROR: ****
RDKit ERROR: 
RDKit ERROR: [22:19:58] Explicit valence for atom # 6 O, 3, is greater than permitted

m is None

True

reading from MolBlock¶

MolBlock doesn't contain property fields like in SDF file.

molblock = """
  Mrv1661310131608212D          

 16 16  0  0  0  0            999 V2000
   -5.9598    1.6732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.6743    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -6.6743    0.4357    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598    0.0232    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2454    0.4357    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.2454    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.5309    1.6732    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.8164    1.2607    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598    2.4982    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3888    1.6732    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -7.3888    0.0232    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -5.9598   -0.8018    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -4.5309    0.0232    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -4.2289    0.5462    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.1019    0.8482    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4039    1.9752    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  4  0  0  0  0
  2  3  4  0  0  0  0
  3  4  4  0  0  0  0
  4  5  4  0  0  0  0
  5  6  4  0  0  0  0
  1  6  4  0  0  0  0
  6  7  1  0  0  0  0
  7  8  1  0  0  0  0
  1  9  1  0  0  0  0
  2 10  1  0  0  0  0
  3 11  1  0  0  0  0
  4 12  1  0  0  0  0
  5 13  1  0  0  0  0
  8 14  1  0  0  0  0
  8 15  1  0  0  0  0
  8 16  1  0  0  0  0
M  END
"""
m = Chem.MolFromMolBlock(molblock)
m

Mol files¶

m = Chem.MolFromMolFile("data/anisole.mol")

m

reading from SDF¶

iterator = Chem.SDMolSupplier("data/logBB.sdf")
for m in iterator:
    if m is not None:              # test whether molecule was read
        print(m.GetNumAtoms())     # returns number of heavy atoms only, Hs were stripped

5
5
5
5

mols = [m for m in iterator]
len(mols)

4

In the second case if some molecules failed to read there will be some Nones in the output list which should be removed

mols = [m for m in iterator if m is not None]
len(mols)

4

You may use Supplier as a random-access object:

m = iterator[2]
m

m = iterator[0]
m

reading from gzipped SDF and other file-like objects¶

In this case you cannot use random access.

import gzip
iterator = Chem.ForwardSDMolSupplier(gzip.open("data/logBB.sdf.gz"))
mols = [m for m in iterator if m is not None]
len(mols)

4

mols[0]

Hydrogens¶

By default hydrogens are removed during reading.

m = Chem.MolFromMolFile("data/anisole.mol")
m

You may add them back manually, however their coordinates will not be recalculated.

m = Chem.AddHs(m)
m

You may generate 2D coordinates for atoms to a obtain more reasonable depiction

from rdkit.Chem import AllChem

AllChem.Compute2DCoords(m)

0

m

To avoid lose of hydrogens add removeHs = False, it will keep them during reading.

m = Chem.MolFromMolFile("data/anisole.mol", removeHs = False)
m

This can be particularly important for molecules with chiral centers with attached hydrogens.

m = Chem.MolFromMolFile("data/chlorofluoroethane.mol")
m

m = Chem.MolFromMolFile("data/chlorofluoroethane.mol", removeHs = False)
m

SMILES reading automatically takes into account chiral hydrogen and preserves them.

m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
m

However, explicitly specified Hs will be removed.

m = Chem.MolFromSmiles("C([H])(F)(Cl)C")
m

Writing¶

SMILES¶

By default saving to SMILES provides canonical SMILES but without chirality

m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
Chem.MolToSmiles(m)

'CC(F)Cl'

Chem.MolToSmiles(m, isomericSmiles = True)

'C[C@H](F)Cl'

MolBlock¶

molblock = Chem.MolToMolBlock(m)
print(molblock)

     RDKit          

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  3  1  0
  1  4  1  0
M  END

However, chirality was lost. Specifying includeStereo = True doesn't help.

molblock = Chem.MolToMolBlock(m, includeStereo = True)
print(molblock)

     RDKit          

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  1  3  1  0
  1  4  1  0
M  END

You may note that MolToMolBlock produces output with all coordinates identical. Therefore it may be useful to generate them to obtain reasonable 2D depiction. At the same time stereoinformation will appear.

AllChem.Compute2DCoords(m)
molblock = Chem.MolToMolBlock(m)
print(molblock)

     RDKit          2D

  4  3  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.2990   -0.7500    0.0000 F   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    1.5000    0.0000 Cl  0  0  0  0  0  0  0  0  0  0  0  0
   -1.2990   -0.7500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  1
  1  3  1  0
  1  4  1  0
M  END

Mol files¶

You may use Python file objects to write mol files.

print(Chem.MolToMolBlock(m), file=open('data/foo.mol','w+'))

SDF¶

w = Chem.SDWriter('data/foo.sdf')
for m in mols:
    w.write(m)
w.close()

You may use file object and write molecules to gzipped sdf

Analogously there is SmilesWriter to create text files containing SMILES representation of molecules

Structure Sanitization¶

http://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization
The molecule parsing functions all, by default, perform a “sanitization” operation on the molecules read. The idea is to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to ensure that the molecules are “reasonable”: that they can be represented with octet-complete Lewis dot structures.

Here are the steps involved, in order.

clearComputedProps: removes any computed properties that already exist on the molecule and its atoms and bonds. This step is always performed.
cleanUp: standardizes a small number of non-standard valence states. The clean up operations are:
- Neutral 5 valent Ns with double bonds to Os are converted to the zwitterionic form. Example: N(=O)=O -> [N+](=O)O-]
- Neutral 5 valent Ns with triple bonds to another N are converted to the zwitterionic form. Example: C-N=N#N -> C-N=[N+]=[N-]
- Neutral 5 valent phosphorus with one double bond to an O and another to either a C or a P are converted to the zwitterionic form. Example: C=P(=O)O -> C=[P+]([O-])O
- Neutral Cl, Br, or I with exclusively O neighbors, and a valence of 3, 5, or 7, are converted to the zwitterionic form. This covers things like chlorous acid, chloric acid, and perchloric acid. Example: O=Cl(=O)O -> [O-][Cl+2][O-]O
  This step should not generate exceptions.

updatePropertyCache: calculates the explicit and implicit valences on all atoms. This generates exceptions for atoms in higher-than-allowed valence states. This step is always performed, but if it is “skipped” the test for non-standard valences will not be carried out.
symmetrizeSSSR: calls the symmetrized smallest set of smallest rings algorithm (discussed in the Getting Started document).
Kekulize: converts aromatic rings to their Kekule form. Will raise an exception if a ring cannot be kekulized or if aromatic bonds are found outside of rings.
assignRadicals: determines the number of radical electrons (if any) on each atom.
setAromaticity: identifies the aromatic rings and ring systems (see above), sets the aromatic flag on atoms and bonds, sets bond orders to aromatic.
setConjugation: identifies which bonds are conjugated
setHybridization: calculates the hybridization state of each atom
cleanupChirality: removes chiral tags from atoms that are not sp3 hybridized.
adjustHs: adds explicit Hs where necessary to preserve the chemistry. This is typically needed for heteroatoms in aromatic rings. The classic example is the nitrogen atom in pyrrole.

The individual steps can be toggled on or off when calling MolOps::sanitizeMol or Chem.SanitizeMol.

m = Chem.MolFromSmiles('Cn(:o):o')

RDKit ERROR: [22:19:59] non-ring atom 1 marked aromatic

Working with molecules¶

Looping over atoms and bonds¶

m = Chem.MolFromSmiles('C1OC=C1')

for atom in m.GetAtoms():
    print(atom.GetAtomicNum())

6
8
6
6

for bond in m.GetBonds():
    print(bond.GetBondType())

SINGLE
SINGLE
DOUBLE
SINGLE

Individual atoms and bonds can be accesses as well as their properties

m.GetAtomWithIdx(1).GetSymbol()

'O'

m.GetAtomWithIdx(2).GetExplicitValence()

3

m.GetBondWithIdx(0).GetBeginAtomIdx()

0

m.GetBondBetweenAtoms(0,1).GetBondType()

rdkit.Chem.rdchem.BondType.SINGLE

You may loop through neigbours of particular atoms

atom = m.GetAtomWithIdx(2)

for nei in atom.GetNeighbors():
    print("root atom %s is connected to %s by %s bond" % (atom.GetSymbol(), nei.GetSymbol(), m.GetBondBetweenAtoms(atom.GetIdx(), nei.GetIdx()).GetBondType()))

root atom C is connected to O by SINGLE bond
root atom C is connected to C by DOUBLE bond

Molecules properties¶

You may set and read properties of molecules, which can be stored in property fields of sdf files.

m.SetProp("Activity", "inactive")

m.SetIntProp("Boiling point", 40)

m.GetProp("Boiling point")

'40'

Magic properties¶

There are a lot of 'magic' properties of atoms/bonds and molecules. More details can be found at http://www.rdkit.org/docs/RDKit_Book.html#magic-property-values
One of them is a title or a name of a molecule ("_Name")

m.SetProp("_Name", "molecule name")

m.GetProp("_Name")

'molecule name'

When you save this molecule to sdf file, _Name property will be stored as a title, all others as ordinary property fields.

w = Chem.SDWriter('data/bar.sdf')
w.write(m)
w.close()

3D structures and conformers¶

Generation of 3D structure

m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m

Since by default RDKit doesn't keep hydrogens they should be added before 3D structure generation to obtain rasonable geometry

m = Chem.AddHs(m)
m

This command generates 3D structure for a molecule usinf distance matrix approach

AllChem.EmbedMolecule(m)

0

The obtained geometry usually is quite ugly and refinement is neccessary. This can be done by using universal force field (UFF) or Merck molecular force field (MMFF).

AllChem.UFFOptimizeMolecule(m)

0

m

AllChem.MMFFOptimizeMolecule(m)

0

m

Generation of multiple conformers¶

There are a lot of input options. The major ones are the number of conformers and RMS threshold, which will help to discard too similar 3D structures and keep diverse conformers.

cids = AllChem.EmbedMultipleConfs(m, numConfs=10, pruneRmsThresh=1)

len(cids)

7

You may see that from 10 required conformers only 7 were generated. If you will descrese the RMS threshold value more conformers will be kept up to specified maximum value (10).

Generated conformers require geometry optimization to produce more reasonable 3D structures.

for cid in cids:
    AllChem.MMFFOptimizeMolecule(m, confId=cid)

For conformers you may corresponding return energy values. Since we optimized geometry with MMFF we will use the same force field for energy calculation.

for cid in cids:
    ff = AllChem.MMFFGetMoleculeForceField(m, AllChem.MMFFGetMoleculeProperties(m), confId=cid)
    print(ff.CalcEnergy())

64.09232266554531
56.52251338020919
64.09232254579031
64.0923226735707
56.52251334421818
56.522513373936874
64.53225822753267

Substructure search¶

m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m

q = Chem.MolFromSmarts('CCN')

m.HasSubstructMatch(q)

True

m.GetSubstructMatch(q)

(1, 2, 3)

m.GetSubstructMatches(q)

((1, 2, 3), (5, 4, 3), (7, 6, 3))

TASK 1. Write a script to retrive compound names having carboxylic acid group from the sdf file - logBB_big.sdf.
TASK 2. Write a script to retrive compound names having more than one carboxylic acid group.

for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
    if m is not None:
        if m.HasSubstructMatch(Chem.MolFromSmarts('C(=O)[OH]')):
            print(m.GetProp("_Name"))

MolID_208
MolID_215
MolID_280
MolID_281
MolID_317
MolID_172
MolID_71
MolID_41
MolID_180
MolID_118
MolID_160
MolID_184
MolID_101
MolID_152
MolID_100

RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561
RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 0 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 15523
RDKit ERROR: [22:19:59] ERROR: non-ring atom 0 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 17846
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19055
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 19968
RDKit ERROR: [22:19:59] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 20965
RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 20 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22625
RDKit ERROR: [22:19:59] ERROR: non-ring atom 20 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 22844
RDKit ERROR: [22:19:59] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 19 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 23484
RDKit ERROR: [22:19:59] ERROR: non-ring atom 19 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 12 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 24286
RDKit ERROR: [22:19:59] ERROR: non-ring atom 12 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 23 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 25470
RDKit ERROR: [22:19:59] ERROR: non-ring atom 23 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 26 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 26268

Descriptors calculation¶

There are a lot of descriptors available in RDKit - http://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors

from rdkit.Chem import Descriptors
m = Chem.MolFromSmiles('c1ccncc1C(=O)O')

Descriptors.TPSA(m)

50.19

Descriptors.MolLogP(m)

0.7797999999999998

Charges are computed differently

AllChem.ComputeGasteigerCharges(m)
for a in m.GetAtoms():
    print("atom %s id=%i has %f charge" % (a.GetSymbol(), a.GetIdx(), float(a.GetProp('_GasteigerCharge'))))

atom C id=0 has -0.044587 charge
atom C id=1 has -0.042959 charge
atom C id=2 has 0.026786 charge
atom N id=3 has -0.263835 charge
atom C id=4 has 0.041281 charge
atom C id=5 has 0.077736 charge
atom C id=6 has 0.336759 charge
atom O id=7 has -0.246333 charge
atom O id=8 has -0.477599 charge

Fingerprints¶

Topological fingerprints¶

They are topological paths between pairs of atoms on a specified distance (defaults: min 1, max 7)

from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols

m1 = Chem.MolFromSmiles("CCCCO")
m2 = Chem.MolFromSmiles("c1ccccc1CO")

fp1 = FingerprintMols.FingerprintMol(m1)
fp2 = FingerprintMols.FingerprintMol(m2)

DataStructs.FingerprintSimilarity(fp1, fp2)

0.14

By default for calculation of similarity Tanimoto is used, but there are some other implemented metrics: Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky

DataStructs.FingerprintSimilarity(fp1,fp2, metric=DataStructs.DiceSimilarity)

0.24561403508771928

MACCS keys¶

They are SMARTS patterns of different functiong groups and common fragments: http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html

from rdkit.Chem import MACCSkeys

fp1 = MACCSkeys.GenMACCSKeys(m1)
fp2 = MACCSkeys.GenMACCSKeys(m2)

DataStructs.FingerprintSimilarity(fp1, fp2)

0.3684210526315789

Morgan fingerprints (Circular Fingerprints)¶

fp1 = AllChem.GetMorganFingerprint(m1, 2)
fp2 = AllChem.GetMorganFingerprint(m2, 2)

Morgan fingerprints by default are counts, therefore dice similarity should be used as a metric.

DataStructs.DiceSimilarity(fp1, fp2)

0.16666666666666666

But it is possible calculate Morgan fingerprints as bit vectors.

fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2)

DataStructs.DiceSimilarity(fp1,fp2)

0.24

DataStructs.FingerprintSimilarity(fp1, fp2)

0.13636363636363635

Atoms can be labeled by feature types (like pharmacophore features): http://www.rdkit.org/docs/GettingStartedInPython.html#feature-definitions-used-in-the-morgan-fingerprints

fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2, useFeatures=True)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2, useFeatures=True)

DataStructs.DiceSimilarity(fp1,fp2)

0.3

DataStructs.FingerprintSimilarity(fp1, fp2)

0.17647058823529413

TASK 3. Write a script to retrive SMILES of the similar compounds from logBB_big.sdf file to the N1CCNCC1c1ccccc1 molecule based on Tanimoto score and topological and Morgan fingerprints.

mol = Chem.MolFromSmiles("N1CCNCC1c1ccccc1")
tp = FingerprintMols.FingerprintMol(mol)
mg = AllChem.GetMorganFingerprintAsBitVect(mol, 2, useFeatures=True)

d = {}
for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
    if m is not None:
        tp_ = FingerprintMols.FingerprintMol(m)
        mg_ = AllChem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True)
        d[Chem.MolToSmiles(m)] = (DataStructs.FingerprintSimilarity(tp, tp_), DataStructs.FingerprintSimilarity(mg, mg_))

RDKit ERROR: [22:19:59] ERROR: non-ring atom 26 marked aromatic
RDKit ERROR: [22:19:59] non-ring atom 4 marked aromatic
RDKit ERROR: [22:19:59] ERROR: Could not sanitize molecule ending on line 6561
RDKit ERROR: [22:19:59] ERROR: non-ring atom 4 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 0 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 15523
RDKit ERROR: [22:20:00] ERROR: non-ring atom 0 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 17846
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19055
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 19968
RDKit ERROR: [22:20:00] ERROR: non-ring atom 16 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 20965
RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 20 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22625
RDKit ERROR: [22:20:00] ERROR: non-ring atom 20 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 22844
RDKit ERROR: [22:20:00] ERROR: non-ring atom 18 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 19 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 23484
RDKit ERROR: [22:20:00] ERROR: non-ring atom 19 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 12 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 24286
RDKit ERROR: [22:20:00] ERROR: non-ring atom 12 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 23 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 25470
RDKit ERROR: [22:20:00] ERROR: non-ring atom 23 marked aromatic
RDKit ERROR: [22:20:00] non-ring atom 26 marked aromatic
RDKit ERROR: [22:20:00] ERROR: Could not sanitize molecule ending on line 26268

Sort output dictionary by descending order of the topological similarity and return top 5 molecules

sorted(d.items(), key=lambda x: -x[1][0])[:5]

[('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333)),
 ('CCCOC(C)=O', (0.5238095238095238, 0.037037037037037035)),
 ('CC(=O)OC(C)C', (0.5238095238095238, 0.04)),
 ('CC(N)Cc1ccccc1', (0.5054945054945055, 0.30434782608695654)),
 ('FC(F)OC(F)C(F)(F)F', (0.5041322314049587, 0.037037037037037035))]

Do the same with Morgan similarity

sorted(d.items(), key=lambda x: -x[1][1])[:5]

[('CC(C)(C)c1ccccc1', (0.35978835978835977, 0.3684210526315789)),
 ('c1cncc(C2CCCN2)c1', (0.4752186588921283, 0.3448275862068966)),
 ('Cc1ccccc1', (0.4603174603174603, 0.3333333333333333)),
 ('Cc1ccccc1C', (0.4051724137931034, 0.3333333333333333)),
 ('CNC(C)Cc1ccccc1', (0.5567567567567567, 0.3333333333333333))]

2D pharmacophore fingerprints¶

from rdkit.Chem.Pharm2D import Gobbi_Pharm2D, Generate

m1 = Chem.MolFromSmiles("NCCCCO")
m2 = Chem.MolFromSmiles("NCCCCCCO")

fp1 = Generate.Gen2DFingerprint(m1, Gobbi_Pharm2D.factory)
fp2 = Generate.Gen2DFingerprint(m2, Gobbi_Pharm2D.factory)

DataStructs.FingerprintSimilarity(fp1, fp2)

0.06153846153846154

Reactions¶

Reactions can be constructed based on SMARTS definitions of reactants and products

rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]')

p = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'),Chem.MolFromSmiles('NC')))

len(p)

1

p[0][0]

Generation of combinatorial library

acids = ['c1ccccc1C(=O)O', 'CC(=O)O', 'OC(=O)CC(=O)O']
bases = ['CCN', 'CCNC', 'CNCCN']

a = [Chem.MolFromSmiles(s) for s in acids]
b = [Chem.MolFromSmiles(s) for s in bases]

from itertools import product

p = []
for s in product(a, b):
    p.extend(rxn.RunReactants(s))

for item in p:
    print(Chem.MolToSmiles(item[0]))

CCNC(=O)c1ccccc1
CCN(C)C(=O)c1ccccc1
CN(CCN)C(=O)c1ccccc1
CNCCNC(=O)c1ccccc1
CCNC(C)=O
CCN(C)C(C)=O
CC(=O)N(C)CCN
CNCCNC(C)=O
CCNC(=O)CC(=O)O
CCNC(=O)CC(=O)O
CCN(C)C(=O)CC(=O)O
CCN(C)C(=O)CC(=O)O
CN(CCN)C(=O)CC(=O)O
CNCCNC(=O)CC(=O)O
CN(CCN)C(=O)CC(=O)O
CNCCNC(=O)CC(=O)O

However, products may contain duplicates, they can be removed based on canonical SMILES of products

set(Chem.MolToSmiles(item[0]) for item in p)

{'CC(=O)N(C)CCN',
 'CCN(C)C(=O)CC(=O)O',
 'CCN(C)C(=O)c1ccccc1',
 'CCN(C)C(C)=O',
 'CCNC(=O)CC(=O)O',
 'CCNC(=O)c1ccccc1',
 'CCNC(C)=O',
 'CN(CCN)C(=O)CC(=O)O',
 'CN(CCN)C(=O)c1ccccc1',
 'CNCCNC(=O)CC(=O)O',
 'CNCCNC(=O)c1ccccc1',
 'CNCCNC(C)=O'}

Drawing molecules¶

from rdkit.Chem import Draw

Create a molecule and compute coordinates

m = Chem.MolFromSmiles('CC(=O)O')
AllChem.Compute2DCoords(m)

0

Image can be directly saved to a file or obtaind as PIL image

Draw.MolToFile(m, 'data/mol.png')

Draw.MolToImage(m, (100, 100))

Multiple molecules can be visualized as a grid. Let's visualize unique products from conbinatorial library generation.

pu = list(set(Chem.MolToSmiles(item[0]) for item in p))
pu

['CCN(C)C(=O)c1ccccc1',
 'CCN(C)C(C)=O',
 'CCNC(=O)CC(=O)O',
 'CC(=O)N(C)CCN',
 'CCNC(C)=O',
 'CN(CCN)C(=O)CC(=O)O',
 'CN(CCN)C(=O)c1ccccc1',
 'CNCCNC(C)=O',
 'CNCCNC(=O)c1ccccc1',
 'CNCCNC(=O)CC(=O)O',
 'CCNC(=O)c1ccccc1',
 'CCN(C)C(=O)CC(=O)O']

Draw.MolsToGridImage([Chem.MolFromSmiles(s) for s in pu], molsPerRow=4, subImgSize=(200,200), legends=pu)

Recommended sources:

RDKit official web-site: http://www.rdkit.org/docs/index.html
RDKit maillists: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/