Dr. Pavel Polishchuk
RDKit is a an open-source cross-platform chemoinformatics toolkit.
Written in C++, supports Python 2 and 3, Java and C#.
BSD license
2000-2006: Developed and used at Rational Discovery for building predictive models for ADME, Tox, biological activity
June 2006: Open-source (BSD license) release of software, Rational Discovery shuts down
to present: Open-source development continues, use within Novartis, contributions from Novartis back to open-source version
Detailed documentation with tutorials and examples are available on http://www.rdkit.org/docs/index.html
Recently new github repository was created to manage RDKit tutorials: https://github.com/rdkit/rdkit-tutorials
I highly recommend to subscribe on RDKit maillist https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/
RDKit supports various formats: SMILES, Mol, SDF, Mol2, PDB, FASTA, etc.
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True
In the case of successful reading the functions return mol object, otherwise None
. The latter can be used to check whether reading was successful or not.
Coordinates for 2D depiction are generated automatically.
m = Chem.MolFromSmiles("c1ccccc1OC")
m
Reading of a structure with errors leads to errors, result will be None
.
m = Chem.MolFromSmiles("c1ccccc1O(C)C")
m is None
MolBlock doesn't contain property fields like in SDF file.
molblock = """
Mrv1661310131608212D
16 16 0 0 0 0 999 V2000
-5.9598 1.6732 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.6743 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-6.6743 0.4357 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.9598 0.0232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2454 0.4357 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.2454 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-4.5309 1.6732 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-3.8164 1.2607 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-5.9598 2.4982 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-7.3888 1.6732 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-7.3888 0.0232 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-5.9598 -0.8018 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-4.5309 0.0232 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-4.2289 0.5462 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.1019 0.8482 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
-3.4039 1.9752 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 4 0 0 0 0
2 3 4 0 0 0 0
3 4 4 0 0 0 0
4 5 4 0 0 0 0
5 6 4 0 0 0 0
1 6 4 0 0 0 0
6 7 1 0 0 0 0
7 8 1 0 0 0 0
1 9 1 0 0 0 0
2 10 1 0 0 0 0
3 11 1 0 0 0 0
4 12 1 0 0 0 0
5 13 1 0 0 0 0
8 14 1 0 0 0 0
8 15 1 0 0 0 0
8 16 1 0 0 0 0
M END
"""
m = Chem.MolFromMolBlock(molblock)
m
m = Chem.MolFromMolFile("data/anisole.mol")
m
iterator = Chem.SDMolSupplier("data/logBB.sdf")
for m in iterator:
if m is not None: # test whether molecule was read
print(m.GetNumAtoms()) # returns number of heavy atoms only, Hs were stripped
mols = [m for m in iterator]
len(mols)
In the second case if some molecules failed to read there will be some None
s in the output list which should be removed
mols = [m for m in iterator if m is not None]
len(mols)
You may use Supplier as a random-access object:
m = iterator[2]
m
m = iterator[0]
m
In this case you cannot use random access.
import gzip
iterator = Chem.ForwardSDMolSupplier(gzip.open("data/logBB.sdf.gz"))
mols = [m for m in iterator if m is not None]
len(mols)
mols[0]
By default hydrogens are removed during reading.
m = Chem.MolFromMolFile("data/anisole.mol")
m
You may add them back manually, however their coordinates will not be recalculated.
m = Chem.AddHs(m)
m
You may generate 2D coordinates for atoms to a obtain more reasonable depiction
from rdkit.Chem import AllChem
AllChem.Compute2DCoords(m)
m
To avoid lose of hydrogens add removeHs = False
, it will keep them during reading.
m = Chem.MolFromMolFile("data/anisole.mol", removeHs = False)
m
This can be particularly important for molecules with chiral centers with attached hydrogens.
m = Chem.MolFromMolFile("data/chlorofluoroethane.mol")
m
m = Chem.MolFromMolFile("data/chlorofluoroethane.mol", removeHs = False)
m
SMILES reading automatically takes into account chiral hydrogen and preserves them.
m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
m
However, explicitly specified Hs will be removed.
m = Chem.MolFromSmiles("C([H])(F)(Cl)C")
m
By default saving to SMILES provides canonical SMILES but without chirality
m = Chem.MolFromSmiles("[C@@H](F)(Cl)C")
Chem.MolToSmiles(m)
Chem.MolToSmiles(m, isomericSmiles = True)
molblock = Chem.MolToMolBlock(m)
print(molblock)
However, chirality was lost. Specifying includeStereo = True
doesn't help.
molblock = Chem.MolToMolBlock(m, includeStereo = True)
print(molblock)
You may note that MolToMolBlock
produces output with all coordinates identical. Therefore it may be useful to generate them to obtain reasonable 2D depiction. At the same time stereoinformation will appear.
AllChem.Compute2DCoords(m)
molblock = Chem.MolToMolBlock(m)
print(molblock)
You may use Python file objects to write mol files.
print(Chem.MolToMolBlock(m), file=open('data/foo.mol','w+'))
w = Chem.SDWriter('data/foo.sdf')
for m in mols:
w.write(m)
w.close()
You may use file object and write molecules to gzipped sdf
Analogously there is SmilesWriter to create text files containing SMILES representation of molecules
http://www.rdkit.org/docs/RDKit_Book.html#molecular-sanitization
The molecule parsing functions all, by default, perform a “sanitization” operation on the molecules read. The idea is to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to ensure that the molecules are “reasonable”: that they can be represented with octet-complete Lewis dot structures.
Here are the steps involved, in order.
clearComputedProps
: removes any computed properties that already exist on the molecule and its atoms and bonds. This step is always performed.
cleanUp
: standardizes a small number of non-standard valence states. The clean up operations are:
updatePropertyCache
: calculates the explicit and implicit valences on all atoms. This generates exceptions for atoms in higher-than-allowed valence states. This step is always performed, but if it is “skipped” the test for non-standard valences will not be carried out.
symmetrizeSSSR
: calls the symmetrized smallest set of smallest rings algorithm (discussed in the Getting Started document).
Kekulize
: converts aromatic rings to their Kekule form. Will raise an exception if a ring cannot be kekulized or if aromatic bonds are found outside of rings.
assignRadicals
: determines the number of radical electrons (if any) on each atom.
setAromaticity
: identifies the aromatic rings and ring systems (see above), sets the aromatic flag on atoms and bonds, sets bond orders to aromatic.
setConjugation
: identifies which bonds are conjugated
setHybridization
: calculates the hybridization state of each atom
cleanupChirality
: removes chiral tags from atoms that are not sp3 hybridized.
adjustHs
: adds explicit Hs where necessary to preserve the chemistry. This is typically needed for heteroatoms in aromatic rings. The classic example is the nitrogen atom in pyrrole.
The individual steps can be toggled on or off when calling MolOps::sanitizeMol
or Chem.SanitizeMol
.
m = Chem.MolFromSmiles('Cn(:o):o')
m = Chem.MolFromSmiles('C1OC=C1')
for atom in m.GetAtoms():
print(atom.GetAtomicNum())
for bond in m.GetBonds():
print(bond.GetBondType())
Individual atoms and bonds can be accesses as well as their properties
m.GetAtomWithIdx(1).GetSymbol()
m.GetAtomWithIdx(2).GetExplicitValence()
m.GetBondWithIdx(0).GetBeginAtomIdx()
m.GetBondBetweenAtoms(0,1).GetBondType()
You may loop through neigbours of particular atoms
atom = m.GetAtomWithIdx(2)
for nei in atom.GetNeighbors():
print("root atom %s is connected to %s by %s bond" % (atom.GetSymbol(), nei.GetSymbol(), m.GetBondBetweenAtoms(atom.GetIdx(), nei.GetIdx()).GetBondType()))
You may set and read properties of molecules, which can be stored in property fields of sdf files.
m.SetProp("Activity", "inactive")
m.SetIntProp("Boiling point", 40)
m.GetProp("Boiling point")
There are a lot of 'magic' properties of atoms/bonds and molecules. More details can be found at http://www.rdkit.org/docs/RDKit_Book.html#magic-property-values
One of them is a title or a name of a molecule ("_Name")
m.SetProp("_Name", "molecule name")
m.GetProp("_Name")
When you save this molecule to sdf file, _Name property will be stored as a title, all others as ordinary property fields.
w = Chem.SDWriter('data/bar.sdf')
w.write(m)
w.close()
Generation of 3D structure
m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m
Since by default RDKit doesn't keep hydrogens they should be added before 3D structure generation to obtain rasonable geometry
m = Chem.AddHs(m)
m
This command generates 3D structure for a molecule usinf distance matrix approach
AllChem.EmbedMolecule(m)
The obtained geometry usually is quite ugly and refinement is neccessary. This can be done by using universal force field (UFF) or Merck molecular force field (MMFF).
AllChem.UFFOptimizeMolecule(m)
m
AllChem.MMFFOptimizeMolecule(m)
m
There are a lot of input options. The major ones are the number of conformers and RMS threshold, which will help to discard too similar 3D structures and keep diverse conformers.
cids = AllChem.EmbedMultipleConfs(m, numConfs=10, pruneRmsThresh=1)
len(cids)
You may see that from 10 required conformers only 7 were generated. If you will descrese the RMS threshold value more conformers will be kept up to specified maximum value (10).
Generated conformers require geometry optimization to produce more reasonable 3D structures.
for cid in cids:
AllChem.MMFFOptimizeMolecule(m, confId=cid)
For conformers you may corresponding return energy values. Since we optimized geometry with MMFF we will use the same force field for energy calculation.
for cid in cids:
ff = AllChem.MMFFGetMoleculeForceField(m, AllChem.MMFFGetMoleculeProperties(m), confId=cid)
print(ff.CalcEnergy())
m = Chem.MolFromSmiles('O1CCN(CC)CC1')
m
q = Chem.MolFromSmarts('CCN')
m.HasSubstructMatch(q)
m.GetSubstructMatch(q)
m.GetSubstructMatches(q)
TASK 1. Write a script to retrive compound names having carboxylic acid group from the sdf file - logBB_big.sdf
.
TASK 2. Write a script to retrive compound names having more than one carboxylic acid group.
for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
if m is not None:
if m.HasSubstructMatch(Chem.MolFromSmarts('C(=O)[OH]')):
print(m.GetProp("_Name"))
There are a lot of descriptors available in RDKit - http://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors
from rdkit.Chem import Descriptors
m = Chem.MolFromSmiles('c1ccncc1C(=O)O')
Descriptors.TPSA(m)
Descriptors.MolLogP(m)
Charges are computed differently
AllChem.ComputeGasteigerCharges(m)
for a in m.GetAtoms():
print("atom %s id=%i has %f charge" % (a.GetSymbol(), a.GetIdx(), float(a.GetProp('_GasteigerCharge'))))
They are topological paths between pairs of atoms on a specified distance (defaults: min 1, max 7)
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
m1 = Chem.MolFromSmiles("CCCCO")
m2 = Chem.MolFromSmiles("c1ccccc1CO")
fp1 = FingerprintMols.FingerprintMol(m1)
fp2 = FingerprintMols.FingerprintMol(m2)
DataStructs.FingerprintSimilarity(fp1, fp2)
By default for calculation of similarity Tanimoto is used, but there are some other implemented metrics: Tanimoto, Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky
DataStructs.FingerprintSimilarity(fp1,fp2, metric=DataStructs.DiceSimilarity)
They are SMARTS patterns of different functiong groups and common fragments: http://rdkit.org/Python_Docs/rdkit.Chem.MACCSkeys-pysrc.html
from rdkit.Chem import MACCSkeys
fp1 = MACCSkeys.GenMACCSKeys(m1)
fp2 = MACCSkeys.GenMACCSKeys(m2)
DataStructs.FingerprintSimilarity(fp1, fp2)
fp1 = AllChem.GetMorganFingerprint(m1, 2)
fp2 = AllChem.GetMorganFingerprint(m2, 2)
Morgan fingerprints by default are counts, therefore dice similarity should be used as a metric.
DataStructs.DiceSimilarity(fp1, fp2)
But it is possible calculate Morgan fingerprints as bit vectors.
fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2)
DataStructs.DiceSimilarity(fp1,fp2)
DataStructs.FingerprintSimilarity(fp1, fp2)
Atoms can be labeled by feature types (like pharmacophore features): http://www.rdkit.org/docs/GettingStartedInPython.html#feature-definitions-used-in-the-morgan-fingerprints
fp1 = AllChem.GetMorganFingerprintAsBitVect(m1, 2, useFeatures=True)
fp2 = AllChem.GetMorganFingerprintAsBitVect(m2, 2, useFeatures=True)
DataStructs.DiceSimilarity(fp1,fp2)
DataStructs.FingerprintSimilarity(fp1, fp2)
TASK 3. Write a script to retrive SMILES of the similar compounds from logBB_big.sdf file to the N1CCNCC1c1ccccc1 molecule based on Tanimoto score and topological and Morgan fingerprints.
mol = Chem.MolFromSmiles("N1CCNCC1c1ccccc1")
tp = FingerprintMols.FingerprintMol(mol)
mg = AllChem.GetMorganFingerprintAsBitVect(mol, 2, useFeatures=True)
d = {}
for m in Chem.SDMolSupplier('data/logBB_big.sdf'):
if m is not None:
tp_ = FingerprintMols.FingerprintMol(m)
mg_ = AllChem.GetMorganFingerprintAsBitVect(m, 2, useFeatures=True)
d[Chem.MolToSmiles(m)] = (DataStructs.FingerprintSimilarity(tp, tp_), DataStructs.FingerprintSimilarity(mg, mg_))
Sort output dictionary by descending order of the topological similarity and return top 5 molecules
sorted(d.items(), key=lambda x: -x[1][0])[:5]
Do the same with Morgan similarity
sorted(d.items(), key=lambda x: -x[1][1])[:5]
from rdkit.Chem.Pharm2D import Gobbi_Pharm2D, Generate
m1 = Chem.MolFromSmiles("NCCCCO")
m2 = Chem.MolFromSmiles("NCCCCCCO")
fp1 = Generate.Gen2DFingerprint(m1, Gobbi_Pharm2D.factory)
fp2 = Generate.Gen2DFingerprint(m2, Gobbi_Pharm2D.factory)
DataStructs.FingerprintSimilarity(fp1, fp2)
Reactions can be constructed based on SMARTS definitions of reactants and products
rxn = AllChem.ReactionFromSmarts('[C:1](=[O:2])-[OD1].[N!H0:3]>>[C:1](=[O:2])[N:3]')
p = rxn.RunReactants((Chem.MolFromSmiles('CC(=O)O'),Chem.MolFromSmiles('NC')))
len(p)
p[0][0]
Generation of combinatorial library
acids = ['c1ccccc1C(=O)O', 'CC(=O)O', 'OC(=O)CC(=O)O']
bases = ['CCN', 'CCNC', 'CNCCN']
a = [Chem.MolFromSmiles(s) for s in acids]
b = [Chem.MolFromSmiles(s) for s in bases]
from itertools import product
p = []
for s in product(a, b):
p.extend(rxn.RunReactants(s))
for item in p:
print(Chem.MolToSmiles(item[0]))
However, products may contain duplicates, they can be removed based on canonical SMILES of products
set(Chem.MolToSmiles(item[0]) for item in p)
from rdkit.Chem import Draw
Create a molecule and compute coordinates
m = Chem.MolFromSmiles('CC(=O)O')
AllChem.Compute2DCoords(m)
Image can be directly saved to a file or obtaind as PIL image
Draw.MolToFile(m, 'data/mol.png')
Draw.MolToImage(m, (100, 100))
Multiple molecules can be visualized as a grid. Let's visualize unique products from conbinatorial library generation.
pu = list(set(Chem.MolToSmiles(item[0]) for item in p))
pu
Draw.MolsToGridImage([Chem.MolFromSmiles(s) for s in pu], molsPerRow=4, subImgSize=(200,200), legends=pu)
Recommended sources: