Advanced in silico drug design workshop. Olomouc, 30 January - 1 February, 2017.¶

QSAR tutorial.¶

Dr. Pavel Polishchuk

Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, Palacký University and University Hospital in Olomouc, Hněvotínská 1333/5, 779 00 Olomouc, Czech Republic
http://imtm.cz
http://qsar4u.com
pavel_polishchuk@ukr.net
pavlo.polishchuk@upol.cz

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Descriptors

# from rdkit.Chem.Draw import IPythonConsole
# from rdkit.Chem import Draw
# IPythonConsole.ipython_useSVG=True

import numpy as np
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, cohen_kappa_score, matthews_corrcoef
from sklearn.externals import joblib

Reading molecules and activity from SDF¶

fname = "data/logBB.sdf"

mols = []
y = []
for mol in Chem.SDMolSupplier(fname):
    if mol is not None:
        mols.append(mol)
        y.append(mol.GetIntProp("logBB_class"))

Calculate descriptors (fingerprints) and convert them into numpy array¶

# generate binary Morgan fingerprint with radius 2
fp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

def rdkit_numpy_convert(fp):
    output = []
    for f in fp:
        arr = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(f, arr)
        output.append(arr)
    return np.asarray(output)

x = rdkit_numpy_convert(fp)

x.shape

(321, 2048)

# check wether the data set is balanced
sum(y) / len(y)

0.5545171339563862

Set random seed to make all further calculations reproducible¶

seed = 42

Split the whole set on training and test sets¶

# randomly select 20% of compounds as test set
x_tr, x_ts, y_tr, y_ts = train_test_split(x, y, test_size=0.20, random_state=seed)

Create folds for cross-validation¶

cv = StratifiedKFold(n_splits=5, random_state=seed)

# print out ids of folds
for i, (train_index, test_index) in enumerate(cv.split(x_tr, y_tr)):
    print("\nFold_" + str(i+1))
    print("TRAIN:", train_index)
    print("TEST:", test_index)

Fold_1
TRAIN: [ 44  48  49  51  52  57  58  59  60  61  62  63  64  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105
 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123
 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177
 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213
 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231
 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
 250 251 252 253 254 255]
TEST: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 50 53 54
 55 56]

Fold_2
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  45  46  47  50  53  54  55  56 100 101
 103 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
 249 250 251 252 253 254 255]
TEST: [ 44  48  49  51  52  57  58  59  60  61  62  63  64  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 102 104 105]

Fold_3
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 102 104 105 154 155 156 157 158
 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212
 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
 249 250 251 252 253 254 255]
TEST: [100 101 103 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153]

Fold_4
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 198 202 207 208 209 210 211 212
 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
 249 250 251 252 253 254 255]
TEST: [154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189
 190 191 192 193 194 195 196 197 199 200 201 203 204 205 206]

Fold_5
TRAIN: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197
 199 200 201 203 204 205 206]
TEST: [198 202 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222
 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255]

Scale X¶

This step may be crucial for certain modeling approaches lke SVM. In the case of binary fingerprints it may be less useful.

# obtain scale object which can be further applied to scale any data to fit the training set
scale = StandardScaler().fit(x_tr)
x_tr = scale.transform(x_tr)

# it is a good idea to save it for future use
joblib.dump(scale, "data/logBB_scale.pkl", compress=3)

['data/logBB_scale.pkl']

Search for optimal tuning parameters and build the model¶

# create grid search dictionary
param_grid = {"max_features": [x_tr.shape[1] // 10, x_tr.shape[1] // 7, x_tr.shape[1] // 5, x_tr.shape[1] // 3], 
              "n_estimators": [100, 250, 500]}

# setup model building
m = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=2, cv=cv, verbose=1)

# run model building
m.fit(x_tr, y_tr)

Fitting 5 folds for each of 12 candidates, totalling 60 fits

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   23.6s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   34.2s finished

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=False),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=2,
       param_grid={'max_features': [204, 292, 409, 682], 'n_estimators': [100, 250, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

m.best_params_

{'max_features': 409, 'n_estimators': 500}

m.best_score_

0.76171875

m.cv_results_

{'mean_fit_time': array([ 0.30032048,  0.70333066,  1.54845624,  0.38332524,  0.93453407,
         1.63609896,  0.39608006,  0.91207976,  1.87488923,  0.43854413,
         1.13343296,  2.34495249]),
 'mean_score_time': array([ 0.00976992,  0.02427993,  0.04539742,  0.01114545,  0.02395878,
         0.0441246 ,  0.00987105,  0.02152734,  0.04689326,  0.00867548,
         0.02405515,  0.04517374]),
 'mean_test_score': array([ 0.74609375,  0.74609375,  0.75      ,  0.75390625,  0.7421875 ,
         0.74609375,  0.734375  ,  0.74609375,  0.76171875,  0.73828125,
         0.74609375,  0.75      ]),
 'mean_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'param_max_features': masked_array(data = [204 204 204 292 292 292 409 409 409 682 682 682],
              mask = [False False False False False False False False False False False False],
        fill_value = ?),
 'param_n_estimators': masked_array(data = [100 250 500 100 250 500 100 250 500 100 250 500],
              mask = [False False False False False False False False False False False False],
        fill_value = ?),
 'params': ({'max_features': 204, 'n_estimators': 100},
  {'max_features': 204, 'n_estimators': 250},
  {'max_features': 204, 'n_estimators': 500},
  {'max_features': 292, 'n_estimators': 100},
  {'max_features': 292, 'n_estimators': 250},
  {'max_features': 292, 'n_estimators': 500},
  {'max_features': 409, 'n_estimators': 100},
  {'max_features': 409, 'n_estimators': 250},
  {'max_features': 409, 'n_estimators': 500},
  {'max_features': 682, 'n_estimators': 100},
  {'max_features': 682, 'n_estimators': 250},
  {'max_features': 682, 'n_estimators': 500}),
 'rank_test_score': array([ 5,  5,  3,  2, 10,  5, 12,  5,  1, 11,  5,  3], dtype=int32),
 'split0_test_score': array([ 0.75      ,  0.78846154,  0.78846154,  0.76923077,  0.76923077,
         0.76923077,  0.76923077,  0.76923077,  0.78846154,  0.78846154,
         0.78846154,  0.76923077]),
 'split0_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'split1_test_score': array([ 0.64705882,  0.62745098,  0.62745098,  0.64705882,  0.60784314,
         0.60784314,  0.60784314,  0.60784314,  0.60784314,  0.60784314,
         0.62745098,  0.60784314]),
 'split1_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'split2_test_score': array([ 0.7254902 ,  0.7254902 ,  0.74509804,  0.74509804,  0.7254902 ,
         0.74509804,  0.7254902 ,  0.7254902 ,  0.78431373,  0.70588235,
         0.7254902 ,  0.74509804]),
 'split2_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'split3_test_score': array([ 0.74509804,  0.74509804,  0.7254902 ,  0.74509804,  0.74509804,
         0.74509804,  0.7254902 ,  0.74509804,  0.7254902 ,  0.7254902 ,
         0.7254902 ,  0.7254902 ]),
 'split3_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'split4_test_score': array([ 0.8627451 ,  0.84313725,  0.8627451 ,  0.8627451 ,  0.8627451 ,
         0.8627451 ,  0.84313725,  0.88235294,  0.90196078,  0.8627451 ,
         0.8627451 ,  0.90196078]),
 'split4_train_score': array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 'std_fit_time': array([ 0.0424009 ,  0.10868227,  0.20148138,  0.0357053 ,  0.06916632,
         0.16002563,  0.00846596,  0.0547276 ,  0.14710231,  0.0264884 ,
         0.06880288,  0.04048173]),
 'std_score_time': array([ 0.00143728,  0.00338692,  0.00511344,  0.00158214,  0.00351071,
         0.00377128,  0.00010672,  0.00169637,  0.00537366,  0.00075465,
         0.00136962,  0.00371764]),
 'std_test_score': array([ 0.06893957,  0.07162219,  0.07713122,  0.06865052,  0.0818342 ,
         0.08141083,  0.07631771,  0.0877498 ,  0.09562268,  0.08502275,
         0.07777439,  0.09394619]),
 'std_train_score': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])}

m.cv_results_['mean_test_score']

array([ 0.74609375,  0.74609375,  0.75      ,  0.75390625,  0.7421875 ,
        0.74609375,  0.734375  ,  0.74609375,  0.76171875,  0.73828125,
        0.74609375,  0.75      ])

m.cv_results_['params']

({'max_features': 204, 'n_estimators': 100},
 {'max_features': 204, 'n_estimators': 250},
 {'max_features': 204, 'n_estimators': 500},
 {'max_features': 292, 'n_estimators': 100},
 {'max_features': 292, 'n_estimators': 250},
 {'max_features': 292, 'n_estimators': 500},
 {'max_features': 409, 'n_estimators': 100},
 {'max_features': 409, 'n_estimators': 250},
 {'max_features': 409, 'n_estimators': 500},
 {'max_features': 682, 'n_estimators': 100},
 {'max_features': 682, 'n_estimators': 250},
 {'max_features': 682, 'n_estimators': 500})

Save model¶

joblib.dump(m, "data/logBB_rf_morgan.pkl", compress=3)

['data/logBB_rf_morgan.pkl']

Predict test set compounds¶

# load scale if necessary
scale = joblib.load("data/logBB_scale.pkl")

# scale descriptors of the test set compounds
x_ts = scale.transform(x_ts)

# predict logBB class
pred_rf = m.predict(x_ts)

pred_rf

array([0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1])

calc statistics for test set preditions¶

accuracy_score(y_ts, pred_rf)

0.75384615384615383

matthews_corrcoef(y_ts, pred_rf)

0.49953579461097208

cohen_kappa_score(y_ts, pred_rf)

0.49855351976856321

applicability domain estimates¶

# if the model includes several ones like RF models or consensus models (or for probabilistic models)
# we can calculate consistency of predictions amongs those models and use it for estimation of applicability domain
pred_prob = m.predict_proba(x_ts)

# probablity
pred_prob

array([[ 0.898,  0.102],
       [ 0.212,  0.788],
       [ 0.224,  0.776],
       [ 0.23 ,  0.77 ],
       [ 0.392,  0.608],
       [ 0.212,  0.788],
       [ 0.008,  0.992],
       [ 0.028,  0.972],
       [ 0.85 ,  0.15 ],
       [ 0.926,  0.074],
       [ 0.136,  0.864],
       [ 0.794,  0.206],
       [ 0.586,  0.414],
       [ 0.912,  0.088],
       [ 0.396,  0.604],
       [ 0.57 ,  0.43 ],
       [ 0.1  ,  0.9  ],
       [ 0.018,  0.982],
       [ 0.002,  0.998],
       [ 0.088,  0.912],
       [ 0.588,  0.412],
       [ 0.002,  0.998],
       [ 0.232,  0.768],
       [ 0.508,  0.492],
       [ 0.266,  0.734],
       [ 0.378,  0.622],
       [ 0.61 ,  0.39 ],
       [ 0.268,  0.732],
       [ 0.806,  0.194],
       [ 0.63 ,  0.37 ],
       [ 0.016,  0.984],
       [ 0.986,  0.014],
       [ 0.112,  0.888],
       [ 0.812,  0.188],
       [ 0.322,  0.678],
       [ 0.542,  0.458],
       [ 0.92 ,  0.08 ],
       [ 0.808,  0.192],
       [ 0.624,  0.376],
       [ 0.   ,  1.   ],
       [ 0.908,  0.092],
       [ 0.   ,  1.   ],
       [ 0.288,  0.712],
       [ 0.536,  0.464],
       [ 0.244,  0.756],
       [ 0.778,  0.222],
       [ 0.158,  0.842],
       [ 0.722,  0.278],
       [ 0.008,  0.992],
       [ 0.23 ,  0.77 ],
       [ 0.68 ,  0.32 ],
       [ 0.152,  0.848],
       [ 0.404,  0.596],
       [ 0.614,  0.386],
       [ 0.788,  0.212],
       [ 0.83 ,  0.17 ],
       [ 0.668,  0.332],
       [ 0.156,  0.844],
       [ 0.942,  0.058],
       [ 0.012,  0.988],
       [ 0.014,  0.986],
       [ 0.194,  0.806],
       [ 0.856,  0.144],
       [ 0.322,  0.678],
       [ 0.012,  0.988]])

# setup threshold
threshold = 0.8

# calc maximum predicted probability for each row (compound) and compare to the threshold
da = np.amax(pred_prob, axis=1) > threshold

da

array([ True, False, False, False, False, False,  True,  True,  True,
        True,  True, False, False,  True, False, False,  True,  True,
        True,  True, False,  True, False, False, False, False, False,
       False,  True, False,  True,  True,  True,  True, False, False,
        True,  True, False,  True,  True,  True, False, False, False,
       False,  True, False,  True, False, False,  True, False, False,
       False,  True, False,  True,  True,  True,  True,  True,  True,
       False,  True], dtype=bool)

# calc statistics
accuracy_score(np.asarray(y_ts)[da], pred_rf[da])

0.87878787878787878

matthews_corrcoef(np.asarray(y_ts)[da], pred_rf[da])

0.75862561287687913

cohen_kappa_score(np.asarray(y_ts)[da], pred_rf[da])

0.75280898876404501

# calc coverage
sum(da) / len(da)

0.50769230769230766

Build SVM model¶

# create grid search dictionary
param_grid = {"C": [10 ** i for i in range(0, 5)],
              "gamma": [10 ** i for i in range(-6, 0)]}

# setup model building
svm = GridSearchCV(SVC(kernel='rbf', probability=True), param_grid, n_jobs=2, cv=cv, verbose=1)

# run model building
svm.fit(x_tr, y_tr)

Fitting 5 folds for each of 30 candidates, totalling 150 fits

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   17.1s
[Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed:   56.0s finished

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=False),
       error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=2,
       param_grid={'gamma': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1], 'C': [1, 10, 100, 1000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

svm.best_score_

0.75390625

svm.best_params_

{'C': 10, 'gamma': 0.0001}

# save model
joblib.dump(svm, "data/logBB_svm_morgan.pkl", compress=3)

['data/logBB_svm_morgan.pkl']

# predict logBB for the test set compounds
pred_svm = svm.predict(x_ts)

pred_svm

array([0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1])

# calc statistics
print("Accuracy = ", accuracy_score(y_ts, pred_svm))
print("MCC = ", matthews_corrcoef(y_ts, pred_svm))
print("Kappa = ", cohen_kappa_score(y_ts, pred_svm))

Accuracy =  0.769230769231
MCC =  0.520325450879
Kappa =  0.506329113924

# estimate applicability domain and calc stat
pred_prob = svm.predict_proba(x_ts)

da = np.amax(pred_prob, axis=1) > threshold

print("Accuracy = ", accuracy_score(np.asarray(y_ts)[da], pred_svm[da]))
print("MCC = ", matthews_corrcoef(np.asarray(y_ts)[da], pred_svm[da]))
print("Kappa = ", cohen_kappa_score(np.asarray(y_ts)[da], pred_svm[da]))
print("Coverage = ", sum(da) / len(da))

Accuracy =  0.823529411765
MCC =  0.632626627812
Kappa =  0.627737226277
Coverage =  0.523076923077

build the third model (GBM) compute consensus predictions from RF, and SVM models¶

# setup model building
param_grid = {"n_estimators": [100, 200, 300, 400, 500]}
gbm = GridSearchCV(GradientBoostingClassifier(subsample=0.5, max_features=0.5), 
                   param_grid, n_jobs=2, cv=cv, verbose=1)

# run model building
gbm.fit(x_tr, y_tr)

Fitting 5 folds for each of 5 candidates, totalling 25 fits

[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:   14.5s finished

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=False),
       error_score='raise',
       estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=0.5, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=0.5, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=2,
       param_grid={'n_estimators': [100, 200, 300, 400, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

gbm.best_score_

0.76171875

gbm.best_params_

{'n_estimators': 100}

pred_gbm = gbm.predict(x_ts)

# calc statistics
print("Accuracy = ", accuracy_score(y_ts, pred_gbm))
print("MCC = ", matthews_corrcoef(y_ts, pred_gbm))
print("Kappa = ", cohen_kappa_score(y_ts, pred_gbm))

Accuracy =  0.753846153846
MCC =  0.488687824151
Kappa =  0.487684729064

consensus model¶

pred_c = 1 * (((pred_rf + pred_svm + pred_gbm) / 3) >= 0.5)

pred_c

array([0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1])

# calc statistics
print("Accuracy = ", accuracy_score(y_ts, pred_c))
print("MCC = ", matthews_corrcoef(y_ts, pred_c))
print("Kappa = ", cohen_kappa_score(y_ts, pred_c))

Accuracy =  0.784615384615
MCC =  0.552858952575
Kappa =  0.551724137931

Add to Morgan fingerprints some other descriptors and look at the model performance¶

# calc some descriptors
descr = []
for m in mols:
    descr.append([Descriptors.MolLogP(m),
                  Descriptors.TPSA(m),
                  Descriptors.NHOHCount(m),
                  Descriptors.NOCount(m),
                  Descriptors.NumHAcceptors(m),
                  Descriptors.NumHDonors(m),
                  Descriptors.NumRotatableBonds(m),
                  Descriptors.NumHeteroatoms(m),
                  Descriptors.FractionCSP3(m)])
descr = np.asarray(descr)

descr.shape

(321, 9)

# add them to morgan fingerprints
x = np.concatenate((x, descr), axis=1)

x.shape

(321, 2057)

# randomly select 20% of compounds as test set
x_tr, x_ts, y_tr, y_ts = train_test_split(x, y, test_size=0.20, random_state=seed)

scale = StandardScaler().fit(x_tr)
x_tr = scale.transform(x_tr)

# create grid search dictionary
param_grid = {"max_features": [x_tr.shape[1] // 10, x_tr.shape[1] // 7, x_tr.shape[1] // 5, x_tr.shape[1] // 3], 
              "n_estimators": [100, 250, 500]}

# setup model building
m = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=2, cv=cv, verbose=1)

# run model building
m.fit(x_tr, y_tr)

Fitting 5 folds for each of 12 candidates, totalling 60 fits

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:   22.0s
[Parallel(n_jobs=2)]: Done  60 out of  60 | elapsed:   32.7s finished

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=False),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=2,
       param_grid={'max_features': [205, 293, 411, 685], 'n_estimators': [100, 250, 500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

m.best_score_

0.78125

# scale descriptors of the test set compounds
x_ts = scale.transform(x_ts)

# predict logBB for the test set compounds
pred = m.predict(x_ts)

pred

array([0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1])

# calc statistics
print("Accuracy = ", accuracy_score(y_ts, pred))
print("MCC = ", matthews_corrcoef(y_ts, pred))
print("Kappa = ", cohen_kappa_score(y_ts, pred))

Accuracy =  0.861538461538
MCC =  0.713588637384
Kappa =  0.710252600297

# estimate applicability domain and calc stat
pred_prob = m.predict_proba(x_ts)

da = np.amax(pred_prob, axis=1) > threshold

print("Accuracy = ", accuracy_score(np.asarray(y_ts)[da], pred[da]))
print("MCC = ", matthews_corrcoef(np.asarray(y_ts)[da], pred[da]))
print("Kappa = ", cohen_kappa_score(np.asarray(y_ts)[da], pred[da]))
print("Coverage = ", sum(da) / len(da))

Accuracy =  0.944444444444
MCC =  0.882304674396
Kappa =  0.875432525952
Coverage =  0.553846153846

The model has a better accuracy. Added descritors improved the model predictivity.

Let's try to analyse which variables are the most important in the model¶

# rebuild RF model manually using best parameters to be able to extract additional information from the model
rf = RandomForestClassifier(n_estimators=m.best_params_["n_estimators"], 
                           max_features=m.best_params_["max_features"],
                           random_state=seed)
rf.fit(x_tr, y_tr)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=205, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

imp = rf.feature_importances_

imp

array([ 0.        ,  0.00258024,  0.0008886 , ...,  0.00765072,
        0.03531639,  0.02393124])

indices = np.argsort(imp)[::-1]

print("Feature ranking:")

# print top 10 features
for i in range(10):
    print("%d. feature %d (%f)" % (i + 1, indices[i], imp[indices[i]]))

Feature ranking:
1. feature 2049 (0.099145)
2. feature 2051 (0.053106)
3. feature 2048 (0.046293)
4. feature 2055 (0.035316)
5. feature 2052 (0.033604)
6. feature 650 (0.029188)
7. feature 2056 (0.023931)
8. feature 2053 (0.018663)
9. feature 2050 (0.017778)
10. feature 807 (0.013761)

2049 - MolLogP
2050 - TPSA(m)
2051 - NHOHCount
2052 - NOCount 2053 - NumHAcceptors
2054 - NumHDonors
2055 - NumRotatableBonds
2056 - NumHeteroatoms
2057 - FractionCSP3

features with numbers 1-2048 are different Morgan fingerprints