Large-Scale Testing of MIT Property Prediction Model

Regina: In our conversation with member companies, we have discovered that all of you are interested in a model for property prediction.  Our model based on graph convolution network is available and ready for you to use.  However, before transitioning it to your organizations, we wanted to test it on a large and varied collection of benchmark datasets and compare it with other state-of-the-art models. Since so many of you have asked how do we fair against Stanford — here is a one-line summary, we do way better :). Another important point to make is that we didn’t tune the model for each benchmark, but used it out of the box. The rationale here was to see what it does when you are applying it on your own datasets as is, without adapting model architecture for your own dataset. Note that this is not the case with other baselines which did per dataset tuning. Below, Wengong describes in more detail how the comparison was performed.

We have already transition the model to Amgen last week, and are awaiting to hear about their experience. We hope within the next few weeks,  this transition will happen with all of you. Our ultimate goal is to understand whether this tool is useful in the practical settings, and how it compares against models that you are currently using for virtual testing in-house. We hope to get your feedback.

Wengong: We tested our property prediction model on 14 Deepchem benchmark datasets (http://moleculenet.ai/), ranging from physical chemistry to biophysics properties. For fair comparison, we followed the same setup in Deepchem, splitting each dataset into 80%/10%/10% for training, validation and testing. For each dataset, we used the same splitting method (random splitting / scaffold splitting) in deepchem. All the results are aggregated over three independent runs, each with a different random seed.

We first compared our model against the graph convolution in deepchem. For classification tasks, we report classification accuracy for each task, measured by ROC-AUC or PRC-AUC. The metric is specified in deepchem.

Dataset Metric (the higher the better) Ours GraphConv (deepchem)
Bace ROC-AUC 0.825 ± 0.011 0.783 ± 0.014
BBBP ROC-AUC 0.692 ± 0.015 0.690 ± 0.009
Tox21 ROC-AUC 0.849 ± 0.006 0.829 ± 0.006
Toxcast ROC-AUC 0.726 ± 0.014 0.716 ± 0.014
Sider ROC-AUC 0.638 ± 0.020 0.638 ± 0.012
clintox ROC-AUC 0.919 ± 0.048 0.807 ± 0.047
MUV PRC-AUC 0.067 ± 0.05 0.046 ± 0.031
HIV ROC-AUC 0.763 ± 0.001 0.763 ± 0.016
PCBA PRC-AUC 0.218 ± 0.001 0.136 ± 0.003

As shown above, on all classification datasets, our model outperforms the deepchem’s implementation. On regression tasks, we report root mean square error (RMSE) or mean absolute error (MAE) on each dataset. Comparing against deepchem’s graph convolutional model and message passing network (MPNN), our model performs better on 4 out of 5 tasks.

Dataset Metric (the lower the better) Ours GraphConv/MPNN (deepchem)
delaney RMSE 0.66 ± 0.07 0.58 ± 0.03
Freesolv RMSE 1.06 ± 0.19 1.15 ± 0.12
Lipo RMSE 0.642 ± 0.065 0.655 ± 0.036
qm8 MAE 0.0116 ± 0.001 0.0143 ± 0.0011
qm9 MAE 2.6 ± 0.1

3.2 ± 1.5

We also compare against traditional methods (random forest and kernel SVM), with ECFP (extended connectivity fingerprints) as input features. Training a random forest or kernel SVM is computationally intensive, so benchmarks only include results of those models for smaller datasets.

Classification Dataset Ours Random Forest Kernel SVM
Bace 0.825 ± 0.011 0.867 ± 0.008 0.862 ± 0.000
BBBP 0.692 ± 0.015 0.714 ± 0.000 0.729 ± 0.00
Tox21 0.849 ± 0.006 0.769 ± 0.015 0.822 ± 0.006
Toxcast 0.726 ± 0.014 n\a 0.669 ± 0.014
Sider 0.638 ± 0.020 0.684 ± 0.009 0.682 ± 0.013
clintox 0.919 ± 0.048 0.713 ± 0.056 0.669 ± 0.092
MUV 0.067 ± 0.05 n\a 0.137 ± 0.033
HIV 0.763 ± 0.001 n\a 0.792 ± 0.000
PCBA 0.218 ± 0.001 n\a n\a
Regression Dataset Ours Random Forest Kernel Ridge Reg.
delaney 0.66 ± 0.07 1.07 ± 0.19 1.53 ± 0.06
Freesolv 1.06 ± 0.19 2.03 ± 0.22 2.11 ± 0.07
Lipo 0.642 ± 0.065 0.876 ± 0.040 0.899 ± 0.043
qm8 0.0116 ± 0.001 n\a 0.0195 ± 0.0003
qm9 2.6 ± 0.1 n\a n\a

As listed on the above two tables, our model outperforms random forest in 5 out of 8 tasks, and outperforms kernel SVM in 7 out of 12 tasks (when results are available for both models being compared).

To conclude, our model achieves very promising results on public benchmark (deepchem), compared against top performing baselines.