Regina: In our conversation with member companies, we have discovered that all of you are interested in a model for property prediction. Our model based on graph convolution network is available and ready for you to use. However, before transitioning it to your organizations, we wanted to test it on a large and varied collection of benchmark datasets and compare it with other state-of-the-art models. Since so many of you have asked how do we fair against Stanford — here is a one-line summary, we do way better :). Another important point to make is that we didn’t tune the model for each benchmark, but used it out of the box. The rationale here was to see what it does when you are applying it on your own datasets as is, without adapting model architecture for your own dataset. Note that this is not the case with other baselines which did per dataset tuning. Below, Wengong describes in more detail how the comparison was performed.
We have already transition the model to Amgen last week, and are awaiting to hear about their experience. We hope within the next few weeks, this transition will happen with all of you. Our ultimate goal is to understand whether this tool is useful in the practical settings, and how it compares against models that you are currently using for virtual testing in-house. We hope to get your feedback.
Wengong: We tested our property prediction model on 14 Deepchem benchmark datasets (http://moleculenet.ai/), ranging from physical chemistry to biophysics properties. For fair comparison, we followed the same setup in Deepchem, splitting each dataset into 80%/10%/10% for training, validation and testing. For each dataset, we used the same splitting method (random splitting / scaffold splitting) in deepchem. All the results are aggregated over three independent runs, each with a different random seed.
We first compared our model against the graph convolution in deepchem. For classification tasks, we report classification accuracy for each task, measured by ROC-AUC or PRC-AUC. The metric is specified in deepchem.
Dataset | Metric (the higher the better) | Ours | GraphConv (deepchem) |
Bace | ROC-AUC | 0.825 ± 0.011 | 0.783 ± 0.014 |
BBBP | ROC-AUC | 0.692 ± 0.015 | 0.690 ± 0.009 |
Tox21 | ROC-AUC | 0.849 ± 0.006 | 0.829 ± 0.006 |
Toxcast | ROC-AUC | 0.726 ± 0.014 | 0.716 ± 0.014 |
Sider | ROC-AUC | 0.638 ± 0.020 | 0.638 ± 0.012 |
clintox | ROC-AUC | 0.919 ± 0.048 | 0.807 ± 0.047 |
MUV | PRC-AUC | 0.067 ± 0.05 | 0.046 ± 0.031 |
HIV | ROC-AUC | 0.763 ± 0.001 | 0.763 ± 0.016 |
PCBA | PRC-AUC | 0.218 ± 0.001 | 0.136 ± 0.003 |
As shown above, on all classification datasets, our model outperforms the deepchem’s implementation. On regression tasks, we report root mean square error (RMSE) or mean absolute error (MAE) on each dataset. Comparing against deepchem’s graph convolutional model and message passing network (MPNN), our model performs better on 4 out of 5 tasks.
Dataset | Metric (the lower the better) | Ours | GraphConv/MPNN (deepchem) |
delaney | RMSE | 0.66 ± 0.07 | 0.58 ± 0.03 |
Freesolv | RMSE | 1.06 ± 0.19 | 1.15 ± 0.12 |
Lipo | RMSE | 0.642 ± 0.065 | 0.655 ± 0.036 |
qm8 | MAE | 0.0116 ± 0.001 | 0.0143 ± 0.0011 |
qm9 | MAE | 2.6 ± 0.1 |
3.2 ± 1.5 |
We also compare against traditional methods (random forest and kernel SVM), with ECFP (extended connectivity fingerprints) as input features. Training a random forest or kernel SVM is computationally intensive, so benchmarks only include results of those models for smaller datasets.
Classification Dataset | Ours | Random Forest | Kernel SVM |
Bace | 0.825 ± 0.011 | 0.867 ± 0.008 | 0.862 ± 0.000 |
BBBP | 0.692 ± 0.015 | 0.714 ± 0.000 | 0.729 ± 0.00 |
Tox21 | 0.849 ± 0.006 | 0.769 ± 0.015 | 0.822 ± 0.006 |
Toxcast | 0.726 ± 0.014 | n\a | 0.669 ± 0.014 |
Sider | 0.638 ± 0.020 | 0.684 ± 0.009 | 0.682 ± 0.013 |
clintox | 0.919 ± 0.048 | 0.713 ± 0.056 | 0.669 ± 0.092 |
MUV | 0.067 ± 0.05 | n\a | 0.137 ± 0.033 |
HIV | 0.763 ± 0.001 | n\a | 0.792 ± 0.000 |
PCBA | 0.218 ± 0.001 | n\a | n\a |
Regression Dataset | Ours | Random Forest | Kernel Ridge Reg. |
delaney | 0.66 ± 0.07 | 1.07 ± 0.19 | 1.53 ± 0.06 |
Freesolv | 1.06 ± 0.19 | 2.03 ± 0.22 | 2.11 ± 0.07 |
Lipo | 0.642 ± 0.065 | 0.876 ± 0.040 | 0.899 ± 0.043 |
qm8 | 0.0116 ± 0.001 | n\a | 0.0195 ± 0.0003 |
qm9 | 2.6 ± 0.1 | n\a | n\a |
As listed on the above two tables, our model outperforms random forest in 5 out of 8 tasks, and outperforms kernel SVM in 7 out of 12 tasks (when results are available for both models being compared).
To conclude, our model achieves very promising results on public benchmark (deepchem), compared against top performing baselines.