Large-Scale Testing of MIT Property Prediction Model

Regina: In our conversation with member companies, we have discovered that all of you are interested in a model for property prediction. Our model based on graph convolution network is available and ready for you to use. However, before transitioning it to your organizations, we wanted to test it on a large and varied collection of benchmark datasets and compare it with other state-of-the-art models. Since so many of you have asked how do we fair against Stanford — here is a one-line summary, we do way better :). Another important point to make is that we didn’t tune the model for each benchmark, but used it out of the box. The rationale here was to see what it does when you are applying it on your own datasets as is, without adapting model architecture for your own dataset. Note that this is not the case with other baselines which did per dataset tuning. Below, Wengong describes in more detail how the comparison was performed.

We have already transition the model to Amgen last week, and are awaiting to hear about their experience. We hope within the next few weeks, this transition will happen with all of you. Our ultimate goal is to understand whether this tool is useful in the practical settings, and how it compares against models that you are currently using for virtual testing in-house. We hope to get your feedback.

Wengong: We tested our property prediction model on 14 Deepchem benchmark datasets (http://moleculenet.ai/), ranging from physical chemistry to biophysics properties. For fair comparison, we followed the same setup in Deepchem, splitting each dataset into 80%/10%/10% for training, validation and testing. For each dataset, we used the same splitting method (random splitting / scaffold splitting) in deepchem. All the results are aggregated over three independent runs, each with a different random seed.

We first compared our model against the graph convolution in deepchem. For classification tasks, we report classification accuracy for each task, measured by ROC-AUC or PRC-AUC. The metric is specified in deepchem.

Dataset	Metric (the higher the better)	Ours	GraphConv (deepchem)
Bace	ROC-AUC	0.825 ± 0.011	0.783 ± 0.014
BBBP	ROC-AUC	0.692 ± 0.015	0.690 ± 0.009
Tox21	ROC-AUC	0.849 ± 0.006	0.829 ± 0.006
Toxcast	ROC-AUC	0.726 ± 0.014	0.716 ± 0.014
Sider	ROC-AUC	0.638 ± 0.020	0.638 ± 0.012
clintox	ROC-AUC	0.919 ± 0.048	0.807 ± 0.047
MUV	PRC-AUC	0.067 ± 0.05	0.046 ± 0.031
HIV	ROC-AUC	0.763 ± 0.001	0.763 ± 0.016
PCBA	PRC-AUC	0.218 ± 0.001	0.136 ± 0.003

As shown above, on all classification datasets, our model outperforms the deepchem’s implementation. On regression tasks, we report root mean square error (RMSE) or mean absolute error (MAE) on each dataset. Comparing against deepchem’s graph convolutional model and message passing network (MPNN), our model performs better on 4 out of 5 tasks.

Dataset	Metric (the lower the better)	Ours	GraphConv/MPNN (deepchem)
delaney	RMSE	0.66 ± 0.07	0.58 ± 0.03
Freesolv	RMSE	1.06 ± 0.19	1.15 ± 0.12
Lipo	RMSE	0.642 ± 0.065	0.655 ± 0.036
qm8	MAE	0.0116 ± 0.001	0.0143 ± 0.0011
qm9	MAE	2.6 ± 0.1	3.2 ± 1.5

We also compare against traditional methods (random forest and kernel SVM), with ECFP (extended connectivity fingerprints) as input features. Training a random forest or kernel SVM is computationally intensive, so benchmarks only include results of those models for smaller datasets.

Classification Dataset	Ours	Random Forest	Kernel SVM
Bace	0.825 ± 0.011	0.867 ± 0.008	0.862 ± 0.000
BBBP	0.692 ± 0.015	0.714 ± 0.000	0.729 ± 0.00
Tox21	0.849 ± 0.006	0.769 ± 0.015	0.822 ± 0.006
Toxcast	0.726 ± 0.014	n\a	0.669 ± 0.014
Sider	0.638 ± 0.020	0.684 ± 0.009	0.682 ± 0.013
clintox	0.919 ± 0.048	0.713 ± 0.056	0.669 ± 0.092
MUV	0.067 ± 0.05	n\a	0.137 ± 0.033
HIV	0.763 ± 0.001	n\a	0.792 ± 0.000
PCBA	0.218 ± 0.001	n\a	n\a

Regression Dataset	Ours	Random Forest	Kernel Ridge Reg.
delaney	0.66 ± 0.07	1.07 ± 0.19	1.53 ± 0.06
Freesolv	1.06 ± 0.19	2.03 ± 0.22	2.11 ± 0.07
Lipo	0.642 ± 0.065	0.876 ± 0.040	0.899 ± 0.043
qm8	0.0116 ± 0.001	n\a	0.0195 ± 0.0003
qm9	2.6 ± 0.1	n\a	n\a

As listed on the above two tables, our model outperforms random forest in 5 out of 8 tasks, and outperforms kernel SVM in 7 out of 12 tasks (when results are available for both models being compared).

To conclude, our model achieves very promising results on public benchmark (deepchem), compared against top performing baselines.

Leave a Reply Cancel reply