The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

The Relative Importance of Domain Applicability Metrics for Estimating Prediction Errors in QSAR Varies with Training Set Diversity

详细信息查看全文

作者：Robert P. Sheridan
刊名：Journal of Chemical Information and Modeling
出版年：2015
出版时间：June 22, 2015
年：2015
卷：55
期：6
页码：1098-1107
全文大小：505K
ISSN：1549-960X

文摘

In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities (an 鈥渁ctivity model鈥?. The aim of the field of domain applicability (DA) is to estimate the uncertainty of prediction of a specific molecule on a specific activity model. A number of DA metrics have been proposed in the literature for this purpose. A quantitative model of the prediction uncertainty (an 鈥渆rror model鈥? can be built using one or more of these metrics. A previous publication from our laboratory (p>Sheridanpan class="NLM_x">pace="preserve">, pan>R. P.p> J. Chem. Inf. Model.pan class="NLM_x">pace="preserve"> pan>2013pan class="NLM_x">pace="preserve">, pan>53pan class="NLM_x">pace="preserve">, pan>2837鈭?850) suggested that QSAR methods such as random forest could be used to build error models by fitting unsigned prediction errors against DA metrics. The QSAR paradigm contains two useful techniques: descriptor importance can determine which DA metrics are most useful, and cross-validation can be used to tell which subset of DA metrics is sufficient to estimate the unsigned errors. Previously we studied 10 large, diverse data sets and seven DA metrics. For those data sets for which it is possible to build a significant error model from those seven metrics, only two metrics were sufficient to account for almost all of the information in the error model. These were TREE_SD (the variation of prediction among random forest trees) and PREDICTED (the predicted activity itself). In this paper we show that when data sets are less diverse, as for example in QSAR models of molecules in a single chemical series, these two DA metrics become less important in explaining prediction error, and the DA metric SIMILARITYNEAREST1 (the similarity of the molecule being predicted to the closest training set compound) becomes more important. Our recommendation is that when the mean pairwise similarity (measured with the Carhart AP descriptor and the Dice similarity index) within a QSAR training set is less than 0.5, one can use only TREE_SD,鈥疨REDICTED to form the error model, but otherwise one should use TREE_SD,鈥疨REDICTED,鈥疭IMILARITYNEAREST1.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700