用户名: 密码: 验证码:
Variable selection after screening: with or without data splitting?
详细信息    查看全文
  • 作者:Xiaoyi Zhu ; Yuhong Yang
  • 关键词:Model selection ; Sparse regression ; Variable screening ; Prediction
  • 刊名:Computational Statistics
  • 出版年:2015
  • 出版时间:March 2015
  • 年:2015
  • 卷:30
  • 期:1
  • 页码:191-203
  • 全文大小:165 KB
  • 参考文献:1. Breheny, P, Huang, J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5: pp. 232-253 CrossRef
    2. Bühlmann, P, Mandozzi, J (2014) High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Comput Stat 29: pp. 407-430 CrossRef
    3. Chen, L, Yang, Y Combining statistical procedures. Frontiers of Statistics. In: Cai, T, Shen, X eds. (2010) High-dimensional data analysis. World Scientific Publishing, Singapore CrossRef
    4. Clarke, B (2003) Comparing Bayes and non-Bayes model averaging when model approximation error cannot be ignored. J Mach Learn Res 4: pp. 683-712
    5. Fan, J, Li, R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96: pp. 1348-1360 CrossRef
    6. Fan, J, Lv, J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B 70: pp. 849-911 CrossRef
    7. Hoeting, J, Madigan, D, Raftery, A, Volinsky, C (1999) Bayesian model averaging: a tutorial (with discussion). Stat Sci 14: pp. 382-417 CrossRef
    8. Huang, J, Ma, S, Zhang, C (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18: pp. 1603-1618
    9. Leng, C, Wang, H (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space- J R Stat Soc Ser B 70: pp. 849-911 CrossRef
    10. Meinshausen, N, Meiera, L, Bühlmann, P (2009) $$p$$ p -values for high-dimensional regression. J Am Stat Assoc 104: pp. 1671-1681 CrossRef
    11. Scheetz, TE, Kim, K-YA, Swiderski, RE, Philip, AR, Braun, TA, Knudtson, KL, Dorrance, AM, DiBona, GF, Huang, J, Casavant, TL, Sheeld, VC, Stone, EM (2006) Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc Natl Acad Sci 103: pp. 14429-14434 CrossRef
    12. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B 58:267-88
    13. Wasserman, L, Roeder, K (2009) High-dimensional variable selection. Ann Stat 37: pp. 2178-2201 CrossRef
    14. Yang, Y (2005) Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation. Biometrika 92: pp. 937-950 CrossRef
    15. Zhang, C (2010) Nearly unbiased variables selection under minimax concave penalty. Ann Stat 38: pp. 894-942 CrossRef
    16. Zhang, W, Xia, Y (2008) Discussion on “Sure independence screening for ultrahigh dimensional feature space- J R Stat Soc Ser B 70: pp. 849-911 CrossRef
  • 刊物类别:Mathematics and Statistics
  • 刊物主题:Mathematics
    Statistics
    Statistics
    Probability and Statistics in Computer Science
    Probability Theory and Stochastic Processes
    Economic Theory
  • 出版者:Physica Verlag, An Imprint of Springer-Verlag GmbH
  • ISSN:1613-9658
文摘
High dimensional data sets are now frequently encountered in many scientific fields. In order to select a sparse set of predictors that have predictive power and/or provide insightful understanding on which predictors really influence the response, a preliminary variable screening is typically done often informally. Fan and Lv (J R Stat Soc Ser B 70:849-11, 2008) proposed sure independence screening (SIS) to reduce the dimension of the set of predictors from ultra-high to a moderate scale below the sample size. Then one may apply a familiar variable selection technique. While this approach has become popular, the screening bias issue has been mainly ignored. The screening bias may lead to the final selection of a number of predictors that have no/little value for prediction/explanation. In this paper we set to examine this screening bias both theoretically and numerically compare the approach with an alternative that utilizes data splitting. The simulation results and real bioinformatics examples show that data splitting can significantly reduce the screening bias for variable selection and improve the prediction accuracy as well.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700