用户名: 密码: 验证码:
Stratified Over-Sampling Bagging Method for Random Forests on Imbalanced Data
详细信息    查看全文
  • 关键词:Imbalanced data ; Classification ; Stratified sampling ; Random forests
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9650
  • 期:1
  • 页码:63-72
  • 全文大小:442 KB
  • 参考文献:1.Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Intell. 29(1), 173–180 (2007)CrossRef
    2.Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)MathSciNet MATH
    3.Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)MathSciNet CrossRef MATH
    4.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. CRC Press, Boca Raton (1984)MATH
    5.Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH
    6.Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRef
    7.Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical report TR.666, University of California, Berkeley, California (2004)
    8.He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
    9.Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)CrossRef
    10.Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6(1), 40–49 (2004)MathSciNet CrossRef
    11.Krawczyk, B., Wozniak, M., Schaefer, G.: Improving minority class prediction using cost-sensitive ensembles. In: 16th Online World Conference on Soft Computing in Industrial Applications (2011)
    12.Liu, Y., Yu, X., Huang, J.X., An, A.: Combining integrated sampling with SVM ensembles for learning from imbalanced datasets. Inf. Process. Manag. 47(4), 617–631 (2011)CrossRef
    13.Nguyen, T., Huang, J.Z., Nguyen, T.T.: Two-level quantile regression forests for bias correction in range prediction. Mach. Learn. 101(1–3), 325–343 (2015)MathSciNet CrossRef MATH
    14.Núñez, M.: The use of background knowledge in decision tree induction. Mach. Learn. 6, 231–250 (1991)
    15.Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V.: Hybrid sampling for imbalanced data. In: Proceedings of the IEEE International Conference on Information Reuse and Integration 2008, Las Vegas, Nevada, USA, pp. 202–207, 13–15 July 2008
    16.Xu, B., Huang, J.Z., Williams, G.J., Wang, Q., Ye, Y.: Classifying very high-dimensional data with random forests built from small subspaces. Int. J. Data Warehous. Min. 8(2), 44–63 (2012)CrossRef
    17.Ye, Y., Wu, Q., Huang, J.Z., Ng, M.K., Li, X.: Stratified sampling for feature subspace selection in random forests for high dimensional data. Pattern Recogn. 46(3), 769–787 (2013)CrossRef
    18.Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)MathSciNet CrossRef
  • 作者单位:He Zhao (16)
    Xiaojun Chen (17)
    Tung Nguyen (18)
    Joshua Zhexue Huang (17)
    Graham  Williams (19)
    Hui Chen (16)

    16. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
    17. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
    18. Faculty of Computer Science and Engineering, Thuyloi University, Hanoi, Vietnam
    19. Australian National University, Canberra, Australia
  • 丛书名:Intelligence and Security Informatics
  • ISBN:978-3-319-31863-9
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
文摘
Imbalanced data presents a big challenge to random forests (RF). Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling (SOB) method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700