面向非均衡数据集的机器学习及在地学数据处理中的应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
分类是数据挖掘和知识发现的重要任务之一,传统的机器学习分类研究大多基于如下假设:(1)以高总体分类正确率为目标;(2)数据集中的各类样本数目基本均衡;(3)所有的分类错误会带来相同的错误代价。基于这些假设,人们研究了大量的分类算法如决策树算法、贝叶斯分类、人工神经网络、K-近邻算法、支持向量机、遗传算法等,并将其广泛应用于医学诊断、信息检索、文本分类等众多应用领域。然而,真实世界的分类问题存在很多类别非均衡的情况,数据集中某个类别的样本数可能会远多于其他类别。在这些情况下,分类器通常会倾向于将测试样本全部判别为大类而忽视小类样本,这使得到的分类器在小类样本上效果会变得很差。不平衡数据集自身的特点(少数类数据的绝对缺乏和相对缺乏、数据碎片、噪声)以及传统分类算法的局限性(不恰当的评价标准和不恰当的归纳偏置)是对不平衡数据集进行准确可靠分类的关键制约因素。因此,对不平衡数据集的分类问题已成为机器学习和模式识别领域中新的研究热点,是对传统分类算法的重大挑战。
     目前,针对非均衡数据集分类性能提高的解决方法主要围绕数据层面和算法层面来开展。通过数据重取样的方法包括过取样和欠取样两类来改变不平衡数据的类分布以降低数据的非均衡程度可提高分类性能;改进已有的分类算法如代价敏感学习、支持向量机算法、单类学习和集成学习等,通过调节各类样本之间的代价函数、对不同类的样本设置不同的权值、改变概率密度、调整分类边界等措施使其更有利于少数类的分类来提高分类性能。然而,目前的处理手段和改进方法在对少数类的分类性能上尽管都有一定程度的改善,但仍旧存在过学习或多数类重要信息损失等问题,分类结果的可靠性会受到一定的影响。因此,在尽可能不降低总体分类性能的前提下,提高少数类分类性能,从而合理运用非均衡数据集的分类结果进行准确的预测仍是一个值得进一步研究的课题。
     本论文针对传统的机器学习分类的三个假设,从算法的改进发展和其实用性验证两大方面展开了系统深入的研究。首先对非均衡数据集的分类性能的评估方法和评价指标进行了详细讨论。进一步的,从数据层面上,在已有算法的基础上对非均衡数据集的重取样算法作了两项关键的改进,并将所提出的算法用于地学领域的数据分类预处理中;从算法层面上,实现了将重构数据集和基于误分类代价最小的算法改进两种方法的有机融合。论文的主要工作和结论如下:
     一、非均衡数据集分类性能评估、算法的改进与发展
     1、非均衡数据集的分类性能评估
     讨论了传统机器学习分类研究的第一条基本假设的合理性,即高的总体正确率为分类目标是否适用于对非均衡数据集分类性能进行评估。正确地评价一个分类系统的性能,对选择分类特征和分类器参数都有重要的指导作用,因此如何检验分类系统性能是很重要的一环。分类器的评估方法和评价指标很多,不同的分类方法可能会偏好某些评估指标,即对分类方法的改进也是基于某一种标准上的改进。建立或设计更先进的算法来解决机器学习的分类问题受到众多学者的重视,然而机器学习结果的评估与算法的改进其重要性至少是相当的,是数据挖掘能否取得真正进展的关键之处。本文对经典的分类技术和常用分类的评估方法、评价指标进行了系统的讨论,并分别对数值型评价指标和图形评价指标进行了分析和比较,指出某些评价指标在面对非均衡数据集分类的性能评价时可能存在一定的问题,从而较难对分类结果做出正确的判断和决策。此外,论文还探讨了一些其他复合数值型评价指标,这些指标亦可用于非均衡数据集的分类性能评估。
     实际上,没有任何评价指标可以适合于所有的分类问题,盲目地确定某一个指标作为评价标准并不是一个好的策略。这也是分类器设计中常见的具体问题,选用哪个分类评价指标将更依赖于分类器的应用背景或用户的需求。根据不同的情况应该选择合适的评价指标,才能有助于我们对算法的分类性能做出正确的评价与判断。
     2、非均衡数据集的重取样算法
     针对传统的机器学习分类研究的第二条“数据集中的各类样本数目基本均衡”的基本假设进行了非均衡数据集分类的研究。论文提出了两种类型的混合重取样算法,即通过将过取样技术和欠取样技术将结合的方法,使非均衡数据集在分类前达到基本均衡。
     第一种是自适应选择近邻的混合重取样算法(Automated Adaptive Selection of the Numberof Nearest Neighbors of Hybrid Re-Sampling,ADSNNHRS),该算法分为两部分,过取样部分解决了SMOTE(Synthetic Minority Over-sampling Technique)算法在产生合成样本过程中存在的盲目性、只能复制生成数值型属性等问题,能够根据实例样本集内部分布的真实特性,自动适应调整选择SMOTE方法中的近邻选择策略,并对具有混合型属性的数据集采用不同的复制方法生成新的实例,从而有效地控制和提高合成样本的质量;欠取样部分通过对合成之后的实例集用改进的邻域清理方法进行欠取样,去掉了多数类中的冗余实例和边界上的噪音数据。本论文所提出的方法实际上结合了过取样和欠取样两种方法的优势,一方面通过自适应选择近邻的方法增加少数类样本的方式强调了正类,另一方面对多数类进行适当程度的欠取样,减少其规模,达到多数类和少数类样本在一定程度上的相对均衡,从而可以有效地处理非均衡数据分类问题,提高分类器的性能。
     第二种是基于Isomap降维混合重取样算法(Hybrid Re-Sampling based on Isomap,HRS-Isomap),即将非线性降维和混合重取样算法相结合,来降低数据的不平衡性。论文研究了两种类型的常用数据降维方法,线性数据降维方法,如主成分分析法(Principal ComponentAnalysis,PCA)、多维尺度分析(Multidimensional Scaling,MDS)和非线性数据降维方法,如等距离特征映射(Isometric feature mapping,Isomap)、局部线性嵌入(Locally Linear Embedding,LLE)等;并分别将两种经典的降维方法用于地学数据的处理中,通过对地学数据分类前的预处理,简化模型的结构,从整体上提高模型的预测性能。在此基础上,针对SMOTE算法基于空间上任意两个少数类样本点之间的样本点也属于少数类这样一个在实际情况下(尤其当数据集非线性可分时)不一定正确的假设,提出将非线性降维Isomap算法和混合重取样算法相结合,先利用等距离特征映射算法(Isomap)将初始数据集进行非线性降维,然后再通过合成少数类过抽样算法(SMOTE)在降维后更加线性可分的数据上过取样,再对过取样后的数据集进行邻域清理的欠取样,来降低数据的不平衡性,得到基本均衡的低维数据。对非均衡数据集进行非线性降维后,其分类性能有较大程度的改善,各项评价指标均有不同程度的提高,特别是对非线性降维后的数据再进行混合重取样,少数类的F-measure值提高显著,在少数类分类性能显著上升的情况下,整体分类性能也有不同程度的提高。说明将非线性降维Isomap方法引入到非均衡数据的重取样处理中是行之有效的。Isomap的强降维和发现数据本质结构的能力给我们提供了一个解决非均衡数据集分类问题的新思路。
     3、非均衡数据集的代价敏感学习算法
     围绕解决传统的机器学习分类研究的第三条基本假设,即所有的分类错误会带来相同的错误代价来展开讨论。基于大多数研究只是集中于纯非均衡数据集分类学习或者纯代价敏感学习,而忽略了类分布非均衡往往和不等错误分类代价同时发生这一事实,本论文尝试在原有的代价敏感学习算法中将重构数据集和基于误分类代价最小的算法改进两种不同类型的解决方法融合在一起,一方面先用样本类空间重构的方法使原始数据集的两类数据达到基本均衡,另一方面,分类基于最小误分代价而非最小错误率,对所关心的类别赋以较大的代价,其他类则赋以较小的代价,然后再用代价敏感学习算法进行分类。
     当通过使用样本空间重构的方法使类分布变得相对均衡且选择合适的代价因子时,基于最小误分类代价的代价敏感学习算法的分类结果明显优于其他的分类算法,不但少数类的分类性能大幅上升,整体的分类性能也有一定程度的提高。
     二、非均衡数据集分类的方法在地学领域中的应用及分析
     本论文将所发展的自适应选择近邻的重取样算法用于岩爆危险性预测工程。岩爆的统计结果是一种典型的非均衡数据集,传统的数据挖掘分类算法很难得到精确的预测结果。实际上,岩爆现象中的少数类实例才是真正需要关注的对象,并期望获得较高的预测精度。论文利用南非科学研究院建立的VCR采场岩爆实例数据库,通过人工生成部分少数类实例作为训练数据进行仿真实验,预测的岩爆危险性状态与实际情况完全一致。这说明本文提出的重取样方案在工程实例岩爆危险性的实例数据非均衡情况下是可行的,预测准确率高,具有良好的工程应用前景。该方法不必建立复杂的数学方程或计算模型,输入数据客观存在或易于量测的,具有实现简单的优点。采用该方法可以找到岩爆发生的主控因素,可为深部开采工程的合理设计与安全施工提供科学依据。
     论文的主要创新点如下:
     1、提出了两种类型的混合重取样算法。针对经典的过取样算法SMOTE产生合成样本的过程中存在的问题和不准确的假设,分别提出了自适应选择近邻的混合重取样算法ADSNNHRS和基于Isomap非线性降维的混合重取样算法HRS-Isomap,这两种混合重取样算法均可有效地处理不平衡数据分类问题。
     2、提出了一种新型的不均衡数据集的代价敏感学习算法。针对数据集类分布不均衡及其错误分类之后可能造成不同的误分类代价这两种情况可能同时发生这一事实,将二种不同类型解决非均衡数据集的分类方法样本类空间重构和基于误分代价最小的代价敏感学习算法有机地融合在一起,其分类结果明显优于其他的分类算法。
     3、在地学领域中引入非均衡数据集的处理解决方法。针对大量地学数据存在着不确定性、经验性、间接性、不完整性及类分布非均衡等特点,将降维方法灵活地用于高维地学数据的预处理中,并在地学数据分析领域中引入非均衡数据的机器学习概念、模式和解决方法,为有效地处理海量地学数据、提高地学数据分析的自动化和智能化水平提供了一套有力的分析工具。
Classification is an important mission of data mining and knowledge discovery in databases. Conventional machine learning classification technologies assumed that, maximizing whole accuracy is the goal of classification, the classifier operates on data drawn from the same distribution as the training data, and the misclassification at any situation brings the same error costs. Based on such assumptions, large amounts of classification algorithms, such as decision tree, Bayesian Classification, artificial neural network, K-nearest neighbor, support vector machines, genetic algorithm, and the newly reported classification algorithms, have been developed and successfully applied to many fields, such as medical diagnoses, information retrieval, text classification, and etc. However, the assumptions always failed to deal with the imbalanced data sets (IDS) in real problems, where one class might be concentrated in a large number of samples and the other classes own very few. Most classification algorithms pursue to minimize the error rate by ignoring the differences between types of misclassification errors cost and consequently yield poor predictive accuracy over the minority class. The major difficulties of IDS classification lie on the feature of the data sets themselves (lack of absolute/relative data of the minority class, data fragmentation, noise, etc.) and the limitations of conventional classification algorithms (improper evaluation metrics and inappropriate inductive bias). Consequently, classification on IDS becomes a hot topic of machine learning and pattern recognition, and it presents a great challenge for conventional classification algorithms.
     In the last decades, many efforts have been performed to improve the classification performance towards the minority class. Two general approaches are currently available to tackle the imbalanced data classification problems. One approach is based on data level, known as data set reconstruction or re-sampling. By using under-sampling of the majority class or over-sampling of the minority class or combining both of the two techniques to reduce the degree of class distribution imbalance, the classification performance towards the minority class can be improved in a certain extent. Another approach is based on algorithms level aiming to modify the existing data mining algorithms or develop new ones such as Cost Sensitive Learning (CSL), Support Vector Machine, One-Class Classification, and ensemble learning methods. Through revising of cost factor, setting different weights according to specific samples, changing probability density function and adjusting the decision border, one can also improve the classification performance towards the minority class. However, although improvements are achieved, problems such as loss of important information of majority class and over-fitting when dealing with IDS still await to solve, which will decrease the reliability of predicted results. Therefore, under the condition of reserving the whole classification performance, how to improve the performance towards the minority class samples and consequently to attain accurate predictions according to the classification results is still a topic that well worth further studying.
     Centering on this topic and starting from the three basic assumptions, we present deep and systematic investigation of developing of several novel algorithms and reliability validation of these algorithms for IDS in this thesis. As a first step, the assessment methods and evaluation measures of the classification performance were thoroughly discussed. Then we proposed two vital amelioration of re-sampling of IDS based on the existing SMOTE over-sampling algorithms at the data level, and these techniques were applied to preprocess of geosciences data sets to validate their reliability; at the algorithm level, we combined the re-sampling technique and CSL technique based the minimal total misclassification cost together to achieve better classification performance. The main efforts and conclusions of this thesis are listed below:
     1. Classification Performance Evaluation and Algorithm Development of IDS
     A) Assessment methods and evaluation measures of the classification performance of IDS
     Whether a high whole accuracy can serve as the evaluation measure of IDS classification or not was discussed firstly. Assessment methods and evaluation measures of classification performance play a critical role in guiding the design of classifiers. There are many assessment methods and evaluation measures each have its individual advantages and disadvantages. Thus the modification of classification algorithms in some extent equals the improvement of criterions. Many efforts have been conducted to design/develop more advanced algorithms to solve the classification problems. In fact, the assessment methods and evaluation measures are at least as important as algorithm and is the first key stage to a successful data mining. We systematically summarized the typical classification technologies, the general classification algorithms, the assessment methods and evaluation measures of IDS. Several different type performance measures, such as numerical value measure and visualizing classifier performance measure, have been analyzed and compared. The problems of these technologies and measures towards IDS may lead to misunderstanding of classification results and even wrong strategy decision. Beside that, a series of complex numerical evaluation measures were also investigated which can also serve for evaluating classification performance of IDS.
     In general, there is no a generalized evaluation measure for various kind of classification problems. A good strategy to identify a proper evaluation measure should largely depend upon specific application requirement. Choose appropriate evaluation measure according to different background can help people make correct judgment to the algorithm classification performance.
     b) Resampling algorithm of IDS
     We proposed two new hybrid re-sampling techniques based on the improved SMOTE over-sampling algorithm. By combining the over-sampling technology and the under-sampling technology together, the IDS evolve to balance before classification.
     The first technique is the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling method. In the SMOTE method, blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors were added into the minority class; and data sets with nominal features cannot be handled. In our procedure of over-sampling, these two problems were solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy. As a consequence, the quality of the new samples can be well controlled. In the procedure of under-sampling, by using the improved under-sampling technique of neighborhood cleaning rule, borderline majority class examples and the noisy or redundant data were removed. This method in fact combined the improved SMOTE and the NCR data cleaning methods. The main motivation behind these methods is not only to balance the training data, but also to remove noisy examples lying on the wrong side of the decision border. The removal of noisy examples might aid in finding better-defined class clusters, therefore, allowing the creation of simpler models with better generalization capabilities. and therefore, promising effective processing of IDS and a considerably enhanced classifier performance.
     The second technique is the Isomap-based hybrid re-sampling method. The method attempts to reduce the degree of imbalanced class distributions through combining the Isomap nonlinear dimensionality reduction method with the hybrid re-sampling technology. We first analyzed two methods for the most general linear (principal component analysis and multidimensional scaling) and nonlinear (Isometric feature mapping and Locally Linear Embedding) dimensionality reduction algorithms. These two technologies were sequentially utilized to preprocess geosciences data and to reduce the dimensionality of the feature space. The structure of classification model was thus simplified and the whole classification performance was highly improved. SMOTE is an approach by over-sampling the minority class. However, it is limited to a strict assumption that the local space between any two minority class instances is minority class instance or belongs to the minority class, which may not be always true in the case when the training data is not linearly separable. We present a new re-sampling technique based on Isomap. The Isomap algorithm is first applied to map the high-dimensional data into a low-dimensional space, where the input data is more separable, and thus can be over-sampled by SMOTE. The over-sampled samples were then under-sampled through NCR method yielding balanced low-dimensional data sets. By such a procedure, the evaluation measures were sequentially promoted and the classification performance is considerably improved, especially the F-measure of minority class. In fact, both the whole and the minority class classification performance were improved simultaneously. The underlying re-sampling algorithm is implemented by incorporating the Isomap technique into the hybrid SMOTE and NCR algorithm. Experimental results demonstrate that the Isomap-based hybrid re-sampling algorithm attains a performance superior to that of the re-sampling. It is clear that the Isomap method is an effective mean of reducing the dimension of the re-sampling, which provides a new possible solution for dealing with the IDS classification.
     c) CSL algorithm of IDS
     We first discussed the misclassification cost problems centering on the third assumption of conventional machine learning. Most studies focused on the IDS classification or cost-sensitive learning systems themselves; however, the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously was neglected. We attempted to combine the re-sampling and the CSL techniques together in order to solve the misclassification of IDS. On one aspect, the re-sampling technique allows balanced data sets by reconstructing both the majority and the minority class. On the other aspect, the classification was performed based on minimal misclassification cost but not the maximal accuracy. Here the misclassification cost for the minority class is much higher than the misclassification cost for the majority class. Cost-sensitive learning procedure was then conducted for classification. Using appropriate cost factor and balancing the data sets through re-sampling technology, our CSL algorithm based on the minimal misclassification cost performs much better than the currently available classification techniques. Not only is the classification performance of minority class improved significantly, but the overall classification performance is enhanced in a certain extent.
     2. Application and Analysis of Our Classification Algorithm of IDS in Geosciences
     The automated adaptive selection of the number of nearest neighbors of re-sampling method was applied to study the fatalness prediction engineering of rockburst. The statistic data of large amounts of rockburst is a kind of typical IDS. It is very difficult to give an accurate prediction using conventional classification methods. In fact, we mostly concern the minority class other than the majority class and high prediction accuracy is always desired. In this thesis, the VCR rockburst database provided the Academy of South Africa was employed as a sample IDS for classification and prediction. By adding extra artificial minority class samples as the expanded training set. experimental simulation was performed, which yields exactly consistent prediction results with the actual situation. Promisingly, the re-sampling method and classification scheme we developed is feasible and reasonable for applications of IDS from engineering. It is unnecessary to build complicate mathematic equation or computer models for our algorithms and the input data sets can be easily measured or obtained, thus this method can be readily implemented to determine the controlling factors of engineering. Such a prediction can provide reasonable and sufficient guidance to design a safe construction scheme of in deep mining engineering.
     The major innovation and contribution of this thesis are listed as follows:
     a) We developed two types of hybrid re-sampling algorithms. Aiming to the problems and improper assumptions of SMOTE algorithm, we proposed the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling algorithm and the Isomap-based hybrid re-sampling algorithm, respectively. Both the two algorithms can effectively deal with the IDS classification.
     b) We proposed a novel CSL algorithm on IDS. Aiming the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously. We proposed combined methods of the re-sampling and the CSL techniques together in order to solve the misclassification problem of IDS. The combination algorithm intergrades the advantanges together and thus can perform much better than existing methods.
     c) We introduced the IDS processing methods to analyze the geoscience data. Due to the characteristics such as uncertaincy, empiricism, oblique, incomplete and imbalanced class distribution of geoscience data, we employed the dimensionality reduction method to preprocess the data firstly and then utilized the effective classification methods towards IDS to virtually process huge amount of geoscience data. Such a analytical scheme would be very powerful for the automatic and intellegent analysis of geoscience data.
引文
[1]Probost,Foster.Machine Learning from Imbalanced Data Sets 101.Invited paper for the AAA I' 2000 Workshop on Imbalanced Data Sets.2000.
    [2]Weiss,G.M.Mining with Rarity:A Unifying Framework[J].ACM SIGKDD Explorations News letter.2004,6(1):7-19
    [3]Batista,G.E.,Prati,R.C.,Monard,M.C.A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data[J].ACM SIGKDD Explorations Newsletter.2004,6(1):20-29
    [4]Chawla,N.V.,Japkowicz,N.,Kotcz,A.Editorial:Special Issue on Learning from Imbalanced Data Sets[J].ACM SIGKDD Explorations Newsletter.2004,6(1):1-6
    [5]Japkowicz,N.The Class Imbalance Problem:A Systematic Study[J].Intelligent Data Analysis.2002,6(5):429-449
    [6]Akbani,R.,Kwek,S.,Japkowicz,N.Applying Support Vector Machines to Imbalanced Datasets[J].LECTURE NOTES IN COMPUTER SCIENCE.2004:39-50
    [7]Raskutti,B.,Kowalczyk,A.Extreme Re-Balancing for Svms:A Case Study[J].ACM SIGKDD Explorations Newsletter.2004,6(1):60-69
    [8]Wu,G.,Chang,E.Y.Class-Boundary Alignment for Imbalanced Dataset Learning.Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets.Washington.2003.
    [9]Ezawa,K.J.,Singh,M.,Norton,S.W.Learning Goal Oriented Bayesian Networks for Telecommunications Risk Management.1996.139-147
    [10]Zhang.J,Mani.I.Knn Approach to Unbalanced Data Distributions:A Case Study Involving Information Extraction.Proceedings of the ICML'03 Workshop on Learning from Imbalanced Data Sets.Washington.2003.
    [11]Liu,B.,Hsu,W.,Ma,Y.Mining Association Rules with Multiple Minimum Supports.1999.337-341
    [12]Andrew,K.C.High-Order Pattern Discovery from Discrete-Valued Data[J].IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1997:877-893
    [13]Phua,C.,Alahakoon,D.,Lee,V.Minority Report in Fraud Detection:Classification of Skewed Data[J].ACM SIGKDD Explorations Newsletter.2004,6(1):50-59
    [14]Fawcett,T.,Provost,F.Adaptive Fraud Detection[J].Data Mining and Knowledge Discovery.1997,1(3):291-316
    [15]Kubat,M.,Holte,R.C.,Matwin,S.Machine Learning for the Detection of Oil Spills in Satellite Radar Images[J].machine learning.1998,30(2):195-215
    [16]Zheng,Z.,Wu,X.,Srihari,R.Feature Selection for Text Categorization on Imbalanced Data[J].ACM SIGKDD Explorations Newsletter.2004,6(1):80-89
    [17]Cohen,G,Hilario,M.,Sax,H.,et al.Data Imbalance in Surveillance of Nosocomial Infections[J].LECTURE NOTES IN COMPUTER SCIENCE.2003:109-117
    [18]Chen,J.X.,Cheng,T.H.,Chan,A.L.F.,et al.An Application of Classification Analysis for Skewed Class Distribution in Therapeutic Drug Monitoring-the Case of Vancomycin.2004.35-39
    [19]Radivojac,P.,Korad,U.,Sivalingam,K.M.,et al.Learning from Class-Imbalanced Data in Wireless Sensor Networks.Proceeding of Vehicular Technology Conference.Orlando.2003.3030-3034
    [20]Japkowicz.N.Aaai Tech Report Ws-00-05.Proceedings of the AAAF2000 Workshop on Learning from Imbalanced Data Sets.2000.
    [21]NV,Chawla,N,Japkowicz,A,Kolcz.Proceedings of the ICML'2003 Workshop on Learning from Imbalanced Data Sets.2003.
    [22]Visa,S.,Ralescu,A.Issues in Mining Imbalanced Data Sets-a Review Paper.Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference.2005.16-17
    [23]Daskalaki,S.,Kopanas,I.,Avouris,N.Evaluation of Classifiers for an Uneven Class Distribution Problem[J].Applied Artificial Intelligence.2006,20(5):3 81-417
    [24]Estabrooks.A.A Combination Scheme for Inductive Learning from Imbalanced Data Sets[Master's Thesis].Dalhousie University.2000
    [25]Chawla,N.V.,Bowyer,K.W.,Hall,L.O.,et al.Smote:Synthetic Minority over-Sampling Technique[J]Journal of Artificial Intelligence Research.2002,16(3):321-357
    [26]Hui,HAN,Wen-yuan,WANG,Bing-huan,MAO.Borderline-Smote:A New over-Sampling Method in Imbalanced Data Sets Learning[J].LECTURE NOTES IN COMPUTER SCIENCE.2005:878-887
    [27]Hart,RE.The Condensed Nearest Neighbor Rule[J].IEEE Transactions on Information Theory.1968,14(3):515-516
    [28]Laurikkala,Jormalmproving Identification of Difficult Small Classes by Balancing Class Distribution.the 8th Conference on AI in Medicine in Europe:Artificial Intelligence Medicine 2001.
    [29]Kubat,M.,Matwin,S.Addressing the Curse of Imbalanced Training Sets:One-Sided Selection.Proceedings of the Fourteenth International Conference on Machine Learning.1997.179-186
    [30]Tomek,I.Two Modifications of Cnn[J].IEEE Transactions on Systems,Man and Cybernetics.1976,6(6):769-772
    [31]Elkan,C.The Foundations of Cost-Sensitive Leaming.Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI'01).Washington DC.2001.973-978
    [32]Friedman,J.H.,Olshen,R.A.,Stone,C.J.,et al.Classification and Regression Trees[M].American Statistical Association;The Film House,1986
    [33]Chan,P.,Stolfo,S.Toward Scalable Learning with Non-Uniform Class and Cost Distributions:A Case Study in Credit Card Fraud Detection.the 4th International Conference on Knowledge Discovery and Data Mining.New York.USA.1998.164-168
    [34]Provost,F.,Fawcett,T.,Kohavi,R.The Case against Accuracy Estimation for Comparing Induction Algorithms.the 5th International Conference on Machine Learning.Madison,USA.1998.445-453
    [35]Domingos,P.Metacost:A General Method for Making Classifiers Cost-Sensitive.the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'99).New York,USA.1999.155-164
    [36]Domingos,P.Knowledge Acquisition from Examples Via Multiple Models.the 14th International Conference on Machine Learning.Nashville,USA.1997.98-106
    [37]Bruha,I.,Kockova,S.A Support for Decision-Making:Cost-Sensitive Learning System[J].Artificial Intelligence in Medicine.1994,6(1):67-82
    [38]Tumey,P.Cost-Sensitive Learning Bibliography[J].Institute for Information Technology,National Research Council,Ottawa,Canada.2000
    [39]Xuewen,CHEN,B,Gerlach,Casasentd.Pruning Support Vectors for Imbalanced Data Classification.2005 IEEE International Joint Conference on Neural Networks.2005.1883-1888
    [40]Brefeld,Ulf,Scheffer,Tobias.Auc Maximizing Support Vector Learning;the ICML 2005Workshop on ROC Analysis in Machine Learning.Bonn,Germany.2005.
    [41]Scholkopf,B.,Platt,J.C.,Shawe-Taylor,J.,et al.Estimating the Support of a High-Dimensional Distribution[J].Neural Computation.2001,13(7):1443-1471
    [42]Manevitz,L.M.,Yousef,M.One-Class Svms for Document Classification[J].The Journal of Machine Learning Research.2002,2:139-154
    [43]Alexander,SENF,Xue-Wen,CHEN,Anne,ZHANG.Comparison of One-Class Svm and Two-Class Svm for Fold Recognition.13th International Conference on Neural Information Processing Hong Kong 2006.
    [44]Cohen,G.,Hilado,M.,Pellegrini,C.One-Class Support Vector Machines with a Conformal Kernel.A Case Study in Handling Class Imbalance[J].LECTURE NOTES IN COMPUTER SCIENCE.2004:850-858
    [45]吴广潮,陈奇刚.不平衡数据集中的组合分类算法[J].计算机工程与设计.2007,28(23):5687-5690
    [46]罗兵,余光柱.不平衡类分布下多分类问题的提升算法[J].长江大学学报(自科版).2007,4(2):50-54
    [47]Freund,Y.,Schapire,R.E.A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting[J].Journal of Computer and System Sciences.1997,55(1):119-139
    [48]Fan,W.,Stolfo,S.J.,Zhang,J.,et al.Adacost:Misclassification Cost-Sensitive Boosting.the 16 th International Conference on Machine Learning(ICML'99).1999.97-105
    [49]Joshi,M.,Kumar,V.,Agarwal,R.Evaluating Boosting Algorithms to Classify Rare Classes:Comparison and Improvements.the 1st IEEE International Conference on Data Mining.2001.257-264
    [50]Sun,Y.,Kamel,M.S.,Wong,A.K.C.,et al.Cost-Sensitive Boosting for Classification of Imbalanced Data[J].Pattern Recognition.2007,40(12):3358-3378
    [51]李跃波,王丽珍.Aucboost算法处理不平衡分类问题[J].云南大学学报.2007,29(S2):313-318
    [52]Chawla,N.V.,Lazarevic,A.,Hall,L.O,et al.Smoteboost:Improving Prediction of the Minority Class in Boosting[J].LECTURE NOTES IN COMPUTER SCIENCE.2003:107-119
    [53]林果为.诊断试验的研究与评价[J].诊断学:理论与实践.2003,2(1):附录
    [54]王勇献,王正华,张振慧.蛋白质结构预测算法的评估[J].计算机工程与科学.2005,27(8):62-64
    [55]赵凤英,王崇骏,陈世福.用于不均衡数据集的挖掘方法[J].计算机科学.2007,34(9):139-141
    [56]林智勇,郝志峰,杨晓伟.不平衡数据分类的研究现状[J].计算机应用研究.2008,25(2):332-336
    [57]Rissanen J,M.Modeling by Shortest Data Description[J].Automatica.1978,14:465-471
    [58]Carbonell.J.Introduction:Paradigms for Machine Learning[J].Artificail Intelligence.1989,40(1):1-9
    [59]杨炳儒.知识工程与知识发现[M].北京:冶金工业出版社,2000
    [60]钟茂生.基于智能agent的个性化web浏览器研究与实现[硕士学位论文].江西师范大学.2003
    [61]Mitchell,Tom M.;曾华军,张银奎等.Machine Learning[M].北京:机械工业出版社,2003
    [62]Mjolsness,E.,DeCoste,D.Machine Learning for Science:State of the Art and Future Prospects[J].science.2001,293(14):2051-2055
    [63]Waibel,A.,Hanazawa,T.,Hinton,G.,et al.Phoneme Recognition Using Time-Delay Neural Networks[J].Acoustics,Speech,and Signal Processing[see also IEEE Transactions on Signal Processing],IEEE Transactions on.1989,37(3):328-339
    [64]Burge,C.,Karlin,S.Prediction of Complete Gene Structures in Human Genomic DNA[J].Journal of Molecular Biology.1997,268(1):78-94
    [65]Shoemaker,D.D.,Schadt,E.E.,Armour,C.D.,et al.Experimental Annotation of the Human Genome Using Microarray Technology[J].Nature.2001,409:922-927
    [66]FinE,G.R.,Spellman,P.T.,Sherlock,G.,et al.Comprehensive Identification of Cell Cycle-Regulated Genes of the Yeast Saccharomyces Cerevisiae by Microarray Hybridization[J].Molecular Biology of the Cell.1998,9(12):3273-3297
    [67]Eisen,M.B.,Spellman,P..T.,Brown,P.O.,et al.Cluster Analysis and Display of Genome-Wide Expression Patterns 1998,95(25):14863-14868
    [68]Gerald.Tesauro.Temporal Difference Learning and Td-Gammon[J].Communications of the ACM.1995,38(3)
    [69]Dean.Pomerleau.Alvinn:An Autonomous Land Vehicle in a Neural Network[J].Advances in Neural Information Processing Systems 1989,1
    [70]Gerald.Tesauro.Practical Issues in Temporal Difference Learning[J].machine learning.1992,8:257-277
    [71]Gerald.Tesauro.Temporal Difference Learning and Td-Gammon[J].Communications of the ACM.1995,38(3):58-68
    [72]王珏,周志华,周傲英.机器学习及其应用[M].北京:清华大学出版社,2006
    [73]Quinlan,J.R.C4.5:Programs for Machine Learning[M].Morgan Kaufmann,1993
    [74]Langley,P.,Iba,W.,Thompson,K.An Analysis of Bayesian Classifiers.the tenth national conference on artificial intelligence.1992.223-223
    [75]张宁,贾自艳,史忠植.使用Knn算法的文本分类[J].计算机工程.2005,31(8):171-172
    [76]Tenenbaum,J.B.,Silva,V.,Langford,J.C.A Global Geometric Framework for Nonlinear Dimensionality Reduction[J].Science.2000,290(22):2319-2323
    [77]Tong,S.,Koller,D.Support Vector Machine Active Learning with Applications to Text Classification[J].The Journal of Machine Learning Research.2002,2:45-66
    [78]Shahshahani,B.M.,Landgrebe,D.A.The Effect of Unlabeled Samples in Reducing the Small Sample Sizeproblem and Mitigating the Hughes Phenomenon[J].Geoscience and Remote Sensing,IEEE Transactions on.1994,32(5):1087-1095
    [79]Sutton,R.S.,Barto,A.G.Reinforcement Learning:An Introduction[M].MIT Press,1998
    [80]Weiss,G.M.Mining with Rarity:A Unifying Framework[J].Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mining.2004,6(1):7-19
    [81]Weiss,Gray M,Provost,Foster.Learning When Training Data Are Costly:The Effect of Class Distribution on Tree Induction[J].Journal of Artificial Intelligence Research.2003,19(10):315-354
    [82]Zadrozny,Bianca,Elkan,Charles.Learning and Making Decisions When Costs and Probabilities Are Both Unknown.the 7th International Conference on Knowledge Discovery and DataMining.New York,US.2001.204-213
    [83]缪志敏.基于单分类器的数据不平衡问题研究[博士论文].中国人民解放军理工大学指挥自动化学院.2008
    [84]Maloof,M.Learning When Data Sets Are Imbalanced and When Costs Are Unequal and Unknown.Working Notes of the ICML'03 Workshop on Learning from Imbalanced Data Sets.Washington,DC.2003.
    [85]杨明,尹军梅,吉根林.不平衡数据分类方法综述[J].南京师范大学学报(工程技术版). 2008,8(4):7-12
    [86]马洪超,胡光道.地学数据融合技术综述[J].地质科技情报.1999,18(1):97-101
    [87]李军,周成虎.地学数据特征分析[J].地理科学.1999,19(2):158-162
    [88]Ian.H.Witten,Eibe.Frank.Data Mining:Practical Machine Learning Tools and Techniques with Java Implementations[M].Seattle,WA:Morgan Kaufmann,2000
    [89]孙德全.数据库的负载自动识别及自管理技术研究[硕士论文].中国石油大学.2007
    [90]Duda,R.O.,Hart,P.E.,Stork,D.G.Pattern Classification and Scene Analysis[M].Wiley New York,1973
    [91]Friedman,N.,Geiger,D.,Goldszmidt,M.Bayesian Network Classifiers[J].machine learning.1997,29(2):131-163
    [92]Domingos,P.,Pazzani,M.On the Optimality of the Simple Bayesian Classifier under Zero-One Loss[J].machine learning.1997,29(2):103-130
    [93]Kononenko,I.Semi-Naive Bayesian Classifier.Proceedings of European Conference on Artificial Intelligence.1991.206-219
    [94]Langley,P.,Sage,S.,Institute For The Study Of,Learning,et al.Induction of Selective Bayesian Classifiers.Seattle,WA.1994.339-406
    [95]Christopher,J.C.Burges.A Tutorial on Support Vector Machines for Pattern Recognition [J].Data Mining and Knowledge Discovery.1998,2(2):121-167
    [96]Platt,John C.Sequential Minimal Optimization:A Fast Algorithm for Training Support Vector Machines[J].Advances in Kernel Methods-Support Vector Learning.1999:185-208
    [97]Ferreira,C.Gene Expression Programming:A New Adaptive Algorithm for Solving Problems[J].ComplexSystems.2001,13(2):87-129
    [98]Breiman,L.Bagging Predictors[J].machine learning.1996,24(2):123-140
    [99]Freund,Yoav,Schapire,Robert E.Experiments with a New Boosting Algorithm.Machine Learning:Proceedings of the Thirteenth International Conference.Italy.1996.148-156
    [100]Lendasse,A.,Wertz,V.,Verleysen,M.Model Selection with Cross-Validations and Bootstraps-Application to Time Series Prediction with Rbfn Models[J].LECTURE NOTES IN COMPUTER SCIENCE.2003:573-580
    [101]Efron,B.,Tibshirani,R.J.An Introduction to the Bootstrap[J].Monographs on Statistics and Applied Probability.1993,57:1-177
    [102]Efron,B.Estimating the Error Rate of a Prediction Rule:Improvements on Cross-Validation [J].Journal of American Statistical Association.1983
    [103]Jain,A.K.,Dubes,R.C.,Chen,C.C.Bootstrap Techniques for Error Estimation[J].IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.1987,9(5):628-633
    [104]孙宜贵,李周芳,职为梅等.数据挖掘分类器性能度量相关问题的研究[J].山西电子技术.2006(5):79-82
    [105]Kohavi,R.A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.In Proc.of the 15th intl.Joint Conf on Artificial Intelligence.Montreal,Canada.1995.1137-1145
    [106]Pepe,M.S.Receiver Operating Characteristic Methodology[J].Journal of the American Statistical Association.2000,95(449):308-311
    [107]Fawcett,T.Roc Graphs:Notes and Practical Considerations for Researchers[J].machine learning.2004,31
    [108]Drummond,C.,Holte,R.C.Cost Curves:An Improved Method for Visualizing Classifier Performance[J].machine learning.2006,65(1):95-130
    [109]Van Rijsbergen,C.J.Information Retrieval[M].London:Butterworth-Heinemann Newton,MA,USA,1979
    [110]Rijsbergen,C.J.van.Information Retrieval[M].London:Butterworths,1979
    [111]冯夏庭.智能岩石力学导论[M].北京:科学出版社,2000
    [112]王钦军,薛林福.数据挖掘技术及其在地学中的应用[J].世界地质.2000,19(3):235-239
    [113]石陆魁.非线性维数约减算法中若干关键问题的研究[博士学位论文].天津大学.2005
    [114]Donoho,D.L.High-Dimensional Data Analysis:The Curses and Blessings of Dimensionality[J].Manuscript.2000
    [115]杨质敏.高维数据的降维方法研究及其应用[J].长沙大学学报.2003,17(2):58-61
    [116]Jolliffe,I.T.Principal Component Analysis[M].Springer New York,2002
    [117]Cox,Trevor F.,Cox,Michael A.A.Multidimensional Scaling[M].second edition.CHAPMAN & HALL/CRC Press,2001
    [118]N.Belhurneur,Peter,P.Hespanha,Joao,J.Kriegman,David.Eigenfaces Vs.Fisherfaces:Recognition Using Class Specific Linear Projection[J].IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE.1997,19(7):711-720
    [119]H.Friedman,Jerome,Tukey,John W.A Projection Pursuit Algorithm for Exploratory Data Analysis[J].IEEE Transactions on Computer.1974,C-23(9):881-890
    [120]Comon,P.Independent Component Analysis,a New Concept[J].Signal Processing.1994,36(3):287-314
    [121]Rowels,Sam T,Saul,Lawrence K.Nonlinear Dimensionality Reduction by Locally Linear Embedding[J].Science.2000,290(22):2323-2326
    [122]Belkin,M,Niyogi,P.Laplacian Eigenmaps for Dimensionality Reduction and Data Representation[J].Neural Computation.2003,15(6):1373-1396
    [123]任若恩,王惠文.多元统计数据分析[J].北京:国防工业出版社.1997,21:92-109
    [124]王立强.基于多元统计图的高维数据降维方法及应用研究[硕士学位论文].燕山大学.2006
    [125]Johnson,Richard A,Wichern,Dean W;陆璇译.实用多元统计分析[M].清华大学出版社,2001
    [126]Shifeng,Weng,Changshui,Zhang,zhonglin,Lin.Explodng the Structure of Supervised Data by Discriminant Isometric Mapping[J].Pattern Recognition.2005,38(4):599-601
    [127]Balasubramanian,M.,Schwartz,E.L.,Tenenbaum,J.B.,et al.The Isomap Algorithm and Topological Stability[J].Science.2002,295(4):7-7
    [128]冯夏庭,王泳嘉,卢世宗.边坡稳定性的神经网络估计[J].工程地质学报.1995,3(4):54-61
    [129]刘沐宇,冯夏庭.基于神经网络范例推理的边坡稳定性评价方法[J].岩土力学.2005,26(2):193-197
    [130]郭瑞清,木合塔尔·扎日,刘新喜.基于自适应神经元模糊推理系统的岩质边坡稳定性评价方法[J].岩石力学与工程学报.2006,25(S1):2785-2789
    [131]罗战友,杨晓军,龚晓南.基于支持向量机的边坡稳定性预测模型[J].岩石力学与工程学报.2005,24(1):144-148
    [132]余志雄,周创兵,李俊平等.基于v-Svr算法的边坡稳定性预测[J].岩石力学与工程学报.2005,24(14):3468-2475
    [133]赵胜利,吴雅琴,燕等,刘.基于som-Bp复合神经网络的边坡稳定性分析[J].河北农业大学学报.2007,30(3):105-108
    [134]薛新华,姚晓东.边坡稳定性预测的模糊神经网络模型[J].工程地质学报.2007,15(1):77-82
    [135]谷琼,蔡之华,朱莉等.一种基于pca的gep算法及在采煤工作面瓦斯涌出量预测中的应用[J].应用基础与工程科学学报.2007,15(4):569-577
    [136]卫海英.Spss10.0 for Windows在经济管理中的应用[M].北京:中国统计出版社,2001
    [137]http://www.gepsoft.com
    [138]YAMAOKA.K,NAKAGAWA.T,UNO.T.Application of Akaike's Information Criterion (Aic) in the Evaluation of Linear Pharmacokinetic Equations[J].Journal of Pharmacokinetics and Pharmaceutics.1978,6(2):165-175
    [139]Lama,R.D.,Bodziony,J.Management of Outburst in Underground Coal Mines[J].International Journal of Coal Geology.1998,35(1-4):83-115
    [140]伍爱友,肖红飞,王从陆等.煤与瓦斯突出控制因素加权灰色关联模型的建立与应用[J].煤炭学报.2005,30(1):58-62
    [141]郝吉生,倪小明.基于Windows平台的采掘工作面煤与瓦斯突出预测专家系统[J].煤炭学报.2005,30(002):141-145
    [142]南存全,冯夏庭.基于svm的煤与瓦斯突出区域预测研究[J].岩石力学与工程学报.2005,24(2):263-267
    [143]郭德勇,范金志,马世志等.煤与瓦斯突出预测层次分析-模糊综合评判方法[J].北京科技大学学报.2007,29(7):660-664
    [144]张子戌,刘高峰,吕闰生等.基于模糊模式识别的煤与瓦斯突出区域预测[J].煤炭学报.2007,32(6):592-595
    [145]孙燕,杨胜强,王彬等.用灰关联分析和神经网络方法预测煤与瓦斯突出[J].中国安全生产科学技术.2008,4(3):14-17
    [146]张学工.关于统计学习理论与支持向量机[J].自动化学报.2000,26(1):32-42
    [147]郭德勇,李念友,裴大文等.煤与瓦斯突出预测灰色理论-神经网络方法[J].北京科技大学 学报.2007,29(4):354-357
    [148]吴强.基于神经网络的煤与瓦斯突出预测模型[J].中国安全科学学报.2001,11(4):69-72
    [149]Laurikkala,J.Improving Identification of Difficult Small Classes by Balancing Class Distribution.the 8th Conference on AI im medicine 2001.
    [150]Han,H.,Wang,W.,Mao,B.Borderline-Smote:A New over-Sampling Method in Imbalanced Data Sets Learning[J].LECTURE NOTES IN COMPUTER SCIENCE.2005,3644(1):878-887
    [151]杨智明,乔立岩,彭喜元.基于改进Smote的不平衡数据挖掘方法研究[J].电子学报.2007,35(B12):22-26
    [152]Laurikkala,J.Improving Identification of Difficult Small Classes by Balancing Class Distribution[J].Artificial Intelligence in Medicine.2001:63-66
    [153]Wilson,D.L.Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J].IEEE Transactions on Systems,Man,and Cybernetics.1972,2(3):408-421
    [154]Stanfill,C.,Waltz,D.Toward Memory-Based Reasoning[J].Communications of the ACM.1986,29(12):1213-1228
    [155]http://archive.ics.uci.edu/ml/datasets.html
    [156]张智星.Matlab程序设计与应用[M].北京:清华大学出版社,2002
    [157]张航,黄攀.精通matlab6[M].北京:清华大学出版社,2002
    [158]http://waldron.stanford.edu/~isomap/
    [159]葛启发,冯夏庭.基于adaboost组合学习方法的岩爆分类预测研究[J].岩土力学.2008,29(4):943-948
    [160]谭以安.模糊数学综合评判在地下洞室岩爆预测中的应用.第二届中国岩石力学与工程学会.1989.247-253
    [161]王元汉,李启光.岩爆预测的模糊数学综合评判方法[J].岩石力学与工程学报.1998,17(5):493-501
    [162]谢和平.岩爆的分形特征和机理[J].岩石力学与工程学报.1993,12(1):28-37
    [163]杨莹春,诸静.物元模型及其在岩爆分级预报中的应用[J].系统工程理论与实践.2001,21(8):125-129
    [164]邱道宏.括苍山高速公路遂道岩爆非线性预测研究[博士学位论文].吉林大学.2008
    [165]白明洲,王连俊.岩爆危险性预测的神经网络模型及应用研究[J].中国安全科学学报.2002,12(4):65-69
    [166]陈海军,聂德新.岩爆预测的人工神经网络模型[J].岩土工程学报.2002,24(2):229-232
    [167]唐宝庆,曹平.回归分析法在建立岩爆数学模型上的应用[J].数学理论及应用.2003,23(2):37-42
    [168]姜彤,黄志全,赵彦彦等.灰色系统最优归类模型在岩爆预测中的应用[J].华北水利水电学院学报.2003,24(2):37-40
    [169]冯夏庭,赵洪波.岩爆预测的支持向量机[J].东北大学学报:自然科学版.2002,23(1):57-59
    [170]赵洪波.岩爆分类的支持向量机方法[J].岩土力学.2005,26(4):642-644
    [171]孙海涛,丁德馨,杨挺拔.基于非线性混沌理论的岩爆预测方法展望[J].西部探矿工程.2005,17(11):9-10
    [172]王湘锋.锦屏二级水电站深埋特长引水隧洞岩爆模拟及预测研究[硕士学位论文].成都理工大学.2006
    [173]Habbema,J.D.E,Hermans,J.,Van Der Burgt,A.T.Cases of Doubt in Allocation Problems[J].Biometrika.1974,61(2):313-324
    [174]Tumey,P.Types of Cost in Inductive Concept Learning[J].The computer Research Repository.2002:15-21
    [175]Ciraco,M.,Rogalewski,M.,Weiss,G.Improving Classifier Utility by Altering the Misclassification Cost Ratio.the 1st International Workshop on Utility-based Data Mining.New York.2005.46-52
    [176]Ting,K.M.An Instance-Weighting Method to Induce Cost-Sensitive Trees[J].IEEE Transactions on Knowledge and Data Engineering.2002,14(3):659-665
    [177]Ting,K.M.A Comparative Study of Cost-Sensitive Boosting Algorithms.the 17th International Conference on Machine Learning.Stanford,USA.2000.983-990
    [178]Schapire,R.E.,Singer,Y.Improved Boosting Algorithms Using Confidence-Rated Predictions[J].machine learning.1999,37(3):297-336
    [179]Guo,H.,Viktor,H.L.Learning from Imbalanced Data Sets with Boosting and Data Generation:The Databoost-Im Approach[J].ACM SIGKDD Explorations Newsletter.2004,6(1):30-39