用户名: 密码: 验证码:
Data mining via mathematical programming and machine learning.
详细信息   
  • 作者:Musicant ; David R.
  • 学历:Doctor
  • 年:2000
  • 导师:Mangasarian, Olvi L.
  • 毕业院校:The University of Wisconsin
  • 专业:Computer Science.;Artificial Intelligence.
  • ISBN:0599901535
  • CBH:9983793
  • Country:USA
  • 语种:English
  • FileSize:4869317
  • Pages:151
文摘
This work explores solving large-scale data mining problems through the use of mathematical programming methods. In particular, algorithms are proposed for the support vector machine (SVM) classification problem, which consists of constructing a separating surface that can discriminate between points from one of two classes. An algorithm based on successive overrelaxation (SOR) is presented which can process very large datasets that need not reside in memory. Concepts from generalized SVMs are combined with SOR and with linear programming to find nonlinear separating surfaces. An “active set” strategy is used to generate a fast algorithm that consists of solving a finite number of linear equations of the order of the dimensionality of the original input space at each step. This ASVM active set algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. An implicit Lagrangian for the dual of an SVM is used to lead to the simple linearly convergent Lagrangian SVM (LSVM) algorithm. LSVM requires the inversion at the outset of a single (typically small) matrix, and the full algorithm is given in 11 lines of MATLAB code.;Support vector regression problems are considered as well. The problem of tolerant data fitting by a nonlinear surface is formulated as a linear program with fewer variables than that of other linear programming formulations. A generalization of the linear programming chunking algorithm for arbitrary kernels is implemented wherein chunking is performed on both data points and problem variables. The robust Huber M-estimator, a differentiable cost function that is quadratic for small errors and linear otherwise, is modeled exactly in the original primal space of the problem by an easily solvable convex quadratic program for both linear and nonlinear support vector estimators. Experiments show that the above classification and regression techniques show strong performance in accuracy, speed, and scalability on both real-world datasets and synthetic ones. In some cases, datasets on the order of millions of points were utilized. These results indicate that SVMs, typically used on smaller datasets, can be used to solve massive data mining problems.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700