Data mining via mathematical programming and machine learning.

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Data mining via mathematical programming and machine learning.

详细信息

作者：Musicant ; David R.
学历：Doctor
年：2000
导师：Mangasarian, Olvi L.
毕业院校：The University of Wisconsin
专业：Computer Science.;Artificial Intelligence.
ISBN：0599901535
CBH：9983793
Country：USA
语种：English
FileSize：4869317
Pages：151

文摘

This work explores solving large-scale data mining problems through the use of mathematical programming methods. In particular, algorithms are proposed for the support vector machine (SVM) classification problem, which consists of constructing a separating surface that can discriminate between points from one of two classes. An algorithm based on successive overrelaxation (SOR) is presented which can process very large datasets that need not reside in memory. Concepts from generalized SVMs are combined with SOR and with linear programming to find nonlinear separating surfaces. An “active set” strategy is used to generate a fast algorithm that consists of solving a finite number of linear equations of the order of the dimensionality of the original input space at each step. This ASVM active set algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. An implicit Lagrangian for the dual of an SVM is used to lead to the simple linearly convergent Lagrangian SVM (LSVM) algorithm. LSVM requires the inversion at the outset of a single (typically small) matrix, and the full algorithm is given in 11 lines of MATLAB code.;Support vector regression problems are considered as well. The problem of tolerant data fitting by a nonlinear surface is formulated as a linear program with fewer variables than that of other linear programming formulations. A generalization of the linear programming chunking algorithm for arbitrary kernels is implemented wherein chunking is performed on both data points and problem variables. The robust Huber M-estimator, a differentiable cost function that is quadratic for small errors and linear otherwise, is modeled exactly in the original primal space of the problem by an easily solvable convex quadratic program for both linear and nonlinear support vector estimators. Experiments show that the above classification and regression techniques show strong performance in accuracy, speed, and scalability on both real-world datasets and synthetic ones. In some cases, datasets on the order of millions of points were utilized. These results indicate that SVMs, typically used on smaller datasets, can be used to solve massive data mining problems.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700