用户名: 密码: 验证码:
Sluicebox: Semi-supervised learning for label prediction with concept evolution and tracking in non-stationary data streams
详细信息   
文摘
The past few decades of statistical or machine learning and data mining research has produced a significant repertoire of algorithms that,when provided sufficient training data,can predict classification labels of newly observed data points. Recently,however,a new challenge has emerged in the form of data through continuous streams instead of the traditional forms of static data sets. Machine learning or data mining in a streaming context encumber algorithms relying on assumptions found in the static data context. One of the most recognized challenges is the intractably large volume,attribute variety,and fast velocity of a data stream,which inhibits any algorithm that requires more than one pass on the data. In addition,as the stream progresses,features may be added,removed,or change in the range of possible values,which is known as feature evolution. The defining concepts for a label or class may also migrate over the span of the data stream. The variation in concept definitions are caused by evolution of the new concept classes and underlying feature evolution. Novel classes that were not known a priori can also appear amid the data stream. Traditional algorithms often characterize unknown labels as errors and outliers,but in the dynamic streaming domain,a sufficiently dense cluster of outliers must be analyzed to discover emergent class concepts,and the concepts should be tracked as they are discovered. The SluiceBox method described in this dissertation aims to adapt to novel classes,feature evolution,and concept drift while predicting data instance labels and adhering to the constraints of continuous data streams. The research presented here details the challenges found in data stream mining with concept drift and evolution,and explores the theoretical requirements for detecting emerging novel classes. It also presents a framework for data stream experimentation as an extension of the Waikato University Massive Online Analysis (MOA) framework wherein the theoretical observations are tested. Variations of the SluiceBox method are compared against other leading approaches using traditional synthetic data sets,a new benchmark framework,and real world data sets to analyze the comparative accuracy and efficiency.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700