数据流异常检测系统若干问题研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

数据流异常检测系统若干问题研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study of Data Stream Anomaly Detection System
作者：李人和
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据流 ; 数据质量 ; 数据流清洗 ; 数据流异常检测系统
英文关键词：Data stream ; data quality ; data stream cleansing ; data stream anomaly detection system
学位年度：2008
导师：周傲英
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2008-04-25

摘要

近年来,随着网络技术的不断发展,数据流作为一种新颖的数据传输方式在日常生活中有着越来越广泛的应用,并推动了传统的数据库管理系统(DataBase Management System)向数据流管理系统(Data Stream Management System)进行转变。与静态数据相比,数据流具有实时性,连续性和无限性的特点,这使得分析数据流的方式与已有的数据处理方式存在着较大区别。而在国民经济的各个领域中,数据流的分析处理技术都有着非常广泛的应用,因此,对于数据流相关问题的研究,已逐渐成为数据库方向研究的重点。
     上海电信在日常运营管理过程中,需要对不同层次上网络端口的流量数据进行分析,以实现对系统中的异常情况实时进行监测和处理,提高服务质量。为了应对这样的需求,我们提出了一个新颖的系统RealMon,该系统能够实时监测电信线网中的流量异常情况。在设计和实现RealMon系统的过程中,我们发现,不同链路的流量数据存在着相互关联的特性,并且,这些数据存在着较为严重的数据质量问题。针对这些问题,在本文中,我们首先提出了通过分析一组数据流中关联关系的变化来查找数据流异常的方法,并针对数据流质量问题,设计了对数据流进行实时清洗的模型,在此基础上,我们设计和实现了能够对电信网络流量数据进行实时分析和异常检测的系统RealMon,该系统结合了部分成熟的数据流分析算法,具有较高的实用性。我们通过模拟和真实环境中的实验验证了系统的有效性。
     本文的贡献和创新之处总结如下:
     1.在网络流量分析系统及证券交易系统中,不同数据流之间的关联性广泛存在,本文提出了通过分析不同数据流之间关联性的变化来查找异常的方法。该方法首先采用分段聚集近似法对数值型的数据流进行转换,然后使用改进的编辑距离来衡量这些数据流相互关联的程度,最后根据编辑距离的变化通过设定相应的阀值来检测多数据流中的异常情况。我们通过实验表明,该算法性能稳定,高效,能够准确地检测数据流中的异常。
     2.我们概括了数据流中常见的数据质量问题,在此基础上提出了数据流清洗框架,我们在这个框架中定义了数据流清洗的基本步骤。该框架具有较好的可扩展性,我们能够方便地在框架中更新模块,以解决新的数据质量问题。同时我们设计了能够实时处理数据缺失和规整数据时间属性的方法,并且通过一系列实验来验证了这些方法的有效性。
     3.我们设计并实现了的一个新颖的数据流异常检测系统RealMon,该系统能够准确地检测出电信网络中SNMP流量数据的异常情况。由于SNMP数据存在着较多数据质量问题,我们在设计过程中应用数据流清洗相关技术对流量数据进行实时清洗。该系统同时结合了数据流异常检测模块和数据流清洗模块,在已有的研究工作中尚属首次,具有较强的实际应用价值。如今,该系统已成功地在模拟环境中,对SNMP流量数据中的异常情况进行实时监测。
     综上所述,我们在本文中研究了数据流中的异常检测问题,提出了数据流清洗模型和通过分析多条数据流关联性来查找多数据流异常的方法,并根据这些工作的研究成果,设计和实现了对电信流量数据进行实时分析和异常检测的系统RealMon。我们的研究成果具有较高的理论价值和实际应用价值。
With the rapid development of information technology, data stream which is a novel data structure has been widely used in our daily lives. Traditional databases have long been used for storing persistent data and querying those data offline. However, the past few years have witnessed an increasing amount of applications that produce data in the form of sequences. The online monitoring and analysis of data streams have been attracting increasingly attention in relevant area of database research.
     Nowadays, most ISP enterprises face the challenge of managing huge amount of network traffic data. In a telecom network, gathering and analyzing SNMP traffic data is one of most important method for administrators to manage network performance, find and solve network problems. In order to meet this requirement, we showcase RealMon, a real stream monitoring system aims at finding anomalies among thousands of network links. By the time we design and implement this system, we found that the data streams from telecom network are correlated with each other and those SNMP data contains a lot of data quality problems. Therefore, in this paper, we first put forward an algorithm to detect the outliers based on the change of correlation between streams and then we showcase a novel framework for data cleansing in real time. Based on these achievements, we demonstrate a real stream monitoring system, RealMon, which can analyze the SNMP data gathering from routers with heavy workload in online fashion. Our major contributions of this thesis include:
     1. A novel algorithm is proposed to detect the anomaly by continually monitoring the change of correlation between streams. It employs the method of Piecewise Aggregation Approximation to transform the raw data into character and finds the anomaly by calculating the Edit Distance between different streams. Extensive experiments are performed to verify the efficiency of our algorithm.
     2. The design of an extensible data stream cleaning framework is provided after we surveyed the common data stream quality problems. Our framework gains its extensibility by employing innovational modules so as to solve various problems separately. Some typical data cleaning algorithms are also implemented in this framework.
     3. A data stream monitoring system, named RealMon, is implemented to detect anomalies among thousands of network links. Some renowned algorithms for data stream analysis are implemented in this system to monitor the huge amount of SNMP (Simple Network Management Protocol) messages, which are collected from routers in telecom backbone network. Some data cleansing algorithms are also integrated into the system to address the data quality problem among SNMP messages. The experiments show that the system could perform efficiently in the simulated environment.
     We believe our work is a good example of integrating theory with practice since we not only provide some key solution for anomaly detection and data cleansing, but also implement a novel system to detect anomalies among thousands of network links. Our work has great importance in data stream research area.

引文

[1] L.Golab and M.Ozsu. Issues in Data Stream Management. ACM SIGMOD Record, Vol. 32, No. 2, 5-14. June 2003.
    [2] S.Tilak, NB.Abu-Ghazaleh, W.Heinzelman . A taxonomy of wireless micro-sensor network models . Proceedings of Mobile Computing and Communications Review, 2002,1(2):1-8.
    [3] Y. Zhu, D. Shasha. StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time. Proceedings of International Conference on Very Large Data Bases, 2002, pp. 358-369.
    [4] Shawn R. Jeffery, Minos Garofalakis, and Michael J. Franklin. Adaptive Cleaning for RFID Data Streams, Proceedings of the 32th International Conference on Very Large Data Bases, Seoul, Korea, September 2006.
    [5] D.Aebi, L.Perrochon. Towards improving Data Quality. Proceedings of the International Conference on Information Systems and Management of Data. Delhi, 1993.273-281.
    [6] Z. Guo, A. Zhou. Research on Data Quality and Data Cleaning: a Survey[J]. Journal of Software 2002 ,13(1): 2076-2082.
    [7] S. Mukhopadhyay, D. Panigrahi, S. Dey. Data aware, Low cost Error correction for Wireless Sensor Networks. Proceedings of IEEE Wireless Communications and Networking Conference, pp. 2492-7, Atlanta, March 2004.
    [8] E. Elnahrawy, B. Nath. Cleaning and Querying Noisy Sensors[C]. ACM. Proceeding of the 2nd International conference on Wireless Sensor Networks and Applications, Sep19, 2003, San Diego, U.S: ACM Press,2003: 78-87.
    [9] R. Jeffery, M. Garofalakis, J. Franklin. Adaptive Cleaning for RFID Data Streams[C], D. Umeshwar, W.Young. Proceedings of 32nd International conference on Very large data bases, September 12-15, 2006, Seoul, Korea: ACM Press, 2006: 163-174.
    [10] D.Hawkins. Identification of Outliers. Chapman and Hall, London, 1980.
    [11] J. Kleinberg. Bursty and hierarchical structure in streams. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Alberta, Canada (2002), pp. 91-101.
    [12] E. Keogh, J. Lin and A. Fu. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. Proceedings of 5th IEEE International Conference on Data Mining, 2005, pp. 226 - 233.
    [13] A.Lankhina, M.Crovella, C.Doit . Diagnosing Network-Wide Traffic Anomalies. Proceedings of ACM Conference of the Special Interest Group on Data Communication, 2004. USA: ACM, 2004: 219-230.
    [14] A.Lankhina, M.Crovella, C.Doit. Mining Anomalies Using Traffic Feature Distributions, Proceedings of ACM Conference of the Special Interest Group on Data Communication, 2005. USA:ACM, 2005: 217-228.
    [15] Y. Zhu and D. Shasha. Efficient Elastic Burst Detection in Data Streams. The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003.
    [16] S.Qin, W.Qian, A.Zhou . Approximately Processing Multi-granularity Aggregate Queries over Data Streams. Proceedings of the 22nd International Conference on Data Engineering 2006,pp 67-76.
    [17] G. Cormode and S. Muthukrishnan . What's new: Finding Significant Differences in Network Data Streams. IEEE/ACM Transactions on Networking, 2005,13(6):1219-1232.
    [18] Lee, M.L., Ling, T.W., Low, W.L. IntelliClean: a knowledge-based intelligent data cleaner. Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining ACM Press, 2000. 290-294.
    [19] E. Hoke, J. Sun, C. Faloutsos. InteMon: Intelligent System Monitoring on Large Clusters[C] D.Umeshwar W.Young. Proceeding of 32nd International conference on Very large data bases, September 12-15, 2006, Seoul, Korea: ACM, 2006: 1239-1242.
    [20] J. Moore, J. Chase, and P. Ranganathan. Weatherman: Automated, online, and predictive thermal mapping and management for data centers[C]G. Anastasios and S. Rizos. IEEE International Conference on Autonomic Computing, June 2006, Dublin, Ireland: American Scientific,2006:155-164.
    [21] P. Barham, R. Isaacs, R. Mortier and D. Narayanan. Magpie: Online Modeling and Performance-aware Systems. Proceedings of 9th Workshop on Hot Topics in Operating Systems,pages 85-90,May 2003.
    [22]E.Cohen and M.Strauss Maintaining Time-decaying Stream Aggregates.Proceedings of the 2003 ACM Symposium on Principles of Database Systems,pages 223-233,San Diego,CA,USA.
    [23]A.Zhou,S.Qin,and W.Qian.Adaptively Detecting Aggregation Bursts in Data Streams.Proceedings of the 10th International Conference on Database Systems for Advanced Applications,pp.435-446.
    [24]秦首科.数据流上的异常检测,[D],复旦大学,2005.
    [25]R.Agrawal,C.Faloutsos,and A.Swami.Efficient Similarity Search in Sequence Databases.Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms,pp.69-84,Illinois,USA,October 1993.
    [26]A.Metwally,D.Agrawal,A.E1 Abbadi.Using Association Rules for Fraud Detection in Web Advertising Networks.Proceedings of the 31 st international conference on Very large data bases(VLDB),pp:169-180,2005.
    [27]Ben-David,S,Johannes Gehrke and Daniel Kifer.Detecting Change in Data Streams.Proceedings of the 30th international conference on Very large data bases(VLDB) 2004,pages 180-191.
    [28]S.Papadimitriou,J.Sun,C.Faloutsos.Streaming pattern discovery in multiple time-series[C].H.Klemens,J.Christian,Proceeding of 31st international conference on Very large data bases,Aug.30 - Sep.2,2005,Trondheim,Norway:ACM Press,2005:697-708.
    [29]http://wiki.mbalib.com/
    [30]Y.Zhu,D.Shasha.StatStream:Statistical Monitoring of Thousands of Data Streams in Real Time[C].W.Gerhard.Proceeding of 28th International Conference on Very Large Databases,August 20-23,2002,Hong Kong,China:Morgan Kaufmann,2002:358-369.
    [31]I.T.Jolliffe.Principal Component Analysis.Springer,2002.
    [32]G.Kanellakis.On Similarity Queries for Time Series Data:Constraint Specification and Implementation.Proceedings of the 1st International Conference on the Principles and Practice of Constraint Programming.Cassis,pp 137-153,1995.
    [33] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. A Pipelined Framework for Online Cleaning of Sensor Data Streams. Proceedings of the 22nd International Conference on Data Engineering (ICDE), Atlanta, Georgia, USA, April 3-7, 2006.
    [34] A. Das Sarma, S.R. Jeffery, M.J. Franklin and J.Widom. Estimating Data Stream Quality for Object-Detection Applications. Proceedings of the Third International ACM SIGMOD Workshop on Information Quality in Information Systems, Chicago, Illinois, June 2006.
    [35] LSM:http://zh.wikipedia.org/wiki/最小二乘法.
    [36] G. Strang. Linear Algebra and Its Applications. Brooks Cole, 3rd edition, 1998.
    [37] O.Troyanskaya, M.Cantor, GSherlock. Missing value estimation methods for DNA microarrays. Bioinformatics 2001,17:520-525.
    [38] V. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707, Feb 1966.
    [39] Y. Sakurai, C. Faloutsos, M. Yamamuro. Stream Monitoring under the Time Warping Distance. Proceedings of the 23rd IEEE International Conference on Data Engineering, 2007: pp 1046-1055.
    [40] F. D. Sacerdoti, M. J. Katz, M. L. Massie, and D. E. Culler. Wide Area Cluster Monitoring with Ganglia. Proceedings of 2003 IEEE International Conference on Cluster Computing, pp 289-299.
    [41] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. Zdonik. Aurora: A New Model and Architecture for Data Stream Management. In VLDB Journal (12)2: 120-139, August 2003.
    [42] E. Keogh, K. Chakrabarti, M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases[C] . ACM. Proceedings of the 20th International Conference on management of data, May 21-24, 2001,Santa Barbara, California: ACM Press,2001: 151-162.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700