用户名: 密码: 验证码:
基于内容的图像垃圾邮件过滤技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,基于邮件内容,并结合机器学习理论、文本分类和信息过滤技术的垃圾邮件过滤器得到了广泛的应用,但这些方法都有一定的局限性。由于这类过滤器对图像格式的垃圾邮件无能为力,因此,随着图像垃圾邮件数量的日益增长,如何识别和过滤图像垃圾邮件成为IT界及邮件服务提供商迫切需要解决的问题。
     本文分析了垃圾邮件过滤问题的研究现状,主要包括垃圾邮件的定义、危害及当前主流的垃圾邮件过滤技术及其优缺点等;论述了垃圾邮件图像过滤中的关键问题——图像特征提取,系统地讨论了颜色、纹理、形状等多种图像视觉特征及提取方法。
     针对图像垃圾邮件的新特性,本文在分析垃圾邮件发送行为及邮件内容的基础上,利用垃圾邮件大批量、反复发送及内容高度相似的特点,提出一种基于图像相似性检测的垃圾邮件图像过滤方法。本方法通过检测邮件图像与垃圾邮件图像样本的相似度来实现:提取邮件图像的颜色、纹理、形状等底层视觉特征,基于综合特征比较新邮件图像与垃圾邮件图像样本的相似度,以此判断该邮件图像是否为垃圾邮件图像。并对方法中涉及的相关问题及关键技术进行了论述,包括图像相似性度量及特征归一化处理等。
     实验结果表明,基于图像相似性检测的垃圾邮件图像过滤方法对垃圾邮件图像有很好的过滤效果。本文的研究为过滤图像垃圾邮件提供了有益的探索,也为设计出更好的反垃圾邮件方案提供了理论支持,具有相当的理论意义和实际应用价值。
Many content-based spam filtering techniques which combine the development of machine learning, text categorization and information filtering have been carried out and widely used in recent years, but these means have certain limitations. Because these technologies are incapable of filtering image-based spam, and with the more and more increment of the image-based spam, so how to identify and filter it is becoming a very important problem that need IT realm and the mail server provider to be resolve urgently.
     In this paper, Analyzed the research condition of spam filter at the present time, which mainly include the definition and endanger of the spam, the current dominating spam filtering techniques and its merit and demerit. Image feature abstraction is analyzed as a key problem of image-based spam filtering. The various visible features and feature extraction method have been studied systematically, the contents includes color, texture, shape, etc.
     To contrapose of the characteristic of the image-based spam, this paper proposed a new kind of image spam filter method based on similarity detection of image, which in terms of the spam sending behavior and the content that includes send in bulk, repeatedly and highly resemble content. the method implement based on similarity detection between new mail image and the sample-image of spam: extract the low-level visual features of image, which include color feature, texture feature and shape feature, then judging the new mail image is a spam image or not by detect similarity between new mail image and the sample-image of spam according to the combined vision features. At the same time, the several related problem and key techniques have been discussed, which include similarity measuremet and feature normalization, etc.
     It is showed with experiments that this new method which based on similarity detection of image has a good performance. It does some useful exploring for the way of image-based spam filtering, and may provide solid theoretical support for designing the anti-spam project; its research has both the theory and the application value.
引文
[1]北京清华得实科技股份有限公司:MailCM反垃圾邮件系统技术白皮书,2005,4
    [2]中国互联网信息中心,第十三次中国互联网络发展状况统计报告.http://www.ennic.com.cn/htmi/dir/2004/02/03/2114.htm
    [3]China Anti-Spam Research Report,http://www.iresearch.com.cn,2003,11
    [4]James Carpinter,Ray Hunt.Tightening the Net:A Review of Current and Next Generation Spare Filtering Tools[J].Computers & Security,2006,25:566-578
    [5]2004年中国反垃圾邮件研究报告(China Anti-Spam Market Research Report),2004
    [6]中国互联网协会,中国互联网协会反垃圾邮件规范,信息安全与通信保密.2004(3):21-22
    [7]李瑞江.浅谈反垃圾邮件技术的应用[J].新疆师范大学学报.2003(22):20-22
    [8]张明武,陈启祥,楚惟善等.HT代理服务系统的实现与分析[J].计算机工程,2001,27(3):145-147
    [9]彭树青,乔佩利,张甲寅.Internet垃圾邮件过滤技术研究[J].信息技术,2003,27(12):80-82
    [10]Graham P.Stopping Spam.http://www.Paulgraham.com/stopspam.html,2003,8
    [11]Christian Siefkes.Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering[C].In:Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2004).2004:410-421
    [12]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述[J].中文信息学报.2005,19(5):1-10
    [13]W.Cohen.Fast Effective Rule Induction[C].In:Machine Learning Proceedings of the 12th International Conference.Lake Taho,California,Mongan Kanfmann,1995:115-123
    [14]H.Drucker,D.Wu,V.N.Vapnik.Support Vector Machines for Spare Categoriz ation[J].IEEE Transactions on Neural Networks,1999,20(5):1048-1054
    [15]X.Carreras,L.Marquez.Boosting Trees for Anti-Spam Email Filtering[C].In:Proceedings of Euro Conference Recent Advances in NLP(RANLP22001),2001,9:58-64
    [16]T.Nicholas.Using AdaBoost and Decision Stumps to Identify Spam E-mai I[EB/OL].Stanford University Course Project,http://nlp.stanford.edu/cours es/cs224n/2003/fp/tyronen/report.pdf,2003
    [17]Z Pawlak.Rough Set[J].International Journal of Computer and Information Sciences,1982:11(5):341-356
    [18]刘洋,杜孝平,罗平,侯志辉等.垃圾邮件的智能分析、过滤及Rough集讨论[C].第十二届中国计算机学会网络与数据通信学术会议.武汉,2002,12
    [19]于洪,李志君,唐宏等.电子邮件过滤系统的粗糙集分析模型[J].计算机工程与应用.2003,39(16):47-48
    [20]Joachims T.Text Categorization with Support Vector Machines:Learning with Many Relevant Features[C].In:Proceedings of the 10th European Conference on Machine Learning,1998
    [21]Li Baolip,Chen Yuzhong,Yu Shiwen.A Comparative Study on Automatic Categorization Methods for Chinese Search Engine[C].In:Proceedings of the Eighth Joint International Computer Conference.2002:117-120
    [22]I.Androutsopoulos,G.Paliouras,V.Karkaletsis,G.Sakkis,C.D.Spyropoulos and P.Stamatopoulos.Learning to Filter Spare E-Mail:A Comparison of a Naive Bayesian and a Memory-Based Approach[C].In:Proceedings 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000):1-13
    [23]A.Kolcz,J.Alspeetor.SVM-based Filtering of E-mail Spam with Content Specific Misclassification Costs[C].In:Proceedings of ICDM-2001 Workshop on Text Mining 2001.San Jose:2001
    [24]潘文峰.基于内容的垃圾邮件过滤研究[D].北京:中国科学院计算技术研究所,2004,7
    [25]M.Sahami,S.Dumais,D.Heckerman and E.Horvitz.A Bayesian Approach to Filtering Junk E-Mail[C].In:Proceedings of AAAI-98 Workshop on Learning for Text Categorization.1998:55-62
    [26]Ching-Tung Wu,Kwang-Ting Cheng,Qiang Zhu,Yi-Leh Wu.Using Visual Veatures for Anti-spam Filtering[J].Image Processing.2005,3(9):509-512
    [27]Hrishikesh B.Aradhye,Gregory K.Myers,James A.Herson.Image Analysis for Efficient Categorization of Image-based Spare E-mail[J].Document Analysis and Recognition,2005,2(9):914-918
    [28]许洋洋,袁华.一种基于内容的广告垃圾图像过滤方法[J].山东大学学报,2006,41(3):37-42
    [29]张耀龙,行为识别技术在反垃圾邮件系统中的研究与应用[D],北京邮电大学,2006
    [30]第三代防垃圾邮件技术“行为识别”诞生,http//www.enet.com.en,2005,8
    [31]I.Androutsopoulos,J.Koutsias,K.V.Chandrinos,G.Paliouras,An Evaluation of Naive Bayesian Anti-Spam Filtering[C].In:Proceedings of the Workshop on Machine Learning in the New Information Age,11th European Conference on Machine Learning(ECML-2000).2000,5:9-17
    [32]王崇骏,杨育彬,陈世福.基于高层语义的检索算法[J].软件学报.2004,15:1491-1469
    [33]邢强,袁保宗,唐晓芳.一种基于加权色彩直方图的快速图像检索方法[J].计算机研究与发展.2005,42(11):1903-1910
    [34]Jing Huang.Color-Spatial Image Indexing and Applicationgs[D].Comell University,New York,1998.8
    [35]Stricker M,Orengo M.Similarity of color images[C].In:Proceedings of SPIE Storage and retrieval for Images and Video databasesⅢ,1995,2420:381-392
    [36]John R.Smith and Shih-Fu Chang.Tools and Techniques for Color Image Tetrieval[C].In:SPIE of the Storage and Retrieval for Image and Video Database Ⅳ.February 1996,2670:426-427
    [37]Pass G,Zabih R,Miller J.Comparing Images Using Color Coherence Vectors[C].In ACM Intermational Conference on Multimedia.Multimedia,1996,65-73
    [38]J.Huang,S.R.Kumar,M.Mitra.Image Indexing Using Color Correlograms[C].In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.San Juan:IEEE Computer Society,1997,762-768.
    [39]黄祥林,沈兰荪.基于内容的图像检索技术研究[J].电子学报,2002,30(7):1065-1071
    [40]R.M.Haralick,L.G.Shapiro.Computer and Robot Vision[M],Addison Wesley,New York,1992
    [41]Tamura H,Mori S,Yamawaki T.Texture Features Corresponding to Visual Perception[J].IEEE Transactions on System,1978,8(6):460-473
    [42]章毓晋,图像分割[M].北京:科学出版社,2001
    [43]章毓晋,基于内容的视觉信息检索[M],北京:科学出版社,2003
    [44]A K Jain,A Vailaya.Image Retrieval Using Color and Shape[J].Pattern Recognition,1997,29(8):1233-1244
    [45]Hu M K.Visual Pattern Recognition by Moment Invariants[J].IRE Transactions on Information Theory,1962,8(2):179-182
    [46]L.Yang,F.Algregtsen.Fast Computation of Invariant Geimetric Moments:A New Method Giving Correct Results[C].In:Proceeding of IEEE International Conference on Image Processing,1994
    [47]詹川,卢显良,侯孟书,刘志辉.基于签名的近似垃圾邮件检测算法[J].计算机工程,2006,32(5):122-124
    [48]C.K.Poon,M.Chang.An E-mail Classifier Based on Resemblance[C].In:Procee dings of the 14th International Symposium,Maebashi City,Japan,2003,344-348
    [49]G.Pass,R.Zabih.Histogram Refinement for Content-based Image Retrieval[C].In:Proceedings of the Third IEEE Workshop on Applications of Comp uter Vision.1996:96-102
    [50]Simone Santini,Ramesh Jain.Similarity Queries in Image Databases[C].In:IEEE International Conference on Computer Vision and Pattern Recognition,San Francisco,1996,646-651
    [51]曾智勇.基于内容图像数据库检索中的关键技术研究[D].博士学位论文,西安电子科技大学,2006
    [52]王海霞.基于纹理特征的图像检索技术研究[D].硕士学位论文,燕山大学,2006
    [53]刘少辉,董明楷,张海俊等.一种基于向量空间模型的多层次文本分类方[J].中文信息学报,2002,16(3):8-14
    [54]田卉,覃团发,梁琳.综合颜色、纹理、形状和相关反馈的图像检索[J].计算机应用研究,2007,24(11):292-294

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700