基于互联网数据集的图像标注技术研究

英文题名：Study on Image Annotation Based on Web Training Data
作者：荚济民
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：图像标注 ; 互联网图像标注 ; 图像标注优化 ; 个性化标注推荐
英文关键词：image annotation ; web image annotation ; image annotation refinement ; personalized annotation recommendation
学位年度：2009
导师：俞能海
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2009-05-02

摘要

随着数码设备的日益普及以及互联网技术的迅速发展,Web图像资源越来越丰富。但由于Web数据具有多样性、复杂性和无规则性等特点,如何快速、准确地从海量Web资源中查找用户感兴趣的图像成为一项非常具有挑战性的任务。解决这一问题的重要途径就是通过对互联网图像进行自动图像标注,建立图像底层视觉内容与高层语义之间的联系,并利用标注词对图像进行索引。近年来,以Flickr为代表的图片共享社区的兴起与繁荣也让图像标注在Web 2.0环境下被赋予了新的生命。此外,自动图像标注在家庭影集的管理、医学图像检索、商标检索和人脸识别等方面都有着广泛的应用。
     由于图像数量的巨大,依靠手工对图像进行标注费用昂贵,已经不能满足实际的需要。从标注使用的训练集来看,自动图像标注技术经历了两个阶段:第一个阶段可以看成是在有限数据集上的图像标注,利用一些传统的机器学习、物体识别的方法建立图像底层特征和高层语义的联系,如基于分类器的方法、基于跨媒体相关的方法、基于翻译模型的方法以及基于隐变量的生成式模型方法等;第二个阶段是基于互联网数据集的图像标注方法,这种方法更多的是从标注的框架和效率入手,充分利用了互联网的丰富资源,大大拓展了训练集的范围,因而更符合互联网环境下图像标注的实际需要,也是近年来图像标注研究的热点。
     本文主要对基于互联网数据集的图像标注中的一些关键问题进行研究,主要成果和创新之处包括以下几个方面:
     讨论了构建互联网标注词词典的重要性,研究了如何从浩如烟海的互联网词汇中选择合适的标注词集合,并分析了词典中词语需要满足的条件。论文根据图片共享社区中词语的统计特性,提出了一种基于随机游走的标注词重要性建模方法,词语的重要性是根据用户的历史标注情况以及词语之间的相互关系衡量的,然后根据词语的重要性排序构建标注词词典。此外,还根据图片共享社区提供的标注词的丰富的语义资源,对带有初始关键词的互联网图像标注进行语义消歧,通过寻找待标注图像在图像共享社区中的合适的语义类,减少“语义鸿沟”的影响,使最后学习出的标注词语义更加一致。
     提出利用多模态相互加强原理进行图像标注。首先给定单幅图像,利用基本图像标注模型得到初始标注词,然后在基于随机游走的图像标注优化框架的基础上,通过标注词相关图和图像内容相关图之间相互加强原理,利用稳定状态下的新的相关性进行优化,可以更好地保证图像内容和最终标注词之间的关联,同时也保持了标注词的语义一致性。由于互联网图像所在网页的文本提供了丰富的语义信息,我们提出利用网页文档之间相似性与正文中命名实体的相互加强原理,更好地表示了网页文档之间的相似性。
     提出了一种基于互联网数据集的家庭影集联合标注框架。与单幅图像标注问题不同,我们考虑了利用影集内图像的相关性对多幅图像进行联合标注。首先对家庭影集中的图像进行聚类,然后从互联网数据中学习图像簇的初始标注词,再将初始标注结果输入半监督学习框架中进行后续处理,这里的半监督学习框架同时考虑了视觉内容相关性、标注词相关性以及时间相关性等。
     提出了一种基于跨媒体相关的个性化图像标注词推荐模型P-DCMRM。该模型综合考虑了视觉内容空间、标注词空间以及用户空间。P-DCMRM模型克服了已有的标注词推荐系统中忽略图像视觉内容的问题,同时也在DCMRM的基础上考虑了用户空间。在模型估计中,综合考虑了训练集的全局统计特性和用户局部空间的统计特性。对于用户上传的图像,系统可以自动地根据不同用户的标注历史和兴趣向用户推荐不同的标注词。
With the rapid development and popularity of digital devices and Internet,Web image resources have become more and more prosperous.The variety,complexity and irregularity of Web data make it a very challenge task for users to search images they need from such large web resources.One important strategy to address this problem is automatic image annotation,which builds the relationship between image visual content and high level semantics,and then images could be indexed by these annotations.In recent years,the promising development and prosperity of image sharing sites,such as Flickr,makes image annotation an important and valuable research direction.In addition,image annotation is widely used in many applications such as personal album management,medical image retrieval,trademark image retrieval and face recognition.
     Providing annotations manually requires too much human resources and money, which makes it unrealistic for such large number of images.Generally speaking, image annotation evolves through two stages:the first stage is image annotation based on limited training data,which build the relationship between low level features and high level semantics by applying existing machine learning and object recognition techniques,such as classifiers based methods,cross media based methods,translation based methods and latent topic based generative model;the second one is image annotation based on training dataset from web,which makes use of the abundant resources on the web and extends the scale of training data greatly.This latter strategy is more applicable under the environment of web,and attracts a lot of attention in recent years.
     This dissertation focuses on several key problems in image annotation based on training data from web.The main contributions of the dissertation are as follows:
     We discuss the necessity of building annotation dictionary based on web resources.The requirements of annotation words in the dictionary are analyzed for choosing proper words from large scale of words on the web.A random walk model is proposed to build the importance of words in the dictionary,based on the statistical properties on the image sharing websites.In addition,with the use of abundant semantic services on these sharing websites,the disambiguation of initial keywords is studied.By matching the query image with the proper semantic classification,the problem of semantic gap is reduced,which makes the final annotations more coherent.
     A novel image annotation framework is proposed based on multimodal similarity reinforcement theory.The initial annotations for the query image are firstly obtained by using basic annotation algorithm such as CMRM,CRM.Then the visual content correlations and the annotations correlations are mutually reinforced for computing the final annotations,based on the random walk annotation refinement framework. The final annotations could be more coherent and related to the content of images by incorporating both visual content correlations and annotation correlations.Since the webpages provide abundant semantic interpration for the images on these webpages, we propose to fuse correlations between webpage documents and named entities in the documents,which helps represent the document similarity better.
     A joint image annotation framework is proposed for annotating personal albums. Different with annotation for a single image,the correlations in the personal albums are considered.The personal album is firstly clustered and the initial annotations are learned from web images for these clusters.The initial annotations are then refined in a semi-supervised learning framework,which combines visual content correlations, annotation correlations and temporal correlations.
     A cross media based personalized image annotation recommendation model P-DCMRM is proposed.The model combines visual content space,annotation space and user space together.P-DCMRM overcomes the problem of existing image annotation methods,which neglects the visual content of images or the properties of user interests.The global statistical properties and the local ones are both considered for estimating the model.For an uploaded image,annotations could be produced according to different user interests and their annotation history.

引文

Adomavicius G,Tuzhilin A.Toward the next generation of recommender systems:A survey of the state-of-the.art and possible extensions.IEEE Trans,on Knowledge and Data Engineering,2005,17(6):734-749.
    Barnard,K,Duygulu P.,Freitas N.,Forsyth D.,Blei D.,and Jordan M.,“Matching Words and Pictures”,Journal of Machine Learning Research,3,1107-1135,2003
    Baumberg A.."Reliable Feature Matching across Widely Separated Views," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition 2000.vol.1,pp.774-781,Hilton Head Island,USA,Jun.2000.
    Baeza Yates R,Ribeiro-Neto B.Modem Information Retrieval.New York:Addison—Wesley Publishing Co.,1999
    Blei D.and Jordan M.Modeling annotated data.In Proceedings of 26~(th) International Conference on Research and Development in Information Retrieval,2003.
    Borkur Sigurbj(o|¨)rnsson,Roelof van Zwol.Flickr Tag Recommendation based on Collective Knowledge.International Conference on World Wide Web,2008.
    Brown P.,Pietra S.D.,Pietra V.D.,and Mercer R..The mathematics of statistical machine translation.Computational Linguistics,19(2):263～311,1993.
    Budanitsky A.and Hirst G.“Semantic distance in WordNet:An experimental,Application-oriented Evaluation of Five measure”,In Workshop on Wordnet and Other Lexical Resources,2nd of the North American Chapter of the ACL,Pittsburgh,2001
    Burke R.Knowledge.Based recommender systems.Encyclopedia of Library and InformationSystems,2000,69(32):180—200
    Cai D.,Yu S.,Wen J.R.,et al.Block-based web search:Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval.NY,USA,2004,pp.456-463.
    Cai D.,He X.,Ma W.Y.,et al.Organizing WWW images based on the analysis of page layout and Web link structure.Vol.1,2004.
    Chang,E.and Goh,K.and Sychay,G.and Wu,G.CBSA:content-based soft annotation for multimodal image retrieval using Bayes point machines.IEEE Transactions on Circuits and Systems for Video Technology,2003.
    Chang N.S.Fu K.S.Query-by-pictorial-example.Computer Software and Applications Conference,1979.COMPSAC 79.Proceedings.The IEEE Computer Society's Third International,1979,pp.325-330.
    Chang N.S.Fu K.S.A Relational Database System for Images.Pictorial Information Systems,Springer,1980,pp.288-321.
    Chang S.K.Yan C.W.,Dimitroff D.C,et al.An intelligent image database system.Software Engineering,IEEE Transactions on,1988,vol.14.pp.681-688.
    Chang S.F.,Chen W.,Sundaram H.Semantic visual templates:Linking visual features to semantics.Proc.IEEE ICIP,Chicago,IL,1998.
    Cilibrasi R.L.,Vit(?)nyi.PMB The google similarity distance.IEEE Trans,on Knowledge and data engineering,2007
    Cooper M.,Foote J.,Girgensohn A.,Wilcox L..Temporal event clustering for digital photo collections.TOMCCAP 1(3):269-288 (2005)
    Cui,J.Wen,F.,Xiao,R.,Tian,Y.,Tang,X.Easy Album:An Interactive Photo Annotation System Based on Face Clustering and Re-ranking.In Proceedings of SIGCHI,2007.
    Cusano C,Ciocca G,and Schettini R.Image annotation using svm.In Proceedings of Internet Imaging Ⅳ,SPIE 5304,volume 5304,pages 330～338,
    Dec 2003.Datta,R.and Li,J.and Wang,J.Z.Content-based image retrieval-approaches and trends of the new age.Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval,2005
    Deerwester S.,Dumais S.,and Harshman R.,"Indexing by Latent Semantic Analysis",Journal of the American Society for Information Science,1990
    Duygulu P,Barnard K,Freitas J.F.G,and Forsyth D.A.Object recognition as machine translation:Learning a lexicon for a fixed image vocabulary.In Proceedings of Seventh European Conference on Computer Vision,pages 97～112,UK,2002.
    Fang F,Geman D,and Boujemaa N.2006.Irma-a content-based approach to image retrieval in medical applications.Proc.of the 2006 Information Resources Management Association International Conference.
    Feng D.,Siu W.C,and Zhang H.-J.,“Multimedia Information Retrieval and Management:Technological Fundamentals and Applications”,Springer-Verlag Berlin and Heidelberg GmbH & Co.K (May 2003)
    Feng S.L.,Manmatha R.and Lavrenko V.Multiple bernoulli relevance models for image and video annotation.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition,pages 1002～1009,2004.
    Goldberg K,RoederT,Gupta D,Perkins C.Eigentaste:A constant time collaborative filtering algorithm.Information Retrieval,2001,4(2):133—151.
    Golshani F..EIC's Message:Multimedia is Correlated Media.IEEE MultiMedia 11(1):(2004)
    Han J.W.,Guo L.A new image retrieval model supporting query by semantics and example.IEEE Conference on Image Processing,New York,USA,September,2002.
    Harmandas V.,Sanderson M.Dunlop M.D.Image retrieval by hypertext links.Proceedings of the 20~(th) annual international ACM SIGIR conference on research and development in information retrieval,1977,pp.296-303.
    Hofmann T.,“Probabilistic Latent Semantic Indexing”,In the Proceedings of ACM International Conference on Research and Development in Information Retrieval,USA,1999
    Hofmann T.Collaborative filtering via Gaussian probabilistic latent semantic analysis.In:Proc.of the 26th Int' 1 ACM SIGIR Conf,New York:ACM Press,2003.259—266.
    Hu,X.and Qian,X.and Ma,X.and Wang,Z.A Novel Region-based Image Annotation Using Multi-instance Learning.Knowledge Discovery and Data Mining,2009.WKDD 2009.Second International Workshop on,2009.
    Jan Schietse,John P.Eakins,and Remco C.Practice and challenges in trademark image retrieval.Proc.of the 6th ACM international conference on Image and video retrieval,2007.
    Jeon J,Lavrenko V,and Manmatha R.Automatic image annotation and retrieval using cross-media relevance models.In Proceedings of the 26th annual international ACM SIGIR,pages 119～126,2003.
    Jeon J and Manmatha R.Using maximum entropy for automatic image annotation.Proceedings of International Conference on Image and Video Retrieval,75,2004.
    Jing,F.and Wang,C.and Yao,Y.and Deng,K.and Zhang,L.and Ma,W.Y.IGroup:web image search results clustering.Proceedings of the 14th annual ACM international conference on Multimedia,2006
    Jiang J.and Conrath D.,“Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy”,in Proceedings of International Conference Research on Comutational Linguistics”,1997
    Jin R.,Chai J.,and Si L,“Effective Automatic Image Annotation Via A Coherent Language Model and Active learning”,Proceedings of the 12th Annual ACM International Conference On Multimedia,pp.892-899,USA,2004
    Jin Y,Khan L.,Wang L.and Awad M.,“Image Annotations By Combining Multiple Evidence & WordNet”,Proceedings of the 13th Annual ACM International Conference On Multimedia,pp.706-715,Singapore,2005
    Kandola,J.,Shawe-Taylor,J.,Cristianini,N.Learning Semantic Similarty.Annual Conference on Neural Information Processing System,2003.
    Kang F,Jin R,and Chai J.Regularizing translation models for better automatic image annotation. In The 13th Conference on Information and Knowledge Management,2004.
    Kang F and Jin R.Symmetric statistical translation models for automatic image annotation.In The SIAM Conference on Data Mining,2005.
    Klas W.,King R..Context-Aware Multimedia.Encyclopedia of Multimedia 2006
    Lavrenko V,Manmatha R.,and Jeon J.A model for learning the semantics of pictures.In Proceedings of Advance in Neutral Information Processing,2003.
    Lee H.Y.,Lee H.K.,Ha Y.H.Spatial color descriptor for image retrieval and video segmentation.IEEE Trans.On Multimedia,5(3):358-367,2003.
    Lehmann TM,Deselaers T,Schubert H,Güld MO,Thies C,Fischer B,and Spitzer K.An interactive system for mental face retrieval.Proc.of the 7~(th) ACM SIGMM international workshop on Multimedia information retrieval,2005.
    Lejsek H.,(?)smundsson F.,J(?)nsson B.,Amsaleg L..Scalability of local image descriptors:a comparative study.ACM Multimedia 2006:589-598.
    Li,J and Wang J.Z,Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach.IEEE Trans.On Pattern Analysis and Machine Intelligence,2003.25(19):p.1075-1088.
    Li.J and Wang J.Z.Real-time Computerized Annotation of Pictures.Proceedings of the ACM Multimedia Conference,pp.911-920,ACM,Santa Barbara,CA,October 2006.
    Lian X.,Chen L.,Yu J.X.,Wang G.,Yu G..Similarity Match Over High Speed Time-Series Streams.ICDE 2007:1086-1095
    Lin.D.“Using Syntactic Dependency as a local Context to Resolve Word Sense Ambiguity”.In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics,pages 64-71,1997
    Lin Y,Sundaram H.? Chi Y,Tatemura J.Tseng.B.Detecting splogs via temporal dynamics using self-similarity analysis.TWEB 2(1):(2008)
    Liu J.,Wang B.,Li M.et al..Dual Cross-Media Relevance Model for Image Annotation.ACM International Conference on Multimedia.2007.
    Liu,J.and Li,M.and Liu,Q.and Lu,H.and Ma,S..Image annotation via graph learning.Pattern Recognition,vol.42,pp.218-228,2009
    Lowe D.G."Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision,vol.60,pp.91-110,2004.
    Luis von Ahn and Laura Dabbish.Labeling images with a computer game.ACM Computer Human Interaction Conference(CHI),2004.
    Luis von Ahn,Ruoran Liu,and Manuel Blum.Peekaboom:A game for locat-ing objects in images.ACM Computer Human Interaction Conference(CHI),2006.
    Middleton SE,Shadbolt NR,de Roure DC.Ontological user profiling in recommender systems.ACM Trans,on Information Systems,2004,22(1):54—88.
    Mikolajczyk K.and Schmid C."Scale and Affine Invariant Interest Point Detectors," International Journal of Computer Vision,vol.60,pp.63-86,2004.
    Miller G.Wordnet:A lexical database for english.In Communications of the ACM,1995.
    Monay F.and Gatica-Perez D.On image auto-annotation with latent space models.In Proceedings of ACM International Conference on Multimedia,2003.
    Monay F and Gatica-Perez D.Plsa-based image auto-annotation:Constraining the latent space.In Proceedings of ACM International Conference on Multimedia,2004.
    Mooney RJ,Bennett PN,Roy L.Book recommending using text categorization with extracted information.In:Proc.of the AAAI' 98/ICML' 98 W orkshop on Learning for Text Categorization.Madison:AAAI Press,1998.49—54.
    Nikhil Garg,Ingmar Weber.Personalized Tag Suggestion for Flickr.International Conference on World Wide Web,2008.
    Page,L.and Brin,S.and Motwani,R.and Winograd,T.The pagerank citation ranking:Bringing order to the web.Technical report,Stanford Digital Library Technologies Project,1998
    Pavlov D,Pennock D.A maximum entropy approach to collaborative filtering in dynamic,sparse,high—dimensional domains.In Proc.ofthe 16th Annual Conf.on Neural Inform ation Processing Systems.2002.
    Pazzani M,Billsus D.Learning and revising user profiles:The identification of interesting web sites.Machine Learning,1997,27(3):313—331.
    Qiu G.P.,Lam K.M.Frequency layered color indexing for content-based image retrieval.IEEE Trans.On Image Processing.12(1):102-113,2003.
    Rattenbury T,Good N.and Naaman M.Towards Automatic Extraction of Event and Place Semantics from Flickr Tags.Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval,2007.
    Rui,X.;Li,M.;Li,Z.;Ma,W.Y;Yu,N..Bipartite graph reinforcement model for web image annotation.Proceedings ofthe 15th international conference on Multimedia,2007a.
    Rui,X.;Yu,N.;Wang,T.;Li,M..A Search-Based Web Image Annotation Method.IEEE International Conference on Multimedia and Expo,2007b.
    Rui Y,Huang T.S.,Mehrotra S.Relevance feedback:a powerful tool in interactive content-based image retrieval.IEEE Trans.On CSVT,8(5):644-655,1998.
    Rui,Y.and Huang,T.S.and Chang,S.F.Image retrieval:Current techniques,promising directions,and open issues.Journal of Visual Communication and Image Representation.1999.
    Rui Y.,Huang T.S.Optimizing learning in image retrieval.IEEE Conf.On CVPR,South Carolina,USA,2000.
    Russell B.C.,Torralba A.,Murphy K.P.,and Freeman W.T..Labelme:a database and web-based tool for image annotation.MIT AI Lab Memo AIM-2005-025,2005.
    Schaffalitzky F.and Zisserman A.."Multi-View Matching for Unordered Image Sets," in Proceedings of the 7th European Conference on Computer Vision,vol.1,pp.414-431,Copenhagen,Denmark,May 2002.
    Schmid C.and Mohr R.."Local Grayvalue Invariants for Image Retrieval," IEEE Transactions on Pattern Analysis and Machine Intelligence,vol.19,pp.530-535,1997.
    Silva I.,Ribeiro-Neto B.,Calado P.,et al.Link-based and content-based evidential information in a belief network model:ACM New York,NY,USA,2000,pp.96-103.
    Smeulders,A.W.M.and Worring,M.and Santini,S.and Gupta,A.and Jain,R.Content-based image retrieval at the end of the early years.IEEE Transactions on pattern analysis and machine intelligence,2000.
    Tang,J.,Hua,X,Mei,T,Qi,G.,Li,S.,Wu,X..Temporally Consistent Gaussian Random Field for Video Semantic Analysis.IEEE International Conference on Image Processing,2007.
    Wang,C.;Jing,F.;Zhang,L.;Zhang,H.J..Image annotation refinement using random walk with restarts.Proceedings of the 14th annual ACM international conference on Multimedia,2006
    Wang,C.;Jing,F.;Zhang,L.;Zhang,H.J..Content-Based Image Annotation Refinement.Computer Vision and Pattern Recognition,2007.IEEE Conference on.
    Wang,M.,Hua,X.,Yuan,X.,Song,Y,Dai,L.Optimizing multi-graph learning:towards a unified video annotation scheme.Proceedings of the 15th international conference on Multimedia,2007.
    Wang X.J.,Ma W.Y.,Xue G.R.et al.Multi-model similarity propagation and its application for web image retrieval.Proceedings of the 12~(th) annual ACM international conference on Multimedia,2004,pp.944-951.
    Wang,X.J.;Zhang,L.;Jing,F.;Ma,W.Y.AnnoSearch:Image Auto-Annotation by Search.The International Conference on Computer Vision and Pattern Recognition,New York,June,2006.
    Wang,Y and Mei,T.and Gong,S.and Hua,X.S..Combining global,regional and contextual features for automatic image annotation.Pattern Recognition.Vol 42,pp 259-266,2009.
    Woodruff A.,Faulring A.,Rosenholtz R.,et al.Using thumbnails to search the Web:ACM Press New York,NY,USA,2001,pp.198-205.
    Xue G.R.,Zeng H.J.,Chen Z.,Ma W.Y,Yu Y,Similarity spreading:a unified framework for similarity calculation of interrelated objects,Proceedings of the 13 th international World Wide Web conference on Alternate track papers & posters,May 19-21,2004,New York,NY,USA
    Xue X.B.,Zhou Z.H.,Zhang Z.M.Improve web search using image snippets.Proceedings of the 21~(st) national conference on Artificial Intelligence,2006,pp.1431-1436.
    Zeng H.J,He Q.C,Chen Z,and Ma W.Y,"Learning To Cluster Web Search Results".SIGIR,2004,pp.210-217.
    Zhang,L.,Chen,L.,Li,M.,Zhang,H.Automated annotation of human faces in family albums.In Proceedings of ACM Multimedia,2003.
    Zhou,D.,Bousquet,O.,Lal,T.N.,et al.."Learning with Local and Global Consistency".18~(th)Annual Conf.on Neural Information Processing Systems,pp.37-244,2003a
    Zhou,D.,J.Weston,A.Gretton,O.et al.,"Ranldng on Data Manifolds." MPI Technical Report (113),Max Planck Institute for Biological Cybernetics,T(u|¨)bingen,Germany,June,2003b.
    Zhu X.Q.,Zhang H.J.New query refinement and semantics integrated image retrieval system with semi-automatic annotation scheme.Journal of Electronic Imaging,2001.
    Zunjarwad A.,Sundaram H.,L.Xie.Contextual wisdom:social relations and correlations for multimedia event annotation.ACM Multimedia 2007:615-624
    刘静.网络图像检索系统中关键技术的研究.中国科学院研究生院博士论文,2008.
    王斌.图像检索中自动标注与快速相似搜索技术研究.中国科学技术大学博士论文,2007.
    王斌,俞能海.一种针对大规模网络图像的自动标注改善算法[J].电子与信息学报,2009年第2期.
    许海玲,吴潇,李晓东,阎保平.互联网推荐系统比较研究[J].软件学报.2009年第2期.
    曾春,邢春晓等.个性化服务技术综述.软件学报[J].2002年第13卷第10期
    中国互联网络发展状况统计报告.中国互联网络信息中心[J].2009年1月.