基于机器学习方法的生物序列分类研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于机器学习方法的生物序列分类研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Biological Sequence Classification Based on Machine Learning Methods
作者：杨旸
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：生物信息学 ; 生物序列 ; 特征提取 ; 模式分类 ; 最小最大模块化网络 ; 支持向量机 ; 问题分解 ; 蛋白质亚细胞定位 ; 非编码RNA
英文关键词：Bioinformatics ; Biological Sequence ; Feature Extraction ; Pattern Classification ; Min-Max Modular Network ; Support Vector Machines ; Task Decomposition ; Protein Subcellular Localization ; Non-Coding RNA
学位年度：2009
导师：吕宝粮
学科代码：081202
学位授予单位：上海交通大学
论文提交日期：2009-06-01

摘要

在过去的几十年间,机器学习方法在生物信息领域获得了强劲的发展动力,成为解决许多生物学问题的重要方法。在生物信息学中,无论是基因识别,还是DNA序列上的功能位点和特征信号的识别,或者是蛋白质序列特征分析,都需要用到机器学习和模式识别技术。本文的工作围绕模式识别的两个关键问题,特征提取和模式分类,对生物序列(包括蛋白质序列和核酸序列)进行深入的分析和分类,以解决蛋白质的亚细胞定位,同源蛋白查找,细菌Ⅲ型分泌系统的分泌蛋白预测以及新的非编码RNA预测等问题。
     本文的主要贡献在以下几个方面。
     1)借鉴中文自然语言处理中的分词技术,提出了一种新的蛋白质序列特征提取方法。我们从蛋白质的氨基酸序列中挑选具有统计意义的子序列构成词典,并将氨基酸序列切分为互不重叠的词,通过统计各个词的出现频率获取蛋白质的特征。相比于传统的氨基酸多联体频率法,所提方法所生成的特征向量具有维数低、准确性高的优点。我们将其应用到蛋白质亚细胞定位和同源蛋白查找中,取得了良好的效果。
     2)针对细菌Ⅲ型分泌系统分泌的效应蛋白序列相似度低和空间结构不稳定的特性,我们首次利用二级结构和溶剂可接触性信息以及氨基酸组份信息预测未知的效应蛋白,在假单胞菌基因组上进行交叉验证,取得了较高准确率,并对根瘤菌的四个不同菌株的基因组进行了预测,得到一批新的效应蛋白。
     3)针对蛋白质定位问题的样本不平衡和多点定位问题,采用最小最大模块化支持向量机解决这一多标号不平衡问题。该方法相比于传统的支持向量机,在总体准确率和类平均准确率指标上均有提高；同时,该方法也大大缩短了训练时间,可用于大规模的数据集。
     4)为最小最大模块化支持向量机提出一种新的基于生物领域知识(物种分类和基因本体注释信息)的任务分解方法,该方法与随机划分和其他划分方法相比具有性能稳定,准确率高的优点。
     5)基于比较基因组学方法,抽取多种植物全基因组序列的基因间隔区,并通过序列比对得到在多个植物基因间隔区中保守的序列片段,对这些片段进行预测,并经过一系列的筛选步骤,得到共计21个新的非编码RNA,分为16个家族。这些新家族均通过生物实验验证其表达性。
Over the past few decades, the machine learning methods have obtained great moti-vation of development in the realm of bioinformatics, and become an important means to solve biological problems. In bioinformatics, gene recognition, function cite/signal recog-nition on DNA sequences, and protein sequence feature analysis, all need machine learning and pattern recognition techniques. In this thesis, we focus on two key problems in pattern recognition, namely feature extraction and pattern classification, to analyze and classify bio-logical sequences including protein and DNA sequences, for dealing with a series of biologi-cal problems, i.e., protein subcellular localization, protein homology searching, prediction of the proteins secreted by type III secretion system and prediction of novel non-coding RNAs.
     The major contributions of the thesis are:
     1) Inspired by the word segmentation techniques in Chinese natural language process-ing, we proposed a new protein sequence feature extraction method. We selected subse-quences with statistical significance from the protein sequences, segmented the amino acid sequences into non-overlapped words, and extracted the features of protein sequences by counting the frequency of each word. Compared with traditional amino acidκ-mer fre-quency method, the proposed method has the advantages of lower dimensionality and higher accuracy. We applied it to protein subcellular localization and protein family classification, and obtained good results.
     2) Considering the low sequence similarity and unstable structures of the proteins se-creted from the type III secretion systems, i.e., effectors, we for the first time utilized protein secondary structure, solvent accessibility and amino acid composition information to predict unknown effectors. We performed cross validation on Pseudomonas genome and obtained high accuracy. Moreover, we predicted all the effectors of four strains of Rhizobium. Com-bining with promoter pattern matching, we obtained a number of new type III secretion effectors.
     3) For the class imbalance and multi-localization problems in protein subcellular local-ization, we used min-max modular support vector machines to solve the multi-label imbal-ance problem. Compared with traditional support vector machines, the modular classifier improved both total accuracy and class average accuracy. At the same time, this method speeded up the training time greatly, which is suited for large-scale data sets.
     4) We proposed a new task decomposition method based on biological domain knowl-edge, namely taxonomy and Gene Ontology information, for the min-max modular support vector machines. The new decomposition method has more stable performance and higher accuracy than random decomposition and other decomposition methods.
     5) Based on the comparative genomic method, we extracted intergenic regions from multiple plant genome sequences, and obtained conserved sequence segments through se-quence alignments. We conducted prediction on these segments, and carried out a series of screening steps, and finally obtained 21 new non-coding RNAs, which can be grouped into 16 families. These new ncRNAs have been verified through wet-bench experiments for their ability to express.

引文

[1]R. Durbin, Eddy S., Krogh A., and Mitchison G. Biological sequence analysis:Probabilistic models of proteins and nucleic acids. Cambridge university press,1998.
    [2]S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins.J. Mol. Biol,48(3):443-453,1970.
    [3]T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. Mol. Bwl, 147:195-197,1981.
    [4]G.D. Stormo, T.D. Schneider, L. Gold, and A. Ehrenfeucht. Use of the'Perceptron'algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research,10(9):2997,1982.
    [5]A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in com-putational biology. Applications to protein modeling. Journal of Molecular Biology,235(5):1501-1531,1994.
    [6]S.R. Eddy, G. Mitchison, and R. Durbin. Maximum discrimination hidden Markov models of sequence consensus. Journal of computational biology:a journal of computational molecular cell biology,2(1):9,1995.
    [7]C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald, and J.C. Wootton. De-tecting subtle sequence signals:a Gibbs sampling strategy for multiple alignment. Science, 262(5131):208-214,1993.
    [8]M.T. Hagan, H.B. Demuth, and M. Beale. Neural network design. PWS Publishing Co. Boston, MA, USA,1997.
    [9]K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks,3(5):551-560,1990.
    [10]K. Hornik, M. Stinchcombe, H. White, and P. Auer. Degree of approximation results for feed-forward networks approximating unknown mappings and their derivatives. Neural Computation, 6(6):1262-1275,1994.
    [11]V.N. Vapnik. Statistical learning theory. Wiley,1998.
    [12]J.R. Quinlan. Introduction of decision trees. Machine Learning, 1(1):81-106,1986.
    [13]J.R. Quinlan. C4.5:programs for machine learning. Morgan Kaufmann,1993.
    [14]J. Cedano, P. Aloy, J. A. Perez-Pons, and E. Querol. Relation Between Amino Acid Composition and Cellular Location of Proteins. Journal of Molecular Biology,266(3):594-600,1997.
    [15]K. J. Park and M. Kanehisa. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics,19(13):1656-1663,2003.
    [16]W. Y. Yang, B. L. Lu, and Y. Yang. A Comparative Study on Feature Extraction from Protein Sequences for Subcellular Localization Prediction. Proceedings of the 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 201-208,2006.
    [17]董启文,王晓龙,林磊,关毅,赵健.蛋白质二级结构预测：基于词条的最大熵马尔科夫方法.中国科学C辑生命科学,35(1)：8796,2005.
    [18]Biological Language Modeling Project website:http://www.cs.cmu.edu/blmt/.
    [19]H. Nakashima and K. Nishikawa. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol,238(1):54-61,1994.
    [20]O. Emanuelsson, H. Nielsen, S. Brunak, and G. von Heijne. Predicting subcellular localiza-tion of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology, 300(4):1005-1016,2000.
    [21]Z. P. Feng and C. T. Zhang. Prediction of the subcellular localization of prokaryotic proteins based on the hydrophobicity index of amino acids. Int. J. Biol. Macromol.,28:255-261,2001.
    [22]K.C. Chou. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Structure Function and Genetics,44(1):60-60,2001.
    [23]K.C. Chou. Prediction of protein subcellular locations by incorporating quasi-sequence-order ef-fect. Biochemical and Biophysical Research Communications,278(2):477-483,2000.
    [24]K.C. Chou and YD. Cai. Using functional domain composition and support vector machines for prediction of protein subcellular location. Journal of Biological Chemistry,277(48):45765-45769, 2002.
    [25]K.C. Chou and Y.D. Cai. A new hybrid approach to predict subcellular localization of pro-teins by incorporating gene ontology. Biochemical and Biophysical Research Communications, 311(3):743-747,2003.
    [26]K.C. Chou and Y.D. Cai. Prediction of protein subcellular locations by GO-FunD-PseAA predic-tor. Biochemical and Biophysical Research Communications,320(4):1236-1239,2004.
    [27]A. Reinhardt and T. Hubbard. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research,26(9):2230-2236,1998.
    [28]Y Fujiwara, M. Asogawa, and K. Nakai. Prediction of Mitochondrial Targeting Signals Using Hidden Markov Model. Genome Inform Ser Workshop Genome Inform,8:53-60,1997.
    [29]S. Hua and Z. Sun. Support vector machine approach for protein subcellular localization prediction. Bioinformatics,17(8):721-728,2001.
    [30]B. Raskutti and A. Kowalczyk. Extreme re-balancing for SVMs:a case study. ACM SIGKDD Explorations Newsletter,6(1):60-69,2004.
    [31]N.V. Chawla, N. Japkowicz, and A. Kotcz. Editorial:special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter,6(1):1-6,2004.
    [32]刘国平,姚莉秀,杨杰,王猛.基于加权支持向量机的膜蛋白类型预测中不平衡问题处理.上海交通大学学报,39(10)：1678-1684,2005.
    [33]陶兰李元乐.基于小波核支持向量机的蛋白质二级结构预测.深圳大学学报,23：117-121,2006.
    [34]K. Chen, W.M. Liang, and B.L. Lu. Data Analysis of Swiss-Prot Database. BCMI Technical Report BCMI-TR-0501,2005.
    [35]Y. D. Cai and K. C. Chou. Predicting 22 protein localizations in budding yeast. Biochem Biophys Res Commun,323(2):425-8,2004.
    [36]K. C. Chou and Y. D. Cai. Predicting protein localization in budding Yeast. Bioinformatics, 21(7):944-950,2005.
    [37]B. L. Lu and M. Ito. Task decomposition and module combination based on class relations:a mod-ular neural network for pattern classification. IEEE Transactions on Neural Networks,10(5):1244-1256,1999.
    [38]H.C. Lian and B.L. Lu. An Algorithm for Pruning Redundant Modules in Min-Max Modular Network. In Proc.14th National Conference on Neural Network, Hefei University of Technology Press, pages 37-42,2004.
    [39]Y. Yang and B.L. Lu. Structure Pruning Strategies for Min-Max Modular Network. In Proc. of International Symposium on Neural Networks (ISNN), Lecture Notes in Computer Science, volume 3496, pages 646-651. Springer,2005.
    [40]H. Zhao and B.L. Lu. Improvement on response performance of min-max modular classifier by symmetric module selection. In Advances in Neural Networks-ISNN 2005:Second International Symposium on Neural Networks, Chongqing, China, May 30-June 1,2005, Proceedings, Part II, page 39. Springer,2005.
    [41]J. Li, B.L. Lu, and M. Ichikawa. Typical Sample Selection and Redundancy Reduction for Min-Max Modular Network with GZC Function. In Proc. of International Symposium on Neural Net-works (ISNN), Lecture Notes in Computer Science, volume 3496, pages 467-472. Springer,2005.
    [42]J. Li, B.L. Lu, and M. Ichikawa. An Algorithm for Pruning Redundant Modules in Min-Max Modular Network with GZC Function. Lecture notes in computer science,3610:293,2005.
    [43]B.L. Lu, Q. Ma, M. Ichikawa, and H. Isahara. Efficient Part-of-Speech Tagging with a Min-Max Modular Neural-Network Model. Applied Intelligence,19(1):65-81,2003.
    [44]F. Y. Liu, K. Wu, H. Zhao, and B.L. Lu. Fast text categorization with min-max modular support vec-tor machines. In 2005 IEEE International Joint Conference on Neural Networks,2005. IJCNN'05. Proceedings, pages 570-575,2005.
    [45]F.Y. Liu, K.A. Wang, B.L. Lu, M. Utiyama, and H. Isahara. Efficient Text Categorization Using a Min-Max Modular Support Vector Machine. In Human Interaction With Machines:Proceedings of the 6th International Workshop Held at the Shanghai Jiao Tong University, March 15-16,2005, page 13. Springer,2006.
    [46]Z. Fan and B.L. Lu. Multi-view face recognition with min-max modular SVMs. Lecture notes in computer science,3611:396,2005.
    [47]H. Lian, B.L. Lu, E. Takikawa, and S. Hosoi. Gender recognition using a min-max modular support vector machine. Lecture notes in computer science,3611:438,2005.
    [48]H. Lian and B.L. Lu. Age estimation using a min-max modular support vector machine. Proc. ICONIP 2005,2005.
    [49]B.L. Lu, J. Shin, and M. Ichikawa. Massively parallel classification of single-trial EEG signals us-ing a min-max modular neural network. IEEE Transactions on Biomedical Engineering,51 (3):551, 2004.
    [50]B. L. Lu, K. A. Wang, M. Utiyama, and H. Isahara. A part-versus-part method for massively parallel training of support vector machines. Proceedings. IEEE International Joint Conference on Neural Networks,1:735-740,2004.
    [51]C. Cortes and V. Vapnik. Support-vector networks. Machine learning,20(3):273-297,1995.
    [52]U. Kressel et al. Pairwise classification and support vector machines. Advances in kernel methods: support vector learning, pages 255-268,1999.
    [53]C.W. Hsu and C.J. Lin. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks,13(2):415-425,2002.
    [54]T. Joachims. Learning to classify text using support vector machines:Methods, theory, and algo-rithms. Computational Linguistics,29(4),2002.
    [55]K. Chen, B. L. Lu, and J. T. Kwok. Efficient Classification of Multi-label and Imbalanced Data using Min-Max Modular Classifiers. International Joint Conference on Neural Networks, pages 1770-1775,2006.
    [56]K. Wang, H. Zhao, and B.L. Lu. Task decomposition using geometric relation for min-max-modular svms. ISNN (1), pages 887-892,2005.
    [57]Y. Wen, B.L. Lu, and H. Zhao. Equal clustering makes min-max modular support vector machine more efficient. In ICONIP,2005.
    [58]M. Borodovsky and J. McInich. GENMARK:parallel gene recognition for both DNA strands. Computers & Chemistry,17(2):123-133,1993.
    [59]A.V. Lukashin and M. Borodovsky. GeneMark. hmm:new solutions for gene finding. Nucleic Acids Research,26(4):1107-1115.
    [60]I. Hoeschele. Statistical techniques for detection of major genes in animal breeding data. TAG Theoretical and Applied Genetics,76(2):311-319,1988.
    [61]M.J. Weber. New human and mouse microRNA genes found by homology search,2005.
    [62]L. Liao and W.S. Noble. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology, pages 225-232,2002.
    [63]K. Blekas, D.I. Fotiadis, and A. Likas. Motif-Based Protein Sequence Classification Using Neural Networks. Journal of Computational Biology,12(1):64-82,2005.
    [64]C. Leslie, E. Eskin, and W.S. Noble. The spectrum kernel:A string kernel for SVM protein classification. Proceedings of the Pacific Symposium on Biocomputing,7:566-575,2002.
    [65]C. Leslie, E. Eskin, J. Weston, and W.S. Noble. Mismatch string kernels for SVM protein classifi-cation. Advances in Neural Information Processing Systems,15:1441-1448,2003.
    [66]R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, and C. Leslie. Profile-based string ker-nels for remote homology detection and motif extraction. Proceedings.2004 IEEE Computational Systems Bioinformatics Conference,2004., pages 152-160,2004.
    [67]Z. P. Feng. Prediction of the subcellular location of prokaryotic proteins based on a new represen-tation of the amino acid composition. Biopolymers,58:491-499,2001.
    [68]M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolin-ski, S.S. Dwight, J.T. Eppig, et al. Gene ontology:tool for the unification of biology. The Gene Ontology Consortium. Nat Genet,25(1):25-9,2000.
    [69]J.S. Liu. Monte Carlo strategies in scientific computing. Springer,2001.
    [70]R. Siddharthan, E.D. Siggia, and E. van Nimwegen. PhyloGibbs:a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol, 1(7):e67,2005.
    [71]B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research,31(1):365-370,2003.
    [72]Z. Wu and G. Tseng. Chinese text segmentation for text retrieval:Achievements and problems. Journal of the American Society for Information Science,44(9),1993.
    [73]J.Y. Nie and F. Ren. Chinese information retrieval:using characters or words? Information Processing and Management,35(4):443-462,1999.
    [74]C.H. Chang. Word Class Discovery for Postprocessing Chinese Handwriting Recognition. Proc. of the International Computational Linguistics-94, pages 1221-1225,1994.
    [75]B.Y.M. Cheng, J.G. Carbonell, and J. Klein-Seetharaman. Protein Classification Based on Text Document Classification Techniques. Proteins:Structure, Function and Bioinformatics,58:955-970,2004.
    [76]K. Aas and L. Eikvil. Text Categorization:A Survey. Technical Report,941,1999.
    [77]N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J.A. Sigrist. The PROSITE database. Nucleic Acids Res,34:D227-D230,2006.
    [78]L.J. McGuffin, K. Bryson, and D.T. Jones. The PSIPRED protein structure prediction server. Bioinformatics,16(4):404-405,2000.
    [79]M.T.A. Shamim, M. Anwaruddin, and H.A. Nagarajaram. Support Vector Machine-based classifi-cation of protein folds using the structural properties of amino acid residues and amino acid residue pairs. Bioinformatics,23(24):3320,2007.
    [80]J. Cheng, A.Z. Randall, M.J. Sweredoski, and P. Baldi. SCRATCH:a protein structure and struc-tural feature prediction server. Nucleic Acids Research,33:W72-W76,2005.
    [81]D. D. Lewis. Evaluating and optimizing autonomous text classfication systems. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Informa-tion Retrieval (SIGIR'95), pages 246-254,1995.
    [82]C.C. Chang and C.J. Lin. LIBSVM:a library for support vector machines. Software available at http://www. csie. ntu. edu. tw/cjlin/libsvm,2001.
    [83]D.D.Lewis. Evaluating text categorization. In Proceedings of Speech and Natural Language Workshop, pages 312-318. Morgan Kaufmann,1991.
    [84]T. Jaakkola, M. Diekhans, and D. Haussler. A Discriminative Framework for Detecting Remote Protein Homologies. Journal of Computational Biology,7(1-2):95-114,2000.
    [85]C. Leslie, E. Eleazar, and W. S. Noble. The spectrum kernel:a string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing, volume 7, pages 566-575,2002.
    [86]C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernel for SVM protein classifi-cation. In Advances in Neural Information Processing Systems 15, pages 1417-1424. MIT Press, 2003.
    [87]C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research,5:1435-1455,2004.
    [88]S.V.N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching. Kernel methods in computational biology, pages 113-130,2004.
    [89]M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to eval-uate sequence matching. Computeres and Chemistry,20(1):25-33,1996.
    [90]R. Karchin, K. Karplus, and D. Haussler. Classifying G-protein coupled receptors with support vector machines. Bioinformatics,18(1):147-159,2002.
    [91]C. Marie, W.J. Deakin, T. Ojanen-Reuhs, E. Diallo, B. Reuhs, W.J. Broughton, and X. Perret. TtsI, a key regulator of Rhizobium species NGR234 is required for type Ⅲ-dependent protein secretion and synthesis of rhamnose-rich polysaccharides. Molecular Plant-Microbe Interactions, 17(9):958-966,2004.
    [92]S. Zehner, G. Schober, M. Wenzel, K. Lang, and M. Gottfert. Expression of the Bradyrhizobium japonicum Type III Secretion System in Legume Nodules and Analysis of the Associated tts box Promoter. Molecular Plant-Microbe Interactions,21(8):1087-1093,2008.
    [93]S.R. Eddy. HMMER-biosequence analysis using profile hidden Markov models. URL http://hmmer.janelia. org,2007.
    [94]L. J. Foster, C.L. de Hoog, Y. Zhang, Y. Zhang, X. Xie, V.K. Mootha, and M. Mann. A Mammalian Organelle Map by Protein Correlation Profiling. Cell,125(1):187-199,2006.
    [95]S. Zhang, X. Xia, J. Shen, Y. Zhou, and Z. Sun. DBMLoc:a Database of proteins with multiple subcellular localizations. BMC Bioinformatics,9(1):127,2008.
    [96]G. Wu and E. Y. Chang. Class-boundary alignment for imbalanced dataset learning. In Proceedings of the ICML, volume 3,2003.
    [97]R. Akbani, S. Kwek, and N. Japkowicz. Applying Support Vector Machines to Imbalanced Datasets. LECTURE NOTES IN COMPUTER SCIENCE, pages 39-50,2004.
    [98]A. Pierleoni, P. L. Martelli, P. Fariselli, and R. Casadio. BaCelLo:a balanced subcellular localiza-tion predictor. Bioinformatics,22(14):408-416,2006.
    [99]R. Nair and B. Rost. Mimicking Cellular Sorting Improves Prediction of Subcellular Localization. Journal of Molecular Biology,348(1):85-100,2005.
    [100]K. Veropoulos, C. Campbell, and N. Cristianini. Controlling the sensitivity of support vector ma-chines. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 55-60,1999.
    [101]H. Zhao and B. Lu. A General Procedure for Combining Binary Classifiers and Its Performance Analysis. Lecture notes in computer science,3610:303,2005.
    [102]J. Furnkranz. Round robin classification. The Journal of Machine Learning Research,2:721-747, 2002.
    [103]J.C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. Advances in neural information processing systems,12(3):547-553,2000.
    [104]W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, and E. K. O'Shea. Global analysis of protein localization in budding yeast. Nature,425(6959):686-691, 2003.
    [105]Y. M. Wen, B. L. Lu, and H. Zhao. Equal clustering makes min-max modular support vector machine more efficient. Proceedings of the 12th International Conference on Neural Information Processing, pages 77-82,2006.
    [106]A. Choudhury, P.B. Nair, and A. J. Keane. A data parallel approach for large-scale Gaussian process modeling. In Proceedings of the Second SIAMInternational Conference on Data Mining,2002.
    [107]J.Z. Wang, Z. Du, R. Payattakool, P.S. Yu, and C.F. Chen. A new method to measure the semantic similarity of GO terms. Bioinformatics,23(10):1274,2007.
    [108]G. Karypis. CLUTO-A Clustering Toolkit.2002.
    [109]C. Guda. pTARGET:a web server for predicting protein subcellular localization. Nucleic Acids Research,34:W210-W213,2006.
    [110]K. Yamada, J. Lim, J.M. Dale, H. Chen, P. Shinn, C.J. Palm, A.M. Southwick, H.C. Wu, C. Kim, M. Nguyen, et al. Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome. Science, 302(5646):842-846,2003.
    [111]V. Stolc, M.P. Samanta, W. Tongprasit, H. Sethi, S. Liang, D.C. Nelson, A. Hegeman, C. Nelson, D. Rancour, S. Bednarek, et al. Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. Proceedings of the National Academy of Sciences, 102(12):4453,2005.
    [112]Z. Zhang, A.W.C. Pang, and M. Gerstein. Comparative analysis of genome tiling array data reveals many novel primate-specific functional RNAs in human. BMC Evolutionary Biology,7(S1-S14), 2007.
    [113]G.C. Macintosh, C. Wilkerson, and P.J. Green. Identification and Analysis of Arabidopsis Ex-pressed Sequence Tags Characteristic of Non-Coding RNAs. Plant Physiology,127(3):765,2001.
    [114]S. Washietl, I.L. Hofacker, M. Lukasser, A. Huttenhofer, and P.F. Stadler. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nature Biotechnology,23:1383-1390,2005.
    [115]S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S.R. Eddy. Rfam:an RNA family database. Nucleic Acids Research,31(1):439,2003.
    [116]S. Griffiths-Jones, R.J. Grocock, S. van Dongen, A. Bateman, and A.J. Enright. miRBase:mi-croRNA sequences, targets and gene nomenclature. Nucleic acids research,34:D140,2006.
    [117]J.W.S. Brown, M. Echeverria, L.H. Qu, T.M. Lowe, J.P. Bachellerie, A. Huttenhofer, J.P. Kasten-mayer, P.J. Green, P. Shaw, and D.F. Marshall. Plant snoRNA database. Nucleic Acids Research, 31(1):432-435,2003.
    [118]S.R. Eddy and R. Durbin. RNA sequence analysis using covariance models. Nucleic Acids Res, 22(11):2079-2088,1994.
    [119]B. J. Yoon and P.P. Vaidyanthan. An overview of the role of context-sensitive HMMs in the predic-tion of ncRNA genes. Proc. IEEE Workshop on Statistical Signal Processing, Bordeaux, France, July, pages 1983-1984,2005.
    [120]S. Will, K. Reiche, I.L. Hofacker, P.F. Stadler, and R. Backofen. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol,3(4):e65, 2007.
    [121]A. Adai, C. Johnson, S. Mlotshwa, S. Archer-Evans, V. Manocha, V. Vance, and V. Sundaresan. Computational prediction of miRNAs in Arabidopsis thaliana. Genome Research,15(1):78,2005.
    [122]E. Torarinsson, Z. Yao, E.D. Wiklund, J.B. Bramsen, C. Hansen, J. Kjems, N. Tommerup, W.L. Ruzzo, and J. Gorodkin. Comparative genomics beyond sequence-based alignments:RNA struc-tures in the ENCODE regions. Genome Research,18(2):242,2008.
    [123]J.P. McCutcheon and S.R. Eddy. Computational identification of non-coding RNAs in Saccha-romyces cerevisiae by comparative genomics. Nucleic Acids Research,31(14):4119,2003.
    [124]E. Rivas and S.R. Eddy. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics,2(8),2001.
    [125]I.M. Axmann, P. Kensche, J. Vogel, S. Kohl, H. Herzel, and W.R. Hess. Identification of cyanobac-terial non-coding RNAs by comparative genome analysis. Genome Biol,6(9):R73,2005.
    [126]Z. Weinberg, J.E. Barrick, Z. Yao, A. Roth, J.N. Kim, J. Gore, J.X. Wang, E.R. Lee, K.F. Block, N. Sudarsan, et al. Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline. Nucleic Acids Research,35(14):4809,2007.
    [127]J.A. Chekanova, B.D. Gregory, S.V. Reverdatto, H. Chen, R. Kumar, T. Hooker, J. Yazaki, P. Li, N. Skiba, Q. Peng, et al. Genome-Wide High-Resolution Mapping of Exosome Substrates Reveals Hidden Features in the Arabidopsis Transcriptome. Cell,131(7):1340-1353,2007.
    [128]J. Yu, S. Hu, J. Wang, G.K.S. Wong, S. Li, B. Liu, Y. Deng, L. Dai, Y. Zhou, X. Zhang, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science,296(5565):79-92,2002.
    [129]G.A. Tuskan, S. DiFazio, S. Jansson, J. Bohlmann, I. Grigoriev, U. Hellsten, N. Putnam, S. Ralph, S. Rombauts, A. Salamov, et al. The Genome of Black Cottonwood, Populus trichocarpa (Torr.& Gray). Science,313(5793):1596-1604,2006.
    [130]O. Jaillon, J.M. Aury, B. Noel, A. Policriti, C. Clepet, A. Casagrande, N. Choisne, S. Aubourg, N. Vitulo, C. Jubin, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature,449(7161):463-467,2007.
    [131]R. Ming, S. Hou, Y. Feng, Q. Yu, A. Dionne-Laporte, J.H. Saw, P. Senin, W. Wang, B.V. Ly, K.L.T. Lewis, et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature,452(7190):991-996,2008.
    [132]J.D. Thompson, D.G. Higgins, T.J. Gibson, et al. CLUSTAL W:improving the sensitivity of pro-gressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res,22(22):4673-4680,1994.
    [133]S. Washietl, I.L. Hofacker, and P.F. Stadler. From The Cover:Fast and reliable prediction of noncoding RNAs. Proceedings of the National Academy of Sciences,102(7):2454,2005.
    [134]K. Missal, D. Rose, and P.F. Stadler. Non-coding RNAs in Ciona intestinalis. Bioinformatics, 21(90002),2005.
    [135]T. Mourier, C. Carret, K. Kyes, Z. Christodoulou, P.P. Gardner, D.C. Jeffares, R. Pinches, B. Barrell, M. Berriman, S. Griffiths-Jones, A. Ivens, C. Newbold, and A. Pain. Genome wide discovery and verification of novel structured RNAs in Plasmodium falciparum. Genome Research,18:281-292, 2008.
    [136]S. Washietl and I.L. Hofacker. Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. Journal of molecular biology,342(1):19-30,2004.
    [137]TAIR:http://www.arabidopsis.org.
    [138]S. Griffiths-Jones, H.K. Saini, S. Dongen, and A.J. Enright. miRBase:tools for microRNA ge-nomics. Nucleic acids research,2007.
    [139]S. Griffiths-Jones. The microRNA registry. Nucleic Acids Research,32(Database Issue):109-111, 2004.
    [140]L.A. Rymarquis, J.P. Kastenmayer, A. Huttenhofer, and P.J. Green. Diamonds in the rough: mRNA-like non-coding RNAs. Trends in Plant Science,13(7):329-334,2008.
    [141]S.R. Eddy. A memory-efficient dynamic programming algorithm for optimal alignment of a se-quence to an RNA secondary structure. BMC bioinformatics,3(1):18,2002.
    [142]I.T. Rombel, K.F. Sykes, S. Rayner, and S.A. Johnston. ORF-FINDER:a vector for high-throughput gene identification. Gene,282(1-2):33-41,2002.
    [143]R.R. Sokal and F.J. Rohlf. Biometry:The Principles and Practice of Statistics in Biological Re-search (ed.),1995.
    [144]A.M. Gustafson, E. Allen, S. Givan, D. Smith, J.C. Carrington, and K.D. Kasschau. ASRP:the Arabidopsis small RNA project database. Nucleic acids research,33(Database Issue):D637,2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700