用户名: 密码: 验证码:
互联网文本聚类与检索技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网技术的高速发展,网络上文本信息的容量与日俱增,人们迫切需要提高在互联网上的信息获取效率。文本挖掘技术用于对文本数据进行知识挖掘,试图有效的解决当前信息过载的问题。
     文本作为自然语言的语义载体,通过引入相关的自然语言处理技术,深度挖掘文本在语义上的特征,能提高相应的文本挖掘算法在文本挖掘中的准确性和效率。本文主要研究了自然语言处理技术在文本聚类和信息检索系统中相关问题的应用。针对搜索引擎和互联网环境下的文本数据挖掘任务,本文提出了一系列基于相关的自然语言处理技术的方法来改善文本聚类算法的效果以及提高信息检索系统中查询结果与查询的相关性质量,论文的主要内容包括以下四个方面。
     本文首先提出了一种基于相关自然语言处理技术的文本聚类语义特征降维方法。文本聚类作为一种无监督的数据挖掘方法,相对于有监督的文本分类算法而言,特征的选择通常没有很有效的方法。因此不同特征对聚类效果的影响就无法得到有效的控制,维度过大,聚类结果的准确性易受噪音特征的影响。本文提出了一种基于词法分析技术的特征降维方法,通过提取文本中名词性的词语作为特征进行聚类,有效的降低了文本集合中特征的维数,同时保证了特征的辨别能力。由于名词存在同义性的问题,使得相同的语义有不同的词语表现形式,影响了文本相似度的衡量。本文通过采用相关的语义知识词典对词语进行类别扩展,在一定程度上降低了特征的同义性,在进一步降低特征的维数的同时,促进了聚类结果的准确性。实验表明,基于词法分析技术和语义知识词典扩展的特征降维方法在显著的降低文本特征空间的大小的同时,有效的提高了聚类结果的准确性。
     相对于搜索引擎线性结果列表中存在的不足,对搜索结果进行聚类是一种更有效的结果呈现方式。搜索结果聚类针对的文档集是搜索结果的摘要描述,尽管这些摘要信息明确,但长度短小,在这样的文本集合上进行聚类,通常的文档相似度算法经常由于特征空间的稀疏而无法得到准确的结果。本文通过引入容错粗糙集技术,利用文档间词语的共现信息对原始结果摘要进行语义上的扩充,扩充后的文档间的相关性得到了强化,避免了特征空间稀疏导致的聚类准确度下降的问题。在聚类算法的选择上,本文提出了一种新的基于词语相关度计算的标签式聚类算法,将搜索结果聚类问题转换成基于搜索结果集合的查询词语义消歧问题。这种聚类算法能生成描述性更清晰、鉴别能力更强的标签描述,同时,与标签对应的结果在内容上也有更好的一致性。实验表明,本文提出的搜索结果聚类算法能有效的挖掘出用户查询在搜索结果中所对应的各种不同的语义,从而帮助用户快速定位他们所需要的文档集合。
     文本聚类算法通常采用向量空间模型来对文本进行形式化表示,向量空间模型中各个特征之间是没有关联的。这种假设对于文本来说丢失了很多有价值的能有效衡量文档之间相似性的信息,从而降低了聚类的准确性。相对于独立的单个词语特征,不同文档之间频繁出现的词语集合更能反映出文档之间的相似程度。本文采用基于上下文约束的闭频繁词集用于衡量文档之间的相似性,更好的体现了文档之间深层的潜在语义联系。频繁项集挖掘是数据挖掘中经典的用于关联分析的技术,通过改进,本文将这种频繁项集挖掘算法引入到了文本集合中用于挖掘文档集中的频繁词集,并通过对发现的频繁词集加入了不同的上下文距离约束限制,使得频繁模式更能保持语义上的一致性,有效地反映出了文本相对于结构化数据的特点。实验表明,基于这种新的相似度衡量方法的文本聚类算法能生成更加准确的聚类结果。
     搜索结果的相关度排序是信息检索中的重要研究内容之一。与传统的文本数据不同,网页通常带有大量的与主题无关的噪音信息,严重影响查询结果的相关性,因此本文采用基于内容单元的网页解析与内容提取技术,对网页首先进行净化处理,以减少网页中内容无关信息对检索相关度的影响。目前绝大多数信息检索系统的相关度计算方法是建立在全文的基础之上。但是基于网页的全文往往在内容的表达上不具一致性,存在与主题无关的内容,这也会在一定程度上影响查询结果的相关度。本文提出了一种通过计算用户查询与净化后网页的自动文摘之间的相关度来提高信息检索的质量的方法,相对于全文来说,摘要是从全文中提取的文档的核心内容,具有简洁性、准确性和清晰性等特点,更能反映文档的主题信息。实验表明,相对于全文,基于摘要的检索结果在相关度排序的准确性上能取得更好的效果。
With the rapid development of the internet, the volume of the text based informationincreases day by day with the high speed, and there is urgent need for people to effectivelyaccess the information. The text mining tasks try to solve the problem of“informationoverload”.
     Text is the semantic representation of natural language, so if some natural languageprocessing (NLP) techniques are adopted into the text mining process to handle the se-mantic features in text, some improvements in text mining algorithms can be foreseen.This thesis put the research focus on some applications in text clustering and informationretrieval using NLP techniques. For the text mining tasks in environment of Web andsearch engine, this thesis propose a series of NLP based methods to improve the qualityof text clustering algorithm and the accuracy in the relevance of search results related touser’s query in web based information retrieval systems. The major contents of this thesisinclude the following four parts.
     Firstly, this thesis proposed an NLP based semantic feature reduction method usedin text clustering algorithm. Compared with the supervised text categorization algorithm,text clustering is an unsupervised data mining method, and there are little effective featurereduction methods yet. The different kinds of features that can affect the quality of textclustering results are hard to be controlled. If the dimension of feature space is too huge,the accuracy of clustering results can be easily affected by the noise features. This the-sis proposed a feature reduction method based on lexical analysis by choosing the nounrelated features, which can significantly reduce the dimension of feature space and mean-while reserve most of their discrimination power. Because there are lots of synonymousnouns that different words share the same meaning, which can cause inaccuracy in docu-ment similarity measure. To solve this problem, this thesis uses the semantic dictionary totransform each remained feature to its upper semantic categorization, leading to a smallerfeature space and meanwhile promoting the accuracy of clustering results.
     To tackle the deficiency in ranked results list returned from search engine, cluster-ing search results is a more suitable result representation. The content of search results issimple and concise, but short in length. The similarity measure based on this kind of short texts usually leads to poor results because of the sparseness in feature space. This thesisuses tolerance rough set to extend the original feature space to its semantic approximateupper feature space based on the words co-occurrences. In the new feature space, thelatent similarity between documents is intensified. And this thesis also presents a new la-bel based search results clustering algorithm according to the correlation between words,and transform the problem of search results clustering to query sense disambiguation.This method can generate more descriptive and indiscriminate labels for each cluster andmeanwhile make documents in the same cluster consistent in contents. Experiments showthat this clustering method can help users to find the different senses in their queries at thesearch results, and easily locate the subset of results that according to their informationneeds.
     The VSM (Vector Space Model) is usually adopted as the text representation in textclustering, where the features are supposed to be independent. This assumption makes alot of useful information lost in similarity measure between documents. Compared withthe single independent features, the frequent wordsets occurred in many documents canimply the similarities between documents with strong indication. This thesis measures thesimilarities between documents based on contextual constraint closed frequent wordset,which is a more suitable feature unit to re?ect the latent relations in documents. Fre-quent itemset mining is a technique adopted from data mining, which used in associationanalysis in structural transaction database. In this thesis, it is modified for text clusteringalgorithm, and constrained with different contextual proximity to make the wordset moreconsistent in semantic. The experiments results show that the clustering algorithm basedon this new documents similarity measure can get more accuracy in results of clustering.
     Ranking of search results by relevance is a very important topic in information re-trieval. Different with the traditional text documents, there is lots of noise informationin Web pages which has strong impacts on the relevance of results. So in this thesis, theWeb pages were purified through page analysis and content extraction method based onthe concept of content unit firstly, which can reduce the impact of the noise informationexist in the structure level of Web pages. Most of the information retrieval systems laytheir relevance computing techniques on the full-length text analysis, but there are moreinconsistent contents which are topic irrelevant existing in Web pages which can also de-teriorate the relevance of results. This thesis proposed a summarization based relevance promoting method computing the relevance between query and summarization instead offull text. Summarization is the core of full text document and more consistent in topic rep-resentation, which has the characteristics like concise, accuracy and clear. Experimentsshow that summarization based relevance computing method can lead to a more accuratesearch results in relevance ranking.
引文
1 G. Zhang, G. Zhang, Q. Yang, et al. Evolution of the Internet and its Cores[J].New Journal of Physics, 2008, 10(12):http://www.iop.org/EJ/refs/1367--2630/10/12/123027.
    2 R. Hausser. Foundations of Computational Linguistics: Man–machine Communi-cation in Natural Language[J]. Computational Linguistics, 2000, 26(3):449–455.
    3 R. Mitkov. The Oxford Handbook of Computational Linguistics[J]. ComputationalLinguistics, 2004, 30(1):103–106.
    4 P. Jackson, I. Moulinier. Natural Language Processing for Online Applications:Text Retrieval, Extraction and Categorization[M]. John Benjamins PublishingCompany, 2007.
    5 R. Feldman, J. Sanger. The Text Mining Handbook[M]. Cambridge UniversityPress, 2006.
    6 M. Berry, M. Castellanos. Survey of Text Mining II: Clustering, Classification, andRetrieval[J]. Springer, 2007.
    7郭萌,王珏.数据挖掘与数据库知识发现:综述[J].模式识别与人工智能,1998, 11(3):292–299.
    8 M.-S. Chen, J. Han, P. S. Yu. Data Mining: An Overview from a Database Perspec-tive[J]. IEEE Transactions on Knowledge and data Engineering, 1996, 8(6):866–883.
    9 R. Feldman, I. Dagan. Knowledge Discovery in Textual Databases[C]Proceedingsof the First International Conference on Knowledge Discovery and Data Mining(KDD-95). Montreal, Canada: AAAI Press, Menlo Park, 1995:112–117.
    10 A. Jain, R. Dubes. Algorithms for Clustering Data[M]. Prentice-Hall, 1988.
    11 P. Sneath, R. Sokal. Numerical Taxonomy[M]. Springer, 1973.
    12 A. Strehl, J. Ghosh, R. Mooney. Impact of Similarity Measures on Web-pageClustering[C]Proc. AAAI Workshop on AI for Web Search (AAAI 2000), Austin.2000:58–64.
    13 G. Salton, M. McGill. Introduction to Modern Information Retrieval[M]. McGraw-Hill, Inc. New York, NY, USA, 1986.
    14 R. Forster. Document Clustering in Large German Corpora Using Natural Lan-guage Processing[D]University of Zurich, 2006:23–25.
    15 H. Schu¨tze, C. Silverstein. Projections for Efficient Document Clustering[C]ACMSIGIR Forum. 1997, 31:74–81.
    16 L. Kaufman, P. Rousseeuw. Finding Groups in Data: An Introduction to ClusterAnalysis[M]. John Wiley & Sons, New York, 1990.
    17 S. Chu, J. Roddick, J. Pan. An Incremental Multi-centroid, Multi-run SamplingScheme for K-medoids-based Algortihms-extended Report[C]Proceedings of theThird International Conference on Data Mining Methods and Databases, Data Min-ing. 2002, 3:553–562.
    18 S. Guha, R. Rastogi, K. Shim. Cure: An Efficient Clustering Algorithm for LargeDatabases[C]SIGMOD’98: Proceedings of the 1998 ACM SIGMOD internationalconference on Management of data. New York, NY, USA: ACM, 1998:73–84.
    19 S. Guha, R. Rastogi, K. Shim. Techniques for Clustering Massive Data Sets[J].Clustering and Information Retrieval, 2003:35–82.
    20 P. Bellot, M. El-Be`ze. Clustering by Means of Unsupervised Decision Trees Or Hi-erarchical and K-means-like Algorithm[C]Proceedings of 6th International Confer-ence‘Recherche d′Information Assiste′e par Ordinateur’(RIAO’00), Paris, France.2000:344–363.
    21 D. Cutting, D. Karger, J. Pedersen, et al. Scatter/gather: A Cluster-based Approachto Browsing Large Document Collections[C]Proceedings of the 15th annual in-ternational ACM SIGIR conference on Research and development in informationretrieval. 1992:318–329.
    22 Y. Peng, G. Kou, Z. Chen, et al. Recent Trends in Data Mining (DM): DocumentClustering of DM Publications[C]International Conference on Service Systems andService Management. 2006:1653–1659.
    23 G. Kowalski. Information Retrieval Systems: Theory and Implementation[M].Kluwer Academic Publishers, 1997.
    24 O. Zamir, O. Etzioni, O. Madani, et al. Fast and Intuitive Clustering of Web Docu-ments[C]Proceedings of the 3rd International Conference on Knowledge Discoveryand Data Mining. 1997:287–290.
    25 H. Zeng, Q. He, Z. Chen, et al. Learning to Cluster Web Search Re-sults[C]Proceedings of the 27th annual international ACM SIGIR conference onresearch and development in information retrieval. 2004:210–217.
    26 D. Koller, M. Sahami. Hierarchically Classifying Documents Using VeryFew Words[C]MACHINE LEARNING-INTERNATIONAL WORKSHOP THENCONFERENCE-. 1997:170–178.
    27 J.-R. Wen, J.-Y. Nie, H.-J. Zhang. Clustering User Queries of a Search En-gine[C]WWW’01: Proceedings of the 10th international conference on WorldWide Web. New York, NY, USA: ACM, 2001:162–168.
    28 Y. Fang, S. Parthasarathy, F. Schwartz. Using Clustering to Boost Text Classifica-tion[C]Proceedings of the IEEE ICDM Workshop on Text Mining (TextDM’01),Maebashi City, Japan. 2001.
    29 P. Cimiano, A. Hotho, S. Staab. Comparing Conceptual, Divisive and Agglom-erative Clustering for Learning Taxonomies from Text[C]Proceedings of the 16thEuropean Conference on Artificial Intelligence. 2004.
    30 A. Hamzah, A. Susanto, F. Soesianto, et al. Concept-Based Text Document Clus-tering[C]Proceedings of the International Conference on Electrical Engineering andInformatics Institut Teknologi Bandung, Indonesia. 2007:210–213.
    31 A. Kuhn, S. Ducasse, T. G′?rba. Semantic Clustering: Identifying Topics in SourceCode[J]. Inf. Softw. Technol., 2007, 49(3):230–243.
    32 K. Yin. Inferring Informed Clustering Problems with Minimum Description LengthPrinciple[D]University at Albany, 2007.
    33 R. Duda, P. Hart. Pattern Classification and Scene Analysis.[M]. New York, 1973.
    34 D. Greene, P. Cunningham. Producing Accurate Interpretable Clusters from High-dimensional Data[J]. Lecture notes in computer science, 2005, 3721:486–494.
    35 G. Hamerly, C. Elkan. Alternatives to the K-means Algorithm That Find BetterClusterings[C]Proceedings of the eleventh international conference on Informationand knowledge management. 2002:600–607.
    36 Y. Zhao, G. Karypis. Soft Clustering Criterion Functions for Partitional DocumentClustering: A Summary of Results[C]CIKM’04: Proceedings of the thirteenthACM international conference on Information and knowledge management. NewYork, NY, USA: ACM, 2004:246–247.
    37 J. Ton, R. Gonzalez. Pattern Recognition Principles[M]. Addison-Wesley, 1974.
    38 B. Everitt, S. Landau, M. Leese. Cluster Analysis[M]. 4nd ed., Wiley Publishing,2009.
    39 P. Willett. Recent Trends in Hierarchic Document Clustering: A Critical Review[J].Inf. Process. Manage., 1988, 24(5):577–597.
    40 Y. Zhao, G. Karypis. Criterion Functions for Document Clustering[R]. Tech. rep.,2005.
    41 J. Hartigan, M. Wong. A K-means Clustering Algorithm[J]. Applied Statistics,1979, 28(1):100–108.
    42 I. Dhillon, D. Modha. Concept Decompositions for Large Sparse Text Data UsingClustering[J]. Machine Learning, 2001, 42(1):143–175.
    43 Y. Zhao, G. Karypis, U. Fayyad. Hierarchical Clustering Algorithms for DocumentDatasets[J]. Data Mining and Knowledge Discovery, 2005, 10(2):141–168.
    44 J. Han, M. Kamber. Data Mining: Concepts and Techniques[M]. 2 ed., MorganKaufmann, 2006.
    45 H. Xiong, J. Wu, J. Chen. K-means Clustering Versus Validation Measures: A DataDistribution Perspective[C]Proceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining. 2006:779–784.
    46 B. Larsen, C. Aone. Fast and Effective Text Mining Using Linear-time DocumentClustering[C]Proceedings of the fifth ACM SIGKDD international conference onKnowledge discovery and data mining. 1999:16–22.
    47 P. Bradley, U. Fayyad. Refining Initial Points for K-means Cluster-ing[C]Proceedings of the Fifteenth International Conference on Machine Learning.1998:91–99.
    48 D. Arthur, S. Vassilvitskii. k-means++: The Advantages of Careful Seed-ing[C]Proceedings of the eighteenth annual ACM-SIAM symposium on Discretealgorithms. 2007:1027–1035.
    49 S. Zhong. Efficient Online Spherical K-means Clustering[C]2005 IEEE Interna-tional Joint Conference on Neural Networks, 2005. IJCNN’05. 2005, 5:3180–3185.
    50 B. Scho¨lkopf, J. Weston, E. Eskin, et al. A Kernel Approach for Learning fromAlmost Orthogonal Patterns[C]Proceedings of the 13th European Conference onMachine Learning. 2002:511–528.
    51 R. Zhang, A. Rudnicky. A Large Scale Clustering Scheme for Kernel K-means[C]International Conference On Pattern Recognition. 2002, 16:289–292.
    52 X. Liu, Y. Gong, W. Xu, et al. Document Clustering with Cluster Refinementand Model Selection Capabilities[C]Proceedings of the 25th annual internationalACM SIGIR conference on Research and development in information retrieval.New York, NY, USA: ACM, 2002:191–198.
    53 C. Ordonez, E. Omiecinski. FREM: Fast and Robust Em Clustering for Large DataSets[C]Proceedings of the eleventh international conference on Information andknowledge management. 2002:590–599.
    54 A. Banerjee, J. Ghosh. Frequency Sensitive Competitive Learning for Clustering onHigh-dimensional Hyperspheres[C]Proc. IEEE Int. Joint Conf. Neural Networks.2002:1590–1595.
    55 S. Zhong, J. Ghosh. Scalable, Balanced Model-based Clustering[C]Proc. 3rd SIAMInt. Conf. Data Mining. 2003:71–82.
    56 Teuvo Kohonen. Self-organizing Maps, 3rd Edition[M]. Berlin: Springer, 2001.
    57 S. Kaski, T. Honkela, K. Lagus, et al. WEBSOM–self-organizing Maps of Docu-ment Collections[J]. Neurocomputing, 1998, 21(1-3):101–117.
    58 D. Pullwitt. Integrating Contextual Information to Enhance Som-based Text Docu-ment Clustering[J]. Neural Networks, 2002, 15(8-9):1099–1106.
    59 J. Bakus, M. Hussin, M. Kamel. A Som-based Document Clustering UsingPhrases[C]Neural Information Processing, 2002. ICONIP’02. Proceedings of the9th International Conference on. 2002, 5:2212–2216.
    60 J. Henderson, P. Merlo, I. Petroff, et al. Using Nlp to Efficiently Visualize TextCollections with Soms[C]Database and Expert Systems Applications, 2002. Pro-ceedings. 13th International Workshop on. 2002:210–214.
    61 J. Henderson, P. Merlo, I. Petroff, et al. Using Syntactic Analysis to Increase Ef-ficiency in Visualizing Text Collections[C]Proceedings of the 19th internationalconference on Computational linguistics-Volume 1. 2002:1–7.
    62 J. Marques de Sa. Pattern Recognition: Concepts, Methods and Applications, 2001.
    63 M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Tech-niques[C]KDD workshop on text mining. 2000.
    64 C. Ding, X. He, H. Zha, et al. A Min-max Cut Algorithm for Graph Partitioningand Data Clustering[C]Proceedings IEEE International Conference on Data Min-ing, 2001. ICDM 2001. 2001:107–114.
    65 X. He, C. Ding, H. Zha, et al. Automatic Topic Identification Using WebpageClustering[C]First IEEE International Conference on Data Mining (ICDM’01). SanJose, CA. 2001:195–202.
    66 I. Dhillon. Co-clustering Documents and Words Using Bipartite Spectral GraphPartitioning[C]Proceedings of the seventh ACM SIGKDD international conferenceon Knowledge discovery and data mining. 2001:269–274.
    67 J. Shi, J. Malik. Normalized Cuts and Image Segmentation[J]. IEEE Transactionson pattern analysis and machine intelligence, 2000, 22(8):888–905.
    68 A. Ng, M. Jordan, Y. Weiss. On Spectral Clustering: Analysis and an Algo-rithm[C]MIT; 1998, 2002, 2:849–856.
    69 U. von Luxburg. A Tutorial on Spectral Clustering[J]. Statistics and Computing,2007, 17(4):395–416.
    70 W. Donath, A. Hoffman. Lower Bounds for the Partitioning of Graphs[J]. IBMJournal of Research and Development, 1973, 17(5):420–425.
    71 L. Bao, S. Tang, J. Li, et al. Document Clustering Based on Spectral Clustering andNon-negative Matrix Factorization[C]IEA/AIE. 2008:149–158.
    72 E. McCreight. A Space-economical Suffix Tree Construction Algorithm[J]. Journalof the ACM (JACM), 1976, 23(2):262–272.
    73 O. Zamir, O. Etzioni. Web Document Clustering: A Feasibility Demonstra-tion[C]Proceedings of the 21st annual international ACM SIGIR conference onResearch and development in information retrieval. 1998:46–54.
    74 K. Hammouda, M. Kamel. Efficient Phrase-based Document Indexing for WebDocument Clustering[J]. IEEE Transactions on Knowledge and Data Engineering,2004, 16(10):1279–1296.
    75 S. Zu Eissen, B. Stein, M. Potthast. The Suffix Tree Document Model Revis-ited[C]Proceedings of the 5th International Conference on Knowledge Manage-ment. 2005:596–603.
    76 H. Chim, X. Deng. Efficient Phrase-based Document Similarity for Clustering[J].IEEE Transactions on Knowledge and Data Engineering, 2008, 20(9):1217–1229.
    77 B. J. Frey, D. Dueck. Clustering by Passing Messages between Data Points[J].Science, 2007, 315:972–976. http://www.psi.toronto.edu/affinitypropagation.
    78陆俭明,徐波,孙茂松.中文信息处理若干重要问题[M].北京:科学出版社,2003.
    79 C. Manning, H. Schu¨tze. Foundations of Statistical Natural Language Process-ing[M]. MIT Press, 1999.
    80 D. Petrelli, M. Beaulieu, M. Sanderson, et al. Observing Users, Designing Clar-ity: A Case Study on the User-centered Design of a Cross-language InformationRetrieval System[J]. Journal of the American Society for Information Science andTechnology, 2004, 55(10):923–934.
    81 L. F. Rau, P. S. Jacobs. Creating Segmented Databases from Free Text for TextRetrieval[C]Proceedings of the 14th annual international ACM SIGIR conferenceon Research and development in information retrieval. 1991:337–346.
    82 M. L. Mauldin. Retrieval Performance in Ferret a Conceptual Information RetrievalSystem[C]Proceedings of the 14th annual international ACM SIGIR conference onResearch and development in information retrieval. 1991:347–355.
    83 P. Jacob, L. Rau. Natural Language Techniques for Intelligent Information Re-trieval[C]Proceedings of the 11th annual international ACM SIGIR conference onResearch and development in information retrieval. 1988:85–99.
    84 A. Ram. Interest-based Information Filtering and Extraction in Natural LanguageUnderstanding Systems[C]Bellcore Workshop on High-Performance InformationFiltering, Morristown, NJ. 1991.
    85 E. Wendlandt, J. Driscoll. Incorporating a Semantic Analysis Into a Document Re-trieval Strategy[C]Proceedings of the 14th annual international ACM SIGIR con-ference on Research and development in information retrieval. 1991:270–279.
    86 T. Strzalkowski. Natural Language Processing in Large-scale Text RetrievalTasks[C]First Text Retrieval Conference (Trec-1): Proceedings. 1993:173.
    87 T. Strzalkowski, J. Carballo. Recent Developments in Natural Language Text Re-trieval[J]. NIST SPECIAL PUBLICATION SP, 1994:123–123.
    88 G. Salton, A. Wong, C. Yang. A Vector Space Model for Automatic Indexing[J].Communications of the ACM, 1975, 18(11):613–620.
    89 S. G., B. C. Term Weighting Approaches in Automatic Text Retrieval[J]. Informa-tion Processing and Management, 1988, 24(5):513–523.
    90 C. Aggarwal, P. Yu. Finding Generalized Projected Clusters in High DimensionalSpaces[C]Proceedings of the 2000 ACM SIGMOD international conference onManagement of data. 2000:70–81.
    91 Y. Yang, J. Pedersen. A Comparative Study on Feature Selection in Text Catego-rization[C]International Workshop Conference Machine Learning. 1997:412–420.
    92 S. Ru¨ger, S. Gauch. Feature Reduction for Document Clustering and Classifica-tion[M]. Imperial College of Science, Technology and Medicine, Department ofComputing (2000), 2000.
    93 T. Mitchell. Machine Learning[M]. McGraw Hill, 1997.
    94 K. Church, P. Hanks. Word Association Norms, Mutual Information, and Lexicog-raphy[J]. Computational linguistics, 1990, 16(1):22–29.
    95 T. Liu, S. Liu, Z. Chen, et al. An Evaluation on Feature Selection for Text Clus-tering[C]Proceedings of the Twentieth International Conference (ICML 2003), Au-gust 21-24, 2003, Washington, DC, USA. 2003:488–495.
    96 Y. Li, C. Luo, S. Chung. Text Clustering with Feature Selection by Using StatisticalData[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(5):641–652.
    97 L. Rigutini, M. Maggini. A Semi-supervised Document Clustering AlgorithmBased on Em[C]Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACMInternational Conference on. 2005:200–206.
    98 N. Wyse, R. Dubes, A. Jain. A Critical Evaluation of Intrinsic Dimensionality Al-gorithms[C]Pattern Recognition in Practice: Proceedings of an International Work-shop Held in Amsterdam, May 21-23, 1980. 1980:415–425.
    99 J. Lin, D. Gunopulos. Dimensionality Reduction by Random Projection and LatentSemantic Indexing[C]Proceedings of the Text Mining Workshop, at the 3rd SIAMInternational Conference on Data Mining. 2003.
    100 S. Kaski. Dimensionality Reduction by Random Mapping: Fast Similarity Compu-tation for Clustering[C]Proceedings of IJCNN. 1998, 98:413–418.
    101 E. Bingham, H. Mannila. Random Projection in Dimensionality Reduction: Ap-plications to Image and Text Data[C]Proceedings of the seventh ACM SIGKDDinternational conference on Knowledge discovery and data mining. 2001:245–250.
    102 B. Tang, X. Luo, M. Heywood, et al. A Comparative Study of Dimension ReductionTechniques for Document Clustering[R]. Tech. rep., Technical Report CS-2004-14,Faculty of Computer Science, Dalhousie University, 2004.
    103 S. Dumais, G. Furnas, T. Landauer, et al. Using Latent Semantic Analysis to Im-prove Access to Textual Information[C]Proceedings of the SIGCHI conference onHuman factors in computing systems. 1988:281–285.
    104 S. Deerwester, S. Dumais, G. Furnas, et al. Indexing by Latent Semantic Analy-sis[J]. Journal of the American society for information science, 1990, 41(6):391–407.
    105 I. Jolliffe. Principal Component Analysis. 1986[M]. Springer, New York, 1986.
    106 T. Kolenda, L. Hansen, S. Sigurdsson. Independent Components in Text[J]. Ad-vances in Independent Component Analysis, 2000:235–256.
    107 N. Slonim, N. Tishby. Document Clustering Using Word Clusters via the Informa-tion Bottleneck Method[C]Proceedings of the 23rd annual international ACM SI-GIR conference on Research and development in information retrieval. 2000:208–215.
    108 I. Fodor. A Survey of Dimension Reduction Techniques[J].LLNL technical report, June 2002, UCRL-ID-148494. URL:http://www.llnl.gov/CASC/sapphire/pubs.html, 2002.
    109 M. Dash, H. Liu. Feature Selection for Clustering[J]. Lecture notes in computerscience, 2000:110–121.
    110 M. Maggini, L. Rigutini, M. Turchi. Pseudo-Supervised Clustering for Text Doc-uments[C]IEEE/WIC/ACM International Conference on Web Intelligence, 2004.WI 2004. Proceedings. 2004:363–369.
    111 L. Liu, J. Kang, J. Yu, et al. A Comparative Study on Unsupervised Feature Se-lection Methods for Text Clustering[C]Natural Language Processing and Knowl-edge Engineering, 2005. IEEE NLP-KE’05. Proceedings of 2005 IEEE Interna-tional Conference on:597–601.
    112 J. Caron. Experiments with Isa Scoring: Optimal Rank and Basis[J]. Computationalinformation retrieval, 2001:157–169.
    113 M. Hasan, Y. Masumoto. Document Clustering: Before and after the SingularValue Decomposition[J]. Sapporo, Japan, Information Processing Society of Japan(IPSJ-TR: 99-NL-134.) pp, 1999:47–55.
    114 D. Kanejiya, A. Kumar, S. Prasad. Statistical Language Modeling with PerformanceBenchmarks Using Various Levels of Syntactic-semantic Information[C]COLING’04: Proceedings of the 20th international conference on Computational Linguis-tics. 2004:1161–1170.
    115 P. Kanerva, J. Kristofersson, A. Holst. Random Indexing of Text Samples for LatentSemantic Analysis[C]Proceedings of the 22nd Annual Conference of the CognitiveScience Society. 2000, 1036.
    116 A. Hotho, S. Staab, G. Stumme. Wordnet Improves Text Document Cluster-ing[C]Proc. of the SIGIR 2003 Semantic Web Workshop. 2003:541–544.
    117 J. Sedding, D. Kazakov. Wordnet-based Text Document Clustering[J]. ROMAND,2004:104–125.
    118 S. Bloehdorn, P. Cimiano, A. Hotho, et al. An Ontology-based Framework for TextMining[J]. GLDV-Journal for Computational Linguistics and Language Technol-ogy, 2005, 20:87–112.
    119 A. Moschitti, R. Basili. Complex Linguistic Features for Text Classification: AComprehensive Study[J]. Lecture Notes in Computer Science, 2004, 2997:181–196.
    120 M. F. Porter. An Algorithm for Suffix Stripping[J]. Program, 1980, 14(3):130–137.
    121 N. Chinchor, P. Robinson. MUC-7 Named Entity Task Definition (version3.5)[C]Proceedings of the Seventh Message Understanding Conference. 1998.
    122姜维.统计中文词法分析及其强化学习机制的研究[D].哈尔滨:哈尔滨工业大学,2007.
    123 R. Basili, A. Moschitti, M. T. Pazienza. Language Sensitive Text Classifica-tion[C]Proceedings of RIAO 2000, 6th International Conference on Computer-Assisted Information Retrieval. CID, 2000:331–343.
    124 S. Yu, H. Duan, X. Zhu, et al. Specification for Corpus Processing at Peking Uni-versity: Word Segmentation, Pos Tagging and Phonetic Notation[J]. Journal ofChinese Language and Computing, 2003, 13(2):121–158.
    125刘远超,王晓龙,徐志明,等.文档聚类综述[J].中文信息学报, 2006, 20(3):55–62.
    126 F. Debole, F. Sebastiani. An Analysis of the Relative Hardness of Reuters-21578Subsets[J]. Journal of the American Society for Information Science and technol-ogy, 2005, 56(6):584–596.
    127 K. Lang. Newsweeder: Learning to Filter Netnews[C]In Proceedings of the TwelfthInternational Conference on Machine Learning. 1995:331–339.
    128董振东,董强.知网和汉语研究[J].当代语言学, 2001, 3(1):32–44.
    129刘群,李素建.基于《知网》的词汇语义相似度计算[J]. Computational Linguis-tics and Chinese Language Processing, 2002, 7(2):59–76.
    130梅家驹,竺一鸣,高蕴琦,等.同义词词林[M].上海辞书出版社, 1983.
    131 M. Hearst, J. Pedersen. Reexamining the Cluster Hypothesis: Scatter/gather onRetrieval Results[C]Proceedings of the 19th annual international ACM SIGIR con-ference on Research and development in information retrieval. 1996:76–84.
    132 A. Leuski. Evaluating Document Clustering for Interactive Information Re-trieval[C]Proceedings of the tenth international conference on Information andknowledge management. 2001:33–40.
    133 T. Ho, N. Nguyen. Nonhierarchical Document Clustering Based on a ToleranceRough Set Model[J]. International Journal of Intelligent Systems, 2002, 17(2):199–212.
    134 O. Zamir, O. Etzioni. Grouper: a Dynamic Clustering Interface to Web Search Re-sults[J]. Computer Networks-the International Journal of Computer and Telecom-munications Networkin, 1999, 31(11):1361–1374.
    135 O. Zamir. Clustering Web Documents: A Phrase-based Method for GroupingSearch Engine Results[D]University of Washington, 1999.
    136 D. Weiss. A Clustering Interface for Web Search Results in Polish and English[J].Master’s Thesis, Poznan University of Technology, Poznan, Poland, 2001.
    137 S. Osinski, J. Stefanowski, D. Weiss. Lingo: Search Results Clustering AlgorithmBased on Singular Value Decomposition[C]Intelligent Information Processing AndWeb Mining: Proceedings of the International IIS: IIPWM’04 Conference Held inZakopane, Poland. 2004:359–360.
    138 C. Ngo, H. Nguyen. A Tolerance Rough Set Approach to Clustering Web SearchResults[J]. Lecture notes in computer science, 2004:515–517.
    139 D. Zhang, Y. Dong. Semantic, Hierarchical, Online Clustering of Web Search Re-sults[J]. Lecture notes in computer science, 2004:69–78.
    140 K. Kummamuru, R. Lotlikar, S. Roy, et al. A Hierarchical MonotheticDocument Clustering Algorithm for Summarization and Browsing Search Re-sults[C]Proceedings of the 13th international conference on World Wide Web.2004:658–665.
    141 M. Ohta, H. Narita, S. Ohno. Overlapping Clustering Method Using Local andGlobal Importance of Feature Terms at Ntcir-4 Web Task[J]. Working Notes ofNTCIR (NII-NACSIS Test Collection for IR Systems)-4 Vol. Supl, 2004, 1:37–44.
    142 D. Crabtree, P. Andreae, X. Gao. Query Directed Web Page Cluster-ing[C]Proceedings of the 2006 IEEE/WIC/ACM International Conference on WebIntelligence. 2006:202–210.
    143 Z. Pawlak. Rough Sets[J]. International Journal of Parallel Programming, 1982,11(5):341–356.
    144 Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data[M]. KluwerAcademic Print on Demand, 1991.
    145 J. Komorowski, Z. Pawlak, L. Polkowski, et al. Rough Sets: A Tutorial[J]. Roughfuzzy hybridization: A new trend in decision-making, 1999:3–98.
    146 K. Funakoshi, T. Ho. A Rough Set Approach to Information Retrieval[J]. RoughSets in Knowledge Discovery, 1998:166–184.
    147 V. Raghavan, R. Sharma. A Framework Anda Prototype for Intelligent Organizationof Information[J]. Canadian Journal of Information Science, 1986, 11:88–101.
    148 P. Srinivasan. The Importance of Rough Approximation for Information Re-trieval[J]. International Journal of Man-Machine Studies, 1991, 34(5):657–671.
    149 C. Ngo, H. Nguyen. A Method of Web Search Result Clustering Based on RoughSets[C]Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM Interna-tional Conference on. 2005:673–679.
    150 A. Skowron, J. Stepaniuk. Generalized Approximation Spaces[J]. Soft Computing,Simulation Councils, San Diego, 1995:18–21.
    151 R. Slowinski, D. Vanderpooten. Similarity Relation as a Basis for Rough Approxi-mations[J]. ICS Research Report, 1995, 53:95–113.
    152 S. Kawasaki, N. Nguyen, T. Ho. Hierarchical Document Clustering Based on Tol-erance Rough Set Model[J]. Lecture notes in computer science, 2000:458–463.
    153 P. Lingras. Rough Set Clustering for Web Mining[C]Proceedings of the 2002 IEEEInternational Conference on Fuzzy Systems. 2002, 2.
    154 A. An, Y. Huang, X. Huang, et al. Feature Selection with Rough Sets for Web PageClassification[J]. Lecture Notes in Computer Science, 2004, 3135:1–13.
    155 P. Kumar, P. Krishna, R. Bapi, et al. Rough Clustering of Sequential Data[J]. Data& Knowledge Engineering, 2007, 63(2):183–199.
    156 G. Miller, W. Charles. Contextual Correlates of Semantic Similarity[J]. Languageand cognitive processes, 1991, 6(1):1–28.
    157 S. Dumais, T. Landauer. A Solution to Plato’s Problem: The Latent SemanticAnalysis Theory of Acquisition, Induction and Representation of Knowledge[J].Psychological review, 1997, 104:211–240.
    158 G. Miller, R. Beckwith, C. Fellbaum, et al. Introduction to Wordnet: An On-lineLexical Database[J]. International Journal of lexicography, 1990, 3(4):235–244.
    159 P. Pantel, D. Lin. Discovering Word Senses from Text[C]Proceedings of the eighthACM SIGKDD international conference on Knowledge discovery and data mining.2002:613–619.
    160 T. Dunning. Accurate Methods for the Statistics of Surprise and Coincidence[J].Computational linguistics, 1993, 19(1):61–74.
    161 R. L. Cilibrasi, P. M. Vitanyi. The Google Similarity Distance[J]. IEEE Transac-tions on Knowledge and Data Engineering, 2007, 19(3):370–383.
    162 T. Pedersen, A. Kulkarni. Discovering Identities in Web Contexts with Unsu-pervised Clustering[C]Proceedings of the IJCAI-2007 Workshop on Analytics forNoisy Unstructured Text Data. Hyderabad, India. 2007:23–30.
    163 R. Lindsey, V. Veksler, A. Grintsvayg, et al. Be Wary of What Your Com-puter Reads: The Effects of Corpus Selection on Measuring Semantic Related-ness[C]Proceedings of the Eighth International Conference on Cognitive Modeling,Oxford, UK. 2007:279–284.
    164 A. Jain, M. Murty, P. Flynn. Data Clustering: A Review[J]. ACM computingsurveys, 1999, 31(3):264–323.
    165 A. Leuski, J. Allan. Improving Interactive Retrieval by Combining Ranked Listsand Clustering[C]Proceedings of RIAO. 2000:665–681.
    166 A. Leouski, W. Croft, M. U. A. D. O. C. SCIENCE. An Evaluation of Techniquesfor Clustering Search Results[M]. Citeseer, 2005.
    167 X.-H. Phan. CRFTagger: Crf English Pos Tag-ger[C]http://crftagger.sourceforge.net/. 2006.
    168 M. H. Dunham. Data Mining: Introductory and Advanced Topics[M]. PrenticeHall PTR Upper Saddle River, NJ, USA, 2002.
    169史忠植.智能主体与应用[M].科学出版社, 2000.
    170 R. Agrawal, R. Srikant. Fast Algorithms for Mining Association Rules in LargeDatabases[C]Proceedings of the 20th International Conference on Very Large DataBases. 1994:487–499.
    171 R. Agrawal, T. Imielin′ski, A. Swami. Mining Association Rules between Sets ofItems in Large Databases[C]SIGMOD’93: Proceedings of the 1993 ACM SIG-MOD international conference on Management of data. New York, NY, USA:ACM, 1993:207–216.
    172 F. Beil, M. Ester, X. Xu. Frequent Term-based Text Clustering[C]Proceedings ofthe eighth ACM SIGKDD international conference on Knowledge discovery anddata mining. 2002:436–442.
    173 B. Fung, K. Wang, M. Ester. Hierarchical Document Clustering Using FrequentItemsets[C]Proceedings of the SIAM International Conference on Data Mining.2003.
    174 Y. Li, S. Chung, J. Holt. Text Document Clustering Based on Frequent Word Mean-ing Sequences[J]. Data & Knowledge Engineering, 2008, 64(1):381–404.
    175 N. Pasquier, Y. Bastide, R. Taouil, et al. Discovering Frequent Closed Itemsets forAssociation Rules[J]. Lecture Notes in Computer Science, 1999:398–416.
    176 R. Bayardo Jr. Efficiently Mining Long Patterns from Databases[C]Proceedingsof the 1998 ACM SIGMOD international conference on Management of data.1998:85–93.
    177 H. Jiawei, P. Jian, Y. Yiwen, et al. Mining Frequent Patterns without Candidate Gen-eration[C]Proc of the ACM SIGMOD International Conference on Management ofData. Dallas, USA. 2000:1–12.
    178 J. Pei, J. Han, R. Mao. CLOSET: An Efficient Algorithm for Mining FrequentClosed Itemsets[C]In Proc. 2000 ACM-SIGMOD Int. Workshop Data Mining andKnowledge Discovery. 2000.
    179 J. Wang, J. Han, J. Pei. Closet+: Searching for the Best Strategies for MiningFrequent Closed Itemsets[C]Proceedings of the ninth ACM SIGKDD internationalconference on Knowledge discovery and data mining. 2003:236–245.
    180 G. Grahne, J. Zhu. Efficiently Using Prefix-trees in Mining Frequent Item-sets[C]Proceedings of the ICDM Workshop on Frequent Itemset Mining Imple-mentations. 2003.
    181 Z. Zhang, J. Chen, X. Li. A Preprocessing Framework and Approach for WebApplications[J]. Journal of Web Engineering, 2004, 2:176–192.
    182于满泉,陈铁睿,徐洪波.基于分块的网页信息解析器的研究与设计[J].计算机应用,2005,25(4):974–976.
    183李蕾,王劲林,白鹤,等.基于fft的网页正文提取算法研究与实现[J].计算机工程与应用,2007,43(30):148–151.
    184 S. Gupta, G. Kaiser, D. Neistadt, et al. DOM-based Content Extraction of HtmlDocuments[C]Proceedings of the 12th international conference on World WideWeb. 2003:207–214.
    185 S. Gupta, H. Becker, G. Kaiser, et al. Verifying Genre-based Clustering Approachto Content Extraction[C]Proceedings of the 15th international conference on WorldWide Web. 2006:875–876.
    186王琦,唐世渭,杨冬青,等.基于dom的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1786–1792.
    187 S. Yu, D. Cai, J. Wen, et al. Improving Pseudo-relevance Feedback in Web Infor-mation Retrieval Using Web Page Segmentation[C]Proceedings of the 12th inter-national conference on World Wide Web. 2003:11–18.
    188 R. Song, H. Liu, J. Wen, et al. Learning Important Models for Web Page BlocksBased on Layout and Content Analysis[J]. ACM SIGKDD Explorations Newslet-ter, 2004, 6(2):14–23.
    189韩先培,刘康,赵军.基于布局特征与语言特征的网页主要内容块发现[J].中文信息学报,2008,22(1):15–21.
    190王璟琦.基于内容单元的网页解析与内容提取[D].哈尔滨:哈尔滨工业大学,2008:51–59.
    191 L. Shangjian. A Composite Approach to Language/encoding Detection[C]19th In-ternational Unicode Conference. San Jose California: State Uiversity. 2001.
    192 S. Brin, L. Page. The Anatomy of a Large-scale Hypertextual Web Search En-gine[J]. Computer networks and ISDN systems, 1998, 30(1-7):107–117.
    193 J. Kleinberg. Authoritative Sources in a Hyperlinked Environment[J]. Journal ofthe ACM, 1999, 46(5):604–632.
    194 T. Haveliwala. Topic-sensitive Pagerank: A Context-sensitive Ranking Algorithmfor Web Search[J]. IEEE Transactions on Knowledge and Data Engineering, 2003,15(4):784–796.
    195 K. Bharat, G. Mihaila. Hilltop: A Search Engine Based on Expert Docu-ments[C]Proc. of the 9th International WWW Conference (Poster). 2000.
    196 R. Lempel, S. Moran. The Stochastic Approach for Link-structure Analysis (salsa)and the Tkc Effect[J]. Computer Networks, 2000, 33(6):387–401.
    197 R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern Information Retrieval[M].Addison-Wesley Harlow, England, 1999.
    198 J. Ponte, W. Croft. A Language Modeling Approach to Information Re-trieval[C]Proceedings of the 21st annual international ACM SIGIR conference onResearch and development in information retrieval. 1998:275–281.
    199 H. Turtle, W. Croft. Evaluation of an Inference Network-based Retrieval Model[J].ACM Transactions on Information Systems (TOIS), 1991, 9(3):187–222.
    200 T. Strohman, D. Metzler, H. Turtle, et al. Indri: A Language-model Based SearchEngine for Complex Queries (extended Version)[J]. 2005.
    201 D. Metzler, W. Croft. Combining the Language Model and Inference NetworkApproaches to Retrieval[J]. Information Processing and Management, 2004,40(5):735–750.
    202 H. P. Luhn. An Experiment in Auto-abstracting[C]Proceedings of InternationalConference on Scientific Information, , Washington, DC.
    203 I. Mani, M. Maybury. Advances in Automatic Text Summarization[M]. MIT Press,1999.
    204陈燕敏.基于集聚的自动文摘方法研究[D].哈尔滨:哈尔滨工业大学, 2006.
    205陈清才.基于粗集的统计语言模型研究[D].哈尔滨:哈尔滨工业大学, 2003.
    206 L. Zhou, D. Zhang. NLPIR: a Theoretical Framework for Applying Natural Lan-guage Processing to Information Retrieval[J]. Journal of the American Society forInformation Science and Technology, 2003, 54(2):115–123.
    207 A. Tombros. Advantages of Query Biased Summaries in Information Re-trieval[C]Proceedings of the 21st annual international ACM SIGIR conference onResearch and development in information retrieval. 1998:2–10.
    208 T. Sakai, K. Sparck-Jones. Generic Summaries for Indexing in Information Re-trieval[C]Proceedings of the 24th annual international ACM SIGIR conference onResearch and development in information retrieval. 2001:190–198.
    209刘挺,王开铸.基于篇章多级依存结构的自动文摘研究[J].计算机研究与发展,1999,36(4):479–488.
    210 J. Reynar. An Automatic Method of Finding Topic Boundaries[C]ANNUALMEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. 1994,32:331–333.
    211 C. Buckley, E. Voorhees. Evaluating Evaluation Measure Stability[C]Proceedingsof the 23rd annual international ACM SIGIR conference on Research and develop-ment in information retrieval. 2000:33–40.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700