用户名: 密码: 验证码:
A Probabilistic Model for Clustering Text Documents with Multiple Fields
详细信息    查看全文
  • 作者:Shanfeng Zhu ; Ichigaku Takigawa ; Shuqin Zhang ; Hiroshi Mamitsuka
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2007
  • 出版时间:2007
  • 年:2007
  • 卷:4425
  • 期:1
  • 页码:331-342
  • 全文大小:425 KB
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
文摘
We address the problem of clustering documents with multiple fields, such as scientific literature with the distinct fields: title, abstract, keywords, main text and references. By taking into consideration of the distinct word distributions of each field, we propose a new probabilistic model, Field Independent Clustering Model (FICM), for clustering documents with multiple fields. The benefits of FICM come not only from integrating the discrimination abilities of each field but also from the power of selecting the most suitable component probabilistic model for each field. We examined the performance of FICM on the problem of clustering biomedical documents with three fields (title, abstract and MeSH). From the genomics track data of TREC 2004 and TREC 2005, we randomly generated 60 datasets where the number of classes in each dataset ranged from 3 to 12. By applying the appropriate configuration of generative models for each field, FICM outperformed a classical multinomial model in 59 out of the total 60 datasets, of which 47 were statistically significant at the 95% level, and FICM also outperformed a multivariate Bernoulli model in 52 out of the total 60 datasets, of which 36 were statistically significant at the 95% level.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700