用户名: 密码: 验证码:
Indexing Text Documents for Fast Evaluation of Regular Expressions.
详细信息   
  • 作者:Chen ; Ting.
  • 学历:Doctor
  • 年:2012
  • 导师:Doan,AnHai,eadvisorCai,Jin-Yiecommittee memberNaughton,Jeffreyecommittee memberPatel,Jigneshecommittee memberRe,Christopherecommittee memberDoan,AnHaiecommittee member
  • 毕业院校:The University of Wisconsin
  • Department:Computer Sciences.
  • ISBN:9781267587015
  • CBH:3524345
  • Country:USA
  • 语种:English
  • FileSize:1018530
  • Pages:106
文摘
Fast regular expression regex) evaluation over text documents is a fundamental operation in numerous text-centric applications,such as information extraction,search,data mining,exploratory data analysis,and business intelligence. To support this operation,current work builds an inverted index for the k-grams in the documents. Given a regex R,the work analyzes R to infer k-grams that must be present in R,then uses the inverted index to quickly locate documents that are likely to match R. In this dissertation we significantly advance the above state of the art. First,we develop a new method to build k-gram inverted index that takes far less time,works even when the set of k-grams considered does not fit into memory,and can handle the so-called "zero document" cases. Our index is also "space aware",in that it can make use of extra disk space,if any. Second,we index not just k-grams,but also distance based d-grams and transformed versions of the documents. Third,we show how to analyze a given regex at a much deeper level than possible in current work) to derive more properties that we can then use to query the index. Taken together,these advances significantly reduce regex evaluation time,as demonstrated with extensive experiments over two real-world data sets. Finally,we develop a novel incremental index update method that greatly improves the index update efficiency. Our index update method can be used in applications that operate on dynamic text corpora.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700