Indexing Text Documents for Fast Evaluation of Regular Expressions.

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Indexing Text Documents for Fast Evaluation of Regular Expressions.

详细信息

作者：Chen ; Ting.
学历：Doctor
年：2012
导师：Doan,AnHai,eadvisorCai,Jin-Yiecommittee memberNaughton,Jeffreyecommittee memberPatel,Jigneshecommittee memberRe,Christopherecommittee memberDoan,AnHaiecommittee member
毕业院校：The University of Wisconsin
Department：Computer Sciences.
ISBN：9781267587015
CBH：3524345
Country：USA
语种：English
FileSize：1018530
Pages：106

文摘

Fast regular expression regex) evaluation over text documents is a fundamental operation in numerous text-centric applications,such as information extraction,search,data mining,exploratory data analysis,and business intelligence. To support this operation,current work builds an inverted index for the k-grams in the documents. Given a regex R,the work analyzes R to infer k-grams that must be present in R,then uses the inverted index to quickly locate documents that are likely to match R. In this dissertation we significantly advance the above state of the art. First,we develop a new method to build k-gram inverted index that takes far less time,works even when the set of k-grams considered does not fit into memory,and can handle the so-called "zero document" cases. Our index is also "space aware",in that it can make use of extra disk space,if any. Second,we index not just k-grams,but also distance based d-grams and transformed versions of the documents. Third,we show how to analyze a given regex at a much deeper level than possible in current work) to derive more properties that we can then use to query the index. Taken together,these advances significantly reduce regex evaluation time,as demonstrated with extensive experiments over two real-world data sets. Finally,we develop a novel incremental index update method that greatly improves the index update efficiency. Our index update method can be used in applications that operate on dynamic text corpora.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700