文摘
Fast regular expression regex) evaluation over text documents is a fundamental operation in numerous text-centric applications,such as information extraction,search,data mining,exploratory data analysis,and business intelligence. To support this operation,current work builds an inverted index for the k-grams in the documents. Given a regex R,the work analyzes R to infer k-grams that must be present in R,then uses the inverted index to quickly locate documents that are likely to match R. In this dissertation we significantly advance the above state of the art. First,we develop a new method to build k-gram inverted index that takes far less time,works even when the set of k-grams considered does not fit into memory,and can handle the so-called "zero document" cases. Our index is also "space aware",in that it can make use of extra disk space,if any. Second,we index not just k-grams,but also distance based d-grams and transformed versions of the documents. Third,we show how to analyze a given regex at a much deeper level than possible in current work) to derive more properties that we can then use to query the index. Taken together,these advances significantly reduce regex evaluation time,as demonstrated with extensive experiments over two real-world data sets. Finally,we develop a novel incremental index update method that greatly improves the index update efficiency. Our index update method can be used in applications that operate on dynamic text corpora.