Geographical localization of web domains and organization addresses recognition by employing natural language processing, Pattern Matching and clustering
详细信息   
摘要
Nowadays, the World Wide Web is growing at increasing rate and speed, and consequently the online available resources populating Internet represent a large source of knowledge for various business and research interests. For instance, over the past years, increasing attention has been focused on retrieving information related to geographical location of places and entities, which is largely contained in web pages and documents. However, such resources are represented in a wide variety of generally unstructured formats, and this actually does not help final users to find desired information items. The automatic annotation and comprehension of toponyms, location names and addresses (at different resolution and granularity levels) can deliver significant benefits for the whole web community by improving search engines filtering capabilities and intelligent data mining systems. The present paper addresses the problem of gathering geographical information from unstructured text in web pages and documents. In the specific, the proposed method aims at extracting geographical location (at street number resolution) of commercial companies and services, by annotating geo-related information from their web domains. The annotation process is based on Natural Language Processing (NLP) techniques for text comprehension, and relies on Pattern Matching and Hierarchical Cluster Analysis for recognizing and disambiguating geographical entities. Geotagging performances have been assessed by evaluating Precision, Recall and F-Measure of the proposed system output (represented in form of semantic RDF triples) against both a geo-annotated reference database and a semantic Smart City repository.