Identification of Multi-Focal Questions in Question and Answer Reports

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Identification of Multi-Focal Questions in Question and Answer Reports

详细信息查看全文

作者：Mona Mohamed Zaki Ali (18) (19)
Goran Nenadic (18)
Babis Theodoulidis (20)
关键词：Question Classification ; Question Analysis ; Content Analysis ; Data Quality ; Text Mining ; Data Mining ; Machine Learning ; Rule ; based Methods
刊名：Lecture Notes in Computer Science
出版年：2014
出版时间：2014
年：2014
卷：8455
期：1
页码：126-137
参考文献：1. Blumberg, R., Atre, S.: The problem with unstructured data. DM Review聽13, 42鈥?9 (2003)
2. Marshall, G.: The purpose, design and administration of a questionnaire for data collection. Radiography聽11(2), 131鈥?36 (2005) CrossRef
3. Fadem, T.J.: The art of asking: ask better questions, get better answers. FT Press (2008)
4. Leung, W.-C.: How to design a questionnaire. BMJ聽9(11), 187鈥?89 (2001)
5. Huang, P., Bu, J., Chen, C., Qiu, G.: An effective feature-weighting model for question classification. In: Computational Intelligence and Security International Conference, pp. 32鈥?6. IEEE (2007)
6. Tamura, A., Takamura, H., Okumura, M.: Classification of multiple-sentence questions. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol.聽3651, pp. 426鈥?37. Springer, Heidelberg (2005) CrossRef
7. Xiao-Ming, L., Li, L.: Question Classification Based on Focus. In: 2012 International Conference Communication Systems and Network Technologies (CSNT), pp. 512鈥?16. IEEE (2012)
8. Bos, J.: The 鈥淟a Sapienza鈥?Question Answering System at TREC-2006. In: Voorhees, E.M., Buckland, L.P. (eds.) The Fifteenth Text RETrieval Conference, Gaitersburg, MD, pp. 797鈥?03 (2006)
9. Sahin, A., Kulm, G.: Sixth grade mathematics teachers鈥?intentions and use of probing, guiding, and factual questions. Journal of Mathematics Teacher Education聽11(3), 221鈥?41 (2008) CrossRef
10. Hagstrom, P.A.: Decomposing questions. PhD dissertation, Massachusetts Institute of Technology (1998)
11. Isaacs, J., Rawlins, K.: Conditional questions. Journal of Semantics聽25(3), 269鈥?19 (2008) CrossRef
12. Rubin, A., Babbie, E.R.: Research methods for social work. Cengage Learning (2008)
13. Voorhees, E.M.: Overview of the TREC 2001 question answering track. In: NIST Special Publication, pp. 42鈥?1 (2002)
14. Sehgal, A.K., Das, S., Noto, K., Saier, M.K., Elkan, C.: Identifying relevant data for a biological database: Handcrafted rules versus machine learning. IEEE/ACM Transactions Computational Biology and Bioinformatics聽8(3), 851鈥?57 (2011) CrossRef
15. Zhang, D., Lee, W.S.: Question classification using support vector machines. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 26鈥?2. ACM (2003)
16. Loni, B., van Tulder, G., Wiggers, P., Tax, D.M.J., Loog, M.: Question classification by weighted combination of lexical, syntactic and semantic features. In: Habernal, I., Matou拧ek, V. (eds.) TSD 2011. LNCS (LNAI), vol.聽6836, pp. 243鈥?50. Springer, Heidelberg (2011) CrossRef
17. Metzler, D., Croft, W.B.: Analysis of statistical question classification for fact-based questions. Information Retrieval 8聽3, 481鈥?04 (2005) CrossRef
18. Carbon Disclosure Project, project.net" class="a-plus-plus"> https://www.cdproject.net
19. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Computational Linguistics聽34(4), 555鈥?96 (2008) CrossRef
20. Murray, P.: Fundamental issues in questionnaire design. Accident and Emergency Nursing聽7(3), 148鈥?53 (1999) CrossRef
21. TreeTagger - a language independent part-of-speech tagger, http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
22. Flesch, R.: A new readability yardstick. Journal of Applied Psychology聽32, 221 (1948) CrossRef
23. Kincaid, J.P., Fishburne Jr., R.P., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch (1975)
24. Flesch Reading Ease Readability Score, http://rfptemplates.technologyevaluation.com/readability-scores/flesch-reading-ease-readability-score.html
25. Flesch, R.F.: How to test readability. Harper (1951)
26. IBM SPSS Modeler for data and text mining, http://www.01.ibm.com/software/analytics-/spss-/products/modeler/
27. IBM SPSS Modeler Text Analytics, ftp://public.dhe.ibm.com/software/analytics/spss/doc-umentation/modeler/15.0/en/Users_Guide_For_Text_Analytics.pdf
28. Nenadi茅, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through recognition of variation. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 604. ACL (2004)
29. Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol.聽1. Springer, New York (2006)
30. Kantardzic, M.: Data mining: concepts, models, methods, and algorithms. John Wiley & Sons (2011)
31. Li, D.-C., Fang, Y.-H., Fang, Y.M.: The data complexity index to construct an efficient cross-validation method. Decision Support Systems聽50(1), 93鈥?02 (2010) CrossRef
作者单位：Mona Mohamed Zaki Ali (18) (19)
Goran Nenadic (18)
Babis Theodoulidis (20)

18. School of Computer Science, The University of Manchester, Manchester, UK
19. Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt
20. Manchester Business School, The University of Manchester, Manchester, UK
ISSN：1611-3349

文摘

A significant amount of business and scientific data is collected via question and answer reports. However, these reports often suffer from various data quality issues. In many cases, questionnaires contain a number of questions that require multiple answers, which we argue can be a potential source of problems that may lead to poor-quality answers. This paper introduces multi-focal questions and proposes a model for identifying them. The model consists of three phases: question pre-processing, feature engineering and question classification. We use six types of features: lexical/surface features, Part-of-Speech, readability, question structure, wording and placement features, question response type and format features and question focus. A comparative study of three different machine learning algorithms (Bayes Net, Decision Tree and Support Vector Machine) is performed on a dataset of 150 questions obtained from the Carbon Disclosure Project, achieving the accuracy of 91%.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700