Application of the distributed document representation in the authorship attribution task for small corpora

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

Application of the distributed document representation in the authorship attribution task for small corpora

详细信息查看全文

作者：Juan-Pablo Posadas-Durán ; Helena Gómez-Adorno ; Grigori Sidorov…
关键词：Distributed representation ; Authorship attribution ; Author identification ; Embeddings ; Word embeddings ; Stylometry ; Machine learning ; SVM ; Scarce training data
刊名：Soft Computing
出版年：2017
出版时间：February 2017
年：2017
卷：21
期：3
页码：627-639
全文大小：
刊物类别：Engineering
刊物主题：Computational Intelligence; Artificial Intelligence (incl. Robotics); Mathematical Logic and Foundations; Control, Robotics, Mechatronics;
出版者：Springer Berlin Heidelberg
ISSN：1433-7479
卷排序：21

文摘

Distributed word representation in a vector space (word embeddings) is a novel technique that allows to represent words in terms of the elements in the neighborhood. Distributed representations can be extended to larger language structures like phrases, sentences, paragraphs and documents. The capability to encode semantic information of texts and the ability to handle high- dimensional datasets are the reasons why this representation is widely used in various natural language processing tasks such as text summarization, sentiment analysis and syntactic parsing. In this paper, we propose to use the distributed representation at the document level to solve the task of the authorship attribution. The proposed method learns distributed vector representations at the document level and then uses the SVM classifier to perform the automatic authorship attribution. We also propose to use the word n-grams (instead of the words) as the input data type for learning the distributed representation model. We conducted experiments over six datasets used in the state-of-the-art works, and for the majority of the datasets, we obtained comparable or better results. Our best results were obtained using the combination of words and n-grams of words as the input data types. Training data are relatively scarce, which did not affect the distributed representation.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700