Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms

详细信息查看全文

作者：Jinyan Li ; Simon Fong ; Sabah Mohammed ; Jinan Fiaidhi
关键词：Imbalanced biological data ; Medical classification ; Swarm algorithm ; Parameter optimization
刊名：The Journal of Supercomputing
出版年：2016
出版时间：October 2016
年：2016
卷：72
期：10
页码：3708-3728
全文大小：1,708 KB
刊物类别：Computer Science
刊物主题：Programming Languages, Compilers and Interpreters
Processor Architectures
Computer Science, general
出版者：Springer Netherlands
ISSN：1573-0484
卷排序：72

文摘

Classification which is a popular supervised machine learning method has many applications in computational biology, where data samples are automatically categorized into predefined labels with the aid of data mining. Often the training samples contain very few instances of interest (e.g., medical anomalies, rare disease in a population, and unusual syndromes, etc.), but many normal instances. Such imbalanced ratio of data distributions among the target labels hampers the efficacy of classification algorithms, because the induced model has not been trained with sufficient amount of instances of the interesting label(s), but overwhelmed with ordinary training records. Traditional remedies attempt to rebalance the data distributions of the target classes, by inflating the interesting instances artificially, reducing the majority of the common instances or a combination of both. Though the fundamental concept is effective, there is no clear guideline on how to strike a balance between fabricating the rare samples and reducing the norms, with the purpose of maximizing the classification accuracy. In this paper, an optimization model using different swarm strategies (Bat-inspired algorithm and PSO) is proposed for adaptively balancing the increase/decrease of the class distribution, depending on the properties of the biological datasets. The optimization is extended for achieving the highest possible accuracy and Kappa statistics at the same time as well. The optimization model is tested on five imbalanced medical datasets, which are sourced from lung surgery logs and virtual screening of bioassay data. Computer simulation results show that the proposed optimization model outperforms other class balancing methods in medical data classification.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700