国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡

?

基于間隔理論的過采樣集成算法

2019-08-01 01:48:57張宗堂陳喆戴衛(wèi)國(guó)
計(jì)算機(jī)應(yīng)用 2019年5期
關(guān)鍵詞:機(jī)器學(xué)習(xí)

張宗堂 陳喆 戴衛(wèi)國(guó)

摘 要:針對(duì)傳統(tǒng)集成算法不適用于不平衡數(shù)據(jù)分類的問題,提出基于間隔理論的AdaBoost算法(MOSBoost)。首先通過預(yù)訓(xùn)練得到原始樣本的間隔; 然后依據(jù)間隔排序?qū)ι兕悩颖具M(jìn)行啟發(fā)式復(fù)制,從而形成新的平衡樣本集; 最后將平衡樣本集輸入AdaBoost算法進(jìn)行訓(xùn)練以得到最終集成分類器。在UCI數(shù)據(jù)集上進(jìn)行測(cè)試實(shí)驗(yàn),利用Fmeasure和Gmean兩個(gè)準(zhǔn)則對(duì)MOSBoost、AdaBoost、隨機(jī)過采樣AdaBoost(ROSBoost)和隨機(jī)降采樣AdaBoost(RDSBoost)四種算法進(jìn)行評(píng)價(jià)。實(shí)驗(yàn)結(jié)果表明,MOSBoost算法分類性能優(yōu)于其他三種算法,其中,相對(duì)于AdaBoost算法,MOSBoost算法在Fmeasure和Gmean準(zhǔn)則下分別提升了8.4%和6.2%。

關(guān)鍵詞:不平衡數(shù)據(jù);間隔理論;過采樣方法;集成分類器;機(jī)器學(xué)習(xí)

中圖分類號(hào):TP181

文獻(xiàn)標(biāo)志碼:A

Abstract: In order to solve the problem that traditional ensemble algorithms are not suitable for imbalanced data classification, Over Sampling AdaBoost based on Margin theory (MOSBoost) was proposed. Firstly, the margins of original samples were obtained by pretraining. Then, the minority class samples were heuristic duplicated by margin sorting thus forming a new balanced sample set. Finally, the finall ensemble classifier was obtained by the trained AdaBoost with the balanced sample set as the input. In the experiment on UCI dataset, Fmeasure and Gmean were used to evaluate MOSBoost, AdaBoost, Random OverSampling AdaBoost (ROSBoost) and Random UnderSampling AdaBoost (RDSBoost). The experimental results show that MOSBoost is superior to other three algorithm. Compared with AdaBoost, MOSBoost improves 8.4% and 6.2% respctively under Fmeasure and Gmean criteria.

英文關(guān)鍵詞Key words: imbalanced data; margin theory; over sampling method; ensemble classifier; machine learning

0 引言

近些年,不平衡數(shù)據(jù)分類問題成為了機(jī)器學(xué)習(xí)的熱點(diǎn)問題,它廣泛存在于現(xiàn)實(shí)生產(chǎn)生活中,例如郵件過濾[1]、圖像分類[2]、軟件缺陷預(yù)測(cè)[3]、醫(yī)療診斷[4]、基因數(shù)據(jù)分析[5]等。對(duì)于二分類問題,不平衡數(shù)據(jù)中多類的樣本數(shù)量遠(yuǎn)大于少類。傳統(tǒng)的分類方法以總體分類精度為目標(biāo),忽視了類別不平衡性,從而導(dǎo)致少類樣本分類準(zhǔn)確率降低,然而少類樣本往往具有較高的價(jià)值,這使得錯(cuò)分代價(jià)較大。

針對(duì)不平衡數(shù)據(jù)的處理方法大致分為算法層面和數(shù)據(jù)層面: 算法層面指構(gòu)造新的算法或?qū)υ兴惴ㄟM(jìn)行改造以偏向少類; 數(shù)據(jù)層面主要是利用重采樣方法獲得平衡樣本集,再結(jié)合現(xiàn)有分類器進(jìn)行分類。重采樣方法,包括欠采樣法和過采樣法,形式上比較簡(jiǎn)練,且不影響分類器設(shè)計(jì),因此得到了廣泛的研究。根據(jù)采取的策略,它又可分為隨機(jī)采樣和啟發(fā)式采樣: 隨機(jī)采樣不依據(jù)數(shù)據(jù)信息,只是簡(jiǎn)單地隨機(jī)刪除或添加樣本; 啟發(fā)式采樣則是在利用數(shù)據(jù)內(nèi)部特性的基礎(chǔ)上進(jìn)行采樣。典型的啟發(fā)式欠采樣方法如Tomek links[6]、One sided selection[7]、Neighborhood Cleaning Rule[8]等克服了隨機(jī)欠采樣中容易缺失有用信息的缺點(diǎn),一定程度上提高了算法性能。而啟發(fā)式過采樣中比較有代表性的是SMOTE(Synthetic Minority Oversampling TEchnique)[9]方法及其改進(jìn)算法[10-12]。SMOTE方法的基本假設(shè)是相同類別的鄰近數(shù)據(jù)點(diǎn)所生成的凸集也屬于同一類別。啟發(fā)式重采樣方法基本都是在某種準(zhǔn)則下對(duì)樣本進(jìn)行篩選,對(duì)數(shù)據(jù)集的依賴性較強(qiáng),然而不平衡數(shù)據(jù)集往往存在類內(nèi)不平衡、小析取項(xiàng)、高噪聲等特點(diǎn),使得其難以滿足準(zhǔn)則要求,進(jìn)而降低了算法性能。表面上看,這是數(shù)據(jù)集與準(zhǔn)則之間的適配性問題,實(shí)際上是這些方法缺乏理論基礎(chǔ),泛化性較低。

AdaBoost算法是一種經(jīng)典的集成分類算法,在機(jī)器學(xué)習(xí)中有廣泛的應(yīng)用[13-15]。AdaBoost以最小化總體分類誤差為目標(biāo),忽視了類別間的不平衡性,因而不適用于不平衡數(shù)據(jù)分類。間隔理論是AdaBoost算法的重要理論基礎(chǔ),成功解釋了AdaBoost算法不易過擬合等現(xiàn)象。本文從間隔理論出發(fā),定義了少類間隔和多類間隔,對(duì)少類間隔樣本依據(jù)符號(hào)正負(fù)進(jìn)行篩選,對(duì)正的少類間隔樣本進(jìn)行啟發(fā)式復(fù)制,形成新的平衡樣本集,在此樣本集上進(jìn)行AdaBoost訓(xùn)練,形成了MOSBoost算法,從而提高了不平衡數(shù)據(jù)分類性能。

1 相關(guān)工作

1.1 AdaBoost算法

AdaBoost算法將訓(xùn)練樣本集{(x1,y1),(x2,y2),…,(xN,yN)}作為輸入,其中xi是樣本,yi為其類標(biāo),對(duì)于二分類問題,yi∈{-1,1}。然后根據(jù)已知的基分類算法在t=1,2,…,T輪中不斷地運(yùn)算。Dt(i)表示第t輪中第i個(gè)訓(xùn)練樣本的權(quán)重?;诸愃惴ǖ娜蝿?wù)是在權(quán)重分布Dt的基礎(chǔ)上得到基分類器ht來最小化分類誤差。當(dāng)ht訓(xùn)練完成,AdaBoost選擇一個(gè)參數(shù)αt∈R來衡量ht的分類性能。然后更新權(quán)重分布Dt。最終的集成分類器F是T個(gè)基分類器的加權(quán)輸出。具體算法如算法1所示。

參考文獻(xiàn) (References)

[1] DAI H L. Class imbalance learning via a fuuzy total margin based support vector machine[J]. Applied Soft Computing, 2015, 31(C): 172-184.

[2] 譚潔帆,朱焱,陳同孝,等.基于卷積神經(jīng)網(wǎng)絡(luò)和代價(jià)敏感的不平衡圖像分類方法[J].計(jì)算機(jī)應(yīng)用,2018,38(7):1862-1865,1871.(TAN J F, ZHU Y, CHEN T X, et al. Imbalanced image classification approach based on convolution network and costsensitivity[J]. Journal of Computer Applications,2018,38(7):1862-1865,1871.)

[3] WANG S, YAO X. Using class imbalance learning for software defect prediction[J]. IEEE Transactions on Reliability, 2013, 62(2): 434-443.

[4] OZCIFT A, GULTEN A. Classifer ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms[J]. Computer Methods and Programs in Biomedicine, 2011, 104(3):443-451.

[5] YU H, NI J, ZHAO J. ACOSampling: an ant colony optimizationbased undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013,101:309-318.

[6] TOMEK I. Two modifications of CNN[J]. IEEE Transactions on Systems, Man and Cybernetics, 1976, SMC6(11): 769-772.

[7] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: onesided selection[C]// Proceedings of the 14th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1997: 179-186.

[8] LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. Berlin: Springer, 2001: 63-66.

[9] CHAWLA N, BOWYER K, HALL L, et al. SMOTE: synthetic minority oversampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

[10] RIVERA W A. Noise reduction a priori synthetic oversampling for class imbalanced data sets[J]. Information Sciences, 2017, 408(C): 146-161.

[11] MA L, FAN S. CURESMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests [J]. BMC Bioinformatics, 2017,18(1): 169.

[12] BOROWSKA, K, STEPANIUK J. Imbalanced data classification: a novel resampling approach combining versatile improved SMOTE and rough sets[C]// CISIM 2016: IFIP International Conference on Computer Information Systems and Industrial Management. Berlin: Springer, 2016: 31-42.

[13] BAIG M M, AWAIS M M, ELALFY E S M. AdaBoostbased artificial neural network learning[J]. Neurocomputing, 2017, 248(C): 120-126.

[14] MINZ A, MAHOBIYA C. MR image classification using Adaboost for brain tumor type[C]// Proceedings of the 2017 IEEE 7th International Advance Computing Conference. Washington, DC: IEEE Computer Society, 2017:701-705.

[15] 王軍,費(fèi)凱,程勇.基于改進(jìn)的AdaboostBP模型在降水中的預(yù)測(cè)[J]. 計(jì)算機(jī)應(yīng)用, 2017, 37(9):2689-2693.(WANG J,F(xiàn)EI K,CHENG Y. Prediction of rainfall based on improved AdaboostBP model[J]. Journal of Computer Applications, 2017, 37(9):2689-2693.)

[16] SCHAPIRE R E, FREUND Y, BARTLETT P, et al. Boosting the margin: a new explanation for the effectiveness of voting methods[J]. Annals of Statistics, 1998, 26(5): 1651-1686.

[17] GAO W, ZHOU Z H. On the doubt about margin explanation of boosting[J]. Artificial Intelligence, 2013,203:1-18.

[18] BACHE K, LICHMAN M. UCI repository of machine learning databases[DB/OL].[2018-06-20].http://www.ics.uci.edu/~mlearn/MLRepository.html.

[19] van HULSE J, KHOSHGOFTAAR T M, NAPOLITANO A. Expertimental perspectives on learning from imbalanced data[C]// Proceedings of the 24th International Conference on Machine Learing. New York: ACM, 2007: 935-942.

[20] LIU N, WEI L W, AUNG Z. Handling class imbalance in customer behavior prediction[C]// Proceedings of the 2014 International Conference on Collaboration Technologies and Systems. Piscataway, NJ: IEEE, 2014: 100-103.

猜你喜歡
機(jī)器學(xué)習(xí)
基于詞典與機(jī)器學(xué)習(xí)的中文微博情感分析
基于機(jī)器學(xué)習(xí)的圖像特征提取技術(shù)在圖像版權(quán)保護(hù)中的應(yīng)用
基于網(wǎng)絡(luò)搜索數(shù)據(jù)的平遙旅游客流量預(yù)測(cè)分析
前綴字母為特征在維吾爾語文本情感分類中的研究
下一代廣播電視網(wǎng)中“人工智能”的應(yīng)用
活力(2016年8期)2016-11-12 17:30:08
基于支持向量機(jī)的金融數(shù)據(jù)分析研究
基于Spark的大數(shù)據(jù)計(jì)算模型
基于樸素貝葉斯算法的垃圾短信智能識(shí)別系統(tǒng)
基于圖的半監(jiān)督學(xué)習(xí)方法綜述
機(jī)器學(xué)習(xí)理論在高中自主學(xué)習(xí)中的應(yīng)用
新乡县| 犍为县| 开原市| 山西省| 项城市| 中阳县| 庄浪县| 藁城市| 灵台县| 桃源县| 扎赉特旗| 吴旗县| 二连浩特市| 尼勒克县| 惠来县| 南投市| 邮箱| 垣曲县| 北安市| 昌乐县| 长宁县| 宁远县| 内江市| 闽清县| 溧水县| 尼勒克县| 且末县| 清流县| 镇康县| 东乡族自治县| 班戈县| 民和| 遂平县| 榆树市| 新乡县| 饶平县| 阳春市| 阳朔县| 光泽县| 江源县| 科尔|