于波 李海峰 馬琳
(1 哈爾濱工業(yè)大學(xué) 計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院, 哈爾濱 150080;2 哈爾濱理工大學(xué) 軟件學(xué)院,哈爾濱 150040)
摘要:由于支持向量機(jī)(Support Vector Machine,SVM)在處理樣本不平衡分布時(shí)會(huì)有偏向性,使少數(shù)類別的分類錯(cuò)誤率的上界高于多數(shù)樣本類別。分析總結(jié)了針對(duì)該問題當(dāng)前的研究方法,并指出存在問題。研究分析針對(duì)不平衡樣本SVM分類識(shí)別率的傾向性問題??紤]全局樣本信息,提出了三種針對(duì)所有樣本空間分布距離信息的方法。在UCI數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn),結(jié)果證明MSEDR-SVM(Mean Sample Euclidean Distance Ratio-SVM)能夠有效增加少數(shù)樣本類別的F-值。從而改善標(biāo)準(zhǔn)的SVM只依靠支持向量樣本構(gòu)建分類超平面的局限性。
關(guān)鍵詞:SVM; 不平衡樣本分布;MSEDR-SVM
中圖分類號(hào):TP309 文獻(xiàn)標(biāo)識(shí)碼:A
Support Vector Machine based on the sample spatial distance
YU Bo1,2 , LI Haifeng1 , MA Lin1
(1 School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150080, China;2 Software College, Harbin University of Science and Technology, Harbin 150040, China)
Abstract: As the support vector machine deals with imbalanced sample distribution, there is some tendency,
making the upper boundary of the misclassification rate of the minority above the majority. The research solutions to the problem are analyzed and summarized, and the problems existing in the current research are pointed out.
Accordingly, the paper analyzes the tendency problem of the imbalanced sample distribution on the Classification accuracy. Considering the whole sample information, the three methods based on sample spatial distribution distance information are proposed. The experiment on UCI data sets verifies that the new classier MSEDR-SVM (Mean Sample Euclidean Distance Ratio-SVM) can effectively increase the F-value of the minority.
The method improves the limitation of standard support vector machine only relying Support Vectors to construct classification hyperplane.
Key words: Support Vector Machine; imbalanced sample distribution; MSEDR-SVM
0 引言
1995年,Vpnik首次于研究中引入了支持向量機(jī)(Support Vector Machine,SVM)[1]。SVM是基于統(tǒng)計(jì)學(xué)習(xí)理論VC維度和結(jié)構(gòu)風(fēng)險(xiǎn)最小化原則構(gòu)建提出的,并具有良好的泛化性能,因而能夠解決小樣本、非線性和維數(shù)災(zāi)難等問題。SVM有著廣泛的應(yīng)用,如:語音識(shí)別[2]、腦電識(shí)別[3]、疾病檢測[4- 5]、故障檢測[6- 7]等。綜合如上應(yīng)用領(lǐng)域,當(dāng)各類樣本分布數(shù)量幾乎相等時(shí),SVM通常表現(xiàn)出優(yōu)良的性能。但在實(shí)際進(jìn)程中,樣本的分類信息多數(shù)情況下都是不對(duì)稱的,對(duì)于少數(shù)類別(如:疾病、故障等)的識(shí)別機(jī)會(huì)顯得尤為重要??梢娎C就是:若將一個(gè)癌癥患者錯(cuò)誤識(shí)別為健康患者的代價(jià)則將高過對(duì)于健康患者的誤診?;诖耍槍?duì)少數(shù)類別樣本識(shí)別準(zhǔn)確率的應(yīng)用,不平衡分布樣本的識(shí)別問題研究顯然具備了現(xiàn)實(shí)重要意義。但是,對(duì)于不平衡樣本分布,標(biāo)準(zhǔn)SVM的分類識(shí)別帶有偏向性,導(dǎo)致少數(shù)類別將出現(xiàn)更高的分類錯(cuò)誤。相關(guān)研究已經(jīng)提出SVM針對(duì)不平衡樣本分布的解決方法。具體分析論述如下:
第一類解決偏向性問題的方法是對(duì)于訓(xùn)練樣本進(jìn)行了重采樣。這種方法的主要思想就是使不平衡分布的樣本轉(zhuǎn)變?yōu)槠胶夥植嫉臉颖?。此時(shí),有兩種策略可供選取,主要內(nèi)容為:
1)增采樣(Over Sampling):增加合成樣本到少數(shù)類別。Chawla等人提出SMOTE(Synthetic Minority Over-sampling Technique)方法[8]。該方法在少數(shù)類的每個(gè)樣本以及離其最近的樣本之間插入合成的樣本,從而增加少數(shù)類樣本的數(shù)量。Wu等人提出通過遺傳交叉運(yùn)算生成新的樣本,彌補(bǔ)不平衡樣本引起的不利影響[9]。只是,利用增采樣添加的合成樣本卻難以保證和原有樣本保持一致的分布信息。因此,這些方法只是增加一些重復(fù)的樣本,可能會(huì)產(chǎn)生過擬合的情況。
2)降采樣(Under Sampling):減少多數(shù)類樣本的數(shù)量。Kubat等人提出減少多數(shù)類的噪聲和冗余樣本的單邊選擇方法[10],從而彌補(bǔ)2類樣本的差異。通過降采樣的方法失去了樣本隨機(jī)性和原來有價(jià)值的信息。由于沒有充分利用原有樣本信息,分類超平面的方向可能發(fā)生變化。
第二類解決偏向性問題的方法是修改標(biāo)準(zhǔn)的支持向量機(jī)。Veropoulos等人改進(jìn)了標(biāo)準(zhǔn)的SVM,主要體現(xiàn)在給出了2類不同的懲罰因子[11]。針對(duì)標(biāo)準(zhǔn)SVM的樣本分布不平衡的問題,Chew等人提出兩類的懲罰因子反比于樣本的數(shù)量,從而減少樣本不平衡分布對(duì)于分類器的影響[12]。文獻(xiàn)[12]提出一種新方法,通過聯(lián)合加權(quán)平衡和采樣平衡來設(shè)置標(biāo)準(zhǔn)SVM的參數(shù)C[10]。在只有少數(shù)類的極端情況下,這個(gè)方法獲得了較好的性能。研究可知,前述這些改進(jìn)方法主要是對(duì)分類參數(shù)施加影響,但是未能從根本上解決SVM的偏側(cè)性問題。Liu提出平均距離比的方法(MDR: Method of Average Distance Ratio)[13]。雖然MDR進(jìn)一步完善了SVM,但是該方法的主要缺點(diǎn)局限卻是僅考慮了支持向量到超平面的空間距離,而未考慮內(nèi)點(diǎn)(非支持向量)。
綜上分析可得,以上方法均未充分融合利用所有樣本的分布信息,僅是依據(jù)支持向量來構(gòu)造分類超平面。對(duì)于不平衡樣本分布的問題,有必要改進(jìn)SVM,從而亟需提出更有效的方法。針對(duì)這一問題,本文提出了基于樣本空間分布信息的支持向量機(jī),在歐氏空間考慮所有樣本到超平面的分布距離信息?;谄骄鶚颖練W式距離比(MSEDR:Mean Sample Euclidean Distance Ratio),給出新分類器MSEDR-SVM。最后,通過幾種分類器的實(shí)驗(yàn)對(duì)比證明了MSEDR-SVM的客觀可行和現(xiàn)實(shí)有效性。
4 結(jié)束語
本文提出基于樣本空間分布信息的SVM分類識(shí)別方法。該方法主要是使少數(shù)類樣本點(diǎn)到超平面的距離大于多數(shù)類,降低少數(shù)類的錯(cuò)分類的上界,充分利用樣本的空間分布距離信息,沒有增加或減少樣本點(diǎn)。實(shí)驗(yàn)結(jié)果顯示了提出的3種方法中,MSEDR-SVM在處理不平衡樣本分布時(shí),得到了最佳有效性。不平衡樣本分布的信息還包括,樣本分布的離散程度、分布趨勢等,下一步工作將針對(duì)這些情況展開后續(xù)研究。
參考文獻(xiàn)
[1] VAPNIK V N. The nature of statistical learning theory[M]. New York: Springer, 1995.
[2] GEORGOULAS G, GEORGOPOULOS V C , STYLIOS C D. Speech sound classification and detection of articulation disorders with support vector machines and wavelets[C]//Conf Proc IEEE in Medicine and Biology Society. New York, USA:IEEE, 2006: 2199-2202.
[3] LI S, ZHOU W, YUAN Q, et al. Feature extraction and recognition of ictal EEG using EMD and SVM[J]. Computers in Biology and Medicine, 2013,43(7): 807-816.
[4] LIU Y, ZHOU W, YUAN Q, et al. Automatic seizure detection using wavelet transform and SVM in long-term intracranial EEG[J].IEEE Trans Neural Syst Rehabil Eng, 2012, 20(6):749-755.
[5] LI B, MENG M Q. Tumor recognition in wireless capsule endoscopy images using textural features and SVM-based feature selection[J].IEEE Trans Inf Technol Biomed,2012,16(3):323-329.
[6] ZHANG Y X, CHENG Z F, XU Z P, et al. Application of optimized parameters SVM based on photoacoustic spectroscopy method in fault diagnosis of power transformer[J]. Spectroscopy & Spectral Analysis, 2015,35(1):10-13.
[7] SANTOS, PRDRO, VILLA, et al. An SVM-based solution for fault detection in wind turbines Sensors[J].Sensors (Basel), 2015,15(3): 5627-5648.
[8] CHAWLA N V, BOWYER K W, HALL L O, et al.SMOTE: Synthetic minority over-sampling technique[J]. Journal of Articial Intelligence Research, 2002,16:321-357.
[9] WU H X, PENG Y, PENG X Y. A new support vector machine method for unbalanced data treatment[J].Chinese Journal of Electronics, 2006,34: 2395-2398.
[10] KUBAT M, MATWIN S. Addressing the course of imbalanced training sets: One-sided Selection[C]//Proc. 14th International Conference on Machine Learning. Nashville, TN, USA:ICML, 1997: 179-186.
[11] VEROPOULOS K , CAMPBELL C, CRISTIANINI N. Controlling the Sensitivity of Support Vector Machine[C]//International Joint Conference on AI. Stockholm, Swede:IJCAI, 1999: 55-60.
[12] CHEW H G, CRISP D J, BOGNER R Er,et al. Target detection in radar imagery using Support Vector Machines with training size biasing[J].Southern Medical Journal, 2000, 90(10):959–963.
[13] LIU W H. Study of Support Vector Machine algorithms on unbalanced dataset[D].Qingdao:Shandong University of Science and Technology, 2010.
[14] A Frank, A Asuncion. UCI Machine Learning Repository[EB/OL].[2010-06-13] .http://archive.ics.uci.edu/ml.
[15] CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3):389-396 .