吳疆 董婷 蔣平
摘要:
應(yīng)用半監(jiān)督學(xué)習(xí)方法拉普拉斯支持向量機(jī)(Laplace Support Vector Machine, LapSVM)對(duì)蛋白質(zhì)結(jié)構(gòu)類進(jìn)行預(yù)測(cè)。首先7個(gè)氨基酸理化性質(zhì)參數(shù)作為替代模型將蛋白質(zhì)序列轉(zhuǎn)換為數(shù)字序列,自協(xié)方差變換(AutocrossCovariance, AC)用來描述具有一定間隔氨基酸殘基之間的相互關(guān)系并將數(shù)字序列變換為統(tǒng)一長(zhǎng)度的向量,構(gòu)建樣本的特征空間。然后在數(shù)據(jù)集中分別隨機(jī)挑選20、50、80、110、140、170個(gè)樣本作為無標(biāo)簽樣本構(gòu)建訓(xùn)練集,一對(duì)多分解策略和留一法用來評(píng)價(jià)LapSVM模型的預(yù)報(bào)能力。分類器對(duì)蛋白質(zhì)樣本類預(yù)測(cè)正確率為94.12%,與標(biāo)準(zhǔn)支持向量機(jī)算法(Support Vector Machine, SVM)方法90.69%的預(yù)測(cè)精度相比有明顯的競(jìng)爭(zhēng)力。實(shí)驗(yàn)結(jié)果有效驗(yàn)證了無標(biāo)簽樣本的分布信息作為弱規(guī)則能有效提升分類器的預(yù)報(bào)性能。同時(shí)提供了一種新穎的思路,應(yīng)用半監(jiān)督方法解決全監(jiān)督學(xué)習(xí)問題,更小的優(yōu)化規(guī)模,更好的預(yù)報(bào)能力。
關(guān)鍵詞:
半監(jiān)督學(xué)習(xí); 蛋白質(zhì)結(jié)構(gòu)類; 拉普拉斯支持向量機(jī); 自協(xié)方差變換
中圖分類號(hào): TP 391
文獻(xiàn)標(biāo)志碼: A
Protein Structural Classes Prediction by Using Laplace Support
Vector Machine and Based on Semisupervised Method
WU Jiang1, DONG Ting1, JIANG Ping1,2
(1. Department of Information Engineering ,Yulin University, Yulin, Shanxi ?719000, China;
2. School of Computer Science and Technology, Xidian University, Xian, Shanxi 710071, China)
Abstract:
The purpose of the study is to predict protein structural classes by using Laplace support vector machine (LapSVM) which is a novel semisupervised learning method. Firstly, seven amino acid physicochemical properties cited from literature was applied to transform the protein sequences into numeric vectors, and auto covariance (AC) was used in transforming the physicochemical properties of the amino acids of given proteins into features space with the same size, which is suitable for training models. AC focuses on the neighboring effects and the interactions between residues with a certain distance apart in protein sequences. Secondly, 20, 50, 80, 110, 140 and 170 samples were randomly selected as unlabelled samples to construct training datasets, “oneagainstall” strategy and leaveoneout method were employed to estimate the performance. The prediction accuracy 94.12% was obtained, and it is very promising compared with the accuracy 90.69% predicted by Support Vector Machine (SVM). The experimental results proofed that the unlabelled samples input as weak rules can lightly improve the prediction performances, simultaneously, a novel idea is using semisupervised method to solve a supervised learning problem intends to less optimal scale and higher prediction accuracy.
Key words:
semisupervised learning; protein structural class; Laplace support vector machine; auto correlation