王 雪,馬鐵民,楊 濤,宋 平,謝秋菊,陳爭(zhēng)光
?
基于近紅外光譜的灌漿期玉米籽粒水分小樣本定量分析
王 雪1,2,馬鐵民2,3,楊 濤1※,宋 平1,謝秋菊2,陳爭(zhēng)光2
(1.沈陽(yáng)農(nóng)業(yè)大學(xué)信息與電氣工程學(xué)院,沈陽(yáng) 110866; 2. 黑龍江八一農(nóng)墾大學(xué)電氣與信息學(xué)院,大慶 163319;3. 東北大學(xué)計(jì)算機(jī)科學(xué)與工程學(xué)院,沈陽(yáng) 110819)
玉米灌漿期含水率測(cè)定是考種育種的重要指標(biāo)。為了節(jié)約樣本且快速準(zhǔn)確測(cè)定灌漿期玉米水分,該文應(yīng)用近紅外光譜技術(shù),提出了基于小樣本條件下的自舉算法(Bootstrap)與基于-距離結(jié)合的樣本劃分方法(SPXY, sample set partitioning based on joint-distances)相結(jié)合的樣本優(yōu)化方法的偏最小二乘(PLS,partial least square)水分定量分析模型Bootstrap-SPXY-PLS模型。試驗(yàn)結(jié)果表明,當(dāng)Bootstrap重抽樣本次數(shù)等于500,樣本數(shù)量大于等于10時(shí),模型的性能穩(wěn)定,并且隨著樣本數(shù)量增加,重抽樣本次數(shù)相對(duì)減少;樣本數(shù)量為10和50時(shí),全譜Bootstrap-SPXY-PLS模型的預(yù)測(cè)均方根誤差(RMSEP, root-mean-square error of prediction)均值分別為0.38%和0.40%,預(yù)測(cè)相關(guān)系數(shù)(correlation coefficients of prediction)分別為0.975 1和0.968 5,決定系數(shù)2分別為0.999 9和0.993 6;基于競(jìng)爭(zhēng)性自適應(yīng)重加權(quán)采樣算法(CARS,competitive adaptive reweighed sampling)波長(zhǎng)變量篩選后的CARS-Bootstrap-SPXY-PLS模型的預(yù)測(cè)均方根誤差RMSEP均值分別為0.36%和0.35%,預(yù)測(cè)相關(guān)系數(shù)分別為0.973 6和0.975 0,模型決定系數(shù)2分別為0.924 5和0.918 0。因此,全譜Bootstrap-SPXY-PLS模型和CARS-Bootstrap-SPXY-PLS模型均具有穩(wěn)定的預(yù)測(cè)能力,為玉米育種時(shí)灌漿期種子水分測(cè)定提供了一種穩(wěn)定、高效的方法。
近紅外光譜;水分;模型;定量分析;小樣本集;灌漿期玉米籽粒;Bootstrap重抽樣本;樣本優(yōu)化選擇
近紅外光譜及其分析技術(shù)越來(lái)越多地被應(yīng)用于農(nóng)業(yè)、食品、工業(yè)等領(lǐng)域的質(zhì)量定性分析[1-3]和成分快速測(cè)定[4-7]中,研究中樣本規(guī)模一般在100~200之間[8-11]。在玉米育種時(shí),由于受繁育新品種的種植面積,每平方米可以種植的玉米植株數(shù)目,有效試驗(yàn)穗數(shù)量等客觀條件所限,灌漿期玉米水分測(cè)量時(shí)樣本的取樣數(shù)量、取樣成本等受到一定限制。然而,灌漿期是玉米品種變化和育種考種的關(guān)鍵期,傳統(tǒng)烘干水分測(cè)定方法取樣時(shí)只選取中間的150~250粒進(jìn)行百粒質(zhì)量水分測(cè)量[12],因而需要大量樣本。因此,研究小樣本、高效率水分測(cè)量方法是玉米育種過(guò)程中急需解決的問(wèn)題之一。
在近紅外光譜分析領(lǐng)域中,樣本量是影響算法性能及預(yù)測(cè)能力的關(guān)鍵。一般情況下樣本數(shù)量越少,樣本檢驗(yàn)?zāi)P偷挠行示蜁?huì)越低,所以在應(yīng)用中找到小樣本數(shù)量的臨界值非常重要。Bootstrap算法,由Efron教授在1979年提出[13],廣泛應(yīng)用于化學(xué)計(jì)量學(xué)樣本檢驗(yàn)方法的改進(jìn)。近年來(lái),許多研究者提出了應(yīng)用Bootstrap重抽樣本方法進(jìn)行小樣本條件下的數(shù)據(jù)分析。大部分研究者認(rèn)為Bootstrap方法在對(duì)小樣本數(shù)據(jù)進(jìn)行檢驗(yàn)時(shí)具備可靠性[14],與傳統(tǒng)方法相比減少了傳遞量和不確定性[15],既可以用于樣本的正態(tài)性檢驗(yàn)[16],也可以不考慮數(shù)據(jù)是否為正態(tài)分布[17],不對(duì)數(shù)據(jù)進(jìn)行預(yù)處理[18-19]。也有研究者認(rèn)為對(duì)于偏離正態(tài)分布的原始數(shù)據(jù)應(yīng)該采用非參數(shù)Bootstrap方法[20],對(duì)于正態(tài)分布或正態(tài)分布的原始數(shù)據(jù),可以采用參數(shù)Bootstrap方法代替非參數(shù)方法。尤其是非參數(shù)Bootstrap 在少量樣本情況下的檢驗(yàn)相較于其他檢驗(yàn)方法更為有效,并且認(rèn)為樣本數(shù)量一般至少應(yīng)在10以上[21]。
陳昭等[22]提出了Bootstrap方法與偏最小二乘法結(jié)合的Bagging和Boosting方法,結(jié)果表明Bootstrap方法提高了近紅外光譜定量模型的預(yù)測(cè)能力;Xiao等[23]在瀝青滲透指標(biāo)相關(guān)成分的少量樣本的光譜定量分析模型研究中的結(jié)果表明Bootstrap方法與SVM相結(jié)合的模型效果良好。Lodder等[24]創(chuàng)建Bootstrap Patter Selection樣本選擇方法并用于蛋白質(zhì)的測(cè)定,決定系數(shù)2可以穩(wěn)定在0.988。因此,通過(guò)對(duì)現(xiàn)有Bootstrap方法的相關(guān)文獻(xiàn)研究可知,Bootstrap方法在小樣本條件下的分析檢測(cè)具有優(yōu)勢(shì),并且對(duì)原始數(shù)據(jù)分布沒(méi)有嚴(yán)格的要求,可以更好地應(yīng)用于分析模型中。
本文研究的目的是利用Bootstrap和基于-距離結(jié)合的樣本劃分方法(sample set partitioning based on joint-distance,SPXY)構(gòu)建近紅外光譜樣本優(yōu)化方法,建立適宜小樣本的灌漿期玉米水分定量分析模型,分析重抽樣本次數(shù)和樣本臨界大小對(duì)模型的影響,提高灌漿期玉米水分的測(cè)定效率,降低測(cè)定成本,為灌漿期玉米水分測(cè)量提供新方法,既有助于玉米育種考種研究,也為近紅外光譜在小樣本條件下的定量分析提供新思路。
樣本采集自黑龍江八一農(nóng)墾大學(xué)玉米試驗(yàn)基地,品種為“先玉335”。灌漿期玉米樣本采集期為2016年8月21日開(kāi)始至10月2日結(jié)束,每7 d取1次樣本。由于溫度較高,為了控制樣本采摘后的水分流失,取樣后迅速將樣本轉(zhuǎn)移至試驗(yàn)室,低溫保存。在最短時(shí)間內(nèi)完成化學(xué)試驗(yàn)和光譜數(shù)據(jù)的采集,從而將預(yù)測(cè)模型的外界影響因素降到最低。本文中的樣本數(shù)據(jù)為2016年9月11日取樣,光譜采集時(shí)將玉米籽粒研磨成粉末。光譜采集樣本數(shù)量為200個(gè),剔除異常樣本后剩余156個(gè)樣本按照3﹕1的比例分配建模集和預(yù)測(cè)集分別為118個(gè)樣本和38個(gè)樣本。
光譜采集設(shè)備為北京瑞利WQF-600N FTNIR傅里葉變換光譜儀,光譜波長(zhǎng)范圍在4000~10000 cm-1,每個(gè)樣品掃描32次,最終光譜為32次掃描光譜的平均光譜。
含水率測(cè)定采用美國(guó)雙杰G&G電子天平有限公司JJ224BC天平測(cè)試質(zhì)量,精度0.1 mg。烘干設(shè)備為天津市泰斯特儀器有限公司W(wǎng)H-71電熱恒溫干燥箱,烘干方法采用二次烘干法,105 ℃下斷青,85 ℃恒溫直至百粒質(zhì)量沒(méi)有變化。含水率為單位質(zhì)量的玉米籽粒水分含量,即含水率=((百粒鮮質(zhì)量(g)-百粒干質(zhì)量(g))/(百粒鮮質(zhì)量(g))×100%。
算法實(shí)現(xiàn)采用Matlab2015b 8.6.0和RStudio 3.4.1軟件進(jìn)行,RStudio實(shí)現(xiàn)Bootstrap算法,Matlab2015b實(shí)現(xiàn)樣本的劃分算法和波長(zhǎng)變量的篩選算法;光譜數(shù)據(jù)的預(yù)處理和相關(guān)計(jì)算采用The Unscrambler X 10.3軟件實(shí)現(xiàn)。
1.3.1 Bootstrap重抽樣本算法
Bootstrap算法是為了增強(qiáng)模型的精度和穩(wěn)健性,采用重新抽樣的策略反復(fù)地進(jìn)行模擬原小樣本數(shù)據(jù)集,構(gòu)建滿足分析與建模需要的新數(shù)據(jù)集。如果簡(jiǎn)單的迭代模擬數(shù)據(jù),新數(shù)據(jù)集樣本間的差異性將很難保證。因此,每次在進(jìn)行下一次重抽樣本時(shí)都將上一次抽取的樣本與原樣本進(jìn)行合并,并計(jì)算出新樣本集的均值作為下一次重抽樣本的權(quán)重以增強(qiáng)樣本的差異性和模型的穩(wěn)健程度。本研究中所采用的Bootstrap算法過(guò)程如下:
1)定義原始樣本集為=(1,2, …,),X為參加Bootstrap重抽算法的第個(gè)光譜樣本。設(shè)Bootstrap的最大重抽樣本次數(shù)為;
3)依照公式(1)按照當(dāng)前樣本權(quán)重對(duì)待抽取的樣本集進(jìn)行重新抽取,抽取形成的數(shù)據(jù)集為*=(1,2, …,);
*+(1)
4)將新抽取形成的樣本集*與進(jìn)行合并,按照步驟2)重新計(jì)算樣本抽取權(quán)重;
5)重復(fù)步驟3)、4)次,形成新的樣本集,樣本數(shù)=。
對(duì)于個(gè)樣本,通過(guò)次迭代的重抽樣本,獲得了個(gè)重抽樣本,通過(guò)采用樣本劃分和建模等方法對(duì)樣本集進(jìn)行處理得到適合的創(chuàng)建模型樣本,進(jìn)而得到有效的小樣本的預(yù)測(cè)結(jié)果。
1.3.2 SPXY樣本選擇算法
基于-距離結(jié)合的樣本劃分方法(sample set partitioning based on joint-distances, SPXY)劃分樣本的優(yōu)點(diǎn)是能夠保證覆蓋到所有的樣本,改善模型的預(yù)測(cè)能力,增加模型的校正速度[25]?;赟PXY的樣本劃分方法的原理是在綜合考慮光譜向量和濃度向量的情況下,計(jì)算樣本之間的距離,所以SPXY適合在劃分預(yù)測(cè)模型的樣本集合時(shí)使用。
式中d(,)為各光譜樣本之間的距離,d(,)為各濃度樣本之間的距離,為樣本數(shù)。劃分時(shí)d和d分別除以各自最大值,從而得到每個(gè)樣本之間的-距離作為樣本選擇時(shí)的權(quán)重[26]。
1.3.3 Bootstrap-SPXY樣本優(yōu)化選擇方法
將Bootstrap算法計(jì)算之后的樣本集合記作“重抽樣本”集合。利用“重抽樣本”集合直接建立預(yù)測(cè)模型會(huì)為模型的創(chuàng)建帶來(lái)壓力,而且“重抽樣本”集合的分布穩(wěn)定性也會(huì)對(duì)模型的預(yù)測(cè)能力帶來(lái)影響。Bootstrap-SPXY樣本優(yōu)化方法對(duì)“重抽樣本”集合進(jìn)行優(yōu)化選擇和分布性檢測(cè),在樣本優(yōu)化選擇的同時(shí)確保建模樣本的光譜區(qū)間和分布的一致性。Bootstrap-SPXY樣本優(yōu)化流程如圖1所示。
圖1 Bootstrap-SPXY樣本優(yōu)化流程
首先將小樣本集合經(jīng)過(guò)數(shù)據(jù)處理進(jìn)行Bootstrap算法計(jì)算,輸出“重抽樣本”集合,進(jìn)行SPXY樣本選擇計(jì)算,繪制樣本分布直方圖與原樣本分布進(jìn)行比較,若分布一致性較差,則重新進(jìn)行Bootstrap算法計(jì)算,若分布一致則完成樣本的優(yōu)化選擇,形成建模樣本集合。
在研究中,使用決定系數(shù)2、建模均方根誤差RMSEC、交互驗(yàn)證標(biāo)準(zhǔn)偏差RMSECV、預(yù)測(cè)均方根誤差RMSEP和預(yù)測(cè)相關(guān)系數(shù)r對(duì)模型的預(yù)測(cè)能力進(jìn)行說(shuō)明。一個(gè)好的模型,RMSEC、RMSECV和RMSEP都會(huì)比較小,2和r值會(huì)比較大。其中,RMSECV和RMSEP的值要基本一致,如果RMSECV遠(yuǎn)大于RMSEP的值,則說(shuō)明驗(yàn)證樣品的代表性差;如果RMSECV遠(yuǎn)小于RMSEP的值,則說(shuō)明建模樣品的代表性差,信息擬合不夠或者過(guò)擬合。
光譜水分子活動(dòng)明顯的頻譜區(qū)域在4 500~6 900 cm-1,原始光譜圖在該波段內(nèi)有明顯噪音。在建立模型前,對(duì)光譜數(shù)據(jù)進(jìn)行預(yù)處理。圖2為156個(gè)樣本經(jīng)過(guò)窗寬為13的Savitzky-Golay 平滑預(yù)處理[27-28]后的光譜圖像。
圖2 156個(gè)樣本預(yù)處理后的光譜圖
2.2.1 Bootstrap-SPXY樣本優(yōu)選分析
首先,通過(guò)SPXY方法從原數(shù)據(jù)樣本中劃分出多個(gè)樣本集合,樣本數(shù)量分別為50、20、10和5,分別記為X’_fifty、X’_twenty、X’_ten和X’_five。5個(gè)樣本集合 X’_five添加5個(gè)樣本集合ten+后形成10個(gè)樣本集合X’_ten,樣本集合X’_ten添加10個(gè)樣本集合twenty+形成樣本集合X’_twenty,樣本集合X’_twenty添加30個(gè)樣本集合fifty+形成樣本集合X’_fifty。以此劃分出Bootstrap試驗(yàn)所需的樣本集合,圖3顯示的為6 900 cm-1波段的劃分結(jié)果。
圖3 6 900 cm-1波段SPXY樣本劃分結(jié)果
將集合X’_fifty、X’_twenty、X’_ten和X’_five分別進(jìn)行轉(zhuǎn)置,形成Bootstrap算法待處理樣本集合X_fifty、X_twenty、X_ten和X_five,根據(jù)1.3.1節(jié)所描述的算法實(shí)現(xiàn)重抽樣本。為了研究Bootstrap-SPXY-PLS模型的穩(wěn)定性,將重抽樣本次數(shù)分別設(shè)定為100次、200次….800次。X_five重抽500次后轉(zhuǎn)置形成2 500個(gè)光譜樣本,X_fifty重抽500次后轉(zhuǎn)置形成25 000個(gè)光譜樣本。由于樣本量巨大,為了保證模型的穩(wěn)定,將重抽后的光譜樣本集基于SPXY算法優(yōu)化選擇,樣本數(shù)減少為2 000。
通過(guò)繪制直方圖驗(yàn)證樣本劃分和重抽樣本的可靠性和分布情況。選擇水分子敏感的6 900 cm-1波段為研究對(duì)象。圖4為顯示在6 900 cm-1波段不同階段的樣本分布頻率,其中圖4a顯示的是原始光譜樣本分布頻率,圖4b顯示的是SPXY劃分50個(gè)的樣本集合X’_fifty分布頻率,圖4c顯示的是50個(gè)樣本進(jìn)行Bootstrap重抽樣本500次并重新劃分后的樣本分布頻率。通過(guò)圖4可以看出,a、b和c 3張圖頻率分布情況大致相同,a和b的頻率分布情況更相似。3個(gè)階段的最大頻率的吸收率峰值均出現(xiàn)在45~50之間。圖4c所顯示的分布情況要略優(yōu)于原始樣本數(shù)據(jù),表明通過(guò)Bootstrap重抽樣本方法構(gòu)建的建模樣本不但沒(méi)有造成樣本分布特征的缺失,反而能夠彌補(bǔ)原始樣本在分布上的不足。因此,這2 000個(gè)樣本可以作為創(chuàng)建預(yù)測(cè)模型的樣本集合。
圖4 6 900 cm-1波段樣本分布頻率直方圖
2.2.2 基于Bootstrap-SPXY-PLS的全譜模型分析
將前面得到的建模樣本集合基于PLS的交叉驗(yàn)證劃分方法將2000光譜樣本劃分為10個(gè)子集。分別對(duì)這10個(gè)樣本子集建立PLS模型,獲得10個(gè)Bootstrap- SPXY-PLS子模型,分別利用子模型對(duì)預(yù)測(cè)集進(jìn)行預(yù)測(cè),得到10組預(yù)測(cè)結(jié)果。最后將10組預(yù)測(cè)結(jié)果的均值作為最終的預(yù)測(cè)結(jié)果。表1給出了使用Bootstrap-SPXY-PLS方法,不同樣本大小,重抽樣本500次以及使用不同預(yù)處理方法基于PLS方法建立的NIR預(yù)測(cè)模型的各項(xiàng)評(píng)價(jià)結(jié)果。從表1可以看出,所有模型的RMSEC、RMSECV和RMSEP均值都比較小,r均值都比較大,說(shuō)明模型都具有一定的預(yù)測(cè)能力[29]。但是,PLS方法以及用不同預(yù)處理方法所建立的PLS模型的R均小于0.7,并且RMSEP-mean和RMSECV值具有一定的差距,表明直接使用PLS建模和采用簡(jiǎn)單預(yù)處理之后的模型預(yù)測(cè)性能相對(duì)較差。而B(niǎo)ootstrap-SPXY-PLS建模方法在不同樣本規(guī)模下,RMSECV和RMSEP-mean值都基本一致,尤其當(dāng)樣本大于等于10后,這2個(gè)值基本相等,并且2均大于0.98,性能明顯優(yōu)于PLS以及采用不同預(yù)處理方法所建立的模型。因此,利用Bootstrap-SPXY-PLS方法可以建立大于10個(gè)樣本小樣本近紅外光譜定量預(yù)測(cè)模型。
表1 灌漿期玉米籽粒水分近紅外光譜預(yù)測(cè)模型結(jié)果Table 1 NIRS moisture content prediction model of maize grain in filling stage
注:PLS,偏最小二乘法;SG,Savitzky-Golay平滑濾波;MSC,多元散射校正;SNV,標(biāo)準(zhǔn)歸一化。
Note:PLS, partial least square; SG, Savitzky-Golay; MSC, multiplicative scatter correction; SNV, standard normal variate.
2.2.3 全譜模型穩(wěn)定性的評(píng)價(jià)
通過(guò)觀察預(yù)測(cè)相關(guān)系數(shù)與預(yù)測(cè)均方根誤差的變化趨勢(shì)對(duì)樣本個(gè)數(shù)和重抽次數(shù)對(duì)模型穩(wěn)定性的影響進(jìn)行評(píng)價(jià)。圖5a為不同樣本個(gè)數(shù)和重抽次數(shù)所對(duì)應(yīng)的Bootstrap-SPXY-PLS模型的r值趨勢(shì)圖。從圖5a中可以看出,r值始終處于較好值范圍內(nèi),總體波動(dòng)區(qū)間在0.971 5~0.976 0之間。最小值出現(xiàn)在樣本個(gè)數(shù)為50,重抽樣本次數(shù)為240時(shí)。當(dāng)重抽樣本次數(shù)達(dá)到500之后,r值的波動(dòng)范圍明顯小于500之前,波動(dòng)區(qū)間在0.973 0~0.975 5之間。雖然樣本個(gè)數(shù)為5時(shí)r值變化最為趨緩,500次之后的標(biāo)準(zhǔn)差為0.000 144,樣本個(gè)數(shù)為10和20時(shí),500次之后的標(biāo)準(zhǔn)差均小于0.000 22,樣本個(gè)數(shù)為50時(shí)變化最大,500次之后的標(biāo)準(zhǔn)差為0.000 644,但是樣本個(gè)數(shù)為10、20和50時(shí)最大r值均大于樣本個(gè)數(shù)為5時(shí)的最大r值。所以,通過(guò)分析r值的趨勢(shì)圖僅能夠判斷出重抽樣本次數(shù)達(dá)到500之后模型具有預(yù)測(cè)能力,但是無(wú)法確定建立穩(wěn)定模型樣本個(gè)數(shù)的臨界大小。
圖5b顯示的是不同樣本個(gè)數(shù)和重抽次數(shù)所對(duì)應(yīng)的Bootstrap-SPXY-PLS模型的RMSEP值趨勢(shì)圖,同樣在重抽樣本500次之后的波動(dòng)變小,總體波動(dòng)區(qū)間從(0.34%~0.51%)變化到區(qū)間(0.35%~0.41%);當(dāng)樣本個(gè)數(shù)為5時(shí),重抽次數(shù)在100~400次左右時(shí)RMSEP值波動(dòng)最大,最大RMSEP值為0.51%,標(biāo)準(zhǔn)差大于0.05%,重抽樣本次數(shù)為500次之后相對(duì)穩(wěn)定性較強(qiáng),標(biāo)準(zhǔn)差小于0.01%。隨著樣本個(gè)數(shù)增加為10和20時(shí),在重抽次數(shù)增加的過(guò)程中RMSEP值波動(dòng)逐漸趨緩,樣本數(shù)越大波動(dòng)趨于平穩(wěn)的重抽樣本次數(shù)越少。樣本個(gè)數(shù)為50時(shí),重抽次數(shù)250次之后的RMSEP值標(biāo)準(zhǔn)差與500次之后的標(biāo)準(zhǔn)差均小于0.011%。
整體來(lái)說(shuō),當(dāng)重抽樣本的次數(shù)達(dá)到500之后,在任何樣本數(shù)的條件下Bootstrap-SPXY-PLS模型的預(yù)測(cè)能力都能趨于穩(wěn)定。由于重抽樣本個(gè)數(shù)為5時(shí)前后RMSEP波動(dòng)較大,所以建議建模最小臨界樣本個(gè)數(shù)為10。
圖5 Bootstrap-SPXY-PLS模型的rp值和RMSEP值趨勢(shì)圖
為了進(jìn)一步優(yōu)化Bootstrap-SPXY-PLS模型,采用CARS競(jìng)爭(zhēng)性自適應(yīng)重加權(quán)采樣算法對(duì)原始光譜進(jìn)行波段的篩選[30-31]。本研究中基于CARS進(jìn)行波段的選擇時(shí)設(shè)置蒙特卡洛仿真次數(shù)為500次,按照10折和5折交叉驗(yàn)證分別進(jìn)行篩選。10折交叉驗(yàn)證時(shí)的RMSECV最小值為0.65%,篩選變量數(shù)為115個(gè);5折交叉驗(yàn)證時(shí)RMSECV最小值為0.51%,篩選變量數(shù)為149個(gè)。圖6所示為5折交叉驗(yàn)證時(shí)的結(jié)果圖。圖6c中的曲線為1 201個(gè)光譜波段變量的系數(shù)路徑。通過(guò)回歸系數(shù)路徑得到最佳變量子集的結(jié)果為蒙特卡洛采樣次數(shù)為19時(shí)的變量子集,此時(shí)的RMSECV值也為最小值如圖6b所示。所以選擇如圖6a顯示的149個(gè)變量作為最終篩選的變量子集。
注:圖c中每一條線記錄了不同光譜波段變量在不同采樣次數(shù)下的回歸系數(shù)。
使用篩選后的變量在樣本個(gè)數(shù)分別為10、20、50,重抽次數(shù)為500的條件下進(jìn)行Bootstrap-SPXY-PLS建模,模型的決定系數(shù)2分別為0.924 5、0.901 0和0.918 0,略低于全譜模型的決定系數(shù)。為了進(jìn)一步驗(yàn)證模型的穩(wěn)定性,統(tǒng)計(jì)樣本個(gè)數(shù)為10、20、50的RMSEP均值和r值,RMSEP均值與r值分別為0.36%和0.973 6、0.36%和0.953 4以及0.35%和0.975 0。與全譜模型比較,當(dāng)重抽樣本個(gè)數(shù)增加時(shí)RMSEP值略有減少,r值總體沒(méi)有全譜Bootstrap-SPXY-PLS模型高,但是變化趨勢(shì)相同,都是在20個(gè)重抽樣本時(shí)略有下降。表2顯示了2種模型在不同小樣本條件下的預(yù)測(cè)值與化學(xué)參考值,以及預(yù)測(cè)集各樣本的預(yù)測(cè)值與化學(xué)參考值的平均偏差和模型運(yùn)行時(shí)間的統(tǒng)計(jì)結(jié)果??梢钥闯鲈谌VBootstrap-SPXY-PLS模型時(shí)不同小樣本條件下的預(yù)測(cè)值平均偏差要小于Cars-Bootstrap-SPXY-PLS模型并且比較穩(wěn)定。Cars-Bootstrap-SPXY-PLS模型預(yù)測(cè)精度低于Bootstrap-SPXY-PLS模型的原因可能在于CARS算法本身具有不穩(wěn)定性,試驗(yàn)中僅選擇了RMSECV最小時(shí)的一個(gè)變量子集進(jìn)行模型對(duì)比試驗(yàn),變量篩選也可能造成光譜有效信息的丟失[32]。Cars-Bootstrap-SPXY-PLS模型的優(yōu)勢(shì)是篩選后變量為全譜變量集合的1/8,運(yùn)行時(shí)間低于篩選前運(yùn)行時(shí)間的30%,在一定程度上提高了Bootstrap-SPXY-PLS模型的執(zhí)行力。
表2 化學(xué)參考值和預(yù)測(cè)值
1)提出了基于小樣本條件下的自舉算法(Bootstrap)與基于-距離結(jié)合的樣本劃分方法(SPXY, sample set partitioning based on joint x-y distances)相結(jié)合的樣本優(yōu)化方法,并與偏最小二乘法(PLS,partial-least-square)相結(jié)合建立了Bootstrap-SPXY-PLS水分定量分析模型。在重抽樣本次數(shù)為500,樣本數(shù)目大于等于10時(shí)均能夠創(chuàng)建穩(wěn)定的預(yù)測(cè)模型。樣本個(gè)數(shù)分別為10、20、50的預(yù)測(cè)模型的決定系數(shù)2分別為0.999 9、0.989 8和0.993 6。
2)將本文提出的Bootstrap-SPXY-PLS全譜模型與基于PLS的多個(gè)模型進(jìn)行性能比較,結(jié)果表明全譜Bootstrap-SPXY-PLS方法在Bootstrap重抽樣本數(shù)量為10及以上時(shí),重抽次數(shù)為500時(shí)創(chuàng)建的模型2值均優(yōu)于PLS模型、PLS與Savitzky-Golay平滑濾波(SG)和多元散射校正(MSC,multiplicative scatter correction)相結(jié)合所建立的模型、PLS與Savitzky-Golay平滑濾波和標(biāo)準(zhǔn)歸一化(SNV,standard normal variate)相結(jié)合所建立的模型以及PLS與Savitzky-Golay平滑濾波、多元散射校正和標(biāo)準(zhǔn)歸一化相結(jié)合所建立的模型。
3)將全譜Bootstrap-SPXY-PLS與CARS-Bootstrap- SPXY-PLS模型進(jìn)行比較,結(jié)果表明變量篩選后的模型依然能夠保證較為穩(wěn)定的RMSEP和r值,預(yù)測(cè)模型的決定系數(shù)2均大于0.90。
因此,本文提出的小樣本條件下的灌漿期玉米的近紅外光譜水分定量分析的Bootstrap-SPXY-PLS模型,可以為玉米育種灌漿期種子含水率的測(cè)定提供一種穩(wěn)定、高效的方法。
[1] 文韜,鄭立章,龔中良,等. 基于近紅外光譜技術(shù)的茶油原產(chǎn)地快速鑒別[J]. 農(nóng)業(yè)工程學(xué)報(bào),2016,32(16):293-299. Wen Tao, Zheng Lizhang, Gong Zhongliang, et al. Rapid identification of geographical origin of camellia oil based on near infrared spectroscopy technology[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(16): 293-299. (in Chinese with English abstract)
[2] Liang Pei-shih, Haff Ronald P, Hua Sui Sheng T, et al. Nondestructive detection of zebra chip disease in potatoes using near-infrared spectroscopy[J]. Biosystems Engineering, 2018, 166(2): 161-169.
[3] Diago M P, Fernández-Novales J, Gutiérrez S, et al. Development and validation of a new methodology to assess the vineyard water status by on-the-go near infrared spectroscopy:[J]. Frontiers in Plant Science, 2018, 1(9): 1-13.
[4] 李倩倩,田曠達(dá),李祖紅,等. 無(wú)信息變量消除法變量篩選優(yōu)化[J]. 分析化學(xué),2013,41(6):917-921.
Li Qianqian, Tian Kuangda, Li Zhuhong, et al. Model of total nitrogen and total sugar in tobacco optimizing after uninformative variable elimination[J]. Chinese Journal of Analytical Chemistry, 2013, 41(6): 917-921. (in Chinese with English abstract)
[5] Jia Shengyao, Li Hongyang, Wang Yanjie, et al. Recursive variable selection to update near-infrared spectros copy model for the determination of soil nitrogen and organic carbon[J]. Geoderma, 2016, 268(4): 92-99.
[6] 陳奕云,齊天賜,黃穎菁,等. 土壤有機(jī)質(zhì)含量可見(jiàn)-近紅外光譜反演模型校正集優(yōu)選方法[J]. 農(nóng)業(yè)工程學(xué)報(bào),2017,33(6):107-114.
Chen Yiyun, Qi Tianci, Huang Yingjing, et al. Optimization method of calibration dataset for VIS-NIR spectral inversion model of soil organic matter content[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2017, 33(6): 107-114. (in Chinese with English abstract)
[7] Sun Xudong, Dong Xiaoling. Improved partial least squares regression for rapid determination of reducing sugar of potato flours by near infrared spectroscopy and variable selection method[J]. Journal of Food Measurement & Characterization, 2015, 9(1): 95-103.
[8] Liu Ke, Chen Xiaojing, Li Limin, et al. A consensus successive projections algorithm-multiple linear regression method for analyzing near infrared spectra[J]. Analytica Chimica Acta, 2015, 858(1): 16-23.
[9] 朱麗偉,馬文廣,胡晉,等. 近紅外光譜技術(shù)檢測(cè)種子質(zhì)量的應(yīng)用研究進(jìn)展[J]. 光譜學(xué)與光譜分析,2015,35(2):346-349.
Zhu Liwei, Ma Wenguang, Hu Jin, et al. Advances of NIR spectroscopy technology applied in seed quality detection[J]. Spectroscopy and Spectral Analysis, 2015, 35(2): 346-349. (in Chinese with English abstract)
[10] 彭彥昆,趙芳,李龍,等. 利用近紅外光譜與PCA-SVM 識(shí)別熱損傷番茄種子[J]. 農(nóng)業(yè)工程學(xué)報(bào),2018,34(5):159-165.
Peng Yankun, Zhao Fang, Li Long, et al. Discrimination of heat-damaged tomato seeds based on nearinfrared spectroscopy and PCA-SVM method[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions ofthe CSAE), 2018, 34(5): 159-165. (in Chinese with English abstract)
[11] 郭婷婷,徐麗,劉金,等. 玉米亞正常籽粒生活力近紅外光譜判別方法研究[J]. 光譜學(xué)與光譜分析,2013,33(6):1501-1505.
Guo Tingting, Xu Li, Liu Jin, et al. Study on discrimination method of maize seed viability based on near-infrared spectroscopy[J]. Spectroscopy & Spectral Analysis, 2013, 33(6): 1501-1505. (in Chinese with English abstract)
[12] 劉思奇,鐘雪梅,李鳳海,等. 東北地區(qū)4個(gè)代表性玉米品種的灌漿和脫水速率比較[J]. 種子,2015,34(12):69-72.
Liu Siqi, Zhong Xuemei, Li Fenghai, et al. Comparisons of grain filling and dehydration rates in 4representative maize varieties in northeast provinces[J]. Transactions of the Seed, 2015, 34(12): 69-72. (in Chinese with English abstract)
[13] Efron B. Bootstrap methods: another look at the jackknife[J]. Annals of Statistics, 1979, 7(1): 1-26.
[14] Krebsbach C M. Bootstrapping with Small Samples in Structural Equation Modeling: Goodness of Fit and Confidence Intervals[D]. Rhodes Island, USA, University of Rhode Island, 2014.
[15] Amalnerkar E, Lee T H, Lim W. Bootstrap guided information criterion for reliability analysis using small sample size information[C]// World Congress of Structural and Multidisciplinary Optimisation. Springer, Cham, 2017: 326-333.
[16] Wang Yanqing, Zhou Weihu, Dong Dengfeng, et al. Estimation of random vibration signals with small samples using bootstrap maximum entropy method[J]. Measurement, 2017, 105(7): 45-55.
[17] Coskun A, Ceyhan E, Inal T C, et al. The comparison of parametric and nonparametric bootstrap methods for reference interval computation in small sample size groups[J]. Accreditation & Quality Assurance, 2013, 18(1): 51-60.
[18] Heathcote A, Brown S, Wagenmakers E J, et al. Distribution-free tests of stochastic dominance for small samples[J]. Journal of Mathematical Psychology, 2010, 54(5): 454-463.
[19] Vojta A, Shekvugrove?ki A, Radin L, et al. Hematological and biochemical reference intervals in Dalmatian pramenka sheep estimated from reduced sample size by bootstrap resampling.[J]. Veterinarski Arhiv, 2011, 81(1): 25-33.
[20] Neto E C. Speeding up non-parametric bootstrap computations for statistics based on sample moments in small/moderate sample size applications[J]. Plos One, 2015, 10(6): e0131333.
[21] Dwivedi A K, Mallawaarachchi I, Alvarado L A. Analysis of small sample size studies using nonparametric bootstrap test with pooled resampling method[J]. Statistics in Medicine, 2017, 36(14): 2187-2205.
[22] 陳昭,吳志生,史新元,等. Bagging偏最小二乘和Boosting偏最小二乘算法的金銀花醇沉過(guò)程近紅外光譜定量模型預(yù)測(cè)能力研究[J]. 分析化學(xué),2014,42(11):1679-1686.
Chen Zhao, Wu Zhisheng, Shi Xinyuan, et al. A study on model performance for ethanol precipitation process of lonicera japonica by NIR based on bagging-PLS and Boosting-PLS algorithm[J]. Chinese Journal of Analytical Chemistry, 2014, 42(11): 1679-1686. (in Chinese with English abstract)
[23] Xiao Ma, Zhao Zhong, Xiong Shanhai. Spectrum quantitative analysis based on bootstrap-SVM model with small sample set[J]. Spectroscopy & Spectral Analysis, 2016, 36(5): 1571-1575.
[24] Lodder R, Moses J, Buice R G. Determination of protein crosslinking with bootstrap pattern selection and nearinfrared spectrophotometry[J]. CPS: analchem/0008002, 2000(8): 1-5.
[25] 文韜,洪添勝,李立君,等. 霉變稻谷脂肪酸含量的光譜檢測(cè)模型構(gòu)建與優(yōu)化分析[J]. 農(nóng)業(yè)工程學(xué)報(bào),2016,32(1):193-199.
Wen Tao, Hong Tiansheng, Li Lijun, et al. Optimization analysis and establishment of spectra detection model of fatty acid contents for mould paddies[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(1): 193-199. (in Chinese with English abstract)
[26] 李江波,郭志明,黃文倩,等. 應(yīng)用CARS和SPA算法對(duì)草莓SSC含量NIR光譜預(yù)測(cè)模型中變量及樣本篩選[J]. 光
譜學(xué)與光譜分析,2015,35(2):372-378.
Li Jiangbo, Guo zhiming, Huang Wenqian, et al. Near-infrared spectra combining with CARS and SPA algorithms to screen the variables and samples for quantitatively determining the soluble solids content in strawberry[J]. Spectroscopy & Spectral Analysis, 2015, 35(2): 372-378. (in Chinese with English abstract)
[27] 趙安新,湯曉君,張鐘華,等. 優(yōu)化Savitzky-Golay濾波器的參數(shù)及其在傅里葉變換紅外氣體光譜數(shù)據(jù)平滑預(yù)處理中的應(yīng)用[J]. 光譜學(xué)與光譜分析,2016,36(5):1340-1344.
Zhao Anxin, Tang Xiaojun, Zhang Zhonghua, et al. Optimizing savitzky-golay parameters and its smoothing pretreatment for FTIR gas spectra[J]. Spectroscopy & Spectral Analysis, 2016, 36(5): 1340-1344. (in Chinese with English abstract)
[28] 蔡劍華,胡惟文,王先春. 基于組合濾波的魚(yú)油二十碳五烯酸含量近紅外光譜檢測(cè)[J]. 農(nóng)業(yè)工程學(xué)報(bào),2016,32(1):312-317.
Cai Jianhua, Hu Weiwen, Wang Xianchun. Near-infrared spectrum detection of fish oil eicosapentaenoic acid content based on combinational filtering[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(1): 312-317. (in Chinese with English abstract)
[29] 馮艷春,張琪,胡昌勤. 藥品近紅外光譜通用性定量模型評(píng)價(jià)參數(shù)的選擇[J]. 光譜學(xué)與光譜分析,2016,36(8):2447-2454.
Feng Yanchun, Zhang Qi, Hu Changqin. Study on the selection of parameters for evaluating drug NIR universal quantitative models[J]. Spectroscopy & Spectral Analysis, 2016, 36(8): 2447-2454. (in Chinese with English abstract)
[30] 宋相中,唐果,張錄達(dá),等. 近紅外光譜分析中的變量選擇算法研究進(jìn)展[J]. 光譜學(xué)與光譜分析,2017,37(4):1048-1052.
Song Xiangzhong, Tang Guo, Zhang Luda, et al. Research advance of variable selection algorithms in Near Infrared Spectroscopy analysis[J]. Spectroscopy & Spectral Analysis, 2017, 37(4): 1048-1052. (in Chinese with English abstract)
[31] 蔡亮紅,丁建麗. 小波變換耦合CARS 算法提高土壤水分含量高光譜反演精度[J]. 農(nóng)業(yè)工程學(xué)報(bào),2017,33(16):144-151.
Cai Lianghong, Ding Jianli. Wavelet transformation coupled with CARS algorithm improving prediction accuracy of soilmoisture content based on hyperspectral reflectance[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2017, 33(16): 144-151. (in Chinese with English abstract)
[32] 賓俊,范偉,周冀衡,等. 智能優(yōu)化算法應(yīng)用于近紅外光譜波長(zhǎng)選擇的比較研究[J]. 光譜學(xué)與光譜分析,2017,37(1):95-102.
Bin Jun, Fan Wei, Zhou Jiheng, et al. Application of intelligent optimization algorithms to wavelength selection of near-infrared spectroscopy[J]. Spectroscopy & Spectral Analysis, 2017, 37(1): 95-102. (in Chinese with English abstract)
Moisture quantitative analysis with small sample set of maize grain in filling stage based on near infrared spectroscopy
Wang Xue1,2, Ma Tiemin2,3, Yang Tao1※, Song Ping1, Xie Qiuju2, Chen Zhengguang2
(1,,110866,;2.,,163319,;3.,,110819,)
Near infrared spectroscopy (NIRS) and its analytical techniques are increasingly used for the rapid quantitative and qualitative analysis in the field of agriculture, food, industry, and so on. Generally, the sample size in most research is between 100 and 200. In maize breeding, the sampling quantity and its cost for maize grain moisture measurement in filling stage are limited due to some objective limitations of the planting area of new varieties, the maize plants number per square meter, the effective experimental spikes number and other conditions. However, the filling period is a critical stage for maize grain variety changes and breeding test. In the traditional measurement method for moisture drying, 150-250 grains are taken for the moisture measurement, which are a large number of samples. Therefore, it is one of the urgent problems to provide a high efficient moisture measurement method using small sample size in maize breeding process. In NIRS research field, the size of sample set is a key factor for the performance and prediction ability of the algorithm. In general, the smaller the size of sample set, the lower the efficiency of model, so it is very important to find a critical value for the small sample set in practical applications. In recent years, data analysis methods for small sample set based on Bootstrap were proposed, and most of them were considered reliable for the small sample set data validation. In order to reduce sample size and measure the moisture content of maize grainin filling period quickly and accurately, a quantitative analysis model of moisture was presented based on sample set optimized selection and partial least squares (PLS) algorithm using NIRS. The method of sample set optimized selection was on the basis of Bootstrap resampling strategy and sample set partitioning based on jointdistances (SPXY). The models were evaluated by correlation coefficient of prediction and root-mean-square error of prediction (RMSEP) in different resampling times and the sizes of sample set. Firstly, the full spectrum and wavelength selection spectrum were resampled for 100-800 times at the sample size of 5, 10, 20 and 50, respectively, using Bootstrap algorithm. Secondly, the resampled set was selected for the calculation of SPXY samples to optimize selection to form modeling sample set. Thirdly, the modeling sample set was divided into multiple subsets and PLS sub-models were constructed using these subsets respectively, and multiple predictive values were obtained by using the PLS sub-models regression analysis. Finally, the predictive values of maize grain moisture in filling period were obtained by the weighted mean of multiple predictive values. It is shown that a model with stable performance is gotten when the number of Bootstrap resampling is 500 and resampling size is greater than 10, and the number of resampled samples decreases with the increasing of sample size. When the sample size is 10 and 50, the RMSEP mean values of full spectrum Bootstrap-SPXY-PLS model are 0.38% and 0.40% respectively, the correlation coefficients of predictionare 0.975 1 and 0.968 5 respectively, and the determination coefficients (2) of the calibration are 0.999 9 and 0.993 6 respectively; the RMSEP mean values of CARS-Bootstrap-PLS are 0.36% and 0.35% respectively, the correlation coefficients of predictionare 0.973 6 and 0.975 0 respectively, and the2values were 0.924 5 and 0.918 0 respectively. Therefore, the 2 models of full-spectrum Bootstrap-SPXY-PLS and the CARS-Bootstrap-PLS both have good prediction ability and can provide a new stable and efficient method for maize grain moisture determination in filling stage in breeding process. It is helpful for maize breeding research, and also provides a new idea for quantitative analysis of NIR spectra in small sample set.
near infrared spectroscopy; water; models; quantitative analysis; small sample set; maize grain in filling stage; bootstrap resample; sample optimized selection
王 雪,馬鐵民,楊 濤,宋 平,謝秋菊,陳爭(zhēng)光. 基于近紅外光譜的灌漿期玉米籽粒水分小樣本定量分析[J]. 農(nóng)業(yè)工程學(xué)報(bào),2018,34(13):203-210.doi:10.11975/j.issn.1002-6819.2018.13.024 http://www.tcsae.org
Wang Xue, Ma Tiemin, Yang Tao, Song Ping, Xie Qiuju, Chen Zhengguang. Moisture quantitative analysis with small sample set of maize grain in filling stage based on near infrared spectroscopy[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2018, 34(13): 203-210. (in Chinese with English abstract) doi:10.11975/j.issn.1002-6819.2018.13.024 http://www.tcsae.org
2018-02-25
2018-05-19
國(guó)家自然科學(xué)基金青年基金(31701318);黑龍江八一農(nóng)墾大學(xué)校內(nèi)課題培育資助項(xiàng)目(XZR2016-09)。
王 雪,遼寧沈陽(yáng)人,講師,博士生,從事近紅外光譜分析及其在農(nóng)業(yè)上的應(yīng)用研究。Email:mtmwx@163.com
楊濤,博士,教授,博士生導(dǎo)師,主要從事計(jì)算機(jī)技術(shù)在農(nóng)業(yè)領(lǐng)域應(yīng)用的教學(xué)與研究工作。Email:328748306@qq.com
10.11975/j.issn.1002-6819.2018.13.024
S24
A
1002-6819(2018)-13-0203-08