李思卓 周蘭江 周楓 郭劍毅
摘 要:詞語(yǔ)對(duì)齊技術(shù)一直是自然語(yǔ)言處理的基礎(chǔ)問(wèn)題。為實(shí)現(xiàn)漢―老雙語(yǔ)自動(dòng)詞對(duì)齊,首先對(duì)老撾語(yǔ)存在的修飾詞與中心詞順序倒置、結(jié)構(gòu)和位置上的差異性等特征進(jìn)行了分析,通過(guò)分析篩選出一些漢―老雙語(yǔ)特征并將這些特征融合,對(duì)其構(gòu)建特征函數(shù),以最小錯(cuò)誤率算法為條件,在對(duì)數(shù)線性模型框架下訓(xùn)練模型參數(shù),將IBM3模型作為基礎(chǔ)比較模型,通過(guò)逐步添加特征函數(shù)從而實(shí)現(xiàn)與基礎(chǔ)模型的對(duì)比。實(shí)驗(yàn)證明,該方法可有效提高漢-老雙語(yǔ)詞對(duì)齊質(zhì)量。
關(guān)鍵詞:漢―老雙語(yǔ)詞對(duì)齊;特征函數(shù);最小錯(cuò)誤率算法;對(duì)數(shù)線性模型;IBM3模型
DOIDOI:10.11907/rjdk.172624
中圖分類號(hào):TP312
文獻(xiàn)標(biāo)識(shí)碼:A 文章編號(hào):1672-7800(2018)004-0009-04
Abstract:Word alignment has been regarded as one of the basic problems in natural language processing. In order to realize Laos-Chinese bilingual automatic word alignment, this paper analyzes the features of the inverted order of modifiers and central words in sentences and the differences in structure and location of the Lao. By summarizing the above characteristics, we selecte some of the Laos-Chinese bilingual features and fused these features, constructed feature function and trained the model parameters by the minimum error rate algorithm under the framework of Log-Linear model, based on IBM Model 3. During the experiment, we achieve the contrast to the underlying model by adding feature functions to the alignment model step by step. Experimental results show that this algorithm can effectively improve the quality of the Laos-Chinese bilingual word alignment.
Key Words:Laos-Chinese bilingual word alignment; feature function; the minimum error rate algorithm; log-linear model; IBM Model
0 引言
雙語(yǔ)詞語(yǔ)對(duì)齊由Brown等提出,作為機(jī)器翻譯的一個(gè)隱含過(guò)程。Och等在IBM的5個(gè)模型基礎(chǔ)上開(kāi)發(fā)了詞對(duì)齊開(kāi)源工具GIZA++;Blunsom等加入了二叉決策,基于條件隨機(jī)場(chǎng)改進(jìn)了算法搜索過(guò)程。Liu等進(jìn)行了創(chuàng)新,在處理詞對(duì)齊問(wèn)題時(shí)利用對(duì)數(shù)線性模型,針對(duì)不同雙語(yǔ)對(duì)齊語(yǔ)言將語(yǔ)法特點(diǎn)轉(zhuǎn)化成特征模型,從而提高詞對(duì)齊效果。
本文對(duì)漢-老雙語(yǔ)的語(yǔ)言特點(diǎn)進(jìn)行了深入細(xì)致分析。為實(shí)現(xiàn)漢―老雙語(yǔ)自動(dòng)詞對(duì)齊,以漢語(yǔ)為標(biāo)準(zhǔn),通過(guò)總結(jié)老撾語(yǔ)特征,將這些特征融合,構(gòu)建特征函數(shù),以IBM3模型為基礎(chǔ),提出了一種融合多種老撾語(yǔ)依存句法特征函數(shù)的詞對(duì)齊算法,實(shí)驗(yàn)證明該方法可有效提高漢-老雙語(yǔ)詞對(duì)齊質(zhì)量。
1 對(duì)數(shù)線性模型
本文以IBM 3為基礎(chǔ)特征函數(shù),在IBM3的基礎(chǔ)上逐步增加針對(duì)老撾語(yǔ)語(yǔ)言特點(diǎn)設(shè)計(jì)的特征函數(shù),從而進(jìn)行效果對(duì)比。
2 漢語(yǔ)-老撾語(yǔ)詞對(duì)齊特征函數(shù)
2.1 IBM模型
本文中,兩種不同對(duì)齊方向的IBM3特征模型被當(dāng)作不同特征:源語(yǔ)言和目標(biāo)語(yǔ)言都可以是漢語(yǔ)或老撾語(yǔ)。
2.2 漢-老詞對(duì)齊特征函數(shù)
2.2.1 老撾語(yǔ)-漢語(yǔ)詞語(yǔ)定語(yǔ)倒置函數(shù)
與漢語(yǔ)相比,老撾語(yǔ)較為明顯的特征是修飾詞通常置于所修飾的中心詞之后。也就是說(shuō),漢語(yǔ)句子成分的排列順序?yàn)椋海ǘㄕZ(yǔ))主語(yǔ)+謂語(yǔ)+(定語(yǔ))賓語(yǔ),而在老撾語(yǔ)中順序是:主語(yǔ)(定語(yǔ))+謂語(yǔ)+賓語(yǔ)(定語(yǔ))。例如,漢語(yǔ)的“他父親開(kāi)新車”的老撾語(yǔ)為:“(父親)(他)(開(kāi))(車)(新)”。從上面例子可以看出,無(wú)論中心詞是主語(yǔ)還是謂語(yǔ),充當(dāng)修飾功能的定語(yǔ)都是位于中心詞之后的,本文稱這種現(xiàn)象為修飾區(qū)間后置。因此,本文將老撾語(yǔ)句子分成兩塊,第一塊由名詞充當(dāng)?shù)闹行脑~,標(biāo)記為Nd;第二塊由形容詞充當(dāng)?shù)男揎椩~,標(biāo)記為Ad。
2.2.2 漢語(yǔ)-老撾語(yǔ)狀語(yǔ)末置函數(shù)
當(dāng)源語(yǔ)言為漢語(yǔ)時(shí),就可將此模型加入到以IBM3模型為基礎(chǔ)模型的對(duì)數(shù)線性框架中,其特征函數(shù)表示為公式(16):
需要特別指出的是,該特征函數(shù)是單向的,即源語(yǔ)言為漢語(yǔ),目標(biāo)語(yǔ)言為老撾語(yǔ)。
3 參數(shù)訓(xùn)練及搜索
3.1 模型參數(shù)訓(xùn)練
3.2 搜索過(guò)程
本文通過(guò)基于棧的搜索方法,在對(duì)齊特征函數(shù)權(quán)重λ的條件下,將概率最大的雙語(yǔ)詞語(yǔ)對(duì)齊結(jié)果從M維詞對(duì)齊空間搜索出來(lái)。
4 實(shí)驗(yàn)與結(jié)果分析
為了驗(yàn)證該詞對(duì)齊方法的可行性,在由人工對(duì)齊的漢-老雙語(yǔ)平行句對(duì)上展開(kāi)實(shí)驗(yàn)。基礎(chǔ)比較模型為IBM3,在實(shí)驗(yàn)語(yǔ)料上得出詞對(duì)齊實(shí)驗(yàn)結(jié)果。實(shí)驗(yàn)中使用的開(kāi)發(fā)集、測(cè)試集和訓(xùn)練集數(shù)據(jù)如表1所示。
本文使用ICTCLAS(Zhang et al.,2003)對(duì)開(kāi)發(fā)集和測(cè)試集中的漢語(yǔ)句子進(jìn)行分詞和標(biāo)注,老撾語(yǔ)使用東南亞語(yǔ)言信息處理平臺(tái)[14]分詞和標(biāo)注。對(duì)開(kāi)發(fā)集和測(cè)試集中的各500個(gè)句對(duì)進(jìn)行人工對(duì)齊,用來(lái)優(yōu)化模型參數(shù)和增益閾值。
實(shí)驗(yàn)以IBM 3模型作為比較對(duì)象,為了更好地體現(xiàn)每個(gè)特征函數(shù)對(duì)漢老雙語(yǔ)詞對(duì)齊的約束作用,將不同于以上3類特征模型按詞性細(xì)分為幾種特征函數(shù),在以IBM 3模型為基礎(chǔ)的特征函數(shù)上逐步增加前文中定義的幾個(gè)特征函數(shù)。實(shí)驗(yàn)結(jié)果如表2所示。在同一漢-老雙語(yǔ)語(yǔ)料庫(kù)下,“IBM(both directions)”表示漢-老雙語(yǔ)詞對(duì)齊框架僅僅使用IBM3翻譯模型作為特征函數(shù),“+DCL”表示漢語(yǔ)-老撾語(yǔ)狀語(yǔ)末置模型,“+USCL”表示漢語(yǔ)-老撾語(yǔ)數(shù)詞對(duì)照模型,“+UDCL”表示漢語(yǔ)-老撾語(yǔ)數(shù)詞倒置模型,“+PCL(ADJ)”表示在此基礎(chǔ)上添加了漢語(yǔ)-老撾語(yǔ)方向定語(yǔ)倒置模型。
待評(píng)測(cè)對(duì)齊結(jié)果集合為A,人工對(duì)齊結(jié)果分為兩類集合:確定性對(duì)齊集合S和不確定性對(duì)齊P,AER計(jì)算公式如下:
從表2可以看出,在相同規(guī)模的老撾語(yǔ)-漢語(yǔ)雙語(yǔ)語(yǔ)料庫(kù)下,逐漸增加上述特征函數(shù)后,對(duì)齊效果明顯好于僅使用IBM 3模型作為特征函數(shù)的對(duì)齊模型,這說(shuō)明修飾區(qū)間后置和句子主干對(duì)照特征對(duì)老撾語(yǔ)-漢語(yǔ)雙語(yǔ)詞語(yǔ)對(duì)齊起到了重要作用。
5 結(jié)語(yǔ)
本文在對(duì)數(shù)線性模型框架下,針對(duì)老撾語(yǔ)語(yǔ)言特點(diǎn)設(shè)計(jì)對(duì)齊特征函數(shù),將老撾語(yǔ)語(yǔ)言相關(guān)的統(tǒng)計(jì)特性加入到詞語(yǔ)對(duì)齊模型中,以最小錯(cuò)誤率算法為條件,在對(duì)數(shù)線性模型框架下訓(xùn)練模型參數(shù)。以IBM 3模型為基礎(chǔ)比較模型,提出了一種在對(duì)數(shù)線性模型基礎(chǔ)上融合多種老撾語(yǔ)依存句法特征函數(shù)的詞對(duì)齊算法,在實(shí)驗(yàn)中通過(guò)逐步添加特征函數(shù)到對(duì)齊模型,實(shí)現(xiàn)了與基礎(chǔ)模型的對(duì)比。實(shí)驗(yàn)結(jié)果表明,針對(duì)老撾語(yǔ)句法特點(diǎn)設(shè)計(jì)的特征函數(shù)可以明顯改善漢-老雙語(yǔ)語(yǔ)詞對(duì)齊效果。下一步會(huì)將更多的句法特征和依存句法結(jié)構(gòu)加入到模型中,以進(jìn)一步提高漢-老雙語(yǔ)詞對(duì)齊效果。
參考文獻(xiàn):
[1] SHEMTOV H.Text alignment in a tool for translating revised documents[C].Proc of the Sixth Conference of the European Chapter of the Association for Computational Linguistics, Utrecht, Netherlands,1993:449-453.
[2] WANG X Z, HE Y L, WANG D D. Non-naive bayesian classifiers for classification problems with continuous attributes[J]. Cybernetics, IEEE Transactions on,2014,44(1):21-39.
[3] RILEY D, GILDEA D. Improving the IBM alignment models using variational bayes[C].Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics,2012:306-310.
[4] CHERRY C, FOSTER G. Batch tuning strategies for statistical machine translation[C].Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics,2012:427-436.
[5] TANG J, GENTZLER E. Globalisation, networks and translation: a chinese perspective[J]. Perspectives: Studies in Translatology,2009,16(3-4):169-182.
[6] BROWN P F, PIETRA V J D, PIETRA S A D, et al. The mathematics of statistical machine translation: parameter stimation[J]. Computational linguistics,1993,19(2):263-311.
[7] OCH F J, NEY H. A systematic comparison of various statistical alignment models[J]. Computational linguistics,2003,29(1):19-51.
[8] BLUNSOM P, COHN T. Discriminative word alignment with conditional random fields[C].Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics,2006:65-72.
[9] TUFID, ION R, CEAUU A, et al. Combined word alignments[C].Proceedings of the ACL Workshop on Building and Using Parallel Texts. Association for Computational Linguistics,2005:107-110.
[10] LIU Y, LIU Q, LIN S. Discriminative word alignment by linear modeling[J]. Computational Linguistics, 2010,36(3):303-339.
(責(zé)任編輯:杜能鋼)