嚴(yán)紅 陳興蜀 王文賢 王海舟 殷明勇
摘 要:現(xiàn)有法語(yǔ)命名實(shí)體識(shí)別(NER)研究中,機(jī)器學(xué)習(xí)模型多使用詞的字符形態(tài)特征,多語(yǔ)言通用命名實(shí)體模型使用字詞嵌入代表的語(yǔ)義特征,都沒(méi)有綜合考慮語(yǔ)義、字符形態(tài)和語(yǔ)法特征。針對(duì)上述不足,設(shè)計(jì)了一種基于深度神經(jīng)網(wǎng)絡(luò)的法語(yǔ)命名實(shí)體識(shí)別模型CGCfr。首先從文本中提取單詞的詞嵌入、字符嵌入和語(yǔ)法特征向量; 然后由卷積神經(jīng)網(wǎng)絡(luò)(CNN)從單詞的字符嵌入序列中提取單詞的字符特征; 最后通過(guò)雙向門(mén)控循環(huán)神經(jīng)網(wǎng)絡(luò)(BiGRU)和條件隨機(jī)場(chǎng)(CRF)分類(lèi)器根據(jù)詞嵌入、字符特征和語(yǔ)法特征向量識(shí)別出法語(yǔ)文本中的命名實(shí)體。實(shí)驗(yàn)中,CGCfr在測(cè)試集的F1值能夠達(dá)到82.16%,相對(duì)于機(jī)器學(xué)習(xí)模型NERCfr、多語(yǔ)言通用的神經(jīng)網(wǎng)絡(luò)模型LSTMCRF和Char attention模型,分別提升了5.67、1.79和1.06個(gè)百分點(diǎn)。實(shí)驗(yàn)結(jié)果表明,融合三種特征的CGCfr模型比其他模型更具有優(yōu)勢(shì)。
關(guān)鍵詞:命名實(shí)體識(shí)別;法語(yǔ);深度神經(jīng)網(wǎng)絡(luò);自然語(yǔ)言處理;序列標(biāo)注
中圖分類(lèi)號(hào):TP391.1
文獻(xiàn)標(biāo)志碼:A
Abstract: In the existing French Named Entity Recognition (NER) research, the machine learning models mostly use the character morphological features of words, and the multilingual generic named entity models use the semantic features represented by word embedding, both without taking into account the semantic, character morphological and grammatical features comprehensively. Aiming at this shortcoming, a deep neural network based model CGCfr was designed to recognize French named entity. Firstly, word embedding, character embedding and grammar feature vector were extracted from the text. Then, character feature was extracted from the character embedding sequence of words by using Convolution Neural Network (CNN). Finally, Bidirectional Gated Recurrent Unit Network (BiGRU) and Conditional Random Field (CRF) were used to label named entities in French text according to word embedding, character feature and grammar feature vector. In the experiments, F1 value of CGCfr model can reach 82.16% in the test set, which is 5.67 percentage points, 1.79 percentage points and 1.06 percentage points higher than that of NERCfr, LSTM(Long ShortTerm Memory network)CRF and Char attention models respectively. The experimental results show that CGCfr model with three features is more advantageous than the others.
英文關(guān)鍵詞Key words: Named Entity Recognition (NER); French; neural network; Natural Language Processing (NLP); sequence labeling
0 引言
命名實(shí)體識(shí)別(Named Entity Recognition, NER)是指從文本中識(shí)別出特定類(lèi)型事務(wù)名稱(chēng)或者符號(hào)的過(guò)程[1]。它提取出更具有意義的人名、組織名、地名等,使得后續(xù)的自然語(yǔ)言處理任務(wù)能根據(jù)命名實(shí)體進(jìn)一步獲取需要的信息。隨著全球化發(fā)展,各國(guó)之間信息交換日益頻繁。相對(duì)于中文,外語(yǔ)信息更能影響其他國(guó)家對(duì)中國(guó)的看法,多語(yǔ)言輿情分析應(yīng)運(yùn)而生。法語(yǔ)在非英語(yǔ)的語(yǔ)種中影響力相對(duì)較大,其文本是多語(yǔ)種輿情分析中重要目標(biāo)之一。法語(yǔ)NER作為法語(yǔ)文本分析的基礎(chǔ)任務(wù),重要性不可忽視。
專(zhuān)門(mén)針對(duì)法語(yǔ)NER進(jìn)行的研究較少,早期研究主要是基于規(guī)則和詞典的方法[2], 后來(lái),通常將人工選擇的特征輸入到機(jī)器學(xué)習(xí)模型來(lái)識(shí)別出文本中存在的命名實(shí)體[3-7]。Azpeitia等[3]提出了NERCfr模型,模型采用最大熵方法來(lái)識(shí)別法語(yǔ)命名實(shí)體,用到的特征包括詞后綴、字符窗口、鄰近詞、詞前綴、單詞長(zhǎng)度和首字母是否大寫(xiě)等。該方法取得了不錯(cuò)的結(jié)果,但可以看出用到的特征多為單詞的形態(tài)結(jié)構(gòu)特征而非語(yǔ)義特征,缺乏語(yǔ)義特征可能限制了模型的識(shí)別準(zhǔn)確率。
近幾年深度神經(jīng)網(wǎng)絡(luò)在自然語(yǔ)言處理領(lǐng)域取得了很好的效果: Hammerton[8]將長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)(Long ShortTerm Memory network, LSTM)用于英語(yǔ)NER; Rei等[9]提出了多語(yǔ)言通用的Char attention模型,利用Attention機(jī)制融合詞嵌入和字符嵌入,將其作為特征輸入到雙向長(zhǎng)短時(shí)記憶網(wǎng)絡(luò)(Bidirectional Long ShortTerm Memory network, BiLSTM)中,得到序列標(biāo)注產(chǎn)生的命名實(shí)體; Lample等[10]提出BiLSTM后接條件隨機(jī)場(chǎng)(Conditional Random Field, CRF)的LSTMCRF模型,它也是多語(yǔ)言通用的,使用了字詞嵌入作為特征來(lái)識(shí)別英語(yǔ)的命名實(shí)體, 但LSTMCRF模型應(yīng)用在法語(yǔ)上,和英語(yǔ)差距較大,這個(gè)問(wèn)題可能是因?yàn)闆](méi)有用到該語(yǔ)言的語(yǔ)法特征,畢竟法語(yǔ)語(yǔ)法的復(fù)雜程度大幅超過(guò)英語(yǔ)。
為了在抽取過(guò)程中兼顧語(yǔ)義、字符形態(tài)和語(yǔ)法特征,更為準(zhǔn)確地抽取法語(yǔ)的命名實(shí)體,本文設(shè)計(jì)了模型CGCfr。該模型使用詞嵌入表示文本中單詞的語(yǔ)義特征,使用卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network, CNN)提取字符嵌入蘊(yùn)含的單詞字符形態(tài)特征以及預(yù)先提取的法語(yǔ)語(yǔ)法特征,拼接后輸入到雙向門(mén)控循環(huán)網(wǎng)絡(luò)(Gated Recurrent Unit Neural Network, GRU)和條件隨機(jī)場(chǎng)結(jié)合的復(fù)合網(wǎng)絡(luò)中,來(lái)識(shí)別出法語(yǔ)命名實(shí)體。CGCfr充分利用了這些特征,通過(guò)實(shí)驗(yàn)證明了每種特征的貢獻(xiàn)度,并與其他模型進(jìn)行比較證明了融合三種特征的CGCfr模型更具有優(yōu)勢(shì)。除此之外,本文貢獻(xiàn)了一個(gè)法語(yǔ)的數(shù)據(jù)集,包含1005篇文章,29016個(gè)實(shí)體, 增加了法語(yǔ)命名實(shí)體識(shí)別的數(shù)據(jù)集,使得后續(xù)可以有更多的研究不被數(shù)據(jù)集的問(wèn)題困擾。
4 結(jié)語(yǔ)
本文設(shè)計(jì)了用于法語(yǔ)命名實(shí)體識(shí)別的深度神經(jīng)網(wǎng)絡(luò)CGCfr模型,并構(gòu)建了一個(gè)法語(yǔ)命名實(shí)體識(shí)別數(shù)據(jù)集。CGCfr模型將法語(yǔ)文本中單詞的詞嵌入作為語(yǔ)義特征,從單詞對(duì)應(yīng)的字符嵌入序列提取單詞的形態(tài)結(jié)構(gòu)特征,結(jié)合語(yǔ)法特征完成對(duì)命名實(shí)體的識(shí)別。這增加了傳統(tǒng)統(tǒng)計(jì)機(jī)器學(xué)習(xí)方法中特征的多樣性,豐富了特征的內(nèi)涵, 也避免了多語(yǔ)言通用方法對(duì)法語(yǔ)語(yǔ)法的忽視。實(shí)驗(yàn)對(duì)比模型中各個(gè)特征的貢獻(xiàn)度,驗(yàn)證了它們的有效性;還將CGCfr模型與最大熵模型NERCfr、多語(yǔ)言通用模型Char attention和LSTMCRF對(duì)比。實(shí)驗(yàn)結(jié)果表明,CGCfr模型相對(duì)三者的F1值都有提高,驗(yàn)證了融合三種特征的本文模型在法語(yǔ)命名實(shí)體識(shí)別上的有效性,進(jìn)一步提高了法語(yǔ)命名實(shí)體的識(shí)別率。
然而,本文模型也存在著不足,在法語(yǔ)文本中組織名的識(shí)別率相比其余兩種命名實(shí)體類(lèi)型差距較大,模型對(duì)形式存在較大變化的命名實(shí)體類(lèi)型的識(shí)別效果不是很好;其次,相對(duì)于英語(yǔ)較高的命名實(shí)體識(shí)別準(zhǔn)確率,法語(yǔ)命名實(shí)體識(shí)別還有較大的提升空間。
參考文獻(xiàn) (References)
[1] NADEAU D, SEKINE S. A survey of named entity recognition and classification[J]. Lingvisticae Investigationes, 2007, 30(1): 3-26.
[2] WOLINSKI F, VICHOT F, DILLET B. Automatic processing of proper names in texts[C]// Proceedings of the 7th Conference on European Chapter of the Association for Computational Linguistics. San Francisco, CA: Morgan Kaufmann Publishers, 1995: 23-30.
[3] AZPEITIA A, CUDADROS M, GAINES S, et al. NERCfr: supervised named entity recognition for French[C]// TSD 2014: Proceedings of the 2014 International Conference on Text, Speech and Dialogue. Berlin: Springer, 2014: 158-165.
[4] POIBEAU T. The multilingual named entity recognition framework[C]// Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2003: 155-158.
[5] PETASIS G, VICHOT F, WOLINSKI F, et al. Using machine learning to maintain rulebased namedentity recognition and classification systems[C]// Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2001: 426-433.
[6] WU D, NGAI G, CARPUAT M. A stacked, voted, stacked model for named entity recognition[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 200-203.
[7] NOTHMAN J, RINGLAND N, RADFORD W, et al. Learning multilingual named entity recognition from Wikipedia[J]. Artificial Intelligence, 2013, 194:151-175.
[8] HAMMERTON J. Named entity recognition with long shortterm memory[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT. Stroudsburg, PA: Association for Computational Linguistics, 2003: 172-175.
[9] REI M, CRICHTON G, PYYSALO S. Attending to characters in neural sequence labeling models[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1611.04361[2016-11-14]. https://arxiv.org/abs/1611.04361.
[10] LAMPLE G, BALLESTEROS M, SUBRAMANIAN S, et al. Neural architectures for named entity recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2016: 260-270.
[11] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1188-1196.
[12] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1532-1543.
[13] SANTOS C D, ZADROZNY B. Learning characterlevel representations for partofspeech tagging[C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1818-1826.
[14] CHO K, van MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoderdecoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1724-1734.
[15] SANG E F, VEENSTRA J. Representing text chunks[C]// Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 1999: 173-179.