邱澤宇 屈丹 張連海
摘 要:針對(duì)端到端語音合成系統(tǒng)中GriffinLim算法恢復(fù)相位信息合成語音保真度較低、人工處理痕跡明顯的問題,提出了一種基于WaveNet網(wǎng)絡(luò)架構(gòu)的端到端語音合成方法。以序列映射Seq2Seq結(jié)構(gòu)為基礎(chǔ),首先將輸入文本轉(zhuǎn)化為onehot向量,然后引入注意力機(jī)制獲取梅爾聲譜圖,最后利用WaveNet后端處理網(wǎng)絡(luò)重構(gòu)語音信號(hào)的相位信息,從而將梅爾頻譜特征逆變換為時(shí)域波形樣本。實(shí)驗(yàn)的測(cè)試語料為L(zhǎng)JSpeech1.0和THchs30,針對(duì)英語、漢語兩個(gè)語種進(jìn)行了實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明平均意見得分(MOS)分別為3.31、3.02,在合成自然度方面優(yōu)于采用GriffinLim算法的端到端語音合成系統(tǒng)以及參數(shù)式語音合成系統(tǒng)。
關(guān)鍵詞:語音合成;端到端;Seq2Seq;GriffinLim算法;WaveNet
中圖分類號(hào):TN912.33
文獻(xiàn)標(biāo)志碼:A
Abstract: GriffinLim algorithm is widely used in endtoend speech synthesis with phase estimation, which always produces obviously artificial speech with low fidelity. Aiming at this problem, a system for endtoend speech synthesis based on WaveNet network architecture was proposed. Based on Seq2Seq (SequencetoSequence) structure, firstly the input text was converted into a onehot vector, then, the attention mechanism was introduced to obtain a Mel spectrogram, finally WaveNet network was used to reconstruct phase information to generate timedomain waveform samples from the Mel spectrogram features. Aiming at English and Chinese, the proposed method achieves a Mean Opinion Score (MOS) of 3.31 on LJSpeech1.0 corpus and 3.02 on THchs30 corpus, which outperforms the endtoend systems based on GriffinLim algorithm and parametric systems in terms of naturalness.
0 引言
語音合成(Speech Synthesis),又稱文語轉(zhuǎn)換(Text To Speech, TTS)技術(shù)是指計(jì)算機(jī)通過分析將任意文本轉(zhuǎn)化為流暢語音的技術(shù)。語音合成作為實(shí)現(xiàn)人機(jī)語音交互系統(tǒng)的核心技術(shù)之一[1],是語音處理技術(shù)中一個(gè)重要的方向,其應(yīng)用價(jià)值越來越受到重視。
語音合成領(lǐng)域的主導(dǎo)技術(shù)隨著時(shí)代的發(fā)展不斷更迭?;诓ㄐ纹唇拥恼Z音合成方法,是一項(xiàng)把預(yù)先錄制的語音波形片段拼接在一起的技術(shù),是目前語音合成領(lǐng)域常用方法之一[2-5]。受到語料庫內(nèi)容的限制,這種方法對(duì)拼接算法的優(yōu)化、存儲(chǔ)配置的調(diào)整等方面有較大的要求,對(duì)于語料庫之外的其他說話人、其他文本內(nèi)容起不到任何作用。
隨著基于統(tǒng)計(jì)參數(shù)的語音合成方法日益成熟,這種方法被逐漸應(yīng)用到語音合成中[6]?;诮y(tǒng)計(jì)參數(shù)的語音合成方法的基本思想是,通過對(duì)輸入的訓(xùn)練語音進(jìn)行參數(shù)分解,然后對(duì)聲學(xué)參數(shù)建模,并構(gòu)建參數(shù)化訓(xùn)練模型,生成訓(xùn)練模型庫,最后在模型庫的指導(dǎo)下,預(yù)測(cè)待合成文本的語音參數(shù),將參數(shù)輸入聲碼器合成目標(biāo)語音,這種方法解決了拼接式合成方法中邊界人工痕跡很多的問題。然而由這些方法構(gòu)造的系統(tǒng)需要大量的專業(yè)領(lǐng)域知識(shí),因而設(shè)計(jì)困難,并且所需模塊通常是單獨(dú)訓(xùn)練,產(chǎn)生自每個(gè)模塊的錯(cuò)誤會(huì)有疊加效應(yīng),生成的語音與人類語音相比,經(jīng)常模糊不清并且不自然。
隨著人工智能技術(shù)的快速發(fā)展,語音合成領(lǐng)域有了新的技術(shù)支持。深度學(xué)習(xí)可以將內(nèi)部模塊統(tǒng)一到一個(gè)模型中,并直接連接輸入和輸出,減少了基于特定領(lǐng)域知識(shí)的密集工程參數(shù)模型,這種技術(shù)被稱為“端到端”學(xué)習(xí)。設(shè)計(jì)一個(gè)能在已標(biāo)注的(文本、語音)配對(duì)數(shù)據(jù)集上訓(xùn)練的端到端的語音合成系統(tǒng),會(huì)帶來諸多優(yōu)勢(shì): 第一,這樣的系統(tǒng)可以基于各種屬性進(jìn)行多樣化的調(diào)節(jié),比如不同說話人、不同語言,或者像語義這樣的高層特征;第二,與存在錯(cuò)誤疊加效應(yīng)的多階段模型相比,單一模型更魯棒。
近年來端到端的語音合成系統(tǒng)引起了廣泛的研究,WaveNet[7]是一個(gè)強(qiáng)大的語音生成模型,它在TTS中表現(xiàn)良好,但樣本級(jí)自回歸的特性導(dǎo)致其速度較慢,需要一個(gè)復(fù)雜的前端文本分析系統(tǒng),因此不是端到端語音合成系統(tǒng)。Deep Voice[8]將傳統(tǒng)TTS系統(tǒng)流水線中的每一個(gè)模塊分別用神經(jīng)網(wǎng)絡(luò)架構(gòu)代替,然而它的每個(gè)模塊都是單獨(dú)訓(xùn)練的,要把系統(tǒng)改成端到端的方式比較困難。Char2Wav[9]是一個(gè)獨(dú)立開發(fā)的可以在字符數(shù)據(jù)上訓(xùn)練的端到端模型,但是它需要傳統(tǒng)的聲碼器參數(shù)作為中間特征表達(dá),不能直接預(yù)測(cè)輸出頻譜特征。Tacotron[10]是一個(gè)從字符序列生成幅度譜的Seq2Seq(SequencetoSequence)架構(gòu),它僅用輸入數(shù)據(jù)訓(xùn)練出一個(gè)單一的神經(jīng)網(wǎng)絡(luò),用于替代語言學(xué)和聲學(xué)特征的生成模塊,使用GriffinLim算法[11]估計(jì)相位,施加短時(shí)傅里葉變換合成語音,從而簡(jiǎn)化了傳統(tǒng)語音合成的流水線,然而GriffinLim算法會(huì)產(chǎn)生特有的人工痕跡并且合成的語音保真度較低,因此需要替換成神經(jīng)網(wǎng)絡(luò)架構(gòu)。
本文針對(duì)目前端到端系統(tǒng)中GriffinLim算法還原語音信號(hào)自然度較低的問題,提出了一種基于WaveNet網(wǎng)絡(luò)架構(gòu)的端到端語音合成方法,采用基于注意力機(jī)制的Seq2Seq架構(gòu)作為特征預(yù)測(cè)網(wǎng)絡(luò),將輸入文本轉(zhuǎn)化為梅爾聲譜圖,結(jié)合WaveNet架構(gòu)實(shí)現(xiàn)了多語種的語音合成。
4 結(jié)語
本文主要介紹的端到端語音合成系統(tǒng),首先用基于注意力機(jī)制的Seq2Seq模型訓(xùn)練一個(gè)特征預(yù)測(cè)網(wǎng)絡(luò),然后獲取待合成語音的梅爾聲譜圖,利用WaveNet架構(gòu)恢復(fù)損失的相位信息來實(shí)現(xiàn)語音合成。在實(shí)驗(yàn)中,采用WaveNet架構(gòu)的系統(tǒng)性能優(yōu)于采用GriffinLim算法作為波形轉(zhuǎn)換器的系統(tǒng)。實(shí)驗(yàn)中,隨著訓(xùn)練步數(shù)的增加,系統(tǒng)的性能提高,迭代至200k次后趨于穩(wěn)定。調(diào)整字符的表征方式,可以實(shí)現(xiàn)不同語言的合成。由于中文特征表達(dá)以及韻律結(jié)構(gòu)較為復(fù)雜,所以合成自然度不如英文語音。
本次實(shí)驗(yàn)中采用的Seq2Seq架構(gòu)主要為RNN的組合。在后續(xù)的研究中會(huì)探討其他網(wǎng)絡(luò)組合對(duì)合成質(zhì)量的影響,對(duì)WaveNet網(wǎng)絡(luò)結(jié)構(gòu)進(jìn)行修訂以提升收斂速度也是一個(gè)值得研究的課題。
參考文獻(xiàn) (References)
[1] FUNG P, SCHULTZ T. Multilingual spoken language processing [J]. IEEE Signal Processing Magazine, 2008, 25(3):89-97.
[2] HUNT A J, BLACK A W. Unit selection in a concatenative speech synthesis system using a large speech database[C]// Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE, 1996: 373-376.
[3] CAMPBELL N, BLACK A W. Prosody and the selection of source units for concatenative synthesis [M]// Progress in Speech Synthesis. New York: Springer, 1997: 279-292.
[4] ZE H, SENIOR A, SCHUSTER M. Statistical parametric speech synthesis using deep neural networks [C]// Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2013:7962-7966.
[5] TOKUDA K, NANKAKU Y, TODA T, et al. Speech synthesis based on hidden Markov models[J]. Proceedings of the IEEE, 2013, 101(5): 1234-1252.
[6] ZEN H, TOKUDA K, BLACK A W. Statistical parametric speech synthesis [J]. Speech Communication, 2009, 51(11):1039-1064.
[7] OORD A V D, DIELEMAN, ZEN H, et al. WaveNet: a generative model for raw audio[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1609.03499 (2016-09-12) [2016-09-19]. https://arxiv.org/abs/1609.03499.
[8] ARIK S O, CHRZANOWSKI M, COATES A, et al. Deep Voice: realtime neural texttospeech [J/OL]. arXiv Preprint, 2017, 2017: arXiv:1702.07825 (2017-02-25) [2017-03-07]. https://arxiv.org/abs/1702.07825.
[9] SOTELO J, MEHRI S, KUMAR K, et al. Char2Wav: endtoend speech synthesis [EB/OL].[2018-06-20]. http://mila.umontreal.ca/wpcontent/uploads/2017/02/endendspeech.pdf.
[10] WANG Y, SKERRYRYAN R, STANTON D, et al. Tacotron: towards endtoend speech synthesis [J/OL]. arXiv Preprint, 2017, 2017: arXiv:1703.10135 (2017-03-29) [2017-04-06]. https://arxiv.org/abs/1703.10135.
[11] GRIFFIN D, LIM J S. Signal estimation from modified shorttime Fourier transform [J]. IEEE Transactions on Acoustics Speech and Signal Processing, 1984, 32(2):236-243.
[12] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attentionbased models for speech recognition [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 577-585.
[13] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. Endtoend attentionbased large vocabulary speech recognition [C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016: 4945-4949.
[14] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition [C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016:4960-4964.
[15] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164.
[16] VINYALS O, KAISER L, KOO T, et al. Grammar as a foreign language[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014: 2773-2781.
[17] LEE J, CHO K, HOFMANN T. Fully characterlevel neural machine translation without explicit segmentation[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1610.03017 (2016-10-10) [2017-05-13]. https://arxiv.org/abs/1610.03017.
[18] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks [J/OL]. arXiv Preprint, 2015, 2015: arXiv:1505.00387 (2015-03-03) [2015-11-03]. https://arxiv.org/abs/1505.00387.
[19] ERRO D, SAINZ I, NAVAS E, et al. Harmonics plus noise model based vocoder for statistical parametric speech synthesis [J]. IEEE Journal of Selected Topics in Signal Processing, 2014, 8(2):184-194.
[20] AOKI N. Development of a rulebased speech synthesis system for the Japanese language using a MELP vocoder [C]// Proceedings of the 2000 10th European Signal Processing Conference. Piscataway, NJ: IEEE, 2000: 1-4.
[21] GUNDUZHAN E, MOMTAHAN K. Linear prediction based packet loss concealment algorithm for PCM coded speech [J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(8): 778-785.