By
掃碼聽讀
A new wave of startups are using deep learning to build synthetic voice actors for digital assistants, videogame characters, and corporate videos.
2The company blog post drips with the enthusiasm of a ’90s US infomercial1infomercial 商業(yè)信息電視片,專題廣告片。. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured,and professional.”
3Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.
4WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.
新一波的初創(chuàng)公司正在運(yùn)用深度學(xué)習(xí)技術(shù)為數(shù)字助理、視頻游戲角色和企業(yè)視頻合成虛擬配音演員。
2WellSaid Labs 公司的博客文章字里行間充溢著90 年代美國專題廣告片的熱情,描述了其“八位新數(shù)字配音演員”能帶給客戶的效果。托賓“精力充沛、洞察力強(qiáng)”;佩姬“沉著而富有表現(xiàn)力”;阿娃“優(yōu)雅、自信、專業(yè)”。
3每個(gè)數(shù)字配音演員都有一位真人配音演員作原型,(經(jīng)后者同意)利用AI 技術(shù)保留相似度。如今,公司可以授權(quán)這些聲音按需說話,只要將一些文本輸入語音引擎,就會輸出一個(gè)清晰的音頻剪輯,播放著聽起來自然的表演。
4WellSaid Labs 是一家初創(chuàng)公司,總部位于西雅圖,從非營利性研究組織艾倫人工智能研究所中分離出來,新近開始為客戶提供A I語音。目前,它專注于企業(yè)電子學(xué)習(xí)視頻的聲音。其他初創(chuàng)公司的業(yè)務(wù)涉及為數(shù)字助理、呼叫中心運(yùn)營商甚至視頻游戲角色配音。
5Not too long ago, such deepfake2deepfake 是深度學(xué)習(xí)(deep learning)與fake(偽造)的合成詞,指基于深度學(xué)習(xí)等機(jī)器學(xué)習(xí)方法創(chuàng)建或合成視聽覺內(nèi)容,如圖像、音視頻、文本等。深偽技術(shù)最廣為人知的一種應(yīng)用形式是AI 換臉(face-swap)。voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.
6AI voices are also cheap, scalable3scalable(系統(tǒng))可擴(kuò)增的;可增大的。,and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.
7Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa,simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.
5不久前,這種深偽技術(shù)合成的聲音用于詐騙電話和互聯(lián)網(wǎng)騙術(shù),因而名聲不佳。但此后,它們的質(zhì)量持續(xù)提升,激發(fā)了越來越多公司的興趣。最近,深度學(xué)習(xí)的技術(shù)突破使復(fù)制人類語言的許多微妙之處成為可能。這些聲音的停頓、呼吸都恰到好處,還能改變自己的風(fēng)格或情感。如果它們長時(shí)間說話,你就能發(fā)現(xiàn)端倪,然而在簡短的音頻剪輯中,有些合成聲音已經(jīng)與真人聲音難以區(qū)分。
6此外,AI 語音造價(jià)低、可擴(kuò)展且易于使用。合成聲音與真人配音演員的錄音不同,它們還能實(shí)時(shí)更新腳本,為個(gè)性化廣告開辟了新機(jī)會。
7合成聲音已經(jīng)存在了一段時(shí)間。但是,包括原始S i r i和A l e x a在內(nèi)的老版聲音只是簡單地將單詞和聲音黏合在一起,聽著笨拙,如同機(jī)器人。如要讓它們聽起來更自然,就需要人工作業(yè),頗為費(fèi)勁。
8Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead,they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.
9Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deeplearning models. The first predicts, from a passage of text, the broad strokes4stroke = brushstroke(計(jì)劃或想法的)闡釋方式。of what a speaker will sound like—including accent, pitch, and timbre5timbre 音質(zhì),音色。. The second fills in the details, including breaths and the way the voice resonates in its environment.
10Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.
11Capturing these nuances involves finding the right voice actors to supply the appropriate training data and finetune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.
8深度學(xué)習(xí)改變了這一點(diǎn)。語音開發(fā)人員無須再規(guī)定生成語音的確切節(jié)奏、發(fā)音或語調(diào)。他們可以將幾個(gè)小時(shí)的音頻輸入算法,讓算法自主學(xué)習(xí)這些模式。
9多年來,研究人員利用這一基本理念構(gòu)建日趨復(fù)雜的語音引擎。例如,WellSaid Labs構(gòu)建的一個(gè)語音引擎就使用了兩個(gè)主要的深度學(xué)習(xí)模型。第一個(gè)模型是從一段文字中預(yù)測說話者聽起來大致是什么樣子——包括口音、音高和音色。第二個(gè)模型填充細(xì)節(jié),包括呼吸和聲音在其環(huán)境中的回音。
10然而,要想合成聲音以假亂真,不能僅憑按一下按鈕。真人聲音之所以聽起來像真的,部分原因是它并非一成不變,表現(xiàn)力強(qiáng),有能力根據(jù)語境以截然不同的風(fēng)格演繹出相同的臺詞。
11要想捕捉這些細(xì)微差別,就要找到合適的配音演員提供適當(dāng)?shù)挠?xùn)練數(shù)據(jù),還要微調(diào)深度學(xué)習(xí)模型。WellSaid 說,如果要開發(fā)一個(gè)逼真的合成復(fù)制品,至少需要一兩個(gè)小時(shí)的音頻和幾周的勞動。
12AIvoiceshavegrownparticularly popular among brandslooking tomaintainaconsistentsound inmillionsof interactionswithcustomers.Withthe ubiquity of smartspeakerstoday,and theriseof automatedcustomerserviceagentsaswellasdigital assistants embeddedincarsandsmartdevices,brandsmay need toproduceupwardsof a hundred hoursof audioa month.But they alsonolonger want touse thegenericvoicesoffered by traditional textto-speech technology—a trend that accelerated during thepandemicasmore andmorecustomersskippedin-store interactionstoengagewith companies virtually.
13“If I’m Pizza Hut,Icertainly can’t soundlikeDomino’s,andIcertainly can’t sound like Papa John’s,”saysRupalPatel,aprofessor at Northeastern UniversityandthefounderandCEO of VocaliD,whichpromisestobuild custom voicesthat match a company’s brandidentity.“Thesebrandshave thoughtabouttheircolors.They’ve thought about their fonts.Now they’ve got tostart thinking about the way their voice soundsaswell.”
14 Whereas companies used to have to hire different voice actors for different markets—the Northeast versus Southern US, or France versus Mexico—some voice AI firms can manipulate the accent or switch the language of a single voice in different ways. This opens up the pos-sibility of adapting ads on streaming platforms depending on who is listening, changing not just the characteristics of the voice but also the words being spo-ken. A beer ad could tell a listener to stop by a different pub depending on whether it’s playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it’s already working with clients to launch such personalized audio ads on Spotify and Pandora.
12想要在與客戶的數(shù)百萬次互動中保持始終如一聲音的品牌格外青睞AI語音。隨著當(dāng)今智能揚(yáng)聲器的普及,隨著自動化客戶服務(wù)代理以及車載和智能設(shè)備內(nèi)置數(shù)字助理的興起,各大品牌每月可能需要制作超過一百小時(shí)的音頻。但是,它們不再愿意使用傳統(tǒng)的文本轉(zhuǎn)語音技術(shù)所提供的通用語音——這一趨勢在疫情期間加速發(fā)展,越來越多的顧客放棄到店購物,轉(zhuǎn)而與公司進(jìn)行虛擬互動。
13“如果我是必勝客,我肯定不能聽起來像達(dá)美樂或是棒約翰。”東北大學(xué)教授、VocaliD創(chuàng)始人兼首席執(zhí)行官魯帕爾·帕特爾說。他的公司承諾提供與公司品牌特性相匹配的定制聲音?!斑@些品牌已經(jīng)考慮過它們的顏色、字體,現(xiàn)在也開始考慮它們的聲音風(fēng)格。”
14曾經(jīng),各大公司必須為不同的市場(比如美國東北部與南部、法國與墨西哥)雇用不同的配音演員。如今一些語音AI 公司能夠以不同方式對言這就可以根據(jù)聽來調(diào)整流體平臺上的廣告,不僅能改聲音特征,還能改變措辭。例如,啤酒廣告可以針對其不同的播放地區(qū),如紐的酒吧。告智助計(jì)語音的Resembl.ai 表示,它已經(jīng)在與客戶合作將在Spotify 和Padora 上推出這種個(gè)性化的音頻廣告。
15But there are limitations to how far AI can go. It’s still difficult to maintain the realism of a voice over the long stretches of time that might be required for an audiobook or podcast.And there’s little ability to control an AI voice’s performance in the same way a director can guide a human performer.
16In other words, human voice actors aren’t going away just yet. Expressive,creative, and long-form projects are still best done by humans. And for every synthetic voice made by these companies, a voice actor also needs to supply the original training data.
15但是AI 的應(yīng)用前景有其局限性。有聲讀物或播客都需要長時(shí)間播放,而要在這么長的一段時(shí)間內(nèi)保持聲音的真實(shí)感仍然是一件困難的事情。此外,像導(dǎo)演指導(dǎo)人類表演者那樣掌控AI 語音的表演幾乎無法做到。
16換句話說,人類配音演員還沒到離場的時(shí)候。表現(xiàn)力強(qiáng)、富于創(chuàng)意和長篇的項(xiàng)目還是人類做得最好,而且上述公司制作的每一個(gè)合成聲音都需要配音演員提供原始的訓(xùn)練數(shù)據(jù)。
17 在VocaliD 的帕特爾看來,AI語音的最終目的并不是復(fù)制人類的表現(xiàn)或運(yùn)用自動化技術(shù)取代現(xiàn)有的配音工作。它們的前途在于有望開辟全新的可能性。她說,如果將來可以使用合成語音快速調(diào)整在線教育材料以適應(yīng)不同的受眾,那會怎樣?“打個(gè)比方,假設(shè)你要把材料推廣給市中心貧困區(qū)的孩子,如果那個(gè)聲音聽起來真的像是來自他們的社區(qū),難道不是很好嗎?”
17For VocaliD’s Patel, the point of AI voices is ultimately not to replicate human performance or to automate away existing voice-over work. Instead, the promise is that they could open up entirely new possibilities. What if in the future,she says, synthetic voices could be used to rapidly adapt online educational materials to different audiences? “If you’re trying to reach6reach 理解;與……交流。, let’s say, an inner-city77 inner-city 市中心貧民區(qū)的。group of kids, wouldn’t it be great if that voice actually sounded like it was from their community?” ■