第四范式：基于機(jī)器學(xué)習(xí)的荷蘭智慧城市宜居性預(yù)測(cè)模型研究

2020-02-25 08:27鄔峻

風(fēng)景園林 2020年5期

鄔峻

1 研究背景

1.1 第四次工業(yè)革命：開(kāi)啟以人為本的“人工智能”新時(shí)代

18世紀(jì)中葉以來(lái)，人類歷史上先后發(fā)生過(guò)3次工業(yè)革命。第一次工業(yè)革命開(kāi)創(chuàng)了“蒸汽時(shí)代”（1760—1840年），標(biāo)志著農(nóng)耕文明向工業(yè)文明的過(guò)渡，是人類發(fā)展史上的第一個(gè)偉大奇跡；第二次工業(yè)革命開(kāi)啟了“電氣時(shí)代”（1840—1950年），使得電力、鋼鐵、鐵路、化工、汽車等重工業(yè)興起，石油成為新能源，促進(jìn)了交通迅速發(fā)展以及世界各國(guó)之間更頻繁地交流，重塑了全球國(guó)際政治經(jīng)濟(jì)格局；兩次世界大戰(zhàn)之后開(kāi)始的第三次工業(yè)革命，更是開(kāi)創(chuàng)了“信息時(shí)代”（1950年至今），全球信息和資源交流變得更為便捷，大多數(shù)國(guó)家和地區(qū)都被卷入全球一體化和高度信息化進(jìn)程中，人類文明達(dá)到空前發(fā)達(dá)的高度。

2016年1月20日，世界經(jīng)濟(jì)論壇在瑞士達(dá)沃斯召開(kāi)主題為“第四次工業(yè)革命”的主題年會(huì)，正式宣告了將徹底改變世界發(fā)展進(jìn)程的第四次工業(yè)革命的到來(lái)。論壇創(chuàng)始人、執(zhí)行主席施瓦布（Schwab）教授在其《第四次工業(yè)革命》（The Fourth Industrial Revolution）中詳細(xì)闡述了可植入技術(shù)、基因排序、物聯(lián)網(wǎng)（IoT）、3D打印、無(wú)人駕駛、人工智能、機(jī)器人、量子計(jì)算（quanturn computing）、區(qū)塊鏈、大數(shù)據(jù)、智慧城市等技術(shù)變革對(duì)“智能時(shí)代”人類社會(huì)的深刻影響。這次工業(yè)革命中大數(shù)據(jù)將逐步取代石油成為第一資源，其發(fā)展速度、范圍和程度將遠(yuǎn)遠(yuǎn)超過(guò)前3次工業(yè)革命，并將改寫(xiě)人類命運(yùn)以及沖擊幾乎所有傳統(tǒng)行業(yè)的發(fā)展[1]，建筑、景觀與城市的發(fā)展也不例外。

1.2 第四范式：探索“數(shù)據(jù)密集型”科研新范式的緊迫性

第四次工業(yè)革命帶來(lái)的數(shù)據(jù)爆炸正在改變我們的生活和人類未來(lái)。過(guò)去十幾年，無(wú)論是數(shù)據(jù)的總量、種類、實(shí)時(shí)性還是變化速度都在呈現(xiàn)幾何級(jí)別的遞增[2]。截至2013年全世界電子數(shù)據(jù)已經(jīng)達(dá)到460億兆字節(jié)，相當(dāng)于約400萬(wàn)億份傳統(tǒng)印刷本報(bào)告，它們拼接后的長(zhǎng)度可以從地球一直鋪墊到冥王星。而僅僅在過(guò)去2年里，我們創(chuàng)造的數(shù)據(jù)量就占人類已創(chuàng)造數(shù)據(jù)總量的90%[3]。大數(shù)據(jù)已被視為21世紀(jì)國(guó)家創(chuàng)新競(jìng)爭(zhēng)的重要戰(zhàn)略資源，并成為發(fā)達(dá)國(guó)家爭(zhēng)相搶占的下一輪科技創(chuàng)新的前沿陣地[4]。

雖然數(shù)據(jù)大井噴帶來(lái)了符合“新摩爾定律”的數(shù)據(jù)大爆炸，2020年全世界產(chǎn)生的數(shù)據(jù)總量將是2009年數(shù)據(jù)總量的44倍[5]，但是由于缺乏應(yīng)對(duì)數(shù)據(jù)大爆炸的新型研究范式，世界上僅有不到1%的信息能夠被分析并轉(zhuǎn)化為新知識(shí)[6]。Anderson指出這種數(shù)據(jù)爆炸不僅是數(shù)量上的激增，更是在復(fù)雜度、類別、變化速度和準(zhǔn)確度上的激增與相互混合。他將這種混合型爆炸定義為“數(shù)據(jù)洪水”（data deluge），認(rèn)為在出現(xiàn)新的研究范式以前，“數(shù)據(jù)洪水”將成為制約現(xiàn)有所有學(xué)科領(lǐng)域科研發(fā)展的瓶頸[7]1。在城市和建筑設(shè)計(jì)領(lǐng)域，盡管我們?cè)缫呀?jīng)生活在麻省理工學(xué)院米歇爾教授（William J. Mitchell）生前預(yù)言的“字節(jié)城市”（city of bits）里[8]，但是卻遠(yuǎn)遠(yuǎn)沒(méi)有賦予城市研究與設(shè)計(jì)“字節(jié)的超能量”（power of bits）[9]。我們必須開(kāi)發(fā)新型研究范式以應(yīng)對(duì)“數(shù)據(jù)洪水”的強(qiáng)烈沖擊。

常規(guī)研究方法與傳統(tǒng)范式越來(lái)越捉襟見(jiàn)肘，這迫使一些科學(xué)家探索適合于大數(shù)據(jù)和人工智能的新型研究范式。Bell、Hey和Szalay預(yù)警道，所有研究領(lǐng)域都必須面對(duì)越來(lái)越多的數(shù)據(jù)挑戰(zhàn)，“數(shù)據(jù)洪水”的處理和分析對(duì)所有研究科學(xué)家至關(guān)重要且任務(wù)繁重。Bell、Hey和Szalay進(jìn)而提出了應(yīng)對(duì)“數(shù)據(jù)洪水”的“第四范式”。他們一致認(rèn)為：至少自17世紀(jì)牛頓運(yùn)動(dòng)定律出現(xiàn)以來(lái)，科學(xué)家們已經(jīng)認(rèn)識(shí)到實(shí)驗(yàn)和理論科學(xué)是理解自然的2種基本研究范式。近幾十年來(lái)，計(jì)算機(jī)模擬已成為必不可少的第三范式：一種科學(xué)家難以通過(guò)以往理論和實(shí)驗(yàn)探索進(jìn)行研究的新標(biāo)準(zhǔn)工具。而現(xiàn)在隨著模擬和實(shí)驗(yàn)產(chǎn)生了越來(lái)越多的數(shù)據(jù)，第四種范式正在出現(xiàn)，這就是執(zhí)行數(shù)據(jù)密集型科學(xué)分析所需的最新AI技術(shù)[10]。

Halevy等指出，“數(shù)據(jù)洪水”表明傳統(tǒng)人工智能中的“知識(shí)瓶頸”，即如何最大化提取有限系統(tǒng)中的無(wú)限知識(shí)的問(wèn)題，將可以通過(guò)在許多學(xué)科中引入第四范式得到解決。第四范式將運(yùn)用大數(shù)據(jù)和新興機(jī)器學(xué)習(xí)的方法，而不再純粹依靠傳統(tǒng)理論研究的數(shù)學(xué)建模、經(jīng)驗(yàn)觀察和復(fù)雜計(jì)算[11]。

美國(guó)硅谷科學(xué)家Gray、Hey、Tansley和Tolle等總結(jié)出數(shù)據(jù)密集型“第四范式”區(qū)別于以前科研范式的一些主要特征（圖1）[12]16-19：1）大數(shù)據(jù)的探索將整合現(xiàn)有理論、實(shí)驗(yàn)和模擬；2）大數(shù)據(jù)可以由不同IoT設(shè)備捕捉或由模擬器產(chǎn)生；3）大數(shù)據(jù)由大型并行計(jì)算系統(tǒng)和復(fù)雜編程處理來(lái)發(fā)現(xiàn)隱藏在大數(shù)據(jù)中的寶貴信息（新知識(shí)）；4）科學(xué)家通過(guò)數(shù)據(jù)管理和統(tǒng)計(jì)學(xué)來(lái)分析數(shù)據(jù)庫(kù)，并處理大批量研究文件，以獲取發(fā)現(xiàn)新知識(shí)的新途徑。

第四范式與傳統(tǒng)范式在研究目的和途徑上的差異主要表現(xiàn)在以下方面。傳統(tǒng)范式最初或多或少?gòu)摹盀槭裁础保╳hy）和“如何”（how）之類的問(wèn)題開(kāi)始理論構(gòu)建，后來(lái)在“什么”（what）類問(wèn)題的實(shí)驗(yàn)觀測(cè)中得到驗(yàn)證。但是，第四范式的作用相反，它僅從數(shù)據(jù)密集型“什么”類問(wèn)題的數(shù)據(jù)調(diào)查開(kāi)始，然后使用各種算法來(lái)發(fā)現(xiàn)大數(shù)據(jù)中隱藏的新知識(shí)和規(guī)律，反過(guò)來(lái)生成為揭示“如何”和“為什么”類問(wèn)題的新理論。安德森在其《理論的終結(jié)：數(shù)據(jù)洪水使科學(xué)方法過(guò)時(shí)了？》（“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete?”）一文中指出，首先，第四范式并不急于從煩瑣的實(shí)驗(yàn)和模擬，或嚴(yán)格的定義、推理和假設(shè)開(kāi)始理論構(gòu)建；相反，它從大型復(fù)雜數(shù)據(jù)集的收集和分析開(kāi)始[7]1。其次，隱藏在這些龐大、復(fù)雜和交織的數(shù)據(jù)集中的寶貴知識(shí)很難處理，通常無(wú)法使用傳統(tǒng)的科學(xué)研究范式完成知識(shí) 發(fā)現(xiàn)[12]16-19。

在國(guó)內(nèi)外，使用第四范式進(jìn)行城市和風(fēng)景園林研究的探索仍處于起步階段，其方法和目標(biāo)多種多樣。由于篇幅所限，本研究直接使用了一些開(kāi)放數(shù)據(jù)，并將其與荷蘭政府關(guān)于宜居性的調(diào)查結(jié)果相結(jié)合，重點(diǎn)是將“宜居性”（livability）作為機(jī)器學(xué)習(xí)的預(yù)測(cè)目標(biāo)，以引入系統(tǒng)的數(shù)據(jù)密集型研究方法論證第四范式在城市和風(fēng)景園林研究中的可行性和實(shí)用前景。

2 研究現(xiàn)狀與研究目標(biāo)

2.1 國(guó)際經(jīng)典范式城市宜居性相關(guān)研究簡(jiǎn)述

宜居性是衡量與評(píng)價(jià)城市與景觀環(huán)境可居性與舒適性的重要指標(biāo)，近年來(lái)更是成為智慧城市開(kāi)發(fā)的重要切入點(diǎn)，澳大利亞-新西蘭智慧城市委員會(huì)（The Council of Smart Cities of Australia/New Zealand）執(zhí)行主席Adam Beck將智慧城市定義為：“智慧城市將利用高科技和大數(shù)據(jù)加強(qiáng)城市宜居性、可操作性和可持續(xù)性?！盵13]

“Eudaimonia”（宜居性）在西方最初由亞里士多德提出，意味著生活和發(fā)展得很好。長(zhǎng)期以來(lái)，關(guān)于宜居性并沒(méi)有統(tǒng)一的定義，它在不同城市發(fā)展階段、不同地區(qū)和不同學(xué)科領(lǐng)域有多樣化的含義與運(yùn)用，這導(dǎo)致了宜居性概念上的混亂和可操作性上的難度。盡管宜居性缺乏統(tǒng)一的認(rèn)知和可量化的度量系統(tǒng)，多年來(lái)，經(jīng)典理論研究嘗試從經(jīng)濟(jì)、社會(huì)、政治、地理和環(huán)境等維度來(lái)探索宜居性的相關(guān)指標(biāo)。

Balsas強(qiáng)調(diào)了經(jīng)濟(jì)因素對(duì)宜居性的決定性作用，他認(rèn)為較高的就業(yè)率、人口中不同階層的購(gòu)買力、經(jīng)濟(jì)發(fā)展、享受教育和就業(yè)的機(jī)會(huì)、生活水準(zhǔn)是決定宜居性的基礎(chǔ)[14]103。Litman也指出人均GDP對(duì)宜居性的重要影響。同時(shí)，居民對(duì)于交通、教育、公共健康設(shè)施的可及性和經(jīng)濟(jì)購(gòu)買力也應(yīng)被視為衡量宜居性的重要指標(biāo)[15]。Veenhoven的研究也發(fā)現(xiàn)GDP發(fā)展程度、經(jīng)濟(jì)的社會(huì)需求和居民購(gòu)買力對(duì)于評(píng)價(jià)宜居性起到重要作用[16]2-3。

Mankiw對(duì)以經(jīng)濟(jì)作為單一指標(biāo)評(píng)價(jià)宜居性提出批評(píng)，他認(rèn)為僅僅用人均GDP維度來(lái)衡量宜居性是不夠的[17]。Rojas建議其他維度也必須納入考慮范圍，例如政治、社會(huì)、地理、文化和環(huán)境的因素[18]。Veenhoven增加了政治自由度、文化氛圍、社會(huì)環(huán)境與安全性作為宜居性評(píng)價(jià)指標(biāo)[16]7-8。

Van Vliet的研究表明，社會(huì)融合度、環(huán)境清潔度、安全性、就業(yè)率以及諸如教育和醫(yī)療保健等基礎(chǔ)設(shè)施的可及性對(duì)城市宜居性具有直接影響[19]。Balsas還承認(rèn)，除了經(jīng)濟(jì)以外，諸如完備的基礎(chǔ)設(shè)施、充足的公園設(shè)施、社區(qū)感和公眾參與度等因素也對(duì)城市宜居性的提升發(fā)揮了積極作用[14]103。

盡管上述學(xué)者提出了評(píng)估宜居性的政治、社會(huì)、經(jīng)濟(jì)和環(huán)境因素，但他們并沒(méi)有給出評(píng)估城市宜居性的具體指標(biāo)建議。Goertz試圖通過(guò)整體性的3層方法系統(tǒng)地評(píng)估宜居性，從而整合上述要素，并為每個(gè)子系統(tǒng)提出相關(guān)的評(píng)估指標(biāo)。第一層界定了宜居類型，第二層構(gòu)建了因素框架，而第三層則定義了具體指標(biāo)變量[20]（表1）。

表1 根據(jù)Goertz三層結(jié)構(gòu)系統(tǒng)總結(jié)的宏觀宜居性因素框架與相關(guān)指標(biāo)變量Tab. 1 The framework of macro livability factors and related measurement indexes based on Goertz’s three-tiers system

通過(guò)對(duì)宜居性經(jīng)典研究方法進(jìn)行總結(jié)和整合，Goertz在宏觀層面對(duì)宜居性進(jìn)行了定性研究，其中一些因素為未來(lái)的定量研究指明了方向。但是，他未能在中觀層面上呈現(xiàn)不同城市系統(tǒng)中相應(yīng)的可控變量。作為回應(yīng)，Sofeska提出了一個(gè)中觀層面的城市系統(tǒng)評(píng)價(jià)因素框架[21]，包括安全和犯罪率、政治和經(jīng)濟(jì)穩(wěn)定性、公眾寬容度和商業(yè)條件；有效的政策、獲得商品和服務(wù)的機(jī)會(huì)、高國(guó)民收入和低個(gè)人風(fēng)險(xiǎn)；教育、保健和醫(yī)療水平、人口組成、壽命和出生率；環(huán)境和娛樂(lè)設(shè)施、氣候環(huán)境、自然區(qū)域的可及性；公共交通和國(guó)際化。Sofeska特別強(qiáng)調(diào)了建筑質(zhì)量、城市設(shè)計(jì)和基礎(chǔ)設(shè)施有效性對(duì)中觀層面城市系統(tǒng)宜居性的影響。

超越宏觀的政治、經(jīng)濟(jì)、社會(huì)、環(huán)境體系和中觀的城市體系，Giap等認(rèn)為“宜居性”更應(yīng)該是一個(gè)微觀的地域性概念。在定性研究微觀的社區(qū)尺度的宜居性時(shí)，他更強(qiáng)調(diào)城市生活質(zhì)量與城市物質(zhì)環(huán)境質(zhì)量對(duì)宜居性的微觀影響，并將綠色基礎(chǔ)設(shè)施作為一個(gè)重要的指標(biāo)[22]。因此，Giap等提出在社區(qū)單元尺度上定性研究宜居性的重要性。

但是無(wú)論Goertz、Sofeska還是Giap都未建立關(guān)于社區(qū)宜居性的定量評(píng)價(jià)指標(biāo)體系和對(duì)應(yīng)的預(yù)測(cè)方法。目前對(duì)宜居性作定量分級(jí)調(diào)查主要來(lái)自兩套宏觀系統(tǒng)：基于六大指標(biāo)體系的經(jīng)濟(jì)學(xué)人智庫(kù)（Economist Intelligence Unit, EIU）宜居指標(biāo)與基于十大指標(biāo)體系的Mercer生活質(zhì)量調(diào)查體系（Mercer LLC）。不過(guò)這兩套體系僅提供了主要基于經(jīng)濟(jì)指標(biāo)的不同城市之間宜居性的宏觀分級(jí)比較工具，在社區(qū)微觀層面進(jìn)行量化研究和預(yù)測(cè)時(shí)并不具備可操作性[23]。

2.2 荷蘭城市宜居性相關(guān)研究簡(jiǎn)述

在荷蘭語(yǔ)大辭典中，“l(fā)eefbaarheid”（宜居性）的一般定義如下：“適合居住或與之共存”（荷蘭語(yǔ)：geschikt om erin of ermee te kunnen leven）。因此，荷蘭語(yǔ)境的宜居性實(shí)際是關(guān)于主體（有機(jī)體，個(gè)人或社區(qū)）與客體環(huán)境之間適宜和互動(dòng)關(guān)系的陳述[24]。

1969年，Groot將宜居性描述為對(duì)獲得合理收入和享受合理生活的客觀社會(huì)保障，充分滿足對(duì)商品和服務(wù)需求的社會(huì)主觀認(rèn)識(shí)。在這個(gè)偏經(jīng)濟(jì)和社會(huì)目標(biāo)的定義中，對(duì)客觀和主觀宜居性的劃分隱含其中?？陀^保障涉及實(shí)際可記錄的客觀情況（勞動(dòng)力市場(chǎng)、設(shè)施、住房質(zhì)量等）；主觀意識(shí)涉及人們體驗(yàn)實(shí)際情況的主觀方式[25]。在20世紀(jì)70年代，宜居性進(jìn)入了地區(qū)政治的視角。人們意識(shí)到宜居性的中心不應(yīng)該是建筑物，而是人；物質(zhì)生活不僅在數(shù)量，更在質(zhì)量。當(dāng)時(shí)的鹿特丹市議員Vermeulen將這一社會(huì)概念的轉(zhuǎn)換描述為：“你可以用你住房的磚塊數(shù)量算出你房子的大小，但卻無(wú)法知道它的宜居性?！?2002年，荷蘭社會(huì)和文化規(guī)劃辦公室（The Social and Cultural Planning Office of The Netherlands）對(duì)宜居性給出了以下描述：物質(zhì)空間、社會(huì)質(zhì)量、社區(qū)特征和環(huán)境安全性之間的相互作用。2005年Van Dorst總結(jié)了宜居性的3個(gè)視角：顯著的宜居性是人與環(huán)境的最佳匹配；以人為本體驗(yàn)環(huán)境的宜居性；從可定義的生活環(huán)境去推斷宜居性[26]。

荷蘭政府自1998年以來(lái)設(shè)計(jì)并分發(fā)了大量調(diào)查問(wèn)卷，定期對(duì)全國(guó)部分地區(qū)進(jìn)行宜居性調(diào)查和統(tǒng)計(jì)。調(diào)查問(wèn)卷的內(nèi)容主要包括：基于居住環(huán)境、綠色與文體設(shè)施、公共空間基礎(chǔ)設(shè)施、社會(huì)環(huán)境、安全性等方面的滿意度，對(duì)宜居性從極低到極高（1～9）打分。

同時(shí)，荷蘭住建部（Ministerie van Volkshuisvesting, Ruimtelijke Ordening en Milieu, VROM）委托國(guó)立公共健康和環(huán)境研究院（Rijksinstituut voor Volksgezondheid en Milieu, RIVM）以及荷蘭建筑環(huán)境研究院（Het Research Instituut Gebouwde Omgeving, RIGO）依據(jù)調(diào)查問(wèn)卷結(jié)果進(jìn)行深入分析。他們首先對(duì)過(guò)去150年內(nèi)建筑、城市規(guī)劃、社會(huì)學(xué)、經(jīng)濟(jì)學(xué)角度關(guān)于宜居性的研究進(jìn)行了相關(guān)文獻(xiàn)梳理，發(fā)現(xiàn)對(duì)于宜居性的定義和研究長(zhǎng)期以來(lái)存在廣泛的不同定義甚至分歧。他們通過(guò)文獻(xiàn)研究認(rèn)為：如果想在關(guān)于宜居性的研究領(lǐng)域取得突破，那么就必須建立一個(gè)超越當(dāng)前文獻(xiàn)中學(xué)科差異的關(guān)于宜居性的多學(xué)科融合的理論框架。為此，他們提出了與客觀環(huán)境相對(duì)應(yīng)的宜居性、感知和行為的主觀評(píng)估系統(tǒng)。該系統(tǒng)包括：研究環(huán)境和人的各方面如何影響對(duì)生活環(huán)境宜居性的感知；縱向研究宜居性的交叉特征；對(duì)宜居性決定性因素進(jìn)行跨文化比較，旨在根據(jù)時(shí)間、地點(diǎn)和文化確定普遍要素、基本需求和相對(duì)要素。

從調(diào)查問(wèn)卷結(jié)果來(lái)看，該研究認(rèn)為生活質(zhì)量是連接人的主觀評(píng)價(jià)和客觀環(huán)境的一個(gè)研究切入點(diǎn)。而在評(píng)價(jià)生活質(zhì)量時(shí)，社區(qū)尺度的環(huán)境、經(jīng)濟(jì)和社會(huì)質(zhì)量的因子選擇是至關(guān)重要的。他們列出50個(gè)因子作為評(píng)價(jià)荷蘭社區(qū)宜居性的指標(biāo)，并將這些因子分為居住條件、公共空間、環(huán)境基礎(chǔ)設(shè)施、人口構(gòu)成、社會(huì)條件、安全性等幾個(gè)領(lǐng)域（圖2）。他們提出盡快利用大數(shù)據(jù)和開(kāi)發(fā)高級(jí)AI預(yù)測(cè)工具支持城市建設(shè)、決策和制定規(guī)劃的緊迫性[27]。

2.3 研究目標(biāo)與框架

為了響應(yīng)后工業(yè)社會(huì)的到來(lái)，Battey在 20世紀(jì)90年代率先提出了“智慧城市”的概念。由于當(dāng)時(shí)的大數(shù)據(jù)還處于初期階段，Battey只強(qiáng)調(diào)了互聯(lián)網(wǎng)技術(shù)在增強(qiáng)信息交流和城市競(jìng)爭(zhēng)力中的重要性[28]。鑒于智慧城市的內(nèi)涵太廣泛，并且涉及整個(gè)城市系統(tǒng)，因此智慧城市很難獲得統(tǒng)一的認(rèn)同。目前，一個(gè)由6個(gè)子系統(tǒng)構(gòu)成的智慧城市框架逐步被許多學(xué)者接受，其中智慧公民、智慧環(huán)境和智慧生活是3個(gè)重要的環(huán)節(jié)[29]。這符合荷蘭宜居性理論研究關(guān)鍵結(jié)論中關(guān)于社區(qū)居民、主觀宜居性和客觀環(huán)境質(zhì)量間互動(dòng)關(guān)系的描述。

如RIVM和RIGO研究中心的結(jié)論所示，經(jīng)典分析研究結(jié)果和問(wèn)卷調(diào)查結(jié)果基本吻合。但是他們認(rèn)為所有使用傳統(tǒng)范式的研究都存在一定局限性，無(wú)論它們來(lái)自問(wèn)卷、觀察、系統(tǒng)理論、數(shù)學(xué)模型還是統(tǒng)計(jì)方法，在方法論創(chuàng)新上都沒(méi)有太大的區(qū)別。因此，除了傳統(tǒng)研究之外，有必要探索利用大數(shù)據(jù)的新方法并開(kāi)發(fā)先進(jìn)的AI工具，這也將為宜居性評(píng)估創(chuàng)造新條件。第四范式帶來(lái)的上述大數(shù)據(jù)的挑戰(zhàn)和機(jī)遇恰好為這種轉(zhuǎn)變提供了機(jī)會(huì)。這項(xiàng)研究的目的就是開(kāi)發(fā)一種基于機(jī)器學(xué)習(xí)的新型數(shù)據(jù)密集型AI工具箱，以監(jiān)測(cè)荷蘭人居環(huán)境中的宜居性。

新工具箱旨在最大限度地從所有開(kāi)源數(shù)據(jù)中提取數(shù)據(jù)并開(kāi)發(fā)相關(guān)變量，然后它將通過(guò)高級(jí)數(shù)據(jù)工程和數(shù)據(jù)庫(kù)技術(shù)完成轉(zhuǎn)換、集成和存儲(chǔ)數(shù)據(jù)。此后，宜居性問(wèn)卷結(jié)果獲取的宜居性等級(jí)將作為機(jī)器學(xué)習(xí)中的預(yù)測(cè)目標(biāo)。在數(shù)據(jù)倉(cāng)庫(kù)中，將來(lái)自調(diào)查問(wèn)卷的歷史宜居性等級(jí)與同一歷史時(shí)期內(nèi)最相關(guān)的變量進(jìn)行集成，以建立機(jī)器學(xué)習(xí)的預(yù)測(cè)模型。先進(jìn)的AI算法能夠根據(jù)最相關(guān)變量的新的數(shù)據(jù)輸入來(lái)預(yù)測(cè)未來(lái)的宜居性。同時(shí)，可以將其與傳統(tǒng)范式得出的結(jié)論進(jìn)行比較，以確定通過(guò)第四范式發(fā)現(xiàn)新知識(shí)的有效性和優(yōu)勢(shì)。新的輸入可以支持機(jī)器學(xué)習(xí)的再訓(xùn)練，以改善模型。圖3總結(jié)了基于大數(shù)據(jù)的關(guān)鍵研究框架。

3 研究方法與過(guò)程

基于大數(shù)據(jù)爆炸及相伴產(chǎn)生的第四范式，這種新的預(yù)測(cè)工具箱將不依賴于現(xiàn)有常用研究范式，而是首先搜尋可用的開(kāi)源大數(shù)據(jù)、再通過(guò)數(shù)據(jù)密集型的機(jī)器學(xué)習(xí)來(lái)研究這些數(shù)據(jù)，并構(gòu)建算法和預(yù)測(cè)模型。就可以在提供相應(yīng)參數(shù)的情況下對(duì)任何社區(qū)的宜居性進(jìn)行科學(xué)預(yù)測(cè)甚至提前干預(yù)，為智慧城市的宜居性評(píng)價(jià)和規(guī)劃打下基礎(chǔ)。

3.1 數(shù)據(jù)轉(zhuǎn)錄與初步建模

社區(qū)宜居性等級(jí)的歷史記錄是通過(guò)RIVM問(wèn)卷和RIGO研究獲得的，而這些社區(qū)的人口、經(jīng)濟(jì)、社會(huì)和環(huán)境領(lǐng)域的所有可用變量均來(lái)自同一時(shí)期荷蘭中央統(tǒng)計(jì)局（Het Centraal Bureau voor de Statistiek, CBS）和其他開(kāi)源數(shù)據(jù)的數(shù)據(jù)集。這2個(gè)數(shù)據(jù)集可以通過(guò)它們的郵政編碼相互連接，派生出的數(shù)據(jù)用于形成可能的機(jī)器學(xué)習(xí)數(shù)據(jù)集。由于數(shù)據(jù)來(lái)源不同、格式不同、規(guī)模大且雜亂無(wú)章，并且具有不同的實(shí)時(shí)更新頻率，因此它們符合大數(shù)據(jù)的最典型特征，即“四個(gè)V”：數(shù)量（volume）、種類（variety）、準(zhǔn)確性（veracity）和速度（velocity）[30]。首先必須執(zhí)行必要的數(shù)據(jù)工程學(xué)流程，以滿足機(jī)器學(xué)習(xí)對(duì)數(shù)據(jù)質(zhì)量的基本要求。

量子力學(xué)的先驅(qū)，維爾納·海森堡指出，“必須記住，我們觀察到的不是自然本身，而是暴露在我們質(zhì)疑方法下的自然”。傳統(tǒng)范式研究帶來(lái)的主觀認(rèn)知局限性是顯而易見(jiàn)的。而人工智能和大數(shù)據(jù)的涌現(xiàn)，無(wú)疑提供了第四范式這樣一個(gè)更加客觀的認(rèn)知方法論，來(lái)分析隱藏在繁雜多變的數(shù)據(jù)后面的神秘自然規(guī)律。因而，Wolkenhauer將數(shù)據(jù)工程總結(jié)為認(rèn)知科學(xué)和系統(tǒng)科學(xué)的完美結(jié)合，并稱之為知識(shí)工程中數(shù)據(jù)和模型相互匹配的最佳實(shí)踐（圖4）[31]。

根據(jù)Cuesta的數(shù)據(jù)工程學(xué)工藝[32]，在預(yù)處理繁復(fù)數(shù)據(jù)集合時(shí)，我們可以通過(guò)數(shù)據(jù)流程管理、數(shù)據(jù)庫(kù)設(shè)計(jì)、數(shù)據(jù)平臺(tái)架構(gòu)、數(shù)據(jù)管道構(gòu)建、數(shù)據(jù)計(jì)算機(jī)語(yǔ)言腳本編程等關(guān)鍵流程來(lái)實(shí)現(xiàn)數(shù)據(jù)的獲取、轉(zhuǎn)換、清理、建模和存儲(chǔ)。這個(gè)復(fù)雜過(guò)程可以用最典型的ETL（Extract-Transform-Load）流程簡(jiǎn)述（圖5）。

在通過(guò)上述復(fù)雜過(guò)程對(duì)所有不同來(lái)源的數(shù)據(jù)進(jìn)行清理和規(guī)范化處理之后，應(yīng)為數(shù)據(jù)倉(cāng)庫(kù)（data warehouse, DWH）的構(gòu)建設(shè)計(jì)合適的數(shù)據(jù)模型。數(shù)據(jù)應(yīng)定期存儲(chǔ)在數(shù)據(jù)倉(cāng)庫(kù)中，以利于深入分析并及時(shí)進(jìn)行機(jī)器學(xué)習(xí)?；诋?dāng)前數(shù)據(jù)環(huán)境，并以宜居性為核心預(yù)測(cè)目標(biāo)，設(shè)計(jì)了星形數(shù)據(jù)模型。在此關(guān)系數(shù)據(jù)模型中，在中心構(gòu)建記錄每個(gè)社區(qū)宜居性級(jí)別的事實(shí)表，并通過(guò)主鍵和外鍵將其與各個(gè)域的維度表相關(guān)聯(lián)。該數(shù)據(jù)模型的設(shè)計(jì)參考了上述針對(duì)宜居因子分類框架的經(jīng)典研究方法的相關(guān)結(jié)果（表1，圖2）。6個(gè)主要維度分別是人口維度、社會(huì)維度、經(jīng)濟(jì)維度、住房維度、基礎(chǔ)設(shè)施維度、土地利用和環(huán)境維度（圖6）。

3.2 數(shù)據(jù)清理

實(shí)際上，通過(guò)前數(shù)據(jù)工程收集和建模得到的源數(shù)據(jù)通常高度混亂，整體質(zhì)量低下，不適合直接用于機(jī)器學(xué)習(xí)。需要進(jìn)行廣泛的數(shù)據(jù)清理，以符合機(jī)器學(xué)習(xí)對(duì)數(shù)據(jù)質(zhì)量的基本要求。盡管它絕對(duì)不是機(jī)器學(xué)習(xí)最動(dòng)人的部分，但它代表了每個(gè)專業(yè)數(shù)據(jù)科學(xué)家必須面對(duì)的流程之一。此外，數(shù)據(jù)清理是一項(xiàng)艱巨而煩瑣的任務(wù)，需要占用數(shù)據(jù)科學(xué)家50%～80%的精力。眾所周知，“更好的數(shù)據(jù)集往往勝過(guò)更智能的算法”。換句話說(shuō)，即使運(yùn)用簡(jiǎn)單的算法，正確清理過(guò)的數(shù)據(jù)集也能提供最深刻的見(jiàn)解。當(dāng)數(shù)據(jù)“燃料”中存在大量雜質(zhì)時(shí)，即使是最佳算法（即“機(jī)器”）也無(wú)濟(jì)于事。

鑒于數(shù)據(jù)清理工作如此重要，我們首先必須明白什么是合格的數(shù)據(jù)。一般而言，合格數(shù)據(jù)應(yīng)該至少具備以下質(zhì)量標(biāo)準(zhǔn)（圖7）[33]。 1）有效性。數(shù)據(jù)必須滿足業(yè)務(wù)規(guī)則定義的有效約束或度量程度的有效范圍，包括滿足數(shù)據(jù)范圍、數(shù)據(jù)唯一性、有效值、跨字段驗(yàn)證等約束。2）準(zhǔn)確性。與測(cè)量值或標(biāo)準(zhǔn)以及真實(shí)值、唯一性和不可重復(fù)性的符合程度。為了驗(yàn)證準(zhǔn)確性，有時(shí)必須通過(guò)訪問(wèn)外部附加數(shù)據(jù)源來(lái)確認(rèn)數(shù)值的真實(shí)性。3）完整性。數(shù)據(jù)的缺省或者缺失以及各范圍的數(shù)據(jù)值的完整分布對(duì)機(jī)器學(xué)習(xí)的結(jié)果將產(chǎn)生相應(yīng)的影響。如果系統(tǒng)堅(jiān)持某些欄位不應(yīng)為空，則可以通過(guò)指定一個(gè)表示“未知”或“缺失”的值，僅僅提供默認(rèn)值并不意味著數(shù)據(jù)已具備完整性。4）一致性。指的是一套度量在整個(gè)系統(tǒng)中的等效程度。當(dāng)數(shù)據(jù)集中的2個(gè)數(shù)據(jù)項(xiàng)相互矛盾時(shí)，就會(huì)發(fā)生不一致。數(shù)據(jù)的一致性包括內(nèi)容、格式、單位等。

顯然，不同類型的數(shù)據(jù)需要不同類型的清理算法。在評(píng)估了適用于機(jī)器學(xué)習(xí)的宜居性數(shù)據(jù)集的特征和初始質(zhì)量之后，應(yīng)使用以下方法清理所提議的數(shù)據(jù)。1）刪除不需要的或無(wú)關(guān)的觀察結(jié)果。2）修正結(jié)構(gòu)錯(cuò)誤。在測(cè)量、數(shù)據(jù)傳輸或其他“不良內(nèi)部管理”過(guò)程中會(huì)出現(xiàn)結(jié)構(gòu)錯(cuò)誤。3）檢查數(shù)據(jù)的標(biāo)簽錯(cuò)誤，即對(duì)具有相同含義的不同標(biāo)簽進(jìn)行統(tǒng)一處理。4）過(guò)濾不需要的離群值。離群值可能會(huì)在某些類型的模型中引起問(wèn)題。如果有充分的理由刪除或替換了異常值，則學(xué)習(xí)模型應(yīng)表現(xiàn)得更好。但是，這項(xiàng)工作應(yīng)謹(jǐn)慎進(jìn)行。5）刪除重復(fù)數(shù)據(jù)，以避免機(jī)器學(xué)習(xí)的過(guò)度擬合。6）處理丟失的數(shù)據(jù)是機(jī)器學(xué)習(xí)中一個(gè)具有挑戰(zhàn)性的問(wèn)題。由于大多數(shù)現(xiàn)有算法不接受缺失值，因此必須通過(guò)“數(shù)據(jù)插補(bǔ)”技術(shù)進(jìn)行調(diào)整，例如：刪除具有缺失值的行；用“0”、平均值或中位數(shù)替換丟失的數(shù)字標(biāo)值。缺失值也可以使用基于特殊算法的其他非缺失值的變量來(lái)估算。實(shí)際的具體處理應(yīng)根據(jù)值的實(shí)際含義和應(yīng)用場(chǎng)景確定。

通過(guò)以上專業(yè)操作，數(shù)據(jù)中重復(fù)或者相同社區(qū)不同名字的社區(qū)單元被合并，同樣社區(qū)的不同數(shù)字標(biāo)簽經(jīng)過(guò)自動(dòng)對(duì)比與整合，錯(cuò)誤的標(biāo)簽和數(shù)據(jù)被校正，一些異常值經(jīng)再度確認(rèn)后將被刪除，對(duì)重復(fù)數(shù)據(jù)進(jìn)行去重復(fù)處理。缺失的數(shù)據(jù)按不同情況進(jìn)行相關(guān)“數(shù)據(jù)插補(bǔ)”處理后，使數(shù)據(jù)基本達(dá)到機(jī)器學(xué)習(xí)的要求。

3.3 特征工程

作為數(shù)據(jù)預(yù)處理的重要組成部分，特征工程有助于構(gòu)建后續(xù)的機(jī)器學(xué)習(xí)模型以及知識(shí)發(fā)現(xiàn)的關(guān)鍵窗口。特征工程是一種搜索相關(guān)特征的過(guò)程，該特征可以利用AI算法和專業(yè)知識(shí)來(lái)最大化機(jī)器學(xué)習(xí)的效率，還用作機(jī)器學(xué)習(xí)應(yīng)用程序的基礎(chǔ)。但是提取特征非常困難且耗時(shí)，并且該過(guò)程需要大量的專業(yè)知識(shí)。斯坦福大學(xué)教授Andrew Ng指出：“應(yīng)用型機(jī)器學(xué)習(xí)的主要任務(wù)是特征工程學(xué)?！盵34]

在這個(gè)宜居性機(jī)器學(xué)習(xí)模型中，已經(jīng)研究了維度表局域各自的特征，以獲得局部特征的排名。接下來(lái)對(duì)整個(gè)全域的維度表進(jìn)行研究以獲取全局特征的排名。局部特征排名能夠了解局部各維度特征對(duì)宜居性的影響，從而有助于發(fā)現(xiàn)隱藏在大數(shù)據(jù)中的新知識(shí)。全局特征被用于為機(jī)器模型構(gòu)建特定的算法，以達(dá)到最高的預(yù)測(cè)精度。

通過(guò)機(jī)器學(xué)習(xí)發(fā)現(xiàn)：人口維度、社會(huì)維度、經(jīng)濟(jì)維度、住房維度、基礎(chǔ)設(shè)施維度以及土地利用與環(huán)境維度的局部特征對(duì)宜居性的局部影響權(quán)重排序各不相同（圖8）?，F(xiàn)有數(shù)據(jù)顯示，土地利用與環(huán)境因素對(duì)宜居性的影響是最不均衡的。

在第一組維度（人口維度）中，宜居性影響因子權(quán)重排名最靠前的依次是社區(qū)婚姻狀態(tài)、人口密度、25～44歲人口比率、帶小孩家庭數(shù)量等人口特征。提示只有一定的人口密度才能形成宜居性，而青壯年人口、婚姻穩(wěn)定狀態(tài)、有孩子家庭數(shù)量對(duì)社區(qū)宜居性具有良性作用。

在第三組維度（經(jīng)濟(jì)維度）中，宜居性影響因子權(quán)重排名最靠前的依次是家庭平均年收入、每家擁有汽車平均數(shù)、家庭購(gòu)買力等方面的指標(biāo)。

在第四組維度（住房維度）中，宜居性影響因子權(quán)重排名最靠前的依次是政府廉租房比率、小區(qū)內(nèi)買房自居家庭數(shù)、新建住宅數(shù)、房屋空置率等方面的指標(biāo)。

在第五組維度（基礎(chǔ)設(shè)施維度）中，社區(qū)內(nèi)超市數(shù)量、學(xué)校托幼機(jī)構(gòu)數(shù)量、醫(yī)療健康機(jī)構(gòu)數(shù)量、健身設(shè)施數(shù)量、餐飲娛樂(lè)設(shè)施數(shù)量、這些設(shè)施的可及性對(duì)社區(qū)宜居性影響較大。

在第六組維度（土地利用與環(huán)境維度）中，城市化程度、公園、綠道、水體、交通設(shè)施以及到不同土地功能區(qū)的可達(dá)性對(duì)社區(qū)宜居性影響較大。

全局特征組已被應(yīng)用來(lái)建立用于構(gòu)建機(jī)器學(xué)習(xí)模型的特定算法，以獲得最高的預(yù)測(cè)精度。在按全局變量影響因子排序的140個(gè)可收集變量中，只有第二組維度（社會(huì)維度）、第三組維度（經(jīng)濟(jì)維度）和第四組維度（住房維度）的變量出現(xiàn)在最具影響力因子前20名中。這表明，總體而言，社會(huì)、經(jīng)濟(jì)和住房方面對(duì)社區(qū)的宜居性具有更大的影響。通過(guò)對(duì)具體排名簡(jiǎn)化整合，以下綜合因素出現(xiàn)在最具影響力因子前10名中，是影響社區(qū)宜居性的最具決定性因素：獲得社會(huì)救濟(jì)的人口比例、非西方移民比例、政府廉租住房比率、購(gòu)買住房的平均市場(chǎng)價(jià)格、高收入人口比率、固定收入住戶數(shù)量、新住房數(shù)量、總體犯罪率、戶年均天然氣消耗量以及戶年均電力消耗量（圖9）。

完成上述必要的數(shù)據(jù)掃描和研究工作后，我們將進(jìn)入機(jī)器學(xué)習(xí)的最核心階段：開(kāi)發(fā)算法并優(yōu)化模型。這個(gè)階段的工作重點(diǎn)是獲得最佳的預(yù)測(cè)結(jié)果。

4 研究結(jié)果

4.1 機(jī)器學(xué)習(xí)關(guān)鍵過(guò)程的生成

因?yàn)楸仨毾葮?biāo)注數(shù)據(jù)以便指定預(yù)測(cè)目標(biāo)從而進(jìn)行學(xué)習(xí)，所以本機(jī)器學(xué)習(xí)實(shí)際上為監(jiān)督式機(jī)器學(xué)習(xí)（supervised machine learning）。在進(jìn)行機(jī)器學(xué)習(xí)前通過(guò)必要的數(shù)據(jù)掃描和研究，得到一個(gè)初步的數(shù)據(jù)評(píng)估結(jié)論。根據(jù)上述數(shù)據(jù)清理工程原理進(jìn)行大量的數(shù)據(jù)清理，得到滿足機(jī)器學(xué)習(xí)基本標(biāo)準(zhǔn)的數(shù)據(jù)。再通過(guò)上述數(shù)據(jù)特征工程得到綜合簡(jiǎn)化后的10個(gè)最強(qiáng)影響因子作為預(yù)測(cè)因子參與后續(xù)機(jī)器學(xué)習(xí)，以便生成算法和建模。

然后將上述數(shù)據(jù)集劃分為訓(xùn)練數(shù)據(jù)集和測(cè)試數(shù)據(jù)集，劃分比例為7∶3。將第一組數(shù)據(jù)集（訓(xùn)練數(shù)據(jù)）輸入機(jī)器學(xué)習(xí)算法得到訓(xùn)練模型，對(duì)訓(xùn)練模型進(jìn)行打分。將第二組數(shù)據(jù)集（測(cè)試數(shù)據(jù)）輸入訓(xùn)練模型進(jìn)行比較和評(píng)估。本機(jī)器學(xué)習(xí)目標(biāo)是宜居性的不同等級(jí)，屬于多分類問(wèn)題機(jī)器學(xué)習(xí)，擬采取兩組常用決策林算法進(jìn)行比選優(yōu)化：多類決策叢林算法和多類決策森林算法。這2種通用算法的工作原理都是構(gòu)建多個(gè)決策樹(shù)，然后對(duì)最常見(jiàn)的輸出類進(jìn)行投票。投票是一種聚合形式，在這種形式中，分類決策林中的每個(gè)樹(shù)都輸出標(biāo)簽的非規(guī)范化頻率直方圖。聚合過(guò)程對(duì)這些直方圖求和并對(duì)結(jié)果進(jìn)行規(guī)范化處理，以獲取每個(gè)標(biāo)簽的“概率”。預(yù)測(cè)置信度較高的樹(shù)在最終決定系綜時(shí)具有更大的權(quán)重。通常，決策林是非參數(shù)模型，這意味著它們支持具有不同分布的數(shù)據(jù)。在每個(gè)樹(shù)中，為每個(gè)類運(yùn)行一系列簡(jiǎn)單測(cè)試，從而增加樹(shù)結(jié)構(gòu)的級(jí)別，直到到達(dá)葉節(jié)點(diǎn)（決策）能最佳滿足預(yù)測(cè)目標(biāo)。

該機(jī)器學(xué)習(xí)的決策森林分類器由決策樹(shù)的集合組成。通常，與單個(gè)決策樹(shù)相比，集成模型提供了更好的覆蓋范圍和準(zhǔn)確性。通過(guò)后臺(tái)代碼將算法部署到云計(jì)算環(huán)境中運(yùn)行，生成的該機(jī)器學(xué)習(xí)的具體工作流程如圖10所示。在部署到云端之后，仍然需要在后臺(tái)定期使用新輸入的數(shù)據(jù)來(lái)改進(jìn)算法，換言之，通過(guò)重新訓(xùn)練來(lái)更新和完善數(shù)據(jù)模型和算法。圖11顯示了機(jī)器學(xué)習(xí)的整個(gè)生命周期。

隨后根據(jù)細(xì)過(guò)濾器濾料要求，從倉(cāng)庫(kù)調(diào)撥合適粒徑的細(xì)石榴石，粗石榴石，無(wú)煙煤3種濾料，對(duì)細(xì)過(guò)濾器進(jìn)行徹底的清罐防腐，對(duì)集水器分水器結(jié)構(gòu)進(jìn)行了檢查并按照設(shè)計(jì)濾料厚度更換濾料。濾料更換完成后，注水系統(tǒng)恢復(fù)正常（圖3）。

4.2 機(jī)器學(xué)習(xí)結(jié)果

從機(jī)器學(xué)習(xí)的背景中提取出兩組算法的混淆矩陣（圖12）。與多類決策叢林算法相比，多類決策森林算法具有更好的性能。多類決策叢林算法的主要錯(cuò)誤是，易將宜居性 1～2級(jí)高估為3～4級(jí)，且對(duì)5～9級(jí)的預(yù)測(cè)不夠準(zhǔn)確。在多類決策森林算法中很少發(fā)生類似的錯(cuò)誤，因此它具備更好的總體性能。另外，多類決策叢林算法總體預(yù)測(cè)準(zhǔn)確率為76%，而多類決策森林算法總體預(yù)測(cè)準(zhǔn)確率為96%，高于前者，所以決定在云端生產(chǎn)環(huán)境中部署多類決策森林算法（表2）。

表2 兩種不同機(jī)器學(xué)習(xí)算法的預(yù)測(cè)性能比較Tab. 2 A comparison of predictive performances of two different machine learning algorithms

對(duì)全荷蘭人居環(huán)境宜居性進(jìn)行反復(fù)機(jī)器學(xué)習(xí)和預(yù)測(cè)后，可在全國(guó)地圖上進(jìn)行可視化和監(jiān)測(cè)（圖13）。其中顏色越偏綠的區(qū)域宜居性越高，越偏紅的地方宜居性越低。該圖顯示預(yù)測(cè)宜居性高低分布全國(guó)相對(duì)均衡，在鏈型城市帶（Randstad）和靠近德國(guó)東部地區(qū)的高宜居性區(qū)域相對(duì)比較集中。宜居性比較低的相對(duì)集中在近年圍海造地形成的新省份弗萊福蘭（Flevoland），可能是人口密度較低以及配套基礎(chǔ)設(shè)施比較滯后造成的。

此外，該預(yù)測(cè)工具還能夠?qū)χ杏^層面的城市群和微觀層面的社區(qū)進(jìn)行深入的研究和預(yù)測(cè)。如大鹿特丹地區(qū)和海牙地區(qū)的宜居性預(yù)測(cè)結(jié)果表明（圖14），一些老城區(qū)的市區(qū)宜居性不高，而郊區(qū)的宜居性通常較高。特別是鹿特丹和代爾夫特交界處的北郊，以及海牙西北部的沿海地區(qū)，相對(duì)宜居且人口密集。

5 結(jié)論與展望

第四范式荷蘭智慧城市宜居性預(yù)測(cè)研究的主要結(jié)果表明，基于可用大數(shù)據(jù)和必要的數(shù)據(jù)工程，由人工智能算法可直接推演得到最影響人居環(huán)境宜居性的十大主導(dǎo)要素簡(jiǎn)化后總結(jié)為：獲得社會(huì)救濟(jì)的人口比例、非西方移民比例、政府廉租住房比率、購(gòu)買住房的平均市場(chǎng)價(jià)格、高收入人口比率、固定收入住戶數(shù)量、新住房數(shù)量、總體犯罪率、戶年均天然氣消耗量以及戶年均電力消耗量。此外，可以通過(guò)輸入最新數(shù)據(jù)集和機(jī)器再學(xué)習(xí)改進(jìn)模型來(lái)更新變量，以執(zhí)行環(huán)境宜居性的實(shí)時(shí)預(yù)測(cè)。

這項(xiàng)研究的結(jié)果可以應(yīng)用于宜居性分析的4個(gè)不同階段（圖15）：宜居性描述研究、宜居性診斷研究、宜居性預(yù)測(cè)研究、宜居性預(yù)視研究。因此，它可以根據(jù)實(shí)時(shí)更新的大數(shù)據(jù)和經(jīng)過(guò)重新訓(xùn)練的算法，對(duì)人類住區(qū)中的宜居性進(jìn)行監(jiān)測(cè)和早期干預(yù)。

將根據(jù)第四范式進(jìn)行的本研究與前述傳統(tǒng)范式的研究進(jìn)行比較，發(fā)現(xiàn)不需要依賴傳統(tǒng)人工智能系統(tǒng)的專家體系（expert system）或者專業(yè)研究人員的長(zhǎng)期大量研究積累就能得到一些最有效的知識(shí)發(fā)現(xiàn)和高精度的預(yù)測(cè)模型。通過(guò)機(jī)器學(xué)習(xí)得到的宜居性研究結(jié)論，無(wú)論在主導(dǎo)要素、局域特征還是全局特征上，基本與RIVM與RIGO宜居性的相關(guān)定性研究結(jié)果相互吻合。此外，本研究能夠以定量的方式對(duì)預(yù)測(cè)中最具決定性的因素進(jìn)行快速排名，從而使科學(xué)研究更加高效、快捷，并且實(shí)現(xiàn)實(shí)時(shí)數(shù)據(jù)更新和預(yù)測(cè)。

本研究可用數(shù)據(jù)集中的土地利用與環(huán)境簇群依然偏少，導(dǎo)致其錐型圖比較尖銳。這個(gè)不足之處需要在將來(lái)通過(guò)收集更多土地利用與環(huán)境相關(guān)變量執(zhí)行強(qiáng)化學(xué)習(xí)，進(jìn)行更多的知識(shí)發(fā)現(xiàn)，拓寬預(yù)測(cè)模型的觀察視野。

另外，在收集和處理實(shí)時(shí)大量數(shù)據(jù)時(shí)需要更強(qiáng)大的數(shù)據(jù)收集與處理能力、計(jì)算能力和更復(fù)雜的運(yùn)算環(huán)境。通過(guò)最新的5G、物聯(lián)網(wǎng)和量子計(jì)算等新科技，將來(lái)研究可以收集更多、更復(fù)雜甚至非結(jié)構(gòu)性實(shí)時(shí)數(shù)據(jù)擴(kuò)展當(dāng)前研究，從而具備更廣闊的智慧城市運(yùn)用前景。

圖表來(lái)源：

圖1引自參考文獻(xiàn)[12]；圖2引自參考文獻(xiàn)[27]；圖3、5、6、8～14由作者繪制；圖4引自參考文獻(xiàn)[31]；圖7引自參考文獻(xiàn)[33]；圖15由作者根據(jù)Gartner概念繪制；表1～2由作者繪制。

（編輯/王一蘭）

The Fourth Paradigm: A Research for the Predictive Model of Livability Based on Machine Learning for Smart City in The Netherlands

WU Jun

1 Research Background

1.1 The Fourth Industrial Revolution: Entering a New Age of Human-Orientated Artificial Intelligence

Since the mid-18th century, mankind has gone through three industrial revolutions. The first industrial revolution had ushered in the“steam age” (1760—1840), which marked our transition from agriculture to industry, and represented the first great miracle in the history of human development. The second industrial revolution had launched the“electric age” (1840—1950), which led to the rise of heavy industries such as electricity, steel, railways, chemicals and automobiles, with oil becoming a new energy source. It promoted the rapid development of transport and more frequent exchanges among countries around the world, and reshaped the global political and economic landscape. The third industrial revolution, which began after the two world wars, had initiated the“information age” (1950—present). As it becomes easier to carry out global exchanges of information and resources, most countries and regions are involved in the process of globalization and informatization, and human civilization has reached an unprecedented level of development.

On the 20th January of 2016, the World Economic Forum held its annual meeting in Davos, Switzerland, with a focus on the theme, “The fourth industrial revolution”, and officially heralded the arrival of the fourth industrial revolution that shall bring about a radical change to the global development process. In his address onThe Fourth Industrial Revolution, Professor Schwab, founder and executive chairman of the World Economic Forum, had elaborated on the profound impact of new technologies, such as implantable technology, genetic sequencing, Internet of Things(IoT), 3D printing, autonomous vehicle, artificial intelligence, robotics, quantum computing, blockchain, big data, and smart cities, on the human society in the“Age of Intelligence”. In this industrial revolution, big data will gradually replace oil as the first resource. The pace, scope, and extent of its development shall far exceed those of the previous three industrial revolutions. It shall rewrite the fate of mankind and make a great impact on the development of almost all traditional industries, including architecture, landscape and urban development[1].

1.2 The Fourth Paradigm: The Urgency of Exploring the New Paradigm of“Data-Intensive” Research

Our future and our lives are being changed by the data explosion brought about by the fourth industrial revolution. Over the last decades, there had been an exponential growth in the total volume, variety, veracity of data, as well as the velocity in data[2]. In 2013, the amount of electronic data generated worldwide had reached 46 billion terabytes, equivalent to about 400 trillion traditional printed reports, which could have paved the way from Earth to Pluto. In the past two years alone, we have generated more than 90% of the data worldwide[3]. Regarded as an important strategic resource in nations that are competing for innovation in the 21st century, big data is at the forefront of the next round of scientific and technological innovation that developed countries are scrambling to seize[4].

Despite the“New Moore’s Law” data explosion caused by the big data blowout, and the worldwide generation of 44 times more data in 2020 than in 2009[5], less than 1% of the information in the world can be analyzed and translated into new knowledge[6], due to the lack of a new research paradigm to respond to the data explosion. Anderson pointed out that apart from triggering a surge in quantity, data explosion also represents a mixture and explosion of data in terms of volume, variety, veracity, and velocity. He defined this hybrid explosion as a“data deluge” and argued that the“data deluge” would become a bottleneck for scientific researches in all existing disciplines before the emergence of a new research paradigm[7]1. Although we are already living in the“city of bits” as predicted by the late MIT professor William J. Mitchell in the field of urban and architectural design[8], we are still a distance away from the acquisition of“power of bits” in the urban research and design sector[9]. We must develop a new research paradigm to counter the strong impact of data deluges.

The conventional research methods and traditional paradigms are becoming increasingly inadequate. This has compelled some scientists to explore new research paradigms that are suitable for big data and artificial intelligence. Bell, Hey and Szalay warned that all fields of research shall have to face increasing data challenges, and the handling and analysis of“data deluge” are becoming increasingly onerous and vital for all researchers and scientists. As such, Bell, Hey and Szalay proposed the“fourth paradigm” to deal with the“data deluge”. They concurred that: since Newton’s Laws of Motion in the 17th century or earlier, scientists have recognized that experimental and theoretical sciences are two basic research paradigms for the understanding of nature. In recent decades, computer simulation has become an essential third paradigm and a new standard tool for scientists to explore hard-to-reach areas of theoretical research and experimentation. Nowadays, as an increasing amount of data is generated by simulations and experiments, the fourth paradigm is emerging, and this shall be the latest AI technology desired to perform data-intensive scientific research[10].

Halevy et al. pointed out that the“data deluge” highlighted the“knowledge bottleneck” in traditional AI. This indicates that the question of how to maximize the extraction of infinite knowledge with limited systems shall be solved and applied in many disciplines through big data and emerging machine learning algorithms brought forth by the fourth paradigm. This is in contrast with traditional paradigms that rely solely on pure mathematical modeling, empirical observation, and complex computation[11].

The scientists from the Silicon Valley, Gray, Hey, Tansley and Tolle, have summed up the key features of a data-intensive“fourth paradigm”, that is unlike earlier scientific paradigms, as follows(Fig. 1)[12]16-19: 1) Exploration of big data shall lead to an integration of existing theories, experiments, and simulations; 2) Big data can be captured by different IoT devices or generated by simulators; 3) Big data is processed by large parallel computing systems and complex programming to discover valuable information/new knowledge hidden in the big data; 4) Scientists shall obtain novel methodologies to acquire new knowledge through data management and statistical analysis of databases and mass research documents.

The differences between the fourth paradigm and the traditional paradigms in research purposes and approaches are mainly reflected in the following aspects. The traditional paradigms started with more or less a theoretical construction from“why” or“how”, and was later verified in more experimental observations of“what”. However, the fourth paradigm has the opposite effect. It initially starts with data-intensive“what” surveys, and then uses various algorithms to discover new knowledge and laws hidden in big data, which in turn become the“how” or“why” New theory. In his articleThe End of Theory: The Data Deluge Makes the Scientific Method Obsolete?, Anderson pointed out that, firstly, the fourth paradigm is not in a hurry to start theoretical construction from tedious experiments and simulations, or strict definitions, inferences, and assumptions. Instead, it starts with the collection and analysis of large and complex datasets[7]. Secondly, the valuable knowledge hidden in these huge, complex and intertwined datasets is hard to be processed with the traditional scientific research paradigms to discover new knowledge[12]17-19.

The exploration of urban and landscape research using the fourth paradigm is still in its infancy, both at home and abroad, with a myriad of methods and objectives available. Due to space limitations, this research has directly used some various open data integrated with the results from the governmental survey for livability in The Netherlands, focused on the“l(fā)ivability” as the predictive goal in the machine learning to introduce a systematic data-intensive research method to demonstrate the feasibility and prospects of the fourth paradigm in urban and landscape research.

2 Research Status and Objectives

2.1 Summary of the Relevant International Researches on Urban Livability with Traditional Paradigms

As an important indicator used to measure and assess the level of comfort and habitability in urban and landscape environments, livability has become a key point for the development of smart cities in recent years. Adam Beck, executive director of the Smart Cities Council, of Australia/New Zealand defines a smart city as, “The smart city is one that uses technology, data and intelligent design to enhance a city’s livability, workability and sustainability.”[13]

“Eudaimonia” (Livability) was first proposed in the West by Aristoteles, who defined it as“doing and living well”. For a long time, there had been no unified definition for livability. Instead, there are different implications and applications in different stages of urban development, in different regions, and in different disciplines. This has led to confusion and difficulty in implementing the concept of livability. Although livability lacks a unified and quantifiable measurement system, classical theoretical studies have attempted to explore relevant indicators of livability from the economic, social, political, geographical, and environmental dimensions over the last decades.

Balsas emphasized the decisive role of economic factors in livability. He argued that the foundation of livability is made up of factors such as high employment rates, affordability from a diverse population, economic development, living standards, and accessibility to education and employment[14]103. Litman also pointed out the significant impact of per capita GDP on livability. In addition, the accessibility of residents to transport, education, and public health facilities, as well as their economic power to afford such services, should be recognized as important indicators of livability[15]. Veenhoven’s research also found that the level of GDP development, social needs of the economy, and the purchasing power of residents were vital to the assessment of livability[16]2-3.

Mankiw criticized the use of economy as the only indicator to assess livability and argued that it was inadequate to measure livability solely using per capita GDP[17]. Rojas suggested that other dimensions, such as political, social, geographical, cultural, and environmental factors, should also be taken into account[18]. Furthermore, Veenhoven recommended political freedom, cultural atmosphere, social climate, and safety as additional indicators for the assessment of livability[16]7-8.

Van Vliet’s research showed that social integration, environmental cleanliness, safety, employment rate, and accessibility of infrastructures such as education and medical care had a direct impact on urban livability[19]. Balsa also conceded that factors other than the economy, such as completed infrastructure, adequate park facilities, community spirit, and public participation also played a positive role in urban livability[14]103.

Although the aforementioned scholars had proposed the political, social, economic and environmental foundations to evaluate livability, they have not recommended concrete indexes for the appraisal of urban livability. Goertz has tried to integrate the above elements by systematically evaluating livability with a holistic three-tiers approach and proposing relevant evaluation indicators for each sub-system. The first tier is composed of the types of livability, the second tier includes the framework of factors, while the third tier shows the variables of indicators[20](Tab. 1).

Tab. 1 The framework of macro livability factors and related measurement indexes based on Goertz’s three-tiers system

Through his summary and integration of classical livability, Goertz has made the qualitative research of livability feasible at the macro-level and some of the factors have indicated the direction for future quantitative research. However, he has failed to present the corresponding controllable variables in different urban systems at the meso-level. In response, Sofeska has proposed a framework of evaluation factors in urban systems at the mesolevel[21], including security and crime rates, political and economic stability, public tolerance, and business conditions; effective policies, access to goods and services, high national income and low personal risks; education, health and medical levels, demographic composition, longevity and birth rates; environment and recreation facilities, climatic environment, accessibility of natural areas; public transport, and internationalization. In particular, Sofeska hasemphasized the impact of building quality, urban design, and effectiveness of infrastructure on the livability of urban systems at the meso-level.

Apart from the nation’s political, economic, social, and environmental systems at the macrolevel, as well as urban systems at the meso-level, Giap et al. believed that“l(fā)ivability” should be a community concept at the micro-level. While carrying out the qualitative research of livability on the micro-community scale, he emphasized the micro impacts of the quality of urban life and the city’s physical environment on livability and regarded green infrastructure as the significant indicators[22]. Therefore, Giap et al. proposed the importance of qualitative research of livability on the community scale.

However, neither Goertz, Sofeska nor Giap have established a quantitative evaluation of index system and the corresponding method to predict community livability. At present, the quantitative classification survey of livability is mainly derived from two macro systems: Economist Intelligence Unit(EIU)’s Livability Index, which is based on six major indicators; and Mercer’s Quality of Living Survey, which is based on ten indicators. Nevertheless, these two systems provide only macrolevel tools to compare livability among different cities and are mainly based on economic indicators, hence they are not feasible in quantitative research and prediction at the micro-level of communities[23].

2.2 Summary of the Researches on Urban Livability in The Netherlands

In the Dutch dictionary, “l(fā)eefbaarheid” (livability) is generally defined as: “suitable for living or coexistence” (“geschikt om erin of ermee te kunnen leven”). Therefore, livability in the Dutch context is actually a statement on the appropriate and interactive relationship between the subjects(living beings, individuals or communities) and the environment as the object[24].

In 1969, Groot described livability as the objective social security with the means to obtain a reasonable income and enjoy a reasonable life, thus fulfilling the social subjective understanding of demands for goods and services. This definition, which is in favor of economic and social objectives, implies the distinction between objective and subjective livability. The former involves tangible objective information(labor market, facilities, housing quality, etc.), while the latter relates to subjective ways that the actual situation is experienced by people[25]. Livability was brought into regional politics in the 1970s, when people realized that livability should not be centered on buildings. It was more important to consider the quality and not simply the quantity of our material lives. Vermeulen, a city councilor of Rotterdam, described the conversion of this social concept as, “You can figure out the size of your house with the number of bricks, but you can’t know its livability.” In 2002, the Social and Cultural Planning Office of The Netherlands provided the following description of livability: interactions between the physical space, social quality, community characteristics, and environmental security. In 2005, Van Dorst summarized three perspectives on livability: Remarkable livability is the optimal match between human and environment, the livability of the environment should be experienced in a peopleoriented approach, and livability should be inferred from a definable living environment[26].

Since 1998, the Dutch government has designed and distributed a large number of questionnaires to conduct livability surveys and gather the statistics in different parts of the country on a regular basis. The questionnaires mainly comprise livability scores(1-9), which reflect the people’s satisfaction with the living environment, green spaces, cultural and sports facilities, public space infrastructure, social environment, security and etc.

Simultaneously, the Dutch Ministry of Housing, Spatial Planning and the Environment(Ministerie van Volkshuisvesting, Ruimtelijke Ordening en Milieu, VROM) commissioned the Dutch Institute for Public Health and Environment(Rijksinstituut voor Volksgezondheid en Milieu, RIVM) and the RIGO Research Centre to conduct in-depth analysis based on the results of the questionnaires. After reviewing the literature on livability over the past 150 years from the perspectives of architecture, urban planning, sociology, and economics, RIVM and RIGO found that a wide range of differences and even divergences in the definition and research of livability had been present for a long time. Having studied relevant literature, the institutes believed that it was necessary to establish a theoretical framework for multi-disciplinary integration of livability beyond the differences across disciplines of current literature, so as to achieve breakthroughs in the research of livability. To this end, they proposed systematic screening of subjective assessment of livability, perceptions, and behaviors that correspond to the objective environment and relationships. They recommended studies on the influence of environment and people on the perception of environmental livability, longitudinal studies on the inter-disciplinary features of livability, as well as cross-cultural comparison of decisive factors in the assessment of livability, so as to identify the universal, basic, and relative elements based on time, location, and culture.

Based on results from the survey, this research identified the quality of life as a research hub to bridge subjective evaluation and objective environment. The quality of the environment, economy, and society at the community level is crucial in the evaluation of the quality of life. A total of 50 factors have been listed as indicators that can be used to evaluate the livability of Dutch communities. These factors have been divided into several clusters, such as living conditions, public spaces, environmental infrastructure, population composition, social conditions, and security(Fig. 2). They appealed the urgency of using big data and developing advanced AI forecasting tools to support urban design, decision-making in urban planning[27].

2.3 Research Objective and Framework

In response to the advent of the postindustrial society, Battey was the first to propose the concept of“smart city” in the 1990s. As big data was in its early stages at that time, Battey had only stressed the importance of Internet technology in the enhancement of information exchange and competitiveness of cities[28]. Due to its too extensive connotations and involvement in the entire urban system, it is difficult for smart cities to gain unified attention and acceptance. At present, a smart city framework with six sub-systems has been gradually accepted by many scholars, in which the smart citizen, smart environment, and smart life represent three important elements[29]. This conforms to the interactive relations among community residents, subjective livability, and objectively environmental quality in the key conclusions of the theoretical research of livability in The Netherlands.

As illustrated in the conclusions of the RIVM and the RIGO Research Centre, the classical analyses follow the results from the questionnaire. There are limitations in all studies using the traditional paradigms, and it makes no big difference whether or not they come from the questionnaire, observation, systematic theory, mathematical models, or statistical methods. Therefore, beyond the traditional studies, it is necessary to explore the new methodology using big data and developing advanced AI tools to create the basis for the integration of livability evaluation and prediction. The aforesaid big data challenges and opportunities from the fourth paradigm happen to provide an opportunity for this transit. The objective of this research is to develop such a novel dataintensive toolbox based on machine learning to monitor and even to predict livability in the Dutch settlement environment.

The new toolbox aims to maximize the application of all open-source data and develop referenced variables, after that it will transform, integrate and store the data through advanced data engineering and database technologies. Thereafter, the results of livability questionnaires shall be extracted as the target to be predicted in machine learning through the fourth paradigm. In the data warehouse the historical livability grades from the questionnaires are integrated with the most relevant variables at the same historical period, to build predictive models form the machine learning. The developed AI algorithms are able to predict future livability based on the new inputs of the most relevant variables. Consequently, it can be compared with those conclusions drawn from the traditional paradigms to identify the new knowledge discovered by the fourth paradigm. The new inputs can support the retraining in machine learning to improve the model as well. The big data based key research framework is summarized in Fig. 3.

3 Research Method and Process

Due to the big data explosion and the resulting fourth paradigm, the new prediction toolbox shall no longer rely on the existing and widely-used research paradigms. Instead, it shall firstly search for available open-source big data. After that, it will investigate the data via data-intensive machine learning and further build algorithms and prediction models. Thereafter, it may predict and even intervene to the livability of any community with the corresponding parameters in a scientific manner, thus laying a solid foundation for the evaluation and planning of livability in the smart city.

3.1 Data Engineering and Preliminary Data Modeling

The historical grades of community livability were obtained from RIVM and RIGO questionnaires, while all available variables on the population, economic, social, and environmental fields of these communities were obtained from the Dutch Central Bureau of Statistics(Het Centraal Bureau voor de Statistiek, CBS) and other open-source data at the same period. These two datasets are inner joined with each other by their postcodes correspondently. The derived data were used to form possible machine learning data sets. As the data came from different sources, in different formats, were large and disorganized, and had varying frequencies of real-time updates, they match the most typical features of big data, namely the“four V’s”: volume, variety, veracity, and velocity[30]. The necessary data engineering process must be carried out to meet the basic requirements for the data quality of machine learning.

Werner Karl Heisenberg, a pioneer in quantum mechanics, pointed out that, “It must be remembered that what we observe is not nature itself, but nature exposed to our questioning methods”. Consequently, the subjectively cognitive limitation of the traditional paradigm research is evident. The emergence of AI and big data undoubtedly lend a more objectively cognitive methodology, such as the fourth paradigm, to analyze the mysterious natural laws hidden behind the dynamic and complex data. Wolkenhauer summed up data engineering as the perfect combination of cognitive science and systems science, and called it the best practice for the matching of data and models in knowledge engineering(Fig. 4)[31].

According to Cuesta’s data engineering process[32], data acquisition, conversion, cleansing, modeling, and storage can be achieved through key steps such as data flow management, database design, data platform architecture, data pipeline construction, and data script throught computer language during the preprocessing of complex data sets. This complex process can be described using the most typical ETL process(Fig. 5).

After data from all different sources had been cleansed and normalized through the above complex processes, a suitable data model should be designed for the construction of the data warehouse(DWH). Data should be regularly stored in the data warehouse to facilitate in-depth analysis and timely machine learning. A star-schema data model is made based on the current data circumstances and with livability as the core predictive goal. In this relational data model, a fact table with livability grades from each community is found at the center and is associated with the dimensional tables of various domains through the primary and foreign keys. This data model has been designed with reference to the results from the aforesaid classical research based on the framework for livability factors classification(Tab. 1, Fig. 2). The six major dimensions are population dimension, social dimension, economic dimension, housing dimension, service facility dimension, land use and environment dimension(Fig. 6).

3.2 Process of Data Cleansing

In practice, source data collected and modeled through pre-data engineering are often highly disorganized, low in overall quality, and unsuitable for direct machine learning. Extensive data cleansing is required to meet the data quality input requirements for machine learning. While it is definitely not the“sexiest” part of machine learning, it represents one of the required courses for each professional data scientist. Furthermore, data cleansing can be a tough and exhausting“task” that takes up 50%-80% of the energy of a data scientist. It is known that“better data sets tend to outperform smarter algorithms”. In other words, a properly cleansed data set will deliver the deepest insight even with simple algorithms. When a large amount of impurities is present in your data“fuel”, even the best algorithm, i.e. “machine”, is of no help.

Given the importance of data cleansing, it is vital to understand the criteria of qualified data first. In general, qualified data should meet at least the following quality standards(Fig. 7)[33]. 1)Validity: The data must conform to valid constraints defined by the business rules or to the effective range of measures. They include constraints conforming to the data range, data uniqueness constraints, valid value constraints, and cross-field validation. 2)Accuracy: Degree of conformity to the measurements or standard, and to the true value, uniqueness, and nonrepetition. In general, it is sometimes necessary to verify values by accessing external data sources that contain true values as reference. 3)Integrity: Default or missing data and the complete distribution of data values across the ranges shall have a corresponding impact on the results of machine learning. If the system insists that certain fields should not be empty, it is possible to specify a value that represents“unknown” or“missing”. Merely providing the default value does not mean that the data is complete. 4)Consistency: The equivalences of a set of measurements throughout a system. Inconsistency occurs when two data items in a data set are contradictory. Data consistency includes consistency in data content, format, and unit.

It is apparent that different types of data require different types of cleansing algorithms. After evaluating the characteristics and initial quality of this livability data set for machine learning, the proposed data shall be cleansed using the following methods: 1) Delete unwanted or unrelated observation results. 2) Fix structural errors. Structural errors emerge during measurement, data transmission, or other“poor internal management” processes. 3) Check label errors of the data, i.e. initiate unified treatment of different labels with the same meanings. 4) Filter unwanted outliers. Outliers may cause problems in certain types of models. The learning model shall perform better if outliers have been deleted or replaced for good reason. However, this should be carried out with caution. 5) Carry out data deduplication to avoid the overfitting of machine learning. 6) Handle missing data is a challenging issue in machine learning. As most existing algorithms do not accept missing values, they have to be imputed through“Data Imputation” techniques such as: deleting the rows with missing values; replacing the missing numeric vale with 0, average or median values. Missing values can also be estimated using variables of other nonmissing values based on special algorithm. Actual specific treatment shall be determined according to the actual meanings and application scenarios of the values.

Application of the aforementioned professional operations shall result in the merging of community units with duplicate names or those with different names. Different digital labels of the same community are automatically compared and integrated. Error labels and data are corrected. Some outliers are deleted after reconfirmation, and repeated data are removed. After missing data has been processed according to the“Data Imputation”, the data shall basically meet the requirements of machine learning.

3.3 Feature Engineering

As an important part of data preprocessing, feature engineering helps to build the subsequent machine learning model, as well as a key window for knowledge discovery. Feature engineering is a process to search for relevant features that maximize the effectiveness of machine learning with AI algorithms and expertise. It also serves as the basis for machine learning application programs. However, it is very difficult and time-consuming to extract features, and the process requires a lot of expertise. Stanford professor Andrew Ng pointed out that, “‘Applied machine learning’ is basically feature engineering.”[34]

In this machine learning model for livability, the features of different domains of dimensional tables have been investigated to obtain the ranking of local features. This was followed by a research of the dimensional tables of the whole domains to acquire the ranking of global features. The local feature ranking allows the understanding of the impact of local features on livability, thus facilitating the discovery of new knowledge hidden in big data. Global features are applied to construct specific algorithms for machine models and they strive for the highest accuracy.

The machine learning discovered the influences from demographic dimension, social dimension, economic dimension, housing dimension, service facility dimension, as well as the land use& environmental dimension have very varied influences on livability(Fig. 8). In general, the impact of land use& environmental dimension appears the most uneven.

In the first set of dimensions(demographic dimension), the livability factors with the most weighted were the community marital status, the population density, the ratio of people aged 25-44 years, and the number of families with children(in this order). It suggested that only a certain population density was able to form livability, while the young and middle-aged population, stable marital status, and the number of families with children wield a positive effect on the livability of the community.

In the second set of dimensions(social dimension), the livability factors with the most weighted suggested that the proportion of population receiving social relief, non-western immigration ratio from Morocco, Turkey and Suriname, as well as community crime rates wield greater negative effects on the livability of the community.

In the third set of dimensions(economic dimension), the livability factors with the most weighted were the average annual household income, the average number of cars per home, and the household’s purchasing power(in this order).

In the fourth set of dimensions(housing dimension), the livability factors with the most weighted were the ratio of governmental social housing, the number of homeowners, the number of new housing, and the housing vacancy rate(in this order).

In the fifth set of dimensions(facilities dimension), factors such as the number of supermarkets in the community, the number of schools and childcare institutions, the number of health care institutions, the number of fitness facilities, the number of catering and entertainment facilities, and the distances to these facilities have a greater impact on the livability of the community.

In the sixth set of dimensions(land use dimension), the level of urbanization, the total amounts of parks, green corridors, water bodies, transportation facilities, and the distances to these facilities have a greater impact on the livability of the community.

The global features group has been applied to establish specific algorithms for the construction of the machine learning model, and it strives for the highest prediction accuracy. Among the 140 collectible variables that had been ranked by global variable impact factors, only the variables of the second set of dimensions(social dimension), the third set of dimensions(economic dimension), and fourth set of dimensions(housing dimension) have been ranked within the top 20 list. It suggested that as a whole, the social, economic and housing dimensions have a greater impact on the livability of the community. With regards to specific ranking, the following factors have been identified in the top 10 in this order after a simplification, thereby representing the most decisive factors affecting the livability of the community: the ratio of population receiving social relief, the ratio of non-western immigration, the ratio of government low-rent housing, the average market price of purchased housing for residence, the ratio of high-income people, the number of households with fixed incomes, the number of new houses, the overall crime rate, the annual consumption of natural gas, and the annual consumption of electricity(Fig. 9).

After completing the necessary workflows stated above, we shall move on to the core stage of machine learning to develop the algorithms and optimize the models, so as to obtain the best prediction results.

4 Research Results

4.1 Generation of Key Process of Machine Learning

As it is necessary to labelize data to specify the predicted target before machine learning, this research has chosen supervised machine learning. Prior to machine learning, a preliminary conclusion of data evaluation was obtained through necessary data scanning and research. According to the aforementioned principles of data cleansing engineering, large-scale data cleansing was carried out to obtain data that met the basic standard of machine learning. By adhering to the aforementioned principles of data feature engineering, the top 10 important factors were obtained and used as predictors in subsequent algorithm modeling in the machine learning.

Subsequently, the data sets were split into a training dataset and a test dataset, at a ratio of 7:3. The first dataset(training data) was entered into the machine learning algorithm to obtain the training model and the corresponding score. The second dataset(test data) was entered into the training model for comparison and evaluation. This is a type of multi-classification problem in machine learning and the goal is to obtain different grades of livability. Two groups of commonly used decision forest algorithms were planned for selection and optimization: Multiclass Decision Jungle and Multiclass Decision Forest. The two generic algorithms work by building multiple decision trees before voting on the most common output categories. The voting process serves as a form of aggregation, in which each tree in the classification decision forest outputs a non-normalized frequency histogram of the label. The aggregation process sums the histograms and normalizes the results to obtain the“probability” of each label. Trees with higher predictive confidence have greater weight in the final decision ensemble. In general, decision forests are nonparametric models, which means that they support data with different distributions. Within each tree, a series of simple tests were performed for each category, thus increasing the level of the tree structure until the leaf node(decision making) is reached, to best meet the predicted target.

The decision forest classifier of this machine learning is composed of the ensemble of the decision tree. In general, the ensemble models provide better coverage and accuracy, as compared with a single decision tree. Specific workflows of this machine learning through the back-ends codes running in the Cloud are shown in Fig. 10. Following the deployment to the cloud, the algorithm is still required to be improved with newly inputted data at the back-ends regularly, in another word, to update the data model and algorithm through retraining. The entire life cycle of machine learning is shown in Fig. 11.

4.2 Primary Results of Machine Learning

The confusion matrices of the two sets of algorithms can be extracted from the backends of machine learning(Fig. 12). The Multiclass Decision Forest algorithm had better performance as compared to the Multiclass Decision Jungle algorithm. The main errors in the Multiclass Decision Jungle algorithm were that some of the 1-2 grades of livability were overrated as the 3-4 grades. Meanwhile, the Multiclass Decision Jungle algorithm has a lower accuracy in predicting the livability grade 4-9. Similar errors had seldom occurred in the Multiclass Decision Forest, which thus provided better overall performance. In addition, the results showed that the overall prediction accuracy of Multiclass Decision Angle was 76%, and the overall prediction accuracy of Multiclass Decision Forest was 96%. Given the latter being higher than the former, the decision was made to deploy the Multiclass Decision Forest algorithm in the production environment in the Cloud(Tab. 2).

Tab. 2 A comparison of predictive performances of two different machine learning algorithms

Following the retraining of machine learning and prediction of livability in human settlements throughout The Netherlands, livability could be visualized and monitored on the national map(Fig. 13). The darker green areas are more livable, while the red areas are less livable. The figure shows a relatively balanced distribution of key predicted livable regions across the country, with relatively dense concentrations in the megalopolis(Randstad) and in some livable regions near eastern Germany. The relatively low livability regions are concentrated in the new province of Flevoland, which had been formed by land reclamation in recent years. It may be a result of low population density and poor infrastructure and services.

In addition, this prediction tool is able to perform in-depth research and local prediction of urban clusters at the meso-level and community blocks at the micro-level. Fig. 14 shows the prediction of livability in the Greater Rotterdam Area and the Greater Hague Area. The results indicated that livability in downtown areas of some old cities is not high, while that of the suburbs is generally higher. In particular, the northern suburbs of Rotterdam and the Delft junction, as well as the coastal areas of northwest Hague, are relatively livable and densely-concentrated.

5 Conclusion and Prospect

The primary results of this research on the prediction of the livability of human settlements by machine learning are generated from the fourth paradigm. It showed that the AI algorithm was able to directly deduce the top 10 factors that affect the livability of human settlements based on available data sources and necessary data engineering. These 10 factors were simplified as the ratio of population receiving social relief, the ratio of non-western immigration, the ratio of government low-rent housing, the average market price of purchased housing for residence, the ratio of high-income people, the number of households with fixed incomes, the number of new houses, the overall crime rate, the annual consumption of natural gas, and the annual consumption of electricity. Furthermore, the latest variables can be updated according to the latest datasets and the improved model in machine learning developed by retraining to perform live predictions of environmental livability.

The results of this research can be applied in the four stages of livability analysis(Fig. 15). It thereby leads to the monitoring, diagnosis, prediction, and early intervention of livability in human settlements based on timely updated big data and retrained algorithms.

By comparing this research, which was based on the fourth paradigm, with that of aforementioned traditional paradigms, we have found that effective knowledge discovery and high-accuracy prediction models could be obtained without relying heavily on the traditional AI expert system or the long-term studies by professional researchers. These dominant factors were basically consistent with the relevant qualitative research of RIVM and RIGO on livability, either locally or globally focused. Furthermore, the research was able to rank the most decisive factors in a quantitative manner for forecasting, thus allowing scientific researches to be more efficient, faster, foreseeable, and more live data-driven.

In this research, the land-use cluster in the available datasets was relatively small, hence resulting in a sharp cone diagram. This shortcoming should be overcome in future research by collecting more related variables of land-use, carrying out enhanced learning and broadening the observational horizons of the prediction model.

In addition, a greater amount of data, greater processing capabilities, greater computing power, and a more complex computing environment are expected to collect and process large amounts of real-time data. With new technologies, such as the latest 5G, IoT, and quantum computing, we shall be able to gather even more complex and unstructured real-time data to expand the current research in the future, so as to provide it with broader prospects for the applications in Smart City.

Sources of Figures and Tables:

Fig.1 ? reference[12]; Fig. 2 ? reference[27]; Fig. 3, 5, 6, 8-14 ? Wu Jun; Fig. 4 ? reference[31] ; Fig. 7 ? reference[33]; Fig. 15 was drawn by Wu Jun according to Gartner concepts; Tab. 1-2 ? Wu Jun.

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡