胡旭飛 許云峰
摘 要:為了研究網(wǎng)絡(luò)表示學(xué)習(xí)在社交網(wǎng)絡(luò)中鏈路預(yù)測方面的應(yīng)用,提出了一種基于骨干度與網(wǎng)絡(luò)編碼的鏈路預(yù)測模型(BDLINE)。在網(wǎng)絡(luò)表示學(xué)習(xí)算法LINE的基礎(chǔ)上融入骨干度算法,通過給一階相似度和二階相似度中增添骨干權(quán)重,將網(wǎng)絡(luò)編碼到多維向量空間中,調(diào)試到最優(yōu)參數(shù)。實(shí)驗(yàn)采用2個(gè)真實(shí)數(shù)據(jù)的數(shù)據(jù)集,分別在不同的算法模型上進(jìn)行多次實(shí)驗(yàn)。實(shí)驗(yàn)結(jié)果表明:在鏈路預(yù)測方面,BDLINE均比其他網(wǎng)絡(luò)表示學(xué)習(xí)算法的性能有所提升,AUC評測值更高,預(yù)測效果表現(xiàn)得更好。因此,所提出的方法可以方便地提取網(wǎng)絡(luò)特征信息,更好地處理社交網(wǎng)絡(luò)在鏈路預(yù)測中的隨機(jī)性,對社交網(wǎng)絡(luò)中預(yù)測網(wǎng)絡(luò)節(jié)點(diǎn)的關(guān)聯(lián)性和有效性具有一定的參考。
關(guān)鍵詞:計(jì)算機(jī)網(wǎng)絡(luò);網(wǎng)絡(luò)表示學(xué)習(xí);鏈路預(yù)測;社交網(wǎng)絡(luò);相似性
中圖分類號(hào):TP391?? 文獻(xiàn)標(biāo)志碼:A
Abstract:In order to study the application of network representation learning in link prediction in social networks, a link prediction model based on backbone and network coding (BDLINE) is proposed. The model integrates the backbone algorithm based on the network representation learning algorithm LINE. By adding the backbone weight to the first-order similarity and the second-order similarity, the network is encoded into the multi-dimensional vector space and debugged to the optimal parameters. The data set used in the experiment is two different real data, and multiple experiments are performed on different algorithm models. The experimental results show that in terms of link prediction, the proposed algorithm model has improved performance compared to other network representation learning algorithms; the AUC evaluation value is higher, and the prediction effect is better. It is more convenient for extracting network feature information by using the method, and can better deal with the randomness problem of social network in link prediction. The method has certain guiding significance for predicting the relevance and effectiveness of network nodes in social networks.
Keywords:computer network; network representation learning; link prediction; social network; similarity
現(xiàn)實(shí)世界中存在著大量而又復(fù)雜的社交網(wǎng)絡(luò),如何對網(wǎng)絡(luò)數(shù)據(jù)進(jìn)行有效合理的表示是現(xiàn)今學(xué)術(shù)研究中的一個(gè)重要挑戰(zhàn),網(wǎng)絡(luò)表示學(xué)習(xí)正是為了迎接這種挑戰(zhàn)而出現(xiàn)的,是解決大規(guī)模社交網(wǎng)絡(luò)問題的基礎(chǔ)方法[1-4],其研究領(lǐng)域有著非常廣泛的應(yīng)用場景,如可視化[5]、節(jié)點(diǎn)分類[6]、鏈路預(yù)測[7]等。網(wǎng)絡(luò)表示學(xué)習(xí)是一個(gè)網(wǎng)絡(luò)編碼的過程,是根據(jù)網(wǎng)絡(luò)頂點(diǎn)在網(wǎng)絡(luò)中的結(jié)構(gòu)作用,通過無監(jiān)督學(xué)習(xí),將網(wǎng)絡(luò)頂點(diǎn)映射到多維空間。從直觀的角度分析,在網(wǎng)絡(luò)中拓?fù)浣Y(jié)構(gòu)[8]相似的頂點(diǎn)所表示出的向量關(guān)系更緊密,這里向量關(guān)系的相似性一般用向量間的余弦距離或者歐氏距離來表示[9]。
近年來,網(wǎng)絡(luò)表示學(xué)習(xí)問題開始成為學(xué)術(shù)界的焦點(diǎn),尤其是社交網(wǎng)絡(luò)中鏈路預(yù)測的研究越來越受到國內(nèi)外學(xué)者的廣泛關(guān)注。鏈路預(yù)測可以識(shí)別一個(gè)不斷發(fā)展的社交網(wǎng)絡(luò)中可能存在但尚未建立的鏈接,從而為用戶提供信息。真實(shí)世界的網(wǎng)絡(luò)頂點(diǎn)數(shù)量非常多,因此,使用考慮全局結(jié)構(gòu)的算法計(jì)算鏈路概率非常困難。早期的工作集中在頂點(diǎn)對之間直接的相似性度量上[10]。最近幾年的網(wǎng)絡(luò)表示學(xué)習(xí)算法主要有node2vec[11],DeepWalk[12],MMB[13],LINE[14]等,可以學(xué)習(xí)網(wǎng)絡(luò)各頂點(diǎn)的潛在網(wǎng)絡(luò)結(jié)構(gòu)特征,將網(wǎng)絡(luò)表征到多維空間中。本文提出的網(wǎng)絡(luò)表示學(xué)習(xí)算法BDLINE是在LINE的基礎(chǔ)上融入骨干度[15],重新調(diào)整網(wǎng)絡(luò)參數(shù)模型,提高網(wǎng)絡(luò)在向量空間中的相似性,最后通過鏈路預(yù)測實(shí)驗(yàn)分別進(jìn)行各算法對比。結(jié)果表明所提出的算法AUC評測值更高,取得的預(yù)測效果最好。
1 基于骨干度與網(wǎng)絡(luò)編碼的鏈路預(yù)測
在社交網(wǎng)絡(luò)中處理網(wǎng)絡(luò)中的數(shù)據(jù),需要對網(wǎng)絡(luò)進(jìn)行清晰的定義。首先定義一個(gè)信息網(wǎng)絡(luò)
[WTBX]G=(V,E),其中V是頂點(diǎn)集合,E是頂點(diǎn)之間的邊集合,邊e=(u,v)∈E表示了u到v的一條邊。BDLINE模型是在LINE的基礎(chǔ)上建立的,LINE的算法定義了使用一階和二階相似度來解決大規(guī)模網(wǎng)絡(luò)信息編碼問題。
所有的連邊計(jì)算完之后,得到AUC值。AUC值最少應(yīng)大于0.5,最大值不超過1。AUC值越高,說明算法的預(yù)測結(jié)果越精確,性能越好。
2.3 實(shí)驗(yàn)環(huán)境與實(shí)驗(yàn)參數(shù)
實(shí)驗(yàn)環(huán)境:Linux操作系統(tǒng);實(shí)驗(yàn)算法模型均采用Python語言實(shí)現(xiàn);所提算法BDLINE和LINE使用TensorFlow機(jī)器學(xué)習(xí)框架。
實(shí)驗(yàn)參數(shù)設(shè)置:BDLINE設(shè)置向量size大小為128,根據(jù)實(shí)驗(yàn)經(jīng)驗(yàn),此值效果最好;負(fù)采樣值為5,epoch=10;線程數(shù)為8,學(xué)習(xí)率[WTBX]lr=0.001。
2.4 實(shí)驗(yàn)結(jié)果
將本文提出的算法模型BDLINE分別在2個(gè)數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn),為了呈現(xiàn)更好的預(yù)測效果,對訓(xùn)練集在真實(shí)數(shù)據(jù)集所占的比例由低到高依次選取,選取數(shù)據(jù)的過程都具有隨機(jī)性。每個(gè)算法訓(xùn)練模型都進(jìn)行11次測試,獲得測試結(jié)果并取其平均值。所有的實(shí)驗(yàn)結(jié)果如表2和表3所示。
可以看出,當(dāng)只保留15%的邊用于訓(xùn)練時(shí),評測AUC值都很低,所有方法的性能都很差,大多數(shù)頂點(diǎn)是孤立的,而且毫無意義。隨著訓(xùn)練集所占的比例越來越大,訓(xùn)練的邊也越來越多,AUC值也越來越高。通過縱向?qū)Ρ劝l(fā)現(xiàn),BDLINE相對于其他算法來說,AUC值均有所提高,說明了該算法在鏈路預(yù)測任務(wù)中的有效性,驗(yàn)證了該算法具有精確建模頂點(diǎn)間關(guān)系的能力。通過引入骨干度,學(xué)習(xí)到的網(wǎng)絡(luò)感知能力比LINE有較大的改進(jìn),即一個(gè)特定的頂點(diǎn)在與其他頂點(diǎn)交互時(shí)應(yīng)該扮演不同的角色,從而有利于相關(guān)鏈路的預(yù)測任務(wù)。
3 結(jié) 論
1)針對現(xiàn)有的社交網(wǎng)絡(luò)在鏈路預(yù)測方面的問題,提出了基于骨干度與網(wǎng)絡(luò)編碼的鏈路預(yù)測模型BDLINE,首次將骨干度引入到LINE算法中。實(shí)驗(yàn)證明,BDLINE比現(xiàn)有的眾多網(wǎng)絡(luò)表示學(xué)習(xí)算法有著更加準(zhǔn)確的預(yù)測結(jié)果。
2)BDLINE網(wǎng)絡(luò)表示學(xué)習(xí)算法可以學(xué)習(xí)高質(zhì)量的網(wǎng)絡(luò)結(jié)構(gòu)編碼,有助于準(zhǔn)確估計(jì)頂點(diǎn)之間的關(guān)系,提高頂點(diǎn)間的相似度,對社交網(wǎng)絡(luò)中預(yù)測網(wǎng)絡(luò)節(jié)點(diǎn)的關(guān)聯(lián)性和有效性有一定的借鑒意義。
3)本次實(shí)驗(yàn)雖然取得了不錯(cuò)的效果,但仍有不足之處,對于多屬性的大規(guī)模復(fù)雜網(wǎng)絡(luò),處理的時(shí)間復(fù)雜度較高,測試結(jié)果隨機(jī)性太大,不夠穩(wěn)定,今后會(huì)進(jìn)行優(yōu)化改善。在未來的研究中,會(huì)嘗試用對抗性網(wǎng)絡(luò)學(xué)習(xí)機(jī)制來處理具有復(fù)雜屬性的網(wǎng)絡(luò),用深度學(xué)習(xí)技術(shù)進(jìn)一步提高網(wǎng)絡(luò)骨干度權(quán)重的精確值。
參考文獻(xiàn)/References:
[1] AMED A, SERVASIDZE N, NARAYANAMURTY S, et al. Distributed large-scale natural graph factorization[C]// Proceedings of the 22nd International Conference on World Wide Web. New York:ACM, 2013:37-48.
[2] LEVY O, GOLDBERG Y. Neural word embedding as implicit matrix factorization. Advances in Neural Information Processing Systems, 2014(3):2177-2185.
[3] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. Beijing:[s.n.],2014:1188-1196.
[4] CANG Shiyu, AN Wei, TANG iliang, et al. eterogeneous network embedding via deep architectures[C]// Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM,2015:119-128.
[5] TANG ian, LIU ingzhou, ZANG Ming, et al. Visualization large-scale and high-dimensional data. Machine Learning,2016: 10.1145/2872427.2883041.
[6] BAGAT S, CORMODE G, MUTUKRISNAN S. Node classification in social networks. Social Network Data Analytics, 2011, 16(3):115-148.
[7] LIBEN-NOWELL D, KLEINBERG . The Link-Prediction Problem for Social Networks[M].[S.l.]: ohn Wiley & Sons, 2007.
[8] TANAY B, KANDEMIR M B. Topological structure of fuzzy soft sets. Computers & Mathematics with Applications, 2011, 61(10):2952-2957.
[9] TU Cunchao, LIU an, LIU Zhiyuan, et al. CANE: Context-aware network embedding for relation modeling[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouve: Association for Computational Linguistics, 2017: 1722-1731.
[10]L Linyuan, ZOU Tao. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and Its? Applications, 2011, 390(6):1150-1170.
[11]GROVER A, LESKOVEC . node2vec: Scalable feature learning for networks[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2016:855-864.
[12]PEROZZI B, Al-RFOU R, SKIENA S. DeepWalk: Online learning of Social representations[C]// Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM, 2014:701-710.
[13]AIROLDI E M, BLEI D M, FIENBERG S E, et al. Mixed membership stochastic blockmodels. The ournal of Machine Learning Research, 2008,9:1981-2014.
[14]TANG ian, QU Meng, WANG Mingzhe, et al. LINE: Large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web. Florence:[s.n.],2015:1067-1077.
[15]XU Yunfeng, XU ua, ZANG Dongwen. A novel disjoint community detection algorithm for social networks based on backbone degree and expansion. Expert Systems with Applications, 2015, 42(21):8349-8360.
[16]TANG ie, ZANG ing, YAO Limin, et al. ArnetMiner: Extraction and mining of academic social networks[C]// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2008:990-998.