熊富蕊 桑應(yīng)朋
摘要:主要針對關(guān)聯(lián)規(guī)則中的Apriori算法在執(zhí)行過程中產(chǎn)生大量候選集和重復(fù)掃描數(shù)據(jù)庫會使串行算法的時(shí)間復(fù)雜度高和執(zhí)行率低的缺點(diǎn),提出了一種基于MapReduce的并行關(guān)聯(lián)規(guī)則挖掘算法,對傳統(tǒng)的關(guān)聯(lián)規(guī)則算法進(jìn)行并行優(yōu)化。在Hadoop平臺上進(jìn)行了單機(jī)測試和集群測試。分析得出基于MapReduce的關(guān)聯(lián)規(guī)則算法克服了串行算法產(chǎn)生大量候選集和重復(fù)掃描數(shù)據(jù)庫的兩大缺點(diǎn),從而提高了執(zhí)行效率。此外,針對目前數(shù)據(jù)隱私泄露的嚴(yán)重現(xiàn)象,在進(jìn)行并行化的數(shù)據(jù)挖掘之前,使用加隨機(jī)數(shù)的方法來保護(hù)數(shù)據(jù)的隱私,并驗(yàn)證了保護(hù)數(shù)據(jù)在關(guān)聯(lián)規(guī)則挖掘中保留了高可用性。
關(guān)鍵字:MapReduce;Apriori;Hadoop;隱私保護(hù)
中圖分類號:TP391 文獻(xiàn)標(biāo)識碼 A文章編號:2095-2163(2015)06-
Abstract:The main disadvantage of Apriori algorithm for association rule mining is that it produces large amounts of candidate sets and database scannings during execution, making the serial algorithm having high time complexity and low implementation rate. This paper proposes a new algorithm based on MapReduce,which optimizes the traditional association rule algorithm in a parallel way.Simulations based on the hadoop platform are performed on one single machine and clusters. The results demonstrate that our new algorithm based on MapReduce overcomes the disadvantage of the serial algorithm.Whats more, considering the serious phenomenon of data privacy leaking, the paper uses randomization to protect data privacy before they are mined, and shows the randomized data keep a high utiltity for association rule mining.
Keywords: MapReduce;Apriori;Hadoop;Privacy_Preserving
0 引言
隨著各行各業(yè)的快速發(fā)展,大量的數(shù)據(jù)開始出現(xiàn)和累積。然而,如何從這些數(shù)據(jù)中,提取出所需要的有用信息,則已成為時(shí)下研究關(guān)注、且矚目的一個(gè)焦點(diǎn)問題。作為一個(gè)分析工具,數(shù)據(jù)挖掘可以從大量數(shù)據(jù)集中發(fā)現(xiàn)有趣、有用的信息?,F(xiàn)如今數(shù)據(jù)挖掘的技術(shù)已經(jīng)開始用于商業(yè)用途,藉此提高商業(yè)價(jià)值。數(shù)據(jù)挖掘主要分為三大領(lǐng)域:分類分析、聚類分析和關(guān)聯(lián)規(guī)則分析。尤其是,關(guān)聯(lián)規(guī)則分析已經(jīng)獲得了數(shù)據(jù)挖掘中比較重要的領(lǐng)域地址,具體實(shí)現(xiàn)主要分為兩步:發(fā)現(xiàn)頻繁項(xiàng)目集和生成關(guān)聯(lián)規(guī)則。關(guān)聯(lián)規(guī)則中的一種基本算法就是Apriori算法。該算法在執(zhí)行過程中會產(chǎn)生大量的候選集,并且還要多次重復(fù)掃描數(shù)據(jù)庫。由此可知,隨著數(shù)據(jù)的逐漸增加,有如Apriori這樣的傳統(tǒng)挖掘算法已經(jīng)不能快速有效地分析獲取有用的信息?;谝陨媳尘八觯疚奶岢隽艘环N新的基于MapReduce的并行關(guān)聯(lián)規(guī)則的算法。此外,隨著挖掘技術(shù)的不斷進(jìn)步,使得一些敏感、有用的信息相繼公開,這就增加了原始數(shù)據(jù)的風(fēng)險(xiǎn)性。因此,基于隱私保護(hù)[1-2]的需求,本文即在數(shù)據(jù)挖掘前使用了添加隨機(jī)數(shù)的方法,從而實(shí)現(xiàn)對數(shù)據(jù)隱私的保護(hù)。
1 國內(nèi)外研究現(xiàn)狀
現(xiàn)在國內(nèi)外已經(jīng)應(yīng)用很多方法解決關(guān)聯(lián)規(guī)則挖掘算法[3-6]。隨著大量數(shù)據(jù)的產(chǎn)生,各種并行關(guān)聯(lián)規(guī)則算法也隨即陸續(xù)提出。例如:CD (Count Distribution) CaD(Candidate Distribution) and DD (Data Distribution)[7-8],這些算法可以運(yùn)用于云計(jì)算環(huán)境。但是算法卻都具有缺乏同步性的缺點(diǎn)。
文獻(xiàn)[9]針對Apriori算法產(chǎn)生大量候選項(xiàng)集的缺點(diǎn),提出了一種頻繁模式算法(FP),是一種不用生成候選項(xiàng)目集,實(shí)現(xiàn)Apriori的算法??梢钥焖贅?gòu)建挖掘模型,是一種高效的關(guān)聯(lián)規(guī)則算法。
文獻(xiàn)[10-13]提出了基于MapReduce的Apriori并行算法,利用Hadoop平臺的多個(gè)節(jié)點(diǎn)并行運(yùn)算的Apriori算法??梢杂行У貙?shí)施并行化,但仍會產(chǎn)生大量候選項(xiàng)集。
2 基礎(chǔ)理論概述
2.1關(guān)聯(lián)規(guī)則挖掘和Apriori算法
設(shè)I={i1,i2…,im}是項(xiàng)的集合。給定一個(gè)交易數(shù)據(jù)庫D,其中每個(gè)事務(wù)T是I的非空子集,即,每一個(gè)交易都與一個(gè)唯一的標(biāo)識符TID(Transaction ID)對應(yīng)。關(guān)聯(lián)規(guī)則在D中的支持度(support)是D中事務(wù)同時(shí)包含A、B的百分比,即P(A∪B);置信度(confidence)是D中事務(wù)已經(jīng)包含A的情況下,包含B的百分比,即P(B|A)。如果滿足最小支持度閾值和最小置信度閾值,則認(rèn)為關(guān)聯(lián)規(guī)則是有趣的[14]。關(guān)聯(lián)規(guī)則的挖掘分為兩個(gè)步驟:
(1)找出所有滿足最小支持度的頻繁項(xiàng)集;
(2)由頻繁項(xiàng)集產(chǎn)生滿足最少支持度和最少置信度的強(qiáng)關(guān)聯(lián)規(guī)則。
Apriori算法是最具影響的一類關(guān)聯(lián)規(guī)則挖掘算法,使用一種逐層搜索的迭代方法。其具體實(shí)現(xiàn)過程為:首先,找出1-項(xiàng)集的集合(L1),用L1去找頻繁2項(xiàng)集L2。依次下去,直至不能找到頻繁k項(xiàng)集。每尋獲一個(gè)LK都需要掃描一次數(shù)據(jù)庫。所以,需要多次重復(fù)掃描數(shù)據(jù)庫導(dǎo)致時(shí)間復(fù)雜度較高和執(zhí)行效率降低。偽代碼如下[15]:
輸入:交易數(shù)據(jù)庫D,最小支持度min_sup。
輸出:頻繁集L
L1=find_frequent_1_itemset(D);//產(chǎn)生1-頻繁集
for(k=2;Lk-1!=?;k++){
Ck=apriori_gen(Lk-1,min_sup);//產(chǎn)生k-候選集
for each transaction t in D{
Ct=subset(Ck,t);//獲得事務(wù)t包含的候選項(xiàng)集
for each candidate c in Ct
c.count++; }
Lk={c∈Ck|c.count>=min_sup}; }
L=?kLk;
Procedure apriori_gen(Lk-1:frequent(k-1)-itemset;min_sup:support)
Foreach itemset l1 in Lk-1
Foreach itemset l2 in Lk-1
If( (l1 [1]=l2 [1])&&( l1 [2]=l2 [2])&& …&&(l1 [k-1]then{ c = l1 link l2 // 連接步:產(chǎn)生候選
if has_infrequent_subset(c, Lk-1) then
delete c; // 剪枝步:刪除非頻繁候選
else add c to Ck; }
Return Ck;
Procedure has_infrequent_subset(c:candidatek-itemset;
Lk-1 :frequent(k-1)-itemsets)
For each (k-1)-subset s of c
If s?Lk-1 then
Return true;
Return false;
2.2Hadoop平臺
Hadoop是一個(gè)開發(fā)和運(yùn)行處理大規(guī)模數(shù)據(jù)的軟件平臺,是Appach的一個(gè)基于java語言推出的開源軟件框架,可以實(shí)現(xiàn)在大量計(jì)算機(jī)組成的集群中對海量數(shù)據(jù)進(jìn)行分布式計(jì)算[16]。Hadoop的框架最核心的設(shè)計(jì)就是:HDFS和MapReduce。在此,對其分述如下。
HDFS(Hadoop Distributed File System)是一個(gè)分布式文件系統(tǒng),由一個(gè)名稱節(jié)點(diǎn)(NameNode)和N個(gè)數(shù)據(jù)節(jié)點(diǎn)(DataNode)組成,每個(gè)節(jié)點(diǎn)均是一臺普通的計(jì)算機(jī)。NameNode 是一個(gè)通常在 HDFS 實(shí)例中的單獨(dú)機(jī)器上運(yùn)行的軟件,具體負(fù)責(zé)管理文件系統(tǒng)名稱空間和控制外部客戶機(jī)的訪問。NameNode 決定是否將文件映射到 DataNode 上的復(fù)制塊上。Hadoop 集群包含一個(gè) NameNode 和大量 DataNode。DataNode 響應(yīng)來自 HDFS 客戶機(jī)的讀寫請求。同時(shí)還響應(yīng)來自 NameNode 的創(chuàng)建、刪除和復(fù)制塊的命令[17]。
MapReduce是一種并行編程模型,其基于HDFS拓展實(shí)現(xiàn),可用于大規(guī)模數(shù)據(jù)集的并行計(jì)算。MapReduce編程模型的原理是:MapReduce 以函數(shù)方式提供了 Map 和 Reduce 來進(jìn)行分布式計(jì)算。Map 相對獨(dú)立且并行運(yùn)行,對存儲系統(tǒng)中的文件按行處理,處理后產(chǎn)生鍵值(key/value)對。Reduce則以 Map 的輸出作為輸入,相同鍵值的記錄匯聚到同一 reduce,再對這組記錄進(jìn)行操作,相應(yīng)將產(chǎn)生新的key/value對集合[18]。
3 基于Mapruduce的隱私保護(hù)的關(guān)聯(lián)規(guī)則挖掘算法
該算法的思想是:首先通過加隨機(jī)數(shù)的方式對原始的數(shù)據(jù)D進(jìn)行保護(hù),得到變換后的數(shù)據(jù)D1。然后使用mapreduce的分布式編程原理,將原始數(shù)據(jù)集按行劃分為多個(gè)小數(shù)據(jù)集,分布到三個(gè)節(jié)點(diǎn)上。稍后每個(gè)節(jié)點(diǎn)對輸入的數(shù)據(jù)塊分配一個(gè)map,通過map函數(shù)執(zhí)行,并采用一種改進(jìn)的Apriori算法計(jì)算滿足最小支持度的候選項(xiàng)目集。最后使用reduce函數(shù)獲得頻繁項(xiàng)目集,再將得到的數(shù)據(jù)集存放在HDFS上。主要實(shí)現(xiàn)步驟為:
(1)在每個(gè)節(jié)點(diǎn)上對原始數(shù)據(jù)D添加隨機(jī)數(shù)變成D1。
(2)第一階段:
在每個(gè)節(jié)點(diǎn)上用map函數(shù)對數(shù)據(jù)進(jìn)行操作,分割的整個(gè)數(shù)據(jù)集作為輸入,產(chǎn)生鍵值對
Struct hashnode{
Map mapnode;
Boolean isleafnode;
List items;
}
在datanode上用reduce函數(shù)對map輸出結(jié)果進(jìn)行整合。將結(jié)果返回給namenode節(jié)點(diǎn)。Reduce函數(shù)輸出結(jié)果為key/value對,key代表頻繁1項(xiàng)集元素,value為出現(xiàn)該鍵值的次數(shù)。并且將結(jié)果存儲在HDFS中。生成1頻繁項(xiàng)目集。
(2)第二階段:直接從HDFS中取出數(shù)據(jù),此時(shí)的數(shù)據(jù)已經(jīng)不是最初數(shù)據(jù)庫中的原始數(shù)據(jù),而是前一階段產(chǎn)生的1頻繁項(xiàng)目集,這樣就大大的縮減了候選項(xiàng)集的數(shù)目。此后再通過map/reduce函數(shù)重復(fù)以上的步驟,依次找到2頻繁項(xiàng)目集、k頻繁項(xiàng)目集。
(1)-(3)偽代碼如下:
Map task:
輸入:改變后分割數(shù)據(jù)D2
輸出:對,key代表頻繁項(xiàng)目集的元素
map(object, D2)
C=G-Apriori(D2)
Foreach itemset item in C
Out(item,one);
End foreach
End map
End
Reduce task
輸入:
參考文獻(xiàn):
[1]ZHU Jianming, ZHANG Ning, LI Zhanyu Li. A new privacy preserving association rule mining algorithm based on hybrid partial hiding strategy[J]. Bulgarian Acadmeny Of Sciences,2013,13(special issue):41-50.
[2]AGRAWAL D, AGGARWAL C C. On the design and quantification of privacy preserving data mining algorithms[C]// PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of databases,New York:ACM,2002:247-255.
[3]TOIVONEN H.Sampling large databases for association rules[C]// Intl.Conf.on Very Large Databases,Bombay,India:ACM,1996:134–145.
[4]SAVASERE A ,OMIECINSKI E , SHAMKANT B. Et al. An efficient algorithm for Mining Association Rules in Large Databases[C]// Proceedings of the 21th International Conference on Very Large Data Bases, San Francisco:Morgan Kaufmann Publishers Inc,1995 :432-444.
[5]FANG Min , SHIVAKUMAR N ,GARCIA-MOLINA H ,et al. Ullman, computing iceberg queries efficiently[C]//Proceedings of the 24rd International Conference on Very Large Data Bases,New York:Morgan Kaufmann,1998:299-310.
[6]QURESHI Z,BANSAL J, BANSAL S. A survey on association rule mining in Cloud Computing[J].International Journal of Emerging Technology and Advanced Engineering,2013,3(4):2250-2459.
[7]ASHRAFI M Z, TANIAR D, SMITH K. ODAM: An optimized distributed association rule mining algorithm[J]. Distributed Systems Online IEEE, 2004, 5(3):1.
[8]LI L, ZHANG M. The strategy of mining association rule based on Cloud Computing[J]. IEEE, 2011, 52(1391):475-478.
[9] SANJEEV R, GUPTA P. Implementing improved algorithm
over APRIORI data mining association rule algorithm[J].IJCST,2012,3(1):2250-2459.
[10] CHEN SY, LI Jiahong Li, LIN Kechung Lin, et al.
Using MapReduce framework for mining association rules[C]//
Springer Science+Business Media Dordrecht,TaiWan:NSC,2013:723-731.
[11] RIONDATO M, DEBRABANT J A, FONSECA R, et al. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce[C]// Proceedings of the 21st ACM international conference on Information and knowledge management,New York:ACM, 2012:85-94.
[12] YANG Xinyue, LIU Zhen, FU Yan. MapReduce as a programming model for 1ssociation rules algorithm on hadoop[C]// 3rd International Conference on Information Sciences and Interaction Sciences. Chengdu: IEEE,2010:99-102.
[13]LI N, ZENG L, HE Q, et al. Parallel implementation of Apriori Algorithm based on MapReduce[C]// Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and ParallelDistributed Computing. USA:IEEE Computer Society,2012:236-241.
[14]LIN Xueyan. MR-Apriori:Association rules algorithm based on MapReduce[A]. IEEE Beijing Section.Proceedings of 2014 IEEE 5th International Conference on Software Engineering and Service Science[C].Beijing:IEEE Beijing Section,2014:4.
[15]ZHANG CS, LI ZY, ZHENG DS. An improved algorithm for Apriori. Fourth[C]// Proceedings of the 1st International Workshop on Education Technology and Computer Science, ETCS 2009,v1,Wuhan:IEEE,2009:995-998.
[16]孫趙旭,謝曉蘭,周國清,等. 基于Hadoop的Apriori算法與實(shí)現(xiàn)[J]. 桂林理工大學(xué)學(xué)報(bào),2014(3):584-588.
[17]郝曉飛,譚躍生,王靜宇. Hadoop平臺上Apriori算法并行化研究與實(shí)現(xiàn)[J]. 計(jì)算機(jī)與現(xiàn)代化,2013(3):1-4+8.
[18]LIN M Y, LEE P Y, HSUEH S C. Apriori-based frequent itemset mining algorithms on MapReduce[C]// Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication,New York:ACM,2012:20-22.