Shengmei Luo ,Qing He ,Lixia Liu ,Xiang Ao ,3,Ning Li ,3,Fuzhen Zhuang
(1.Pre-Research department of ZTE,Nanjing,210012,China;
2.Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences.Beijing,100190,China;
3.Graduate University of Chinese Academy of Sciences,Beijing,100190,China)
Abstract Traditional machine-learning algorithms are struggling to handle the exceedingly large amount of data being generated by the internet.In real-world applications,there is an urgent need for machine-learning algorithms to be able to handle large-scale,high-dimensional text data.Cloud computing involves the delivery of computing and storage as a service to a heterogeneous community of recipients.Recently,it has aroused much interest in industry and academia.Most previous works on cloud platforms only focus on the parallel algorithms for structured data.In this paper,we focus on the parallel implementation of web-mining algorithms and develop a parallel web-mining system that includes parallelweb crawler;parallel text extract,transform and load(ETL)and modeling;and parallel text mining and application subsystems.The complete system enables variable real-world web-mining applications for mass data.
Keyw ords web mining;large scale;high volume;high dimension;cloud computing This work is supported by the National Natural Science Foundation of China(No.61175052,60975039,61203297,60933004,61035003),National High-tech R&DProgram of China(863 Program)(No.2012AA011003).
H adoop 1.x is a large-scale data processing platform for cloud computing.At the core of Hadoop is Map Reduce,which provides users with a new parallel programming mode in the distributed environment[1].It allows users to benefit from the advanced features of distributed computing without the need to do any programming to coordinate tasks in the distributed environment.Recently,many machine-learning algorithms have been paralleled based on Map Reduce[2]-[8].Chu et al.developed a broadly applicable parallel programming method that can be easily applied to many different learning algorithms[2].They have shown that algorithms fitting the statistical query model can be written in“summation form,”which allows them to be easily parallelized on multicore computers.He et al.proposed several parallel-classification algorithms,including k-nearest neighbors,naive Bayesian model,and decision tree for structured data[3].Experimentalresults show the efficiency of the proposed parallel methods for handling large data sets.Zhao et al.proposed a parallel k-means clustering algorithm based on Map Reduce that is scalable and can efficiently process large data sets on commercially available hardware[4].He et al.provided a parallel incremental learning algorithm for ESVM(PIESVM)that can solve large-scale and online problems.
Previous works mainly focus on the parallel implementation of machine-learning algorithms for structured data.However,there is a large amount of unstructured data on the internet and only a few parallel text mining algorithms[9]-[10].Elsayed et al.proposed a Map Reduce algorithm for computing pair-wise document similarity in large document collections[10].Experiments on a collection of approximately 900,000 newswire articles show that the proposed algorithm’s running time and space grows linearly with the number of articles processed.Zhang et al.introduced a novel probabilistic generative model MicroBlog-Latent Dirichlet Allocation(MB-LDA),which takes both contactor relevance and document relevance into consideration in order to improve topic mining in microblogs.They also developed distributed MB-LDA in the Map Reduce framework in order to process large-scale microblogs with high scalability.However,existing learning algorithms are stillvery impractical for web mining simply because of the explosion in the amount of information on the internet.In this work,we concentrate on various web-mining algorithms and propose a parallel algorithm designed using Map Reduce.Finally,we develop a parallel web-mining system that includes web crawler,webpage parsing,text data preprocessing,and text data mining.
In section 2,we introduce techniques related to our proposal.In section 3,we introduce the system architecture and its implementation.In section 4,we give experimental results.Section 5 concludes the paper.
Hadoop allows programmers to easily develop and run mass-data-processing applications.The core of Hadoop is mainly Hadoop distributed file system(HDFS),Map Reduce,and HBase,which can be used as a data source for a Map Reduce job.
HDFSis inspired by Google File System,which uses large-scale clusters to store large amounts of data[11].Despite being similar to the existing GFS,HDFShas some differences.First,the structure of HDFSensures that it is highly tolerant of faults compared with the low fault-tolerance of GFS.Furthermore,HDFScan be deployed on low-cost hardware.The third advantage of HDFSis that it provides high throughput for data-accessing applications whereas GFS does not.Therefore,HDFSis suitable for applications with large data sets.Fig.1 shows the HDFSarchitecture.Data files are stored in blocks of the same size on clusters of DataNodes.Each block has replications to ensure fault tolerance.NameNode replicates all the file blocks and periodically receives heartbeat and block-report messages from DataNodes.
Map Reduce provides a convenient programming mode and associated implementation for processing and generating large data sets using special
The main characteristic of Map Reduce is it provides a simple but powerful interface for automatic parallelization and distribution of large-scale computation.The following is pseudocode for a simple word-count task.It illustrates the main idea of Map Reduce:
In the map function,we parse the input
HBase is an important Apache Hadoop-based project modeled on Google's Big Table database[12].It builds a distributed,fault-tolerant,scalable database on top of the HDFSfile system and has random,real-time,read/write access to data.Each HBase table is stored as a multidimensional sparse map with rows and columns,and each cellhas a timestamp.HBase has its own Java client API,and tables in HBase can be used both as an input source and output target for Map Reduce jobs through TableInput/TableOutputFormat.All access to tables is by the primary key.Secondary indices are possible through additional index tables;programmers need to de-normalize and replicate.
Atable is made up of regions.Each region is defined by a startKey and End Key and may live on different nodes.A region is made up of several HDFSfiles and blocks,each of which is replicated by Hadoop.Columns can be added on-the-fly to tables,with only the parent column families being fixed in a schema.Each cell is tagged with column family and column name,so programs can always identify what type of data a given cell contains.
▲Figure 1.HDFSarchitecture.
Our system is designed to mine web information in parallel,and we develop parallel algorithms based on Map Reduce for this purpose.Fig.2 shows the system architecture.
The proposed system includes four subsystems:parallel web crawler;parallel text extract,transform and load(ETL)and modeling;paralleltext mining;and application subsystems.
The parallel text data collection subsystem crawls web pages from the internet and extract the content them.Here,we focus on extracting text information.
3.1.1 Parallel Web Crawler
Intelligent Spider(HISpider)is based on Hadoop and is designed to browse the internet in a methodical,automated way.It contains site mode,keywords mode,and breakpoint transmission mode.In site mode,a user-specified URL list is taken as the basis.Then,the seed pages and relative hyperlinks in these seed pages are downloaded hierarchically according to their link structure.In keywords mode,our module uses Baidu.com,Bing.com,and Sogou.com to query user-specified keywords that are stored in the keywords list file.Then,the module intelligently integrates returned pages to generate an initial URLlist.Keywords mode is similar to site mode in that the seed pages and relevant hyperlinks are downloaded hierarchically according to their link structure after the initial seed URL has been constructed.The breakpoint transmission mode is used to complete a download task when the task has been unexpectedly terminated.In HISpider,update is an optional feature that is used to regularly check all downloaded data and re-extract pages that have be updated at the server side.
Using Map Reduce,we assign a URL to multiple mapper classes,and web pages are downloaded in a parallel manner during the map stage.Fig.3 shows the process of HISpider.First,HISpider acquires an initial URL list according to different extraction modes.Then,it starts a timer to time when to begin an update if update strategy has been triggered by the user.When a download is running,HISpider downloads in parallel all URLs,including webpages and documents,which are all in an incomplete URL list in the spider map.In that map,if a current iteration is less than the extraction depth,HISpider extracts all hyperlinks at the current iteration as an incomplete URLlist for the next round.After downloading all pages in the current iteration,HISpider puts metadata from downloaded pages and documents into a table in HBase,and this is used for other modules.When an update is running,the update map scans in parallel all record files that are generated by download jobs.Check whether the URL needs to be updated and re-download updated pages to overwrite original versions.The pseudo codes of the spider map and update map are shown in algorithm 1 and algorithm 2.
Algorithm 1:Spider map(key,value)
Input:
key:offset in bytes;value:URL address
Output:
key’:hyperlinks in this URL address;value’:null.
1:parse value to an array valueArray;
2:URL←valueArray[0];
3:original last modified timeρ←valueArray[1];
4:get file typeβof URL;5:ifβis pdf or doc then;
6:extract text from document and store in HDFS;
7:else;
8:get charset of webpage and transfer to UTF-8;
9:check re-directed page and go to target page;
10:get all contents of webpage;
▲Figure 3.Parallelintelligent spider module.
▲Figure 2.System architecture of parallelweb mining system based on cloud platform.
11:output downloaded page’s content into HDFS;
12:output downloaded page’s URLinto HDFS;
13:if iteration roundλ 14:key’←hyperlinks in current URL; 15:output key’as uncompleted URLinto HDFS; 16:end if; 17:end if. Algorithm 2:Update map(key,value) Input: key:offset in bytes; value:downloaded URL records with last modified time. Output: key’:updated URLand its updated last modified time;value’:null. 1:parse value to an array valueArray; 2:URL←valueArray[0]; 3:original last modified timeρ←valueArray[1]; 4:get last modified timeγof URL; 5:ifρ<γthen; 6:get file typeβof URL; 7:ifβis pdf or doc then; 8:extract text from document and store in HDFS. 3.1.2 Webpage Parsing Web page parsing involves extracting text in the webpage.After parsing,web pages can be indexed for quick retrieval.It involves identifying the character set of HTMLfiles,grasping the main structure,and extracting text according to the structure.Different websites often have different character sets,and these character sets need to be identified and converted into a uniform code for the subsequence operations.The main structure of the web page contains title,keywords,labels,pictures,and hyperlink.We extract text from the main structure. Our parallel parsing web page algorithm and process is based on Hadoop and HBase(Fig.4).The parsing algorithm can be easily implemented using Map Reduce;therefore,we omit the detailed implementation of pseudocodes here. Data is stored in HDFSand HBase;web page URLs for unique identification are stored in HBase,and the extracted text is store in HDFS. The parallel text ETL and modeling subsystem contains four modules:Chinese word segmentation,word count and modeling,indexing,and feature selection.For each module,we develop several parallel algorithms.The feature-selection algorithms include TFIDF,information gain,mutual information,and chi-square test.Here,we only detail the implementation of Chinese word segmentation and term frequency-inverse document frequency(TFIDF). ?Figure 4.Parallel parsing algorithm. 3.2.1 Chinese Word Segmentation Unlike English and other European languages,there is no delimiter to mark the beginning and endings in written Chinese sentences.Therefore,word segmentation becomes the first task in processing information written in Chinese.The task is challenging because it is often difficult to define what constitutes a word in Chinese.There have been quite a few approaches,which can be roughly categorized as rules-based approaches(based on linguistics)and statistical approaches(based on a corpus and machine learning).ICTCLASis a unified framework based on a hierarchical hidden Markov model that integrates Chinese word segmentation,part-of-speech tagging,disambiguation,and unknown-word recognition[13]. Our parallel Map Reduce algorithm for Chinese word segmentation uses an open-source implementation of ICTCLAS.Fig.5 shows our parallel Chinese word segmentation algorithm bon Hadoop platform,HBase,and Map Reduce.First,we upload the dictionary files from the master node to HDFSto allow the slave nodes to distributively load the dictionary before calling the word-segmentation package.Then,we configure and run a map/reduce job.This has an input HBase table,which stores the HDFSlocations and IDs of the files to be processed;an output HBase table,which stores the HDFSlocations;and the IDs of the files after they have been processed.Amapper class consists of three functions:setup(),map(),and cleanup().Under the Map Reduce framework,the functions of the mapper class are executed in a distributed manner.First,setup()is executed to load dictionary files from HDFSto a temporal path on each slave node.Second,map()parses a record of the input HBase table to obtain the file to be processed.It then calls the ICTCLASpackage to split the sentences of the file into words.Finally,cleanup()deletes the dictionary files under the temporalpath on each slave node.The pseudocode of map()of our algorithm is shown in algorithm 3.Algorithm 3:CWS_map(key,value) Figure 5.?Flowchartofparallel Chinese word segmentation algorithm. Input: key:row key of the current row being processed;value:the value of the row. Output: key’:the same as the input key; put:HDFSposition and ID number of the segmentation result file. 1:parse a record of the input HBase table to obtain the ID number and HDFSpath of the file to be processed; 2:read in the file to be processed; 3:callthe ictclas4j package to split sentences of the file into words; 4:write segmentation results to some HDFSfile specified by user; 5:take key as key’; 6:take the ID number and HDFSposition of the segmentation result file as two columns of put; 7:write a record(key’,put)into the ouput HBase table. 3.2.2 Term Frequency-Inverse Document Frequency TFIDFis a numerical statistic that reflects the importance of a word to a document in a collection or corpus[14].It is often used as a weighting factor in information retrievaland text mining.The TFIDFvalue increases proportionally to the number of times a word appears in a document but is offset by the frequency of the word in the corpus because some words are generally more common than others.TFIDF undervalues terms that frequently appear in documents belonging to the same class and gives greater weight to terms that represent the characteristic of the documents in its class. Term frequency(TF)is the number of times a term occurs in a document.This can be formulated as tf(t,d),where t is the term and d is the document.Inverse document frequency(IDF)measures whether the term is common or rare across all documents.It is obtained by dividing the total number of documents by the number of documents containing the term,and then taking the logarithm of that quotient: where|D|is the total number of documents in the corpus,|{d∈D:t∈d}|is the number of documents where the term t appears.If the term is not in the corpus,a division-by-zero occurs.Therefore,it is common to adjust the formula to 1+|{d∈D:t∈d}|.Then,the TFIDFis calculated: According to the tf(t,d)matrix,we design parallel Map Reduce algorithms to calculate df(t,d)and idf(t,d)of each term,respectively.Fig.6 shows the calculation of tfidf(t,d,D). There are two main Map Reduce jobs.The first involves calculating df of each term and selecting features according to the df threshold;the second involves calculating idf and tfidf of each term and selecting features according to the threshold of tfidf.Finally,a reduced document-term matrix is obtained that removes most of redundant information and keeps most of the feature information.The two Map Reduce jobs in a parallel TFIDFalgorithm can be implemented as follows. The first Map Reduce job is described in algorithms 4 and 5.In algorithm 4,step 1 removes the document id from the input text value,and steps 2 to 5 output all the terms with a value of 1.Steps 2 to 4 in algorithm 5 calculate the summation of each v in value,and steps 5 to 8 output allthe terms with a summation value greater or equal to the DFthreshold. Algorithm 4:DFMapper Input: key:offset in bytes; value:a text containing document id and a vector of all Output: key’:a text for each term; value’:an integer one.1:Separate all the 2:for each 3:Set key’as term,value’as 1; 4:output 5:endfor. The second Map Reduce job is described in algorithm 6.Steps 1 and 2 are preparation and involve parameter reading and preprocessing of input data.Steps 3 to 8 calculate the tfidf for each term and select features according to the TFIDF threshold.Steps 9 to 10 output the tfidf tfidf vector for each document. Algorithm 5:DFReducer Input: key:a term text; value:a vector of integer one with the length that the term occurs. key’:the same with key; value’:sum of integer one in value. 1:Inilialize sum as zero,DFThreshold as Threshold of DF; 2:for each integer v in value; 3:sum+=v; ▲Figure 6.Parallel TFIDFalgorithm. 4:endfor; 5:if(sum≥DFThreshold); 6:Set key’as key,value’as sum; 7:Output 8:endif. Algorithm 6:TFIDFMapper Input: key:the offset in bytes; value:a text containing document id and a vector of all Output: key’:a text of document id; value’:a vector of all the terms with their tfidf value.1:Read df value from temporary HDFSpath,set N as the total number of documents,and TFIDFThreshold as Threshold of TFIDF; 2:split the input text value,to get document id and all the< term,count>pairs; 3:for each 4:calculate tfidf according to(2);5:if(tfidf TFIDFThreshold); 6:value’.append( 9:set key’as document id; 10:output 3.3 Parallel Text Mining Subsystem The parallel text mining subsystem is the core of our web mining system,and is closely related to the applications.The subsystem has eight modules:full text retrieval,text classification,text association,sentiment analysis,semantic analysis,text clustering,text abstraction,and topic discovery.In total,we have developed nineteen parallel algorithms. Here,we only detailthe implementation of two parallel algorithms. 3.3.1 Co-Occurrence Analysis Co-occurrence analysis(Cooc Analysis)is a statistical method used in text mining[15].Generally speaking,it uses statistical theory to analyze the co-occurrence distribution characteristics of the text knowledge units,and it mines the potential association between these units.Recently, Cooc Analysis has become more important in knowledge mining and discovery.The calculation formulas of Cooc Analysis between term Tjand Tkare: (3)gives the co-occurrence weight from term Tjto Tk,and(4)gives the co-occurrence weight from term Tkto Tj.The entire co-occurrence weight matrix of all terms is non-symmetric.In(3)and(4)djkiis the co-weight of terms Tj and Tkin document i and is defined as where ?tfjkiis the minor number of co-occurrences of Tjand Tkin document i ?dfikis the number of documents where terms Tjand Tkoccur together ?N is the number of documents. In the Cooc Analysis algorithm,in order to give very common terms a certain disadvantage,we multiply a weight factor to each term,which is similar to the inverse document frequency.The weight factor is given by From(6),it can be concluded that the main purpose of Cooc Analysis algorithm is to do some statistics on term frequency so that it is suitable for parallelization.We have designed parallel Map Reduce algorithms for the Cooc Analysis algorithm,and the process for the Cooc Analysis algorithm is shown in Fig.7. There are two main Map Reduce jobs.The first involves calculating document frequency and weight factor of each term.The second involves calculating the co-weight of each term pair in each document and co-occurrence weight between terms.The two Map Reduce jobs of parallel Cooc Analysis algorithm can be implemented as follows. The first Map Reduce job of parallel Cooc Analysis algorithm is described in algorithms 7 and 8.In algorithm 7,step 1 obtains the document id and Algorithm 7:InvIndex WFMapper Input: key:the offset in bytes; value:a text containing document id and a vector of all Output: ▲Figure 7.Parallel CoocAnalysis algorithm. key’:a text for each term; value’:document id for each term. 1:Split the input text value,to get document id and all the 2:for each 3:set key’as term,value’as document id; 4:output 5:endfor. Algorithm 8:InvIndex WFReducer Input: key:a text for each term; value:a vector of document ids where the term key occurs.Output:two kinds of For the first kind: key’:the same with key; value’:text of alldocument ids in value. For the second kind, key’:a string“WF”appended with key; value’:weight factor of each term. 1:Initialize sum as zero,set N as the total number of documents; 2:for each document id v in value; 3:value’.append(v); 4:sum+=1; 5:endfor; 6:set key’as key 7:output 8:key’=”WF”+key; 9:calculate value’according to(6); 10:output The second Map Reduce job in the parallel Cooc Analysis algorithm is described in algorithms 9 and 10.In algorithm 9,steps 1 to 2 are preparationand include parameter reading and preprocessing of input data.Steps 3 to 17 calculate the tfidf value for each term as well as the co-occurrence TFIDF value of each term pair,which are then output.In algorithm 10,step 1 is preparation,and steps 2 to 4 calculate the co-occurrence weight between terms,which is then output.Algorithm 9:Cooc Mapper Input: key:offset in bytes; value:a text containing document id and a vector of all 1:Read in the output of the previous Map Reduce job,initialize TFIDF as a list to save tfidf of each single term; 2:Split the input text value,to get document id and all the 3:for each 4:Calculate tfidf value of each single term in 5:TFIDF.add(tfidf); 6:Initialize doc IDList 1 to save all document ids where the term term occurs; 7:Put all the document ids where the term term occurs to doc IDLinst1,and sort it; 8:for each Output: key’:a text of a term pair; value’:co-occurrence TFIDFvalue of each term pair. 9:initialize doc IDList 2 to save all document ids where the term term’occurs; 10:put allthe document ids where the term term occurs to doc IDLinst 2,and sort it; 11:according to doc IDList 1 and doc IDList 2,calculate the co-occurrence number of term and term’; 12:calculate co-occurrence TFIDFvalue of the term pair according to(5),and set value’as it; 13:set key’as 14:output 15:endfor; 16:endfor; 17:output TFIDF to a temporary HDFSpath.Algorithm 10:Cooc Reducer Input: key:a text of a term pair; value:co-occurrence TFIDFvalue of each term pair. Output: key’:a text of a term pair; value’:co-occurrence weight between terms. 1:Read in TFIDF from the temporary HDFSpath; 2:Calculate value’according to(3)-(4); 3:Construct key’according to key; 4:Output 3.3.2 Semantic Analysis based on PLSA In machine learning,semantic analysis of a corpus is achieved by building structures that approximate concepts from a large set of documents.This generally does not involve prior semantic understanding of the documents.Latent semantic analysis(LSA)is a technique in natural language processing used to analyze relationships between a set of documents and terms in order to produce a set of concepts related to the documents and terms.Probabilistic latent semantic analysis(PLSA)is a popular topic-modeling technique for exploring document collections[16].A parallel PLSA algorithm is described in[17].We design a parallel semantic analysis algorithm(SPLSA)based on parallel PLSA. We aim to find related words given some index words.In a corpus,PLSA algorithm can find the topics and corresponding probabilities that each word belongs to a topic.After that,we build the word-topic-probability relationship;that is,each word belongs to a topic with a probability.Algorithm 11 shows the pseudocode.Algorithm 11:Map(key,value) Input: key:offset in bytes; value:word-topics-probabilities. Output: The word-topic-probability file 1:Read value into array wordtopic[]; 2:max=-1;//record the maximum probability with which the word belongs to a topic; 3:topic=-1;//record the topic index to which the word belong; 4:for(i=1;i 6:max=wordtopic[i]; 7:topic=i-1; 8:endif; 9:endfor; 10:write wordtopic[0]-topic-max;//the word-topic-probability relationship. According to this relationship,we can find allwords that belong to the same topic.The words with the top-n max probability in the same topic are related.The parameter given by the user is n,and the parallelanalysis process is described in Fig.8. First,we read the index words and the word-topic file.Second,we find the topics and build the word-topic-probability relationship during the map phase.Finally,we determine related words from the relationship and output these words. In this work,we focus on designing parallel web-mining algorithms and therefore only test the efficiency of parallel algorithms to guarantee their correct parallelimplementation. Fig.9 shows the simulation interface.We show how to set the parameters of the parallel Cooc Analysis algorithm,and can set the main class and jar package.It is also very convenient to set the data input path-iand output path-o,the total number of documents-d,and the number of reducers,-r. Here,we only describe the efficiency of the web crawler,web page parsing,TFIDF,Cooc Analysis according to speedup in this experiments. The data set is from the Yahoo!News website.We use the proposed parallel web crawler to crawl the webpage.To test the web crawler,we construct a complete URL list with 20,000 URLs,and the experiments are conducted on different computing systems with different nodes.The size of the URL list is only 800 kB,and to maximize parallelization,we modify the block size rather than adopt the default value of 64 MB. The parallel system is a cluster of four computer nodes with Linux,and each node has four 2.8 GHz cores and 4 GB memory.The Map Reduce system is configured with Hadoop 0.17.0 and Java 1.6.0.22 for all experiments.We perform the experiments on computing systems with 1,2 and 4 nodes. We use the popular evaluation metric speedup[4]to validate our algorithms,which is defined as ▲Figure 8.Flowchartof parallel SPLSAalgorithm. When the value of speedup is m,we obtain the linear speedup.However,in practice we cannot reach the linear speedup because of the task allocation balance,and communication cost between computer nodes. Fig.10(a)-(d)shows the results.We find that allthe parallel algorithms have very good speedup properties.Specifically,all the algorithms except Cooc Analysis have a speedup value higher than 3 when there are only four nodes.Furthermore,low-complexity algorithms such as webpage parsing and TFIDFhave a high speedup value. ▲Figure 9.Simulation interface. In this paper,we have developed a parallel web-mining system that includes more than forty machine learning algorithms for text mining.Using cloud platform and Map Reduce,we analyze the mechanism of each algorithm and carefully design the ▲Figure 10.The speedup of parallelalgorithms. Acknowledgements This work is also supported by the ZTEresearch found of Parallel Web Mining project.We also could like to thank Wenjuan Luo,Tianfeng Shang,Changying Du,Xin Jin,Zhi Dong,YunLong Ma,Qun Wang,Shuo Han,Xinyu Wu,Xiaofeng Geng for their contributions to this paper.3.2 Parallel Text ETLand Modeling Subsystem
4 Experiments
4.1 Experiment Preparation
4.2 Results
5 Conclusion