因为师妹毕设需要,实验室的集群原先的Mahout 0.11.1跟教程上关于随机森林算法有些出入,由于急于使用,因此想换回低版本。在Cloudera公司的CDN 5.3下用的就是Mahout 0.9版本,为了能够实验室的Hadoop 2.6集群环境下使用Mahout 0.9,需要对Mahout 0.9进行编译,因为Mahout 0.9官方提供的版本上只能兼容hadoop 1.x系列,因此需要使用源代码重新进行编译(参考:mahout, https://mahout.apache.org/developers/buildingmahout.html)。
为了方便大家,特将编译之后可用于Hadoop 2.x版本的Mahout提供下载:http://pan.baidu.com/s/1eRRBP2U
如果想自行编译,请参考以下教程:
第一步:Mahout0.9源码修改(Mahout源码打patch)
Mahout官方下载地址:http://archive.apache.org/dist/mahout/
目前mahout只支持hadoop1 的缘故。在这里可以找到解决方法:https://issues.apache.org/jira/browse/MAHOUT-1329。主要就是修改pom文件,修改mahout的依赖。
下载:1329-3.patch
拷贝至服务器上;
https://issues.apache.org/jira/secure/attachment/12630146/1329-3.patch
下载:mahout-distribution-0.9-src.tar.gz
源码解压至服务器上;
到源码根目录下执行以下命令打patch:
patch -p0 < ../1329-3.patch
提示如下:
patching file core/pom.xml patching file integration/pom.xml patching file pom.xml
第二步:Mahout0.9源码编译(Hadoop2.x)
执行(前提已安装好Maven环境,文章末尾介绍Maven环境的安装):
mvn package -Prelease -Dhadoop2 -Dhadoop2.version=2.2.0 -DskipTests=true
然后就是漫长的等待过程……
直到:
[INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools ................................. SUCCESS [ 1.680 s] [INFO] Apache Mahout ...................................... SUCCESS [ 2.056 s] [INFO] Mahout Math ........................................ SUCCESS [ 24.012 s] [INFO] Mahout Core ........................................ SUCCESS [ 32.697 s] [INFO] Mahout Integration ................................. SUCCESS [ 7.977 s] [INFO] Mahout Examples .................................... SUCCESS [ 20.199 s] [INFO] Mahout Release Package ............................. SUCCESS [ 34.697 s] [INFO] Mahout Math/Scala wrappers ......................... SUCCESS [ 4.728 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:08 min [INFO] Finished at: 2016-03-31T14:56:44+08:00 [INFO] Final Memory: 87M/1320M [INFO] ------------------------------------------------------------------------
出现上述信息说明编译成功。
查看编译部署二进制包
进入到源代码的distribution/target
目录。
mahout-distribution-0.9.tar.gz
为安装部署包
第三步:Mahout0.9安装部署
下载编译好的mahout-distribution-0.9.tar.gz
部署包,直接解压即可。
移动到相应的目录。
vi /etc/profile
配置环境变量
添加:
# set mahout environment export MAHOUT_HOME=/usr/local/hadoop-2.6.0/mahout-0.9 export PATH=$PATH:$MAHOUT_HOME/bin export CLASSPATH=.:$CLASSPATH:$MAHOUT_HOME/lib
保存退出,输入source /etc/profile
命令使之立即生效
Mahout配置
vi mahout-distribution-0.9/bin/mahout
进入配置界面
添加:
MAHOUT_JAVA_HOME=/usr/local/jdk1.7.0_03
如果要设置伪分布式 添加
MAHOUT_LOCAL=true
如果不设置 将默认为在hadoop上运行。
输入mahout --help
测试下Mahout是否安装成功
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/local/hadoop/hadoop-2.2.0/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.2.0/conf MAHOUT-JOB: /usr/lib/mahout-distribution-0.9/mahout-examples-0.9-job.jar Unknown program '--help' chosen. Valid program names are: arff.vector: : Generate Vectors from an ARFF file or directory baumwelch: : Baum-Welch algorithm for unsupervised HMM training canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text clusterpp: : Groups Clustering Output In Clusters cmdump: : Dump confusion matrix in HTML or text formats concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx) cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally. evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes fkmeans: : Fuzzy K-means clustering hmmpredict: : Generate random sequence of observations by given HMM itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lucene.vector: : Generate Vectors from a Lucene index lucene2seq: : Generate Text SequenceFiles from a Lucene index matrixdump: : Dump matrix in CSV format matrixmult: : Take the product of two matrices parallelALS: : ALS-WR factorization of a rating matrix qualcluster: : Runs clustering experiments and summarizes results in a CSV recommendfactorized: : Compute recommendations using the factorization of a rating matrix recommenditembased: : Compute recommendations using item-based collaborative filtering regexconverter: : Convert text files on a per line basis based on regular expressions resplit: : Splits a set of SequenceFiles into a number of equal splits rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>} rowsimilarity: : Compute the pairwise similarities of the rows of a matrix runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model runlogistic: : Run a logistic regression model against CSV data seq2encoded: : Encoded Sparse Vector generation from Text sequence files seq2sparse: : Sparse Vector generation from Text sequence files seqdirectory: : Generate sequence files (of Text) from a directory seqdumper: : Generic Sequence File dumper seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives seqwiki: : Wikipedia xml dump to sequence file spectralkmeans: : Spectral k-means clustering split: : Split Input data into test and train sets splitDataset: : split a rating dataset into training and probe parts ssvd: : Stochastic SVD streamingkmeans: : Streaming k-means clustering svd: : Lanczos Singular Value Decomposition testnb: : Test the Vector-based Bayes classifier trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model trainlogistic: : Train a logistic regression using stochastic gradient descent trainnb: : Train the Vector-based Bayes classifier transpose: : Take the transpose of a matrix validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors vectordump: : Dump vectors from a sequence file to text viterbi: : Viterbi decoding of hidden states from given output states sequence
看到以上,代表安装成功,上面提示:Running on hadoop,下面是Mahout支持的函数列表
附:Maven配置
Manen官方下载页面:http://maven.apache.org/download.cgi
选择相应版本下载,上传服务器,解压并拷贝至相关目录。
配置环境变量:
# set maven environment export MAVEN_HOME=/usr/local/maven-3.3.9 export PATH=$MAVEN_HOME/bin:$PATH
source /etc/profile
是配置生效
输入mvn -v
测试
出现:
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00) Maven home: /usr/local/maven-3.3.9 Java version: 1.7.0_79, vendor: Oracle Corporation Java home: /usr/java/jdk1.7.0_79/jre Default locale: zh_CN, platform encoding: UTF-8 OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"
说明安装成功。
给maven添加本地仓库:
修改conf下面的settings.xml
文件
找到第53行,把注释去掉,修改成:
/usr/data/maven/maven-dependcies
由于maven国外服务器可能连不上,先给maven配置一下国内镜像,在maven目录下,conf/settings.xml
,在<mirrors></mirros>里添加,原本的不要动:
<id>nexus-osc</id> <mirrorOf>*</mirrorOf> <name>Nexusosc</name> <url>http://maven.oschina.net/content/groups/public/</url>
同样,在<profiles></profiles>内新添加:
<id>jdk-1.7</id> <activation> <jdk>1.4</jdk> </activation> <repositories> <repository> <id>nexus</id> <name>localprivatenexus</name> <url>http://maven.oschina.net/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>nexus</id> <name>localprivatenexus</name> <url>http://maven.oschina.net/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories>
截至目前,配置完成。