在Hadoop 2.6环境下使用Mahout 0.9

博主： ltax
发布时间：2016 年 03 月 31 日
暂无评论
8775字数
分类：收藏品

因为师妹毕设需要，实验室的集群原先的Mahout 0.11.1跟教程上关于随机森林算法有些出入，由于急于使用，因此想换回低版本。在Cloudera公司的CDN 5.3下用的就是Mahout 0.9版本，为了能够实验室的Hadoop 2.6集群环境下使用Mahout 0.9，需要对Mahout 0.9进行编译，因为Mahout 0.9官方提供的版本上只能兼容hadoop 1.x系列，因此需要使用源代码重新进行编译（参考：mahout, https://mahout.apache.org/developers/buildingmahout.html）。

为了方便大家，特将编译之后可用于Hadoop 2.x版本的Mahout提供下载：http://pan.baidu.com/s/1eRRBP2U

如果想自行编译，请参考以下教程：

第一步：Mahout0.9源码修改（Mahout源码打patch）

Mahout官方下载地址：http://archive.apache.org/dist/mahout/

目前mahout只支持hadoop1 的缘故。在这里可以找到解决方法：https://issues.apache.org/jira/browse/MAHOUT-1329。主要就是修改pom文件，修改mahout的依赖。

下载：1329-3.patch拷贝至服务器上；

https://issues.apache.org/jira/secure/attachment/12630146/1329-3.patch

下载：mahout-distribution-0.9-src.tar.gz源码解压至服务器上；

到源码根目录下执行以下命令打patch：

patch -p0 < ../1329-3.patch

提示如下：

patching file core/pom.xml
patching file integration/pom.xml
patching file pom.xml

第二步：Mahout0.9源码编译（Hadoop2.x）

执行（前提已安装好Maven环境，文章末尾介绍Maven环境的安装）：

mvn package -Prelease -Dhadoop2 -Dhadoop2.version=2.2.0 -DskipTests=true

然后就是漫长的等待过程……

直到：

[INFO] Reactor Summary:
[INFO] 
[INFO] Mahout Build Tools ................................. SUCCESS [  1.680 s]
[INFO] Apache Mahout ...................................... SUCCESS [  2.056 s]
[INFO] Mahout Math ........................................ SUCCESS [ 24.012 s]
[INFO] Mahout Core ........................................ SUCCESS [ 32.697 s]
[INFO] Mahout Integration ................................. SUCCESS [  7.977 s]
[INFO] Mahout Examples .................................... SUCCESS [ 20.199 s]
[INFO] Mahout Release Package ............................. SUCCESS [ 34.697 s]
[INFO] Mahout Math/Scala wrappers ......................... SUCCESS [  4.728 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:08 min
[INFO] Finished at: 2016-03-31T14:56:44+08:00
[INFO] Final Memory: 87M/1320M
[INFO] ------------------------------------------------------------------------

出现上述信息说明编译成功。

查看编译部署二进制包

进入到源代码的distribution/target目录。

mahout-distribution-0.9.tar.gz为安装部署包

第三步：Mahout0.9安装部署

下载编译好的mahout-distribution-0.9.tar.gz部署包，直接解压即可。

移动到相应的目录。

vi /etc/profile 配置环境变量

添加：

# set mahout environment
export MAHOUT_HOME=/usr/local/hadoop-2.6.0/mahout-0.9
export PATH=$PATH:$MAHOUT_HOME/bin
export CLASSPATH=.:$CLASSPATH:$MAHOUT_HOME/lib

保存退出，输入source /etc/profile命令使之立即生效

Mahout配置

vi mahout-distribution-0.9/bin/mahout进入配置界面

添加：
MAHOUT_JAVA_HOME=/usr/local/jdk1.7.0_03

如果要设置伪分布式添加

MAHOUT_LOCAL=true

如果不设置将默认为在hadoop上运行。

输入mahout --help测试下Mahout是否安装成功

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/local/hadoop/hadoop-2.2.0/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.2.0/conf
MAHOUT-JOB: /usr/lib/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Unknown program '--help' chosen.
Valid program names are:
  arff.vector: : Generate Vectors from an ARFF file or directory
  baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  canopy: : Canopy clustering
  cat: : Print a file or resource as the logistic regression models would see it
  cleansvd: : Cleanup and verification of SVD output
  clusterdump: : Dump cluster output to text
  clusterpp: : Groups Clustering Output In Clusters
  cmdump: : Dump confusion matrix in HTML or text formats
  concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
  cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  fkmeans: : Fuzzy K-means clustering
  hmmpredict: : Generate random sequence of observations by given HMM
  itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  kmeans: : K-means clustering
  lucene.vector: : Generate Vectors from a Lucene index
  lucene2seq: : Generate Text SequenceFiles from a Lucene index
  matrixdump: : Dump matrix in CSV format
  matrixmult: : Take the product of two matrices
  parallelALS: : ALS-WR factorization of a rating matrix
  qualcluster: : Runs clustering experiments and summarizes results in a CSV
  recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  recommenditembased: : Compute recommendations using item-based collaborative filtering
  regexconverter: : Convert text files on a per line basis based on regular expressions
  resplit: : Splits a set of SequenceFiles into a number of equal splits
  rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  runlogistic: : Run a logistic regression model against CSV data
  seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  seq2sparse: : Sparse Vector generation from Text sequence files
  seqdirectory: : Generate sequence files (of Text) from a directory
  seqdumper: : Generic Sequence File dumper
  seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  seqwiki: : Wikipedia xml dump to sequence file
  spectralkmeans: : Spectral k-means clustering
  split: : Split Input data into test and train sets
  splitDataset: : split a rating dataset into training and probe parts
  ssvd: : Stochastic SVD
  streamingkmeans: : Streaming k-means clustering
  svd: : Lanczos Singular Value Decomposition
  testnb: : Test the Vector-based Bayes classifier
  trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  trainlogistic: : Train a logistic regression using stochastic gradient descent
  trainnb: : Train the Vector-based Bayes classifier
  transpose: : Take the transpose of a matrix
  validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  vectordump: : Dump vectors from a sequence file to text
  viterbi: : Viterbi decoding of hidden states from given output states sequence

看到以上，代表安装成功，上面提示：Running on hadoop，下面是Mahout支持的函数列表

附：Maven配置

Manen官方下载页面：http://maven.apache.org/download.cgi

选择相应版本下载，上传服务器，解压并拷贝至相关目录。

配置环境变量：

# set maven environment
export MAVEN_HOME=/usr/local/maven-3.3.9
export PATH=$MAVEN_HOME/bin:$PATH

source /etc/profile是配置生效

输入mvn -v测试

出现：

Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /usr/local/maven-3.3.9
Java version: 1.7.0_79, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"

说明安装成功。

给maven添加本地仓库：

修改conf下面的settings.xml文件

找到第53行，把注释去掉，修改成：

/usr/data/maven/maven-dependcies

由于maven国外服务器可能连不上，先给maven配置一下国内镜像，在maven目录下，conf/settings.xml,在<mirrors></mirros>里添加，原本的不要动：

    <id>nexus-osc</id>
    <mirrorOf>*</mirrorOf>
    <name>Nexusosc</name>  
    <url>http://maven.oschina.net/content/groups/public/</url>

同样，在<profiles></profiles>内新添加:

    <id>jdk-1.7</id>
    <activation>
      <jdk>1.4</jdk>
    </activation>
    <repositories>
      <repository>
        <id>nexus</id>
        <name>localprivatenexus</name>
        <url>http://maven.oschina.net/content/groups/public/</url>
        <releases>
          <enabled>true</enabled>
        </releases>
        <snapshots>
          <enabled>false</enabled>
        </snapshots>
      </repository>
    </repositories>
    <pluginRepositories>
      <pluginRepository>
        <id>nexus</id>
        <name>localprivatenexus</name>
        <url>http://maven.oschina.net/content/groups/public/</url>
        <releases>
          <enabled>true</enabled>
        </releases>
        <snapshots>
          <enabled>false</enabled>
        </snapshots>
      </pluginRepository>
    </pluginRepositories>

截至目前，配置完成。

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

在Hadoop 2.6环境下使用Mahout 0.9

ltax • 2016 年 03 月 31 日

因为师妹毕设需要，实验室的集群原先的Mahout 0.11.1跟教程上关于随机森林算法有些出入，由于急于使用，因此想换回低版本。在Cloudera公司的CDN 5.3下用的就是Mahout 0.9版本，为了能够实验室的Hadoop 2.6集群环境下使用Mahout 0.9，需要对Mahout 0.9进行编译，因为Mahout 0.9官方提供的版本上只能兼容hadoop 1.x系列，因此需要使用源代码重新进行编译（参考：mahout, <a href="https://mahout.apache.org/developers/buildingmahout.html" target="_blank">https://mahout.apache.org/developers/buildingmahout.html</a>）。为了方便大家，特将编译之后可用于Hadoop 2.x版本的Mahout提供下载：<a href="http://pan.baidu.com/s/1eRRBP2U" target="_blank">http://pan.baidu.com/s/1eRRBP2U</a> 
如果想自行编译，请参考以下教程：第一步：Mahout0.9源码修改（Mahout源码打patch）Mahout官方下载地址：<a href="http://archive.apache.org/dist/mahout/" target="_blank">http://archive.apache.org/dist/mahout/</a>目前mahout只支持hadoop1 的缘故。在这里可以找到解决方法：<a href="https://issues.apache.org/jira/browse/MAHOUT-1329" target="_blank">https://issues.apache.org/jira/browse/MAHOUT-1329</a>。主要就是修改pom文件，修改mahout的依赖。下载：<code>1329-3.patch</code>拷贝至服务器上；<a href="https://issues.apache.org/jira/secure/attachment/12630146/1329-3.patch" target="_blank">https://issues.apache.org/jira/secure/attachment/12630146/1329-3.patch</a>下载：<code>mahout-distribution-0.9-src.tar.gz</code>源码解压至服务器上；到源码根目录下执行以下命令打patch：<pre class="lang:default decode:true ">patch -p0 &lt; ../1329-3.patch</pre>提示如下：<pre class="lang:default decode:true ">patching file core/pom.xml
patching file integration/pom.xml
patching file pom.xml</pre>第二步：Mahout0.9源码编译（Hadoop2.x）执行（前提已安装好Maven环境，文章末尾介绍Maven环境的安装）：<pre class="lang:default decode:true ">mvn package -Prelease -Dhadoop2 -Dhadoop2.version=2.2.0 -DskipTests=true</pre>然后就是漫长的等待过程……直到：<pre class="lang:default decode:true">[INFO] Reactor Summary:
[INFO] 
[INFO] Mahout Build Tools ................................. SUCCESS [ 1.680 s]
[INFO] Apache Mahout ...................................... SUCCESS [ 2.056 s]
[INFO] Mahout Math ........................................ SUCCESS [ 24.012 s]
[INFO] Mahout Core ........................................ SUCCESS [ 32.697 s]
[INFO] Mahout Integration ................................. SUCCESS [ 7.977 s]
[INFO] Mahout Examples .................................... SUCCESS [ 20.199 s]
[INFO] Mahout Release Package ............................. SUCCESS [ 34.697 s]
[INFO] Mahout Math/Scala wrappers ......................... SUCCESS [ 4.728 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:08 min
[INFO] Finished at: 2016-03-31T14:56:44+08:00
[INFO] Final Memory: 87M/1320M
[INFO] ------------------------------------------------------------------------</pre>出现上述信息说明编译成功。查看编译部署二进制包进入到源代码的<code>distribution/target</code>目录。<code>mahout-distribution-0.9.tar.gz</code>为安装部署包第三步：Mahout0.9安装部署下载编译好的<code>mahout-distribution-0.9.tar.gz</code>部署包，直接解压即可。移动到相应的目录。<code>vi /etc/profile</code> 配置环境变量添加：<pre class="lang:default decode:true "># set mahout environment
export MAHOUT_HOME=/usr/local/hadoop-2.6.0/mahout-0.9
export PATH=$PATH:$MAHOUT_HOME/bin
export CLASSPATH=.:$CLASSPATH:$MAHOUT_HOME/lib
</pre>保存退出，输入<code>source /etc/profile</code>命令使之立即生效Mahout配置<code>vi mahout-distribution-0.9/bin/mahout</code>进入配置界面添加： 
<code>MAHOUT_JAVA_HOME=/usr/local/jdk1.7.0_03</code>如果要设置伪分布式 添加<code>MAHOUT_LOCAL=true</code>如果不设置 将默认为在hadoop上运行。输入<code>mahout --help</code>测试下Mahout是否安装成功<pre class="lang:default decode:true ">MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/local/hadoop/hadoop-2.2.0/bin/hadoop and HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.2.0/conf
MAHOUT-JOB: /usr/lib/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Unknown program '--help' chosen.
Valid program names are:
 arff.vector: : Generate Vectors from an ARFF file or directory
 baumwelch: : Baum-Welch algorithm for unsupervised HMM training
 canopy: : Canopy clustering
 cat: : Print a file or resource as the logistic regression models would see it
 cleansvd: : Cleanup and verification of SVD output
 clusterdump: : Dump cluster output to text
 clusterpp: : Groups Clustering Output In Clusters
 cmdump: : Dump confusion matrix in HTML or text formats
 concatmatrices: : Concatenates 2 matrices of same cardinality into a single matrix
 cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
 cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
 evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
 fkmeans: : Fuzzy K-means clustering
 hmmpredict: : Generate random sequence of observations by given HMM
 itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
 kmeans: : K-means clustering
 lucene.vector: : Generate Vectors from a Lucene index
 lucene2seq: : Generate Text SequenceFiles from a Lucene index
 matrixdump: : Dump matrix in CSV format
 matrixmult: : Take the product of two matrices
 parallelALS: : ALS-WR factorization of a rating matrix
 qualcluster: : Runs clustering experiments and summarizes results in a CSV
 recommendfactorized: : Compute recommendations using the factorization of a rating matrix
 recommenditembased: : Compute recommendations using item-based collaborative filtering
 regexconverter: : Convert text files on a per line basis based on regular expressions
 resplit: : Splits a set of SequenceFiles into a number of equal splits
 rowid: : Map SequenceFile&lt;Text,VectorWritable&gt; to {SequenceFile&lt;IntWritable,VectorWritable&gt;, SequenceFile&lt;IntWritable,Text&gt;}
 rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
 runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
 runlogistic: : Run a logistic regression model against CSV data
 seq2encoded: : Encoded Sparse Vector generation from Text sequence files
 seq2sparse: : Sparse Vector generation from Text sequence files
 seqdirectory: : Generate sequence files (of Text) from a directory
 seqdumper: : Generic Sequence File dumper
 seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
 seqwiki: : Wikipedia xml dump to sequence file
 spectralkmeans: : Spectral k-means clustering
 split: : Split Input data into test and train sets
 splitDataset: : split a rating dataset into training and probe parts
 ssvd: : Stochastic SVD
 streamingkmeans: : Streaming k-means clustering
 svd: : Lanczos Singular Value Decomposition
 testnb: : Test the Vector-based Bayes classifier
 trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
 trainlogistic: : Train a logistic regression using stochastic gradient descent
 trainnb: : Train the Vector-based Bayes classifier
 transpose: : Take the transpose of a matrix
 validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
 vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
 vectordump: : Dump vectors from a sequence file to text
 viterbi: : Viterbi decoding of hidden states from given output states sequence
</pre>看到以上，代表安装成功，上面提示：Running on hadoop，下面是Mahout支持的函数列表附：Maven配置Manen官方下载页面：<a href="http://maven.apache.org/download.cgi" target="_blank">http://maven.apache.org/download.cgi</a>选择相应版本下载，上传服务器，解压并拷贝至相关目录。配置环境变量：<pre class="lang:default decode:true "># set maven environment
export MAVEN_HOME=/usr/local/maven-3.3.9
export PATH=$MAVEN_HOME/bin:$PATH
</pre><code>source /etc/profile</code>是配置生效输入<code>mvn -v</code>测试出现：<pre class="lang:default decode:true ">Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /usr/local/maven-3.3.9
Java version: 1.7.0_79, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"</pre>说明安装成功。给maven添加本地仓库：修改conf下面的<code>settings.xml</code>文件找到第53行，把注释去掉，修改成：<code> /usr/data/maven/maven-dependcies</code>由于maven国外服务器可能连不上，先给maven配置一下国内镜像，在maven目录下，<code>conf/settings.xml</code>,在&lt;mirrors&gt;&lt;/mirros&gt;里添加，原本的不要动：<pre class="lang:default decode:true "> &lt;id&gt;nexus-osc&lt;/id&gt;
 &lt;mirrorOf&gt;*&lt;/mirrorOf&gt;
 &lt;name&gt;Nexusosc&lt;/name&gt; 
 &lt;url&gt;http://maven.oschina.net/content/groups/public/&lt;/url&gt;</pre>同样，在&lt;profiles&gt;&lt;/profiles&gt;内新添加:<pre class="lang:default decode:true "> &lt;id&gt;jdk-1.7&lt;/id&gt;
 &lt;activation&gt;
 &lt;jdk&gt;1.4&lt;/jdk&gt;
 &lt;/activation&gt;
 &lt;repositories&gt;
 &lt;repository&gt;
 &lt;id&gt;nexus&lt;/id&gt;
 &lt;name&gt;localprivatenexus&lt;/name&gt;
 &lt;url&gt;http://maven.oschina.net/content/groups/public/&lt;/url&gt;
 &lt;releases&gt;
 &lt;enabled&gt;true&lt;/enabled&gt;
 &lt;/releases&gt;
 &lt;snapshots&gt;
 &lt;enabled&gt;false&lt;/enabled&gt;
 &lt;/snapshots&gt;
 &lt;/repository&gt;
 &lt;/repositories&gt;
 &lt;pluginRepositories&gt;
 &lt;pluginRepository&gt;
 &lt;id&gt;nexus&lt;/id&gt;
 &lt;name&gt;localprivatenexus&lt;/name&gt;
 &lt;url&gt;http://maven.oschina.net/content/groups/public/&lt;/url&gt;
 &lt;releases&gt;
 &lt;enabled&gt;true&lt;/enabled&gt;
 &lt;/releases&gt;
 &lt;snapshots&gt;
 &lt;enabled&gt;false&lt;/enabled&gt;
 &lt;/snapshots&gt;
 &lt;/pluginRepository&gt;
 &lt;/pluginRepositories&gt;</pre>截至目前，配置完成。

在Hadoop 2.6环境下使用Mahout 0.9

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

烟花与月亮

后知后觉，Flash更名为Animate了……

拍了点烟花……大过年的轻喷

那些，微小说

更换了博客的主程序

Google Adsense

压痕测试与划痕测试

Python 2.7.x 和 3.x 版本的重要区别

centos6 网络配置的相关内容

IIS请求筛选模块被配置为拒绝超过请求内容长度的请求

在Hadoop 2.6环境下使用Mahout 0.9

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

在Hadoop 2.6环境下使用Mahout 0.9

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款