Mahout integrates a lot of common machine learning algorithms which faciliates those who want to do some research in data mining. It is based on Java and a lot of need to be done before you can make it work. At least you will need JDK, Eclispse, Hadoop and Mahout. But I strongly recommend all those below to be done to make it better.
I JDK
II mysql
III Tomcat
IV Eclipse and MyEclipse
V Maven
VI Hadoop and Mahout
VII Test
VIII k-means Algorithm Test
I JDK
sudo gedit /etc/profile
#set java environment
JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
Reboot
Test:vim hello.java
public class hello{
public static void main(String args[]){
System.out.println("Hello World!");
}
}
Javac hello.java
Java hello
II mysql
sudo apt-get install mysql-server my-client
And test:
sudo netstat -tap | grep mysql
A graphical tool is recommended. Search for mysql-admin in Synaptic and install it:
III Tomcat
http://mirror.bjtu.edu.cn/apache/tomcat/tomcat-7/v7.0.40/bin/
apache-tomcat-7.0.40.tar.gz
Add this:
JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
JAVA_OPTS="-server -Xms512m -Xmx1024m -XX:PermSize=600M -XX:MaxPermSize=600m -Dcom.sun.management.jmxremote"
Infront of:
cygwin=false
os400=false
darwin=false
case "`uname`" in
CYGWIN*) cygwin=true;;
OS400*) os400=true;;
Darwin*) darwin=true;;
Add:
JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
To the end
then
Type: localhost:8080 in your browser
IV Eclipse and MyEclipse
http://www.eclipse.org/downloads/
I chose the fist one
Myeclipse:
Modify default jdk:
sudo update-alternatives --install "/usr/bin/java" "java" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/java" 300
sudo update-alternatives --install "/usr/bin/javac" "javac" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/javac" 300
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/home/lethic/Documents/Softwares/jdk1.7.0_21/bin/javaws" 300
sudo update-alternatives --config java
sudo update-alternatives --config javac
sudo update-alternatives --config javaws
Download:
http://www.myeclipseide.com/module-htmlpages-display-pid-4.html
Build a shortcut for MyEclipse
lethic@lethic:~/Documents/Softwares$ sudo chown -R root:root MyEclispse
lethic@lethic:~/Documents/Softwares$ sudo chmod -R +r MyEclispse
lethic@lethic:~/Documents/Softwares$ cd 'MyEclispse/MyEclipse 10/'
lethic@lethic:~/Documents/Softwares/MyEclispse/MyEclipse 10$ sudo chown -R root:root myeclipse
lethic@lethic:~/Documents/Softwares/MyEclispse/MyEclipse 10$ sudo chmod -R +r myeclipse
sudo gedit /usr/bin/MyEclipse
#!/bin/sh
export MYECLIPSE_HOME="/home/lethic/Documents/Softwares/MyEclispse/MyEclipse 10/myeclipse"
$MYECLIPSE_HOME/myeclipse $*
sudo chmod 755 /usr/bin/MyEclipse
sudo chmod -R 777 /home/lethic/Documents/Softwares/MyEclispse
sudo gedit /usr/share/applications/MyEclipse.desktop
[Desktop Entry]
Encoding=UTF-8
Name=MyEclipse 10
Comment=IDE for JavaEE
Exec=/home/lethic/Documents/Softwares/MyEclispse/MyEclipse\ 10/myeclipse
Icon=/home/lethic/Documents/Softwares/MyEclispse /MyEclipse\ 10/icon.xpm
Terminal=false
Type=Application
Categories=GNOME;Application;Development;
StartupNotify=true
Then initialize it:
'/usr/MyEclipse/MyEclipse 10/myeclipse' -clean
V Maven
Apache Maven 3.0.5
http://maven.apache.org/docs/3.0.5/release-notes.html
tar -xvzf apache-maven-3.0.5-bin.tar.gz
#create a link for it to make it easy to upgrade
ln -s apache-maven-3.0.5 apache-maven
#reboot and test
VI Hadoop and Mahout
Hadoop:
http://mirror.bit.edu.cn/apache/hadoop/common/stable/
hadoop-1.1.2.tar.gz
tar zxvf hadoop-1.1.2.tar.gzMahout:
http://mirror.bit.edu.cn/apache/mahout/0.6/
tar zxvf mahout-distribution-0.6.tar.gz
Add this to etc/profile
export HADOOP_HOME=/home/lethic/Documents/Softwares/hadoop-1.1.2
export HADOOP_CONF_DIR=/home/lethic/Documents/Softwares/hadoop-1.1.2/conf
export MAHOUT_HOME=/home/lethic/Documents/Softwares/mahout-distribution-0.6
export PATH=$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$PATH
Then refresh the profile again:
source /etc/profile
VII Test
I modified my /etc/profile again and finally the part I added in is like this:
umask 022
#set java environment
#JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JAVA_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21
export JRE_HOME=/home/lethic/Documents/Softwares/jdk1.7.0_21/jre
#export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
export GTK_IM_MODULE=ibus
export XMODIFIERS="@im=ibus"
export QT_IM_MODULE=ibus
export MAVEN_HOME=/home/lethic/Documents/Softwares/apache-maven-3.0.5
export HADOOP_HOME=/home/lethic/Documents/Softwares/hadoop-1.1.2
export HADOOP_CONF_DIR=/home/lethic/Documents/Softwares/hadoop-1.1.2/conf
export MAHOUT_HOME=/home/lethic/Documents/Softwares/mahout-distribution-0.6
export PATH=$JAVA_HOME/bin:$MAVEN_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$MAHOUT_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH
export HADOOP_HOME_WARN_SUPPRESS=1
NOTICE that all the /home/lethic/Documents/Softwares/ should be changed to your own path.
TEST:
Java:
javac
Remember to add this to etc/profile or it will show some warning:
export HADOOP_HOME_WARN_SUPPRESS=1
Hadoop:
Mahout:
It says that: MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath
I think this not a kind of error because when you refer to mahout, it contains:
if [ "$MAHOUT_LOCAL" != "" ]; then
echo "MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath."
else
echo "MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath."
CLASSPATH=${CLASSPATH}:$HADOOP_CONF_DIR
Fi
Which means whenever MAHOUT_LOCAL is not empty, it will echo “MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.”.
And notice that:
# MAHOUT_LOCAL set to anything other than an empty string to force
# mahout to run locally even if
# HADOOP_CONF_DIR and HADOOP_HOME are set
Which means if you want to run Mahout on Hadoop but not locally, you should set MAHOUT_LOCAL to empty string.
Thus we may get a conclusion that if we want to run Mahout on Hadoop, it will always echo “MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.” which is not a kind of error.
And all above is my opinion and it may be wrong because Im still fledgling. But at least all the things still goes well and I did not met any problem since then.
VIII k-means Algorithm Test
Test k-means:
Download the data:
http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
And copy it to $MAHOUT_HOME
Get the Hadoop started:
$HADOOP_HOME/bin/start-all.sh
Then import the data to ‘testdata'(NOTICE that the name ‘testdata’ cannot be modified, it is said on the Internet that only the name ‘testdata’ can be detected by this program):
$HADOOP_HOME/bin/hadoop fs -mkdir testdata
$HADOOP_HOME/bin/hadoop fs -put $MAHOUT_HOME/synthetic_control.data $MAHOUT_ HOME/testdata
Kmeans algorithm:
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/mahout-examples-0.6-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It will take a few minutes
To see the results:
$HADOOP_HOME/bin/hadoop fs -lsr output
$HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
$cd $MAHOUT_HOME/examples/output
$ ls
And if you see:
clusteredPoints clusters-0 clusters-1 clusters-10 clusters-2 clusters-3 clusters-4
clusters-5 clusters-6 clusters-7 clusters-8 clusters-9 data
Your Mahout is properly installed. 🙂
code
more code
~~~~