Difference between revisions of "Hadoop"

From ArchWiki
Jump to: navigation, search
m (fix the link for Hadoop Official Documentation)
(Standalone Operation: I change the version of hadoop-mapreduce-examples from 2.7.1 to 2.7.2)
 
(27 intermediate revisions by 18 users not shown)
Line 1: Line 1:
{{Out of date|Sysvinit is deprecated. We use systemd now.}}
+
[[Category:Distributed computing]]
 
+
{{Related articles start}}
[[Category:Web Server]]
+
{{Related|Apache spark}}
 +
{{Related articles end}}
 +
[[ja:Hadoop]]
 +
[[zh-CN:Hadoop]]
 
[http://hadoop.apache.org Apache Hadoop] is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.  
 
[http://hadoop.apache.org Apache Hadoop] is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.  
  
Line 9: Line 12:
 
== Configuration ==
 
== Configuration ==
  
By default, hadoop is already configured for pseudo-distributed operation.
+
By default, hadoop is already configured for pseudo-distributed operation. Some environment variables are set in {{ic|/etc/profile.d/hadoop.sh}} with different values than traditional hadoop.
Some environment variables are set in /etc/profile.d/hadoop.sh with  
+
different value than traditional hadoop.
+
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 41: Line 42:
 
|}
 
|}
  
You also should set up follow files correctly.  
+
You also should set up the following files correctly.  
  
 
  /etc/hosts
 
  /etc/hosts
Line 47: Line 48:
 
  /etc/locale.conf
 
  /etc/locale.conf
  
JAVA_HOME should be set automatically by [jdk7-openjdk] /etc/profile.d/jre.sh or [openjdk6] /etc/profile.d/openjdk6.sh. You could check JAVA_HOME:
+
You need to tell hadoop your JAVA_HOME in {{ic|/etc/hadoop/hadoop-env.sh}} because it doesn't assume the location where it's installed to in Arch Linux by itself:
$ echo $JAVA_HOME
+
  
If it does not print anything, maybe you should logout and login to make it right.
+
{{hc|1=/etc/hadoop/hadoop-env.sh|2=
 +
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/
 +
}}
  
 
== Single Node Setup ==
 
== Single Node Setup ==
 
+
{{Note|This section is based on the [http://hadoop.apache.org/docs/stable/ Hadoop Official Documentation] }}
{{Note|This part is base on the [http://hadoop.apache.org/docs/stable/single_node_setup.html Hadoop Official Documentation] }}
+
  
 
=== Standalone Operation ===
 
=== Standalone Operation ===
Line 61: Line 62:
  
 
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
 
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
  $ export HADOOP_CONF_DIR=/usr/lib/hadoop/orig_conf
+
 
 +
  $ HADOOP_CONF_DIR=/usr/lib/hadoop/orig_etc/hadoop/
 
  $ mkdir input
 
  $ mkdir input
 
  $ cp /etc/hadoop/*.xml input
 
  $ cp /etc/hadoop/*.xml input
  $ hadoop jar /usr/lib/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
+
  $ hadoop jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
 
  $ cat output/*
 
  $ cat output/*
  
Line 71: Line 73:
 
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
 
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
  
By default, Hadoop will run as the user root.
+
By default, Hadoop will run as the user root. You can change the user in {{ic|/etc/conf.d/hadoop}}:
If you use initscripts, you can change the user in /etc/conf.d/hadoop-all:
+
  
 
  HADOOP_USERNAME="<your user name>"
 
  HADOOP_USERNAME="<your user name>"
  
==== Setup passphraseless ssh ====
+
==== Set up passphraseless ssh ====
 +
 
 +
Make sure you have [[sshd]] enabled, or start it with {{ic|systemctl enable sshd}}. Now check that you can connect to localhost without a passphrase:
  
Now check that you can [[ssh]] to the localhost without a passphrase:
 
 
  $ ssh localhost
 
  $ ssh localhost
  
 
If you cannot ssh to localhost without a passphrase, execute the following commands:
 
If you cannot ssh to localhost without a passphrase, execute the following commands:
  $ ssh-keygen -t dsa -P "" -f ~/.ssh/id_dsa
+
 
  $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys  
+
  $ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
 +
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
 +
 
 +
Also make sure this line is commented in {{ic|/etc/ssh/sshd_config}}
 +
 
 +
{{hc|/etc/ssh/sshd_config|
 +
#AuthorizedKeysFile .ssh/authorized_keys}}
  
 
==== Execution ====
 
==== Execution ====
Line 91: Line 99:
  
 
Start the hadoop daemons:
 
Start the hadoop daemons:
  $ sudo systemctl start hadoop-datanode
+
  # systemctl start hadoop-datanode
  $ sudo systemctl start hadoop-jobtracker
+
  # systemctl start hadoop-jobtracker
  $ sudo systemctl start hadoop-namenode
+
  # systemctl start hadoop-namenode
  $ sudo systemctl start hadoop-secondarynamenode
+
  # systemctl start hadoop-secondarynamenode
  $ sudo systemctl start hadoop-tasktracker
+
  # systemctl start hadoop-tasktracker
  
  
The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to /var/log/hadoop).
+
The hadoop daemon log output is written to the {{ic|<nowiki>${HADOOP_LOG_DIR}</nowiki>}} directory (defaults to {{ic|/var/log/hadoop}}).
  
 
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
 
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
Line 109: Line 117:
  
 
Run some of the examples provided:
 
Run some of the examples provided:
  $ hadoop jar /usr/lib/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
+
  $ hadoop jar /usr/lib/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
  
 
Examine the output files:
 
Examine the output files:
Line 123: Line 131:
  
 
When you're done, stop the daemons with:
 
When you're done, stop the daemons with:
  $ sudo systemctl stop hadoop-datanode
+
  # systemctl stop hadoop-datanode
  $ sudo systemctl stop hadoop-jobtracker
+
  # systemctl stop hadoop-jobtracker
  $ sudo systemctl stop hadoop-namenode
+
  # systemctl stop hadoop-namenode
  $ sudo systemctl stop hadoop-secondarynamenode
+
  # systemctl stop hadoop-secondarynamenode
  $ sudo systemctl stop hadoop-tasktracker
+
  # systemctl stop hadoop-tasktracker

Latest revision as of 22:04, 24 February 2016

Related articles

Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

Installation

Install the hadoopAUR package which is available in the Arch User Repository.

Configuration

By default, hadoop is already configured for pseudo-distributed operation. Some environment variables are set in /etc/profile.d/hadoop.sh with different values than traditional hadoop.

ENV Value Description Permission
HADOOP_CONF_DIR /etc/hadoop Where configuration files are stored. Read
HADOOP_LOG_DIR /tmp/hadoop/log Where log files are stored. Read and Write
HADOOP_SLAVES /etc/hadoop/slaves File naming remote slave hosts. Read
HADOOP_PID_DIR /tmp/hadoop/run Where pid files are stored. Read and Write

You also should set up the following files correctly.

/etc/hosts
/etc/hostname 
/etc/locale.conf

You need to tell hadoop your JAVA_HOME in /etc/hadoop/hadoop-env.sh because it doesn't assume the location where it's installed to in Arch Linux by itself:

/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/

Single Node Setup

Note: This section is based on the Hadoop Official Documentation

Standalone Operation

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

$ HADOOP_CONF_DIR=/usr/lib/hadoop/orig_etc/hadoop/
$ mkdir input
$ cp /etc/hadoop/*.xml input
$ hadoop jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

By default, Hadoop will run as the user root. You can change the user in /etc/conf.d/hadoop:

HADOOP_USERNAME="<your user name>"

Set up passphraseless ssh

Make sure you have sshd enabled, or start it with systemctl enable sshd. Now check that you can connect to localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2

Also make sure this line is commented in /etc/ssh/sshd_config

/etc/ssh/sshd_config
#AuthorizedKeysFile .ssh/authorized_keys

Execution

Format a new distributed-filesystem:

$ hadoop namenode -format

Start the hadoop daemons:

# systemctl start hadoop-datanode
# systemctl start hadoop-jobtracker
# systemctl start hadoop-namenode
# systemctl start hadoop-secondarynamenode
# systemctl start hadoop-tasktracker


The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to /var/log/hadoop).

Browse the web interface for the NameNode and the JobTracker; by default they are available at:

Copy the input files into the distributed filesystem:

$ hadoop fs -put /etc/hadoop input

Run some of the examples provided:

$ hadoop jar /usr/lib/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Copy the output files from the distributed filesystem to the local filesytem and examine them:

$ hadoop fs -get output output
$ cat output/*

or

View the output files on the distributed filesystem:

$ hadoop fs -cat output/*

When you're done, stop the daemons with:

# systemctl stop hadoop-datanode
# systemctl stop hadoop-jobtracker
# systemctl stop hadoop-namenode
# systemctl stop hadoop-secondarynamenode
# systemctl stop hadoop-tasktracker