Difference between revisions of "Hadoop"

From ArchWiki
Jump to navigation Jump to search
(recategorize to avoid redirect after the old category has been renamed (https://github.com/lahwaacz/wiki-scripts/blob/master/recategorize-over-redirect.py))
Line 2: Line 2:
 
[[ja:Hadoop]]
 
[[ja:Hadoop]]
 
[[zh-CN:Hadoop]]
 
[[zh-CN:Hadoop]]
 +
{{Related|Apache spark}}
 
[http://hadoop.apache.org Apache Hadoop] is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.  
 
[http://hadoop.apache.org Apache Hadoop] is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.  
  

Revision as of 16:28, 9 August 2015

zh-CN:Hadoop

  • Apache spark
  • Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.

    Installation

    Install the hadoopAUR package which is available in the Arch User Repository.

    Configuration

    By default, hadoop is already configured for pseudo-distributed operation. Some environment variables are set in /etc/profile.d/hadoop.sh with different values than traditional hadoop.

    ENV Value Description Permission
    HADOOP_CONF_DIR /etc/hadoop Where configuration files are stored. Read
    HADOOP_LOG_DIR /tmp/hadoop/log Where log files are stored. Read and Write
    HADOOP_SLAVES /etc/hadoop/slaves File naming remote slave hosts. Read
    HADOOP_PID_DIR /tmp/hadoop/run Where pid files are stored. Read and Write

    You also should set up the following files correctly.

    /etc/hosts
    /etc/hostname 
    /etc/locale.conf
    

    JAVA_HOME should be set automatically by jdk7-openjdk /etc/profile.d/jre.sh. Check JAVA_HOME:

    $ echo $JAVA_HOME
    

    If it does not print anything, log out and log back in to see the change.

    Single Node Setup

    Tango-view-refresh-red.pngThis article or section is out of date.Tango-view-refresh-red.png

    Reason: This section of the article is based on documentation for an earlier version of Hadoop. (Discuss in Talk:Hadoop#)
    Note: This section is based on the Hadoop Official Documentation

    Standalone Operation

    By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

    The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

    $ export HADOOP_CONF_DIR=/usr/lib/hadoop/orig_conf
    $ mkdir input
    $ cp /etc/hadoop/*.xml input
    $ hadoop jar /usr/lib/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
    $ cat output/*
    

    For the current 2.5.x release of hadoop included in the AUR:

    $ HADOOP_CONF_DIR=/usr/lib/hadoop/orig_etc/hadoop/
    $ mkdir input
    $ cp /etc/hadoop/hadoop/*.xml input
    $ hadoop jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar grep input output 'dfs[a-z.]+'
    $ cat output/*
    

    Pseudo-Distributed Operation

    Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

    By default, Hadoop will run as the user root. You can change the user in /etc/conf.d/hadoop:

    HADOOP_USERNAME="<your user name>"
    

    Set up passphraseless ssh

    Make sure you have sshd enabled, or start it with systemctl enable sshd. Now check that you can connect to localhost without a passphrase:

    $ ssh localhost
    

    If you cannot ssh to localhost without a passphrase, execute the following commands:

    $ ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
    

    Also make sure this line is commented in /etc/ssh/sshd_config

    /etc/ssh/sshd_config
    #AuthorizedKeysFile .ssh/authorized_keys

    Execution

    Format a new distributed-filesystem:

    $ hadoop namenode -format
    

    Start the hadoop daemons:

    # systemctl start hadoop-datanode
    # systemctl start hadoop-jobtracker
    # systemctl start hadoop-namenode
    # systemctl start hadoop-secondarynamenode
    # systemctl start hadoop-tasktracker
    


    The hadoop daemon log output is written to the ${HADOOP_LOG_DIR} directory (defaults to /var/log/hadoop).

    Browse the web interface for the NameNode and the JobTracker; by default they are available at:

    Copy the input files into the distributed filesystem:

    $ hadoop fs -put /etc/hadoop input
    

    Run some of the examples provided:

    $ hadoop jar /usr/lib/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
    

    Examine the output files:

    Copy the output files from the distributed filesystem to the local filesytem and examine them:

    $ hadoop fs -get output output
    $ cat output/*
    

    or

    View the output files on the distributed filesystem:

    $ hadoop fs -cat output/*
    

    When you're done, stop the daemons with:

    # systemctl stop hadoop-datanode
    # systemctl stop hadoop-jobtracker
    # systemctl stop hadoop-namenode
    # systemctl stop hadoop-secondarynamenode
    # systemctl stop hadoop-tasktracker