Most Useful Hadoop Commands

This guide is not meant to be comprehensive and I am not trying to list simple and very frequently used commands like hadoop fs -ls that are very similar to what we use in a linux shell. If you need a comprehensive guide you can always check the Hadoop commands manual. Instead, I am listing some more “special” commands that I use frequently to manage a Hadoop cluster (HDFS, MapReduce, Zookeepers etc) in the long term and to detect/ fix problems. This is a work in progress post – I intend to update it with new commands based on how often I am using them on my clusters. Feel free to add a comment with commands you find more useful to perform administrative tasks.

HDFS

  1. Set the dfs replication recursively for all existing files
    hadoop dfs -setrep -w 1 -R /
  2. Create a report to check HDFS health
    hadoop dfsadmin -report
  3. Check HDFS file system
    hadoop fsck /
  4. Run cluster balancer – make sure files are distributed in a balanced way across slaves.
    sudo -u hdfs hdfs balancer
  5. Use after removing or adding a datanode
    hadoop dfsadmin -refreshNodes
    sudo -u mapred hadoop mradmin -refreshNodes
  6. When hadoop master enters safe node (often because disk space is not enough to support your desired replication factor)
     hadoop dfsadmin -safemode leave
  7. Display the datanodes that store a particular file with name “filename”.
    hadoop fsck /file-path/filename -files -locations -blocks

    Sample output:

     FSCK started by root (auth:SIMPLE) from /10.2.31.174 for path /spark-1.2.1-bin-2.3.0-mr1-cdh5.1.2.tgz at Wed May 27 17:31:15 PDT 2015
    
     /spark-1.2.1-bin-2.3.0-mr1-cdh5.1.2.tgz 186192499 bytes, 2 block(s): OK
    
     0. BP-323016323-10.2.31.174-1424856295335:blk_1073799439_60141 len=134217728 repl=3 [ip1:50010, ip2:50010, ip3:50010]
    
     1. BP-323016323-10.2.31.174-1424856295335:blk_1073799440_60142 len=51974771 repl=3 [ip1:50010, ip2:50010, ip3:50010]Status: HEALTHY
    
     Total size: 186192499 B
    
     Total dirs: 0
    
     Total files: 1
    
     Total symlinks: 0
    
     Total blocks (validated): 2 (avg. block size 93096249 B)
    
     Minimally replicated blocks: 2 (100.0 %)
    
     Over-replicated blocks: 0 (0.0 %)
    
     Under-replicated blocks: 0 (0.0 %)
    
     Mis-replicated blocks: 0 (0.0 %)
    
     Default replication factor: 3
    
     Average block replication: 3.0
    
     Corrupt blocks: 0
    
     Missing replicas: 0 (0.0 %)
    
     Number of data-nodes: 6
    
     Number of racks: 1
    
     FSCK ended at Wed May 27 17:31:15 PDT 2015 in 1 milliseconds

Map Reduce

  1. List active map-reduce jobs
    hadoop job -list
  2. Kill a job
    hadoop job -kill jobname
  3. Get jobtrackers state (active – standby)
    sudo -u mapred hadoop mrhaadmin -getServiceState jt1

    – where jt1 is the name of each of the jobtracker as configured on you mapred-site.xml file.

Zookeeper

  1. Initialize the High Availability state on zookeeper
    hdfs zkfc -formatZK
  2. Check mode of each zookeeper server:
    echo srvr | nc localhost 2181 | grep Mode