HBase InputFormat/OutputFormat for Hadoop Streaming

What is this?

InputFormat/OutputFormat to use HBase tables as input/output of MapReduce in Hadoop Streaming.

Usage

debian:~% hadoop dfs -mkdir dammy_input
debian:~% hadoop jar hadoop-streaming.jar \
              -input dammy_input \
              -output output \
              -mapper /bin/cat \
              -inputformat org.childtv.hadoop.hbase.mapred.JSONTableInputFormat \
              -jobconf map.input.table=scores \
              -jobconf map.input.columns=course:
debian:~% hadoop dfs -cat output/*
Dan     {"course:math":"87","course:art":"97"}
Dana    {"course:math":"100","course:art":"80"}

Setting

  1. Set up Hadoop 1.7.2, HBase 0.2.1
  2. Download hadoop-hbase-streaming.jar from repository
  3. Edit $HADOOP_HOME/conf/hadoop-env.sh and add downladed jar to HADOOP_CLASSPATH
  4. Run Hadoop Streaming with selected format and jobconf options.

Supported Options

-jobconf map.input.table=
Input table name for the Map step
-jobconf map.input.columns=
Column name to scan. Separate by whitespace for multi columns
-jobconf map.input.binary=
Optional. Input column names and cell values are Base64 encoded if true
-jobconf map.input.timestamp=
Optional. Timestamps are added to input if true
-jobconf reduce.output.table=
Output table name
-jobconf reduce.output.binary=
Set true when column names and cell values are Base64 encoded

InputFormats

org.childtv.hadoop.hbase.mapred.JSONTableInputFormat
Dan     {"course:math":"87","course:art":"97"}

-inputformat=json -jobconf map.input.timestamp=true

Dan     {"course:math":{"value":"87","timestamp":"1226501804191"},"course:art":{"value":"97","timestamp":"1226501810087"}}
org.childtv.hadoop.hbase.mapred.XMLTableInputFormat

Same format as REST API GET /[table_name]/row/[row_key]/

Dan     <?xml version="1.0" encoding="UTF-8"?><row><column><name>course:art</name><value>97</value></column><column><name>course:math</name><value>87</value></column></row>

Values are also same when add option -jobconf map.input.binary=true

Dan     <?xml version="1.0" encoding="UTF-8"?><row><column><name>Y291cnNlOmFydA==</name><value>OTc=</value></column><column><name>Y291cnNlOm1hdGg=</name><value>ODc=</value></column></row>
org.childtv.hadoop.hbase.mapred.ListTableInputFormat

Only values of cell separated by whitespace.

Dan     97 87

You can change separator.
-inputformat=list -jobconf map.input.value.separator=,

Dan     97,87
custom format

Implement a subclass of org.childtv.hadoop.hbase.mapred.TextTableInputFormat

OutputFormats

Read comments on sources.