HBase InputFormat/OutputFormat for Hadoop Streaming
What is this?
InputFormat/OutputFormat to use HBase tables as input/output of MapReduce in Hadoop Streaming.
Usage
debian:~% hadoop dfs -mkdir dammy_input debian:~% hadoop jar hadoop-streaming.jar \ -input dammy_input \ -output output \ -mapper /bin/cat \ -inputformat org.childtv.hadoop.hbase.mapred.JSONTableInputFormat \ -jobconf map.input.table=scores \ -jobconf map.input.columns=course: debian:~% hadoop dfs -cat output/* Dan {"course:math":"87","course:art":"97"} Dana {"course:math":"100","course:art":"80"}
Setting
Supported Options
- -jobconf map.input.table=
- Input table name for the Map step
- -jobconf map.input.columns=
- Column name to scan. Separate by whitespace for multi columns
- -jobconf map.input.binary=
- Optional. Input column names and cell values are Base64 encoded if true
- -jobconf map.input.timestamp=
- Optional. Timestamps are added to input if true
- -jobconf reduce.output.table=
- Output table name
- -jobconf reduce.output.binary=
- Set true when column names and cell values are Base64 encoded
InputFormats
org.childtv.hadoop.hbase.mapred.JSONTableInputFormat
Dan {"course:math":"87","course:art":"97"}
-inputformat=json -jobconf map.input.timestamp=true
Dan {"course:math":{"value":"87","timestamp":"1226501804191"},"course:art":{"value":"97","timestamp":"1226501810087"}}
org.childtv.hadoop.hbase.mapred.XMLTableInputFormat
Same format as REST API GET /[table_name]/row/[row_key]/
Dan <?xml version="1.0" encoding="UTF-8"?><row><column><name>course:art</name><value>97</value></column><column><name>course:math</name><value>87</value></column></row>
Values are also same when add option -jobconf map.input.binary=true
Dan <?xml version="1.0" encoding="UTF-8"?><row><column><name>Y291cnNlOmFydA==</name><value>OTc=</value></column><column><name>Y291cnNlOm1hdGg=</name><value>ODc=</value></column></row>
org.childtv.hadoop.hbase.mapred.ListTableInputFormat
Only values of cell separated by whitespace.
Dan 97 87
You can change separator.
-inputformat=list -jobconf map.input.value.separator=,
Dan 97,87
custom format
Implement a subclass of org.childtv.hadoop.hbase.mapred.TextTableInputFormat
OutputFormats
Read comments on sources.