Custom InputFormat in MapReduce.

Anshuman Anshuman

Published Nov 1, 2021

Function of an inputFormat is to define how to read data from a file into mapper class. Another important function of inputFormat is to divide the input into splits that make up the inputs to user defined map classes. Instance of InputSplit interface encapsulates these splits.

Default input format in MapReduce is TextInputFormat. The TextInputFormat divides files into splits strictly by byte offsets. It then reads individual lines of the files from the split in as record inputs to the Mapper. The RecordReader associated with TextInputFormat must be robust enough to handle the fact that the splits do not necessarily correspond neatly to line-ending boundaries. In fact, the RecordReader will read past the theoretical end of a split to the end of a line in one record. The reader associated with the next split in the file will scan for the first full line in the split to begin processing that fragment. All RecordReader implementations must use some similar logic to ensure that they do not miss records that span InputSplit boundaries.

In case of a requirement where we have to write a custom input format, then we should subcalss FileInputFormat. Then we must override getRecordReader() method that returns an instance of RecordReader. This works well in the case when InputFormat reads data from a file. In case data input os from HBase table, then Hadoop provides a special InputFormat called as TableInputFormat.

To view or add a comment, sign in

Custom InputFormat in MapReduce.

Anshuman Anshuman

More articles by Anshuman Anshuman

Explore content categories

More articles by Anshuman Anshuman

Performance Test Google Cloud Pub/Sub

Recommendation Engine with Spark MLlib

SPARK AND KAFKA INTEGRATION

Cloud IAM in Nutshell

Difference between MapFiles and SequenceFiles.

Job, JobConf, and JobControl in Hadoop

Explore content categories