Custom InputFormat in MapReduce.

Custom InputFormat in MapReduce.

Function of an inputFormat is to define how to read data from a file into mapper class. Another important function of inputFormat is to divide the input into splits that make up the inputs to user defined map classes. Instance of InputSplit interface encapsulates these splits.

Default input format in MapReduce is TextInputFormat. The TextInputFormat divides files into splits strictly by byte offsets. It then reads individual lines of the files from the split in as record inputs to the Mapper. The RecordReader associated with TextInputFormat must be robust enough to handle the fact that the splits do not necessarily correspond neatly to line-ending boundaries. In fact, the RecordReader will read past the theoretical end of a split to the end of a line in one record. The reader associated with the next split in the file will scan for the first full line in the split to begin processing that fragment. All RecordReader implementations must use some similar logic to ensure that they do not miss records that span InputSplit boundaries.

In case of a requirement where we have to write a custom input format, then we should subcalss FileInputFormat. Then we must override getRecordReader() method that returns an instance of RecordReader. This works well in the case when InputFormat reads data from a file. In case data input os from HBase table, then Hadoop provides a special InputFormat called as TableInputFormat.

To view or add a comment, sign in

More articles by Anshuman Anshuman

  • Performance Test Google Cloud Pub/Sub

    Creating a batch data pipeline on GCP is tricky but creating a full-throttle real-time data pipeline is trickier. This…

    4 Comments
  • Recommendation Engine with Spark MLlib

    To continue the spark series, time has come to discuss about Apache Spark Machine Learning libraries and work on one of…

  • SPARK AND KAFKA INTEGRATION

    This write-up describes the integration between kafka and Spark. Very briefly I will touch upon basics of Kafka and…

    1 Comment
  • Cloud IAM in Nutshell

    In Cloud IAM, you grant access to members. Members can be of the following types: Google account Service account Google…

  • Difference between MapFiles and SequenceFiles.

    We need to understand a basic fact first that file types matter in case of MapReduce operations. That means mapreduce…

  • Job, JobConf, and JobControl in Hadoop

    Job :- is a representation of a mapreduce job. In addition to holding job configuration, Job object also holds…

Explore content categories