Word Count Program using R, Spark, Map-reduce, Pig, Hive, Python

Birendra Kumar Sahu

Published Jul 18, 2015

Word Count is very popular example for any programming model. Word Count have own value as it is counting how many?

Word Count program reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

The best option for Word Count program is Spark due to just 3 lines of code, no programming knowledge needed and given the best performance.

Word Count using Spark:

val f = sc.textFile(inputPath)
val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cach()
w.reduceByKey(_+_).saveAsText(outputPath)

Word Count Program using PIG:

input = LOAD ’in-dir' USING TextLoader();
words = FOREACH input GENERATE
FLATTEN(TOKENIZE(*));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group,
COUNT(words);
STORE counts INTO ‘out-dir’;

Word Count using Map-Reduce (Java):

public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public class WordCount {

public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]);
conf.setOutputPath(new Path(args[1]);
JobClient.runJob(conf);
}

Word Count using Hive:

drop table if exists doc;

create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;

load data local inpath '/home/Words' overwrite into table doc;

SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) Table as word GROUP BY word;

Word Count Program using Map-Reduce(python):

//mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)

//reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)

Word Count Program using R:

Create script wc_mapper.R:
#!/usr/bin/env Rscript

library('stringi')
stdin <- file('stdin', open='r')

while(length(x <- readLines(con=stdin, n=1024L))>0) {
x <- unlist(stri_extract_all_words(x))
xt <- table(x)
words <- names(xt)
counts <- as.integer(xt)
cat(stri_paste(words, counts, sep='t'), sep='n')
}
Create a source file wc_reducer.cpp:
#include <iostream>
#include <string>
#include <cstdlib>

using namespace std;

int main()
{
string line;
string last_word = "";
int last_count = 0;

while(getline(cin,line))
{
size_t found = line.find_first_of("t");
if(found != string::npos)
{
string key = line.substr(0,found);
string value = line.substr(found);
int valuei = atoi(value.c_str());
//cerr << "key=" << key << " value=" << value <<endl;
if(key != last_word)
{
if(last_word != "") cout << last_word << "t" << last_count << endl;

last_word = key;
last_count = valuei;
}
else
last_count += valuei;
}
}
if(last_word != "") cout << last_word << "t" << last_count << endl;

return 0;
}

BHASKER SURYA SALVADI 9y

great ....you r sir........!

Puneet Bhatia 9y

Great to see this example comparing all this libraries. The biggest doubt is which one to learn :)

Dr Purshottam MBBS,MD(HA) 10y

please can you explain this in a simple way totally confused MapReduce from perspective to python, totally confused

See more comments

Word Count Program using R, Spark, Map-reduce, Pig, Hive, Python

Birendra Kumar Sahu

Word Count using Spark:

Word Count Program using PIG:

Word Count using Map-Reduce (Java):

Word Count using Hive:

Word Count Program using Map-Reduce(python):

Word Count Program using R:

More articles by this author

Others also viewed

Experiences in Using R and Python in Production

"Python Data Types and Variables: The Foundation of Dynamic Programming"

Use Python in Excel:

Mutable, Immutable... everything is object!

Data and Variables in Programming: A Comprehensive Guide

Why Python Should Be Your First Programming Language

Dynamic ways of creating variables in python

Understanding Variables in Python: Naming Conventions, Data Types, and Best Practices

Mutable, Immutable... everything is object!

Functional Programming in Python with map, filter, and lambda

Explore content categories

Word Count using Spark:

Word Count Program using PIG:

Word Count using Map-Reduce (Java):

Word Count using Hive:

Word Count Program using Map-Reduce(python):

Word Count Program using R:

Data Observability and Resilience at Scale

Dec 12, 2024

Building a Modern Data Platform as a Service (DPaaS) with Data Contracts and PaaS for Scalable Ingestion

Dec 8, 2024

Understanding Decision Science: A Guide with Real-Time Examples

Oct 25, 2024

Unlocking the Power of LLMs for Context-Aware SQL and Reporting/Visualization Generation

Oct 13, 2024

The Modernization of Software Platforms: The Journey from Monolith to Microservices

Oct 12, 2024

Understanding Data Mesh: A Modern Approach to Data Architecture

Oct 9, 2024

How Generative AI is Transforming Data Engineering

Oct 6, 2024

Understanding the Data Lakehouse Engine: Bridging the Gap Between Data Lakes and Data Warehouses

Oct 6, 2024

The Importance of Emotional Intelligence in Engineering Leadership

Oct 6, 2024

From Data to Decisions: How Data Engineering Fuels AI Transformation and Common Pitfalls to Avoid?

Oct 6, 2024

Others also viewed

Experiences in Using R and Python in Production

"Python Data Types and Variables: The Foundation of Dynamic Programming"

Use Python in Excel:

Mutable, Immutable... everything is object!

Data and Variables in Programming: A Comprehensive Guide

Why Python Should Be Your First Programming Language

Dynamic ways of creating variables in python

Understanding Variables in Python: Naming Conventions, Data Types, and Best Practices

Mutable, Immutable... everything is object!

Functional Programming in Python with map, filter, and lambda

Explore content categories