Word Count Program using R, Spark, Map-reduce, Pig, Hive, Python

Word Count Program using R, Spark, Map-reduce, Pig, Hive, Python

Word Count is very popular example for any programming model. Word Count have own value as it is counting how many? 

Word Count program reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

The best option for Word Count program is Spark due to just 3 lines of code, no programming knowledge needed and given the best performance.  

Word Count using Spark:

val f = sc.textFile(inputPath)
val w = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cach()
w.reduceByKey(_+_).saveAsText(outputPath)

Word Count Program using PIG:

input = LOAD ’in-dir' USING TextLoader();
words = FOREACH input GENERATE
FLATTEN(TOKENIZE(*));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group,
COUNT(words);
STORE counts INTO ‘out-dir’;

Word Count using Map-Reduce (Java):


public static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}


public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}


public class WordCount {

public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]);
conf.setOutputPath(new Path(args[1]);
JobClient.runJob(conf);
}

Word Count using Hive:

drop table if exists doc;

create table doc(
text string
) row format delimited fields terminated by '\n' stored as textfile;

load data local inpath '/home/Words' overwrite into table doc;

SELECT word, COUNT(*) FROM doc LATERAL VIEW explode(split(text, ' ')) Table as word GROUP BY word;

Word Count Program using Map-Reduce(python):

//mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)


//reducer.py
#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py
word, count = line.split('\t', 1)

# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)

Word Count Program using R:


Create script wc_mapper.R:
#!/usr/bin/env Rscript

library('stringi')
stdin <- file('stdin', open='r')

while(length(x <- readLines(con=stdin, n=1024L))>0) {
x <- unlist(stri_extract_all_words(x))
xt <- table(x)
words <- names(xt)
counts <- as.integer(xt)
cat(stri_paste(words, counts, sep='t'), sep='n')
}
Create a source file wc_reducer.cpp:
#include <iostream>
#include <string>
#include <cstdlib>

using namespace std;

int main()
{
string line;
string last_word = "";
int last_count = 0;

while(getline(cin,line))
{
size_t found = line.find_first_of("t");
if(found != string::npos)
{
string key = line.substr(0,found);
string value = line.substr(found);
int valuei = atoi(value.c_str());
//cerr << "key=" << key << " value=" << value <<endl;
if(key != last_word)
{
if(last_word != "") cout << last_word << "t" << last_count << endl;

last_word = key;
last_count = valuei;
}
else
last_count += valuei;
}
}
if(last_word != "") cout << last_word << "t" << last_count << endl;

return 0;
}

great ....you r sir........!

Like
Reply

Great to see this example comparing all this libraries. The biggest doubt is which one to learn :)

Like
Reply

please can you explain this in a simple way totally confused MapReduce from perspective to python, totally confused

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories