How to write a map to reduce in R?

I am new to R. I know how to write a map abbreviation in Java. I want to try the same thing in R. So can anyone help in providing any sample codes and is there any fixed format for MapReduce in R.

Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial

All code examples will be more helpful.

+5
source share
1 answer

If you want to implement map abbreviation (using Hadoop) in a language other than Java, then you are using a function called streaming. Then the data is fed to the display device via STDIN (readLines ()), back to Hadoop via STDOUT (cat ()), and then to the reducer again through STDIN (readLines ()) and then washed through STDOUT (cat ()).

The following code is taken from the article . I wrote about writing a map reduction job using R for Hadoop. The code is supposed to be 2 grams, but I would say simple enough to see what happens with MapReduce.

# map.R

library(stringdist, quietly=TRUE)

input <- file("stdin", "r")

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   # in case of empty lines
   # more sophisticated defensive code makes sense here
   if(nchar(line) == 0) break

   fields <- unlist(strsplit(line, "\t"))

   # extract 2-grams
   d <- qgrams(tolower(fields[4]), q=2)

   for(i in 1:ncol(d)) {
     # language / 2-gram / count
     cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
   }
}

close(input)

-

# reduce.R

input <- file("stdin", "r")

# initialize variables that keep
# track of the state

is_first_line <- TRUE

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   line <- unlist(strsplit(line, "\t"))
   # current line belongs to previous
   # line key pair
   if(!is_first_line &&
      prev_lang == line[1] &&
      prev_2gram == line[2]) {
        sum <- sum + as.integer(line[3])
   }
   # current line belongs either to a
   # new key pair or is first line
   else {
     # new key pair - so output the last
     # key pair result
     if(!is_first_line) {
       # language / 2-gram / count
       cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
     }
     # initialize state trackers
     prev_lang <- line[1]
     prev_2gram <- line[2]
     sum <- as.integer(line[3])
     is_first_line <- FALSE
   }
}

# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")

close(input)

http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/

+2
source

All Articles