This is a report on SparkR using R Markdown. ## Spark Installation First go to Spark to download the appropriate Spark installation. In this tutorial we have 1.5.2 with the package type that is pre-built for Hadoop 2.6 and later. For the download type, choose direct download.

After you unpack the package, place it in your desired directory and you can proceed to the next steps.

Load SparkR in RStudio

We need to set the environment and the libaray paths

Sys.setenv(SPARK_HOME= "/Users/yuancalvin/spark-1.5.2")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

Now we can load SparkR library

library(SparkR)
## 
## Attaching package: 'SparkR'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, na.omit
## 
## The following objects are masked from 'package:base':
## 
##     intersect, rbind, sample, subset, summary, table, transform
sc <- sparkR.init(master="local")
## Launching java with spark-submit command /Users/yuancalvin/spark-1.5.2/bin/spark-submit   sparkr-shell /var/folders/r1/l2wpmynd0vl54pzd1bf_00sh0000gn/T//RtmpnDOpE2/backend_portad6417712da6

SparkR Basics

Let’s first look at some simple operations using SparkR. Unlike a regular R object, wif we have a file we cannot direclty see what a file looks like by using view

We first read the textFile into myFile and use take to look at the first 10 lines of README.md

myFile <- SparkR:::textFile(sc, "/Users/yuancalvin/spark-1.5.2/README.md")
take(myFile, 10)
## [[1]]
## [1] "# Apache Spark"
## 
## [[2]]
## [1] ""
## 
## [[3]]
## [1] "Spark is a fast and general cluster computing system for Big Data. It provides"
## 
## [[4]]
## [1] "high-level APIs in Scala, Java, Python, and R, and an optimized engine that"
## 
## [[5]]
## [1] "supports general computation graphs for data analysis. It also supports a"
## 
## [[6]]
## [1] "rich set of higher-level tools including Spark SQL for SQL and DataFrames,"
## 
## [[7]]
## [1] "MLlib for machine learning, GraphX for graph processing,"
## 
## [[8]]
## [1] "and Spark Streaming for stream processing."
## 
## [[9]]
## [1] ""
## 
## [[10]]
## [1] "<http://spark.apache.org/>"
words <- SparkR:::flatMap(myFile, function(line) {strsplit(line, " ")[[1]]})
wordCount <- SparkR:::lapply(words, function(word) {list(word, 1)})
counts <- SparkR:::reduceByKey(wordCount, "+" , 2)
output <- collect(counts)
k = lapply(output, unlist)
m <- data.frame(t(data.frame(k)))
head(m)
##                                      X1 X2
## c..Thriftserver....1..     Thriftserver  1
## c..Alternatively.....1.. Alternatively,  1
## c.....Specifying....1..    ["Specifying  1
## c..guide.....1..                 guide,  1
## c..variable....1..             variable  1
## c..engine....1..                 engine  1