We will try to run both linear and logistic regressions using SparkR. We will use a small dataset cats
from MASS package. Basically each entry has a gender, body weight and heart weight, but first we will try to look a quick look at what our dataset is like.
library("MASS")
data(cats)
head(cats)
## Sex Bwt Hwt
## 1 F 2.0 7.0
## 2 F 2.0 7.4
## 3 F 2.0 9.5
## 4 F 2.1 7.2
## 5 F 2.1 7.3
## 6 F 2.1 7.6
summary(cats)
## Sex Bwt Hwt
## F:47 Min. :2.000 Min. : 6.30
## M:97 1st Qu.:2.300 1st Qu.: 8.95
## Median :2.700 Median :10.10
## Mean :2.724 Mean :10.63
## 3rd Qu.:3.025 3rd Qu.:12.12
## Max. :3.900 Max. :20.50
Now we want to look at correlation between body weight and heart weight. Intuitively a larger hear weight should result in a larger body weight. drawing the scatter-plot to have an intuition. The scatter-plot confirms the intuition.
plot(cats$Hwt, cats$Bwt, xlab = "Heart weight", ylab = "Body weight", main = "Body weight vs Heart weight")
Now we need to init sparkR. We will also need sqlContext to convert the local data frame from R to DataFrame. In an interactive shell, these two should be already created.
Sys.setenv(SPARK_HOME= "/Users/yuancalvin/spark-1.5.2") # Set this to where sparkR is installed
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
##
## Attaching package: 'SparkR'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following objects are masked from 'package:stats':
##
## filter, na.omit
##
## The following objects are masked from 'package:base':
##
## intersect, rbind, sample, subset, summary, table, transform
sc <- sparkR.init(master="local")
## Launching java with spark-submit command /Users/yuancalvin/spark-1.5.2/bin/spark-submit sparkr-shell /var/folders/r1/l2wpmynd0vl54pzd1bf_00sh0000gn/T//Rtmp47u97N/backend_port409f1cbc129d
sqlContext <- sparkRSQL.init(sc)
DataFrame is essentially the Spark version data frame with better optimization. Since the logistic regression doesn’t muti-classification yet, we need to transform the M
to 1 and F
to 0 for sex
.
toConvert <- function(x){
if (x == 'F'){
x <- 0
} else {
x <-1
}
}
cats$Sex <- unlist(lapply(cats$Sex, toConvert))
# Split the data into training and test sets
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(100)
trainIndex <- createDataPartition(cats$Sex, p = .7,
list = FALSE,
times = 1)
head(trainIndex)
## Resample1
## [1,] 2
## [2,] 3
## [3,] 6
## [4,] 8
## [5,] 11
## [6,] 12
train <- cats[trainIndex, ]
test <- cats[-trainIndex, ]
# Change F into 1 and M into 2 for logistic regression
trainDF <- SparkR::createDataFrame(sqlContext, train)
testDF <- SparkR::createDataFrame(sqlContext, test)
head(trainDF)
## Sex Bwt Hwt
## 1 0 2.0 7.4
## 2 0 2.0 9.5
## 3 0 2.1 7.6
## 4 0 2.1 8.2
## 5 0 2.1 8.7
## 6 0 2.1 9.8
Using SparkR::glm
is almost identical as what one would have for glm
except that summary()
is not yet supported for glm
. ggplot
is currently not available for RDD.
lin_spark <- SparkR::glm(Bwt ~ Hwt, data = trainDF, family = "gaussian")
predications <- SparkR::predict(lin_spark, testDF)
showDF(SparkR::select(predications, "Bwt", "prediction"))
## +---+------------------+
## |Bwt| prediction|
## +---+------------------+
## |2.0| 2.145840343434948|
## |2.1|2.1783893223262427|
## |2.1| 2.19466381177189|
## |2.1| 2.324859727337069|
## |2.1| 2.357408706228364|
## |2.1|2.3899576851196587|
## |2.2|2.1621148328805955|
## |2.2|2.5852515584674274|
## |2.3| 2.650349516250017|
## |2.4| 2.438781153456601|
## |2.4| 2.666624005695664|
## |2.0| 2.064467896206711|
## |2.1| 2.650349516250017|
## |2.4|2.2923107484457743|
## |2.4| 2.487604621793543|
## |2.5|2.7968199212608433|
## |2.5|3.0734862418368487|
## |2.5|3.0734862418368487|
## |2.6| 2.25976176955448|
## |2.6| 2.357408706228364|
## +---+------------------+
## only showing top 20 rows
# abline(lin_spark)
Now we will run a logistic regression using sex of each cat as the dependent variable.
model_log <- SparkR::glm(Sex ~ Bwt + Hwt, data= trainDF, family = "binomial")
predications <- SparkR::predict(model_log, testDF)
showDF(SparkR::select(predications, "Sex", "prediction"))
## +---+----------+
## |Sex|prediction|
## +---+----------+
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |0.0| 0.0|
## |1.0| 0.0|
## |1.0| 0.0|
## |1.0| 1.0|
## |1.0| 0.0|
## |1.0| 1.0|
## |1.0| 1.0|
## |1.0| 1.0|
## |1.0| 1.0|
## |1.0| 1.0|
## +---+----------+
## only showing top 20 rows