Linear and Logistic Regressions using SparkR

We will try to run both linear and logistic regressions using SparkR. We will use a small dataset cats from MASS package. Basically each entry has a gender, body weight and heart weight, but first we will try to look a quick look at what our dataset is like.

Dataset info

library("MASS")
data(cats)
head(cats)

##   Sex Bwt Hwt
## 1   F 2.0 7.0
## 2   F 2.0 7.4
## 3   F 2.0 9.5
## 4   F 2.1 7.2
## 5   F 2.1 7.3
## 6   F 2.1 7.6

summary(cats)

##  Sex         Bwt             Hwt       
##  F:47   Min.   :2.000   Min.   : 6.30  
##  M:97   1st Qu.:2.300   1st Qu.: 8.95  
##         Median :2.700   Median :10.10  
##         Mean   :2.724   Mean   :10.63  
##         3rd Qu.:3.025   3rd Qu.:12.12  
##         Max.   :3.900   Max.   :20.50

Now we want to look at correlation between body weight and heart weight. Intuitively a larger hear weight should result in a larger body weight. drawing the scatter-plot to have an intuition. The scatter-plot confirms the intuition.

plot(cats$Hwt, cats$Bwt, xlab = "Heart weight", ylab = "Body weight", main = "Body weight vs Heart weight")

SparkR setup

Now we need to init sparkR. We will also need sqlContext to convert the local data frame from R to DataFrame. In an interactive shell, these two should be already created.

Sys.setenv(SPARK_HOME= "/Users/yuancalvin/spark-1.5.2") # Set this to where sparkR is installed
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)

## 
## Attaching package: 'SparkR'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following objects are masked from 'package:stats':
## 
##     filter, na.omit
## 
## The following objects are masked from 'package:base':
## 
##     intersect, rbind, sample, subset, summary, table, transform

sc <- sparkR.init(master="local")

## Launching java with spark-submit command /Users/yuancalvin/spark-1.5.2/bin/spark-submit   sparkr-shell /var/folders/r1/l2wpmynd0vl54pzd1bf_00sh0000gn/T//Rtmp47u97N/backend_port409f1cbc129d

sqlContext <- sparkRSQL.init(sc)

Convert R data frame to DataFrame

DataFrame is essentially the Spark version data frame with better optimization. Since the logistic regression doesn’t muti-classification yet, we need to transform the M to 1 and F to 0 for sex.

toConvert <- function(x){
  if (x == 'F'){
     x <- 0
  } else {
    x <-1
  }
}
cats$Sex <- unlist(lapply(cats$Sex, toConvert))
# Split the data into training and test sets
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

set.seed(100)
trainIndex <- createDataPartition(cats$Sex, p = .7,
                                  list = FALSE,
                                  times = 1)
head(trainIndex)

##      Resample1
## [1,]         2
## [2,]         3
## [3,]         6
## [4,]         8
## [5,]        11
## [6,]        12

train <- cats[trainIndex, ]
test <- cats[-trainIndex, ]

# Change F into 1 and M into 2 for logistic regression
trainDF <- SparkR::createDataFrame(sqlContext, train)
testDF <- SparkR::createDataFrame(sqlContext, test)
head(trainDF)

##   Sex Bwt Hwt
## 1   0 2.0 7.4
## 2   0 2.0 9.5
## 3   0 2.1 7.6
## 4   0 2.1 8.2
## 5   0 2.1 8.7
## 6   0 2.1 9.8

Linear Regression

Using SparkR::glm is almost identical as what one would have for glm except that summary() is not yet supported for glm. ggplot is currently not available for RDD.

lin_spark <- SparkR::glm(Bwt ~ Hwt, data = trainDF, family = "gaussian")
predications <- SparkR::predict(lin_spark, testDF)
showDF(SparkR::select(predications, "Bwt", "prediction"))

## +---+------------------+
## |Bwt|        prediction|
## +---+------------------+
## |2.0| 2.145840343434948|
## |2.1|2.1783893223262427|
## |2.1|  2.19466381177189|
## |2.1| 2.324859727337069|
## |2.1| 2.357408706228364|
## |2.1|2.3899576851196587|
## |2.2|2.1621148328805955|
## |2.2|2.5852515584674274|
## |2.3| 2.650349516250017|
## |2.4| 2.438781153456601|
## |2.4| 2.666624005695664|
## |2.0| 2.064467896206711|
## |2.1| 2.650349516250017|
## |2.4|2.2923107484457743|
## |2.4| 2.487604621793543|
## |2.5|2.7968199212608433|
## |2.5|3.0734862418368487|
## |2.5|3.0734862418368487|
## |2.6|  2.25976176955448|
## |2.6| 2.357408706228364|
## +---+------------------+
## only showing top 20 rows

# abline(lin_spark)

Logstic Regression

Now we will run a logistic regression using sex of each cat as the dependent variable.

model_log <- SparkR::glm(Sex ~ Bwt + Hwt, data= trainDF, family = "binomial")
predications <- SparkR::predict(model_log, testDF)
showDF(SparkR::select(predications, "Sex", "prediction"))

## +---+----------+
## |Sex|prediction|
## +---+----------+
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |0.0|       0.0|
## |1.0|       0.0|
## |1.0|       0.0|
## |1.0|       1.0|
## |1.0|       0.0|
## |1.0|       1.0|
## |1.0|       1.0|
## |1.0|       1.0|
## |1.0|       1.0|
## |1.0|       1.0|
## +---+----------+
## only showing top 20 rows