Supervised Learning for Text Classification (RText)

Hand labeling texts based on topic, tone or other variables of a data set have contributed to many different areas of social sciences however, this "manual coding" process is highly time consuming, and therefore, with the help of machine learning techniques, we can save lots of time. RTextTools is originally designed for social scientists as a start-to-finish product within a few steps, whilst more advanced users can also use this to do fast prototypes. RTextToolss come with nice algorithms, svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, and maximum entropy.

Processing procedure

Generate document term matrix
Create a container object
Train the classifier
Classify the data
Summarize the classification
Evaluate performance
output the data

We now install the package and load it into our environment:

# install.packages("RTextTools")
library('RTextTools')

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

Create a matrix

Here we make use of USCongress data set, one can also load data sets of other formats using read_data() and reprocess them using the tools from tm. Our data set contains labeled bills from the United States Congress. We are primarily interested in two variables, major, a manually labeled topic code corresponding to the subject of the bill and text, the exact content of the bill.

data("USCongress")
str(USCongress)

## 'data.frame':    4449 obs. of  6 variables:
##  $ ID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cong    : int  107 107 107 107 107 107 107 107 107 107 ...
##  $ billnum : int  4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 ...
##  $ h_or_sen: Factor w/ 2 levels "HR","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ major   : int  18 18 18 18 5 21 15 18 18 18 ...
##  $ text    : Factor w/ 4295 levels "-- Private Bill;  For the relief of Alfonso Quezada-Bonilla.",..: 4270 4269 4273 4158 3267 3521 4175 4284 4246 4285 ...

We create a document term matrix and set the stemWords option to be true which essentially make the terms alike into one. An example will be "happy", "happily" and "happiness" will all become "happiness" in the end because they have the same meanings. Also we set remove some sparse terms to reduce the document size.

my_matrix <- create_matrix(USCongress$text, language = "english", stemWords = TRUE, 
                           removeSparseTerms = .999, removeNumbers = TRUE)

Generate a container

We now want to put the training and testing objects in a container of class matrix_container via:

my_container <- create_container(my_matrix, USCongress$major, trainSize = 1:4000,
                                 testSize = 4001:4449, virgin = FALSE)

From now on, the container can be passed to any subsequent function for our classifier.

Training the classifer

Training the model is fairly straightforward and again we can choose up to 9 different algorithms by specifying the algorithm argument. Here we train the model using SVM and neural network.

my_svm <- train_model(my_container, algorithm = "SVM")
# my_boost <- train_model(my_container, algorithm = "BOOSTING")

Makeing predictions using the trained models

Similar to the training, classifying can be done via:

svm_predictions <- classify_model(my_container, model = my_svm)
# boost_predictions <- classify_model(my_container, my_boost)

Analytics

We can use create_analytics() to know more about the models and we are given the summaries by label, by algorithm, by document and an ensemble summary.

analytics <- create_analytics(my_container, cbind(svm_predictions))
summary(analytics)

## ENSEMBLE SUMMARY
## 
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                   1              0.77
## 
## 
## ALGORITHM PERFORMANCE
## 
## SVM_PRECISION    SVM_RECALL    SVM_FSCORE 
##        0.6780        0.6815        0.6700

Better accuracy using ensemble

A good way of improving the accuracy is to use a multiple algorithms at the same time and see if the number of algorithms have the same predictions exceeds a threshold.

Corss validation

An easy n-fold validation can be conducted. We use a 5-fold cross validation in the example:

SVM <- cross_validate(my_container, 5, "SVM")

## Fold 1 Out of Sample Accuracy = 0.7254005
## Fold 2 Out of Sample Accuracy = 0.7191539
## Fold 3 Out of Sample Accuracy = 0.7404995
## Fold 4 Out of Sample Accuracy = 0.7195531
## Fold 5 Out of Sample Accuracy = 0.7345815

References

Jurka, Timothy P., Loren Collingwood, and Amber E. Boydstun. "RTextTools: A Supervised Learning Package for Text Classification." Journal.r-project. The R Journal, June 2013. Web. 15 May 2016.