Hand labeling texts based on topic, tone or other variables of a data set have contributed to many different areas of social sciences however, this "manual coding" process is highly time consuming, and therefore, with the help of machine learning techniques, we can save lots of time. RTextTools
is originally designed for social scientists as a start-to-finish product within a few steps, whilst more advanced users can also use this to do fast prototypes. RTextToolss
come with nice algorithms, svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, and maximum entropy.
# install.packages("RTextTools")
library('RTextTools')
USCongress
data set, one can also load data sets of other formats using read_data()
and reprocess them using the tools from tm
. Our data set contains labeled bills from the United States Congress. We are primarily interested in two variables, major
, a manually labeled topic code corresponding to the subject of the bill and text
, the exact content of the bill.
data("USCongress")
str(USCongress)
## 'data.frame': 4449 obs. of 6 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ cong : int 107 107 107 107 107 107 107 107 107 107 ...
## $ billnum : int 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 ...
## $ h_or_sen: Factor w/ 2 levels "HR","S": 1 1 1 1 1 1 1 1 1 1 ...
## $ major : int 18 18 18 18 5 21 15 18 18 18 ...
## $ text : Factor w/ 4295 levels "-- Private Bill; For the relief of Alfonso Quezada-Bonilla.",..: 4270 4269 4273 4158 3267 3521 4175 4284 4246 4285 ...
stemWords
option to be true which essentially make the terms alike into one. An example will be "happy", "happily" and "happiness" will all become "happiness" in the end because they have the same meanings. Also we set remove some sparse terms to reduce the document size.
my_matrix <- create_matrix(USCongress$text, language = "english", stemWords = TRUE,
removeSparseTerms = .999, removeNumbers = TRUE)
matrix_container
via:
my_container <- create_container(my_matrix, USCongress$major, trainSize = 1:4000,
testSize = 4001:4449, virgin = FALSE)
From now on, the container can be passed to any subsequent function for our classifier.
my_svm <- train_model(my_container, algorithm = "SVM")
# my_boost <- train_model(my_container, algorithm = "BOOSTING")
svm_predictions <- classify_model(my_container, model = my_svm)
# boost_predictions <- classify_model(my_container, my_boost)
create_analytics()
to know more about the models and we are given the summaries by label, by algorithm, by document and an ensemble summary.
analytics <- create_analytics(my_container, cbind(svm_predictions))
summary(analytics)
## ENSEMBLE SUMMARY
##
## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1 0.77
##
##
## ALGORITHM PERFORMANCE
##
## SVM_PRECISION SVM_RECALL SVM_FSCORE
## 0.6780 0.6815 0.6700
A good way of improving the accuracy is to use a multiple algorithms at the same time and see if the number of algorithms have the same predictions exceeds a threshold.
SVM <- cross_validate(my_container, 5, "SVM")
## Fold 1 Out of Sample Accuracy = 0.7254005
## Fold 2 Out of Sample Accuracy = 0.7191539
## Fold 3 Out of Sample Accuracy = 0.7404995
## Fold 4 Out of Sample Accuracy = 0.7195531
## Fold 5 Out of Sample Accuracy = 0.7345815