In this article, we present to you the usual workflow of using text mining packages, i.e., TM and wordcloud, and the text mining framework. We will analyze the word frequencies from different text files and eventually create a nice word cloud out of the shared words across documents and visualize the distribution of the frequent words.
In this tutorial, we will be using works by Shakespeare. In order to have a better view of Corpus and word count matrix, we divide the original text file into three and you can download them from 1.txt, 2.txt and 3.txt.
Alternatively you can just run the following commands the download and store the files in a folder calleddata:
library(RCurl)
dir.create("data")
setwd("data")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
write(getURL(url), file = "1.txt")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
write(getURL(url), file = "2.txt")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
write(getURL(url), file = "3.txt")
setwd("..")
Corpus the main structure that tm uses for storing and manipulating text documents. There are two types VCorpus (Volatile Corpus) and PCorpus (Permanent Corpus), the main difference between these two implementations is that the former holds the documents as R objects in memory whereas the latter deals with documents that are stored outside of R environment. There
Corpus object, we need to provide a source and there are three types of such sources, DirSource, VectorSource and DataFrameSource. We will use DirSource the import the three text files that we just downloaded and using DirSource is the only way to import files from the user system.
library(tm)
shakespeare <- VCorpus(DirSource("data", encoding = "UTF-8"))
shakespeare
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
Corpus object, it is easy to do via VectorSrouce.
twoPhrases <- c("Phrase one ", "Phrase two")
simpleCorpus <- VCorpus(VectorSource(twoPhrases))
simpleCorpus
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
Corpus for to used by other tools. Simply use
writeCorpus(shakespeare)
which will output the documents to multiple files in the Corpus to the hard disk.
Corpus. Each Corpus has its own meta data and each document in the Corpus also has one.
meta(shakespeare[[1]])
## author : character(0)
## datetimestamp: 2016-04-25 13:39:24
## description : character(0)
## heading : character(0)
## id : 1.txt
## language : en
## origin : character(0)
Corpus using
shakespeare[[1]]
summary(shakespeare)
There are also other useful methods available such as tmUpdate() which checks the new files that do not exist yet and add those in and inspect() which gives a more detailed overview than summary().
the and be. Stem mining is to reduce derived words to their root form such as making says, said into say because they essentially have the same meanings. We do the cleaning with the help of tm_map which applies a function across all the documents.
library(SnowballC)
# Remove whitespace
shakespeare <- tm_map(shakespeare, stripWhitespace, lazy=TRUE)
# Stemming
shakespeare <- tm_map(shakespeare, stemDocument, lazy=TRUE)
# Remove punctuation
shakespeare <- tm_map(shakespeare, removePunctuation, lazy=TRUE)
# Remove stopwords
tm_map(shakespeare, content_transformer(removeWords), stopwords("the"), lazy=TRUE)
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
# Case conversion
tm_map(shakespeare, content_transformer(tolower), lazy = TRUE)
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
I think tm package requires us to put lazy=TRUE explicitly, otherwise we will get all scheduled cores encountered errors in user code` warning.
Corpus we have.
dtm <- DocumentTermMatrix(shakespeare)
highFreqTerms <- findFreqTerms(dtm, 25, Inf)
summary(highFreqTerms)
## Length Class Mode
## 4197 character character
highFreqTerms[1:10]
## [1] "19901993" "aaron" "abat" "abbess" "abbey" "abhor"
## [7] "abhorson" "abid" "abil" "abject"
loves_assocs <- findAssocs(dtm, "love", 0.95)
freq <- sort(colSums(as.matrix(dtm)),decreasing=TRUE)
library(wordcloud)
set.seed(555)
wordcloud(names(freq), freq, min.freq=2500, max.words = 100, colors=brewer.pal(8, "Dark2"))
W. Shakespeare, he Complete Works of William Shakespeare, (1994). Gutenberg EBook: http://www.gutenberg.org/ebooks/100
Feinerer, Ingo. 2015. “Introduction to the Tm Package Text Mining in R.” 2013-12-01]. Http://www, Dainf, Ct. Utfpr, Edu. Br/-Kaestner/Min-Eracao/RDataMining/tm, Pdf.
Meyer, David, Kurt Hornik, and Ingo Feinerer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5). American Statistical