In this article, we present to you the usual workflow of using text mining packages, i.e., TM
and wordcloud
, and the text mining framework. We will analyze the word frequencies from different text files and eventually create a nice word cloud out of the shared words across documents and visualize the distribution of the frequent words.
In this tutorial, we will be using works by Shakespeare. In order to have a better view of Corpus and word count matrix, we divide the original text file into three and you can download them from 1.txt, 2.txt and 3.txt.
Alternatively you can just run the following commands the download and store the files in a folder calleddata
:
library(RCurl)
dir.create("data")
setwd("data")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/1.txt"
write(getURL(url), file = "1.txt")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/2.txt"
write(getURL(url), file = "2.txt")
url <- "https://raw.githubusercontent.com/angerhang/statsTutorial/master/src/textMining/data/3.txt"
write(getURL(url), file = "3.txt")
setwd("..")
Corpus
the main structure that tm
uses for storing and manipulating text documents. There are two types VCorpus (Volatile Corpus) and PCorpus (Permanent Corpus), the main difference between these two implementations is that the former holds the documents as R objects in memory whereas the latter deals with documents that are stored outside of R environment. There
Corpus
object, we need to provide a source and there are three types of such sources, DirSource
, VectorSource
and DataFrameSource
. We will use DirSource
the import the three text files that we just downloaded and using DirSource
is the only way to import files from the user system.
library(tm)
shakespeare <- VCorpus(DirSource("data", encoding = "UTF-8"))
shakespeare
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
Corpus
object, it is easy to do via VectorSrouce
.
twoPhrases <- c("Phrase one ", "Phrase two")
simpleCorpus <- VCorpus(VectorSource(twoPhrases))
simpleCorpus
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
Corpus
for to used by other tools. Simply use
writeCorpus(shakespeare)
which will output the documents to multiple files in the Corpus
to the hard disk.
Corpus
. Each Corpus
has its own meta data and each document in the Corpus
also has one.
meta(shakespeare[[1]])
## author : character(0)
## datetimestamp: 2016-04-25 13:39:24
## description : character(0)
## heading : character(0)
## id : 1.txt
## language : en
## origin : character(0)
Corpus
using
shakespeare[[1]]
summary(shakespeare)
There are also other useful methods available such as tmUpdate()
which checks the new files that do not exist yet and add those in and inspect()
which gives a more detailed overview than summary()
.
the
and be
. Stem mining is to reduce derived words to their root form such as making says
, said
into say
because they essentially have the same meanings. We do the cleaning with the help of tm_map
which applies a function across all the documents.
library(SnowballC)
# Remove whitespace
shakespeare <- tm_map(shakespeare, stripWhitespace, lazy=TRUE)
# Stemming
shakespeare <- tm_map(shakespeare, stemDocument, lazy=TRUE)
# Remove punctuation
shakespeare <- tm_map(shakespeare, removePunctuation, lazy=TRUE)
# Remove stopwords
tm_map(shakespeare, content_transformer(removeWords), stopwords("the"), lazy=TRUE)
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
# Case conversion
tm_map(shakespeare, content_transformer(tolower), lazy = TRUE)
## <>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
I think tm
package requires us to put lazy=TRUE
explicitly, otherwise we will get all scheduled cores encountered errors in user code` warning.
Corpus
we have.
dtm <- DocumentTermMatrix(shakespeare)
highFreqTerms <- findFreqTerms(dtm, 25, Inf)
summary(highFreqTerms)
## Length Class Mode
## 4197 character character
highFreqTerms[1:10]
## [1] "19901993" "aaron" "abat" "abbess" "abbey" "abhor"
## [7] "abhorson" "abid" "abil" "abject"
loves_assocs <- findAssocs(dtm, "love", 0.95)
freq <- sort(colSums(as.matrix(dtm)),decreasing=TRUE)
library(wordcloud)
set.seed(555)
wordcloud(names(freq), freq, min.freq=2500, max.words = 100, colors=brewer.pal(8, "Dark2"))
W. Shakespeare, he Complete Works of William Shakespeare, (1994). Gutenberg EBook: http://www.gutenberg.org/ebooks/100
Feinerer, Ingo. 2015. “Introduction to the Tm Package Text Mining in R.” 2013-12-01]. Http://www, Dainf, Ct. Utfpr, Edu. Br/-Kaestner/Min-Eracao/RDataMining/tm, Pdf.
Meyer, David, Kurt Hornik, and Ingo Feinerer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5). American Statistical