More than 80% of today's data is stored in text, most of which is in unstructured format as well.
Text mining is also referred as text data mining (text analytic). It wants to extract high quality information from text. It usually involves getting the data from the source, structuring the input text, data categorization, text clustering, sentiment analysis, document stigmatization and entity relation modeling. In this series of tutorials, we will introduce a handful of text mining packages in R to demonstrate popular techniques and the text mining infrastructure. First we provide an overview of what are the available packages that we can use.
Package Name | Description |
---|---|
tm | A framework for text mining applications, very good at manipulating data. |
wordnet | An interface to WordNet, a large lexical database of English. |
textir | Tools for analysis of sentiment in text. |
RTextTools | Automatic test classification via supervised learning. |
wordcloud | Various word clouds. |
LSA | Latex semantic analysis for latent features or topics. |
openNLP | An interface to OpenNLP, a collection of natural language, processing tools. |
twitterR | A tool that provides access to most of the Twitter API. |
As illustrated in the diagram above, there exist four stages for text mining, unstructured data, organized repository, structured data and analysis results.
twitterR
will allow us to retrieve tweets for analysis or tm.plugin.mail
will give us mail handling functionality.tm
package. The alternative will to use plain text character sequences for string kernel like methods.Nowadays we are facing with more and more data, we can still use some distributed computing frameworks to tackles those challenges. If you are interested, you can refer to two other tutorials on how to use R with Hadoop and Spark to allow you to deal with large data sets.
In the following tutorials, we will do a series of text mining tasks using some the of packages above to gain a better understanding of what we can do with text mining and how we can do text mining in R.
Other helpful literature for text mining
tm
Package: Text Mining in RMeyer, David, Kurt Hornik, and Ingo Feinerer. 2008. “Text Mining Infrastructure in R.” Journal of Statistical Software 25 (5). American Statistical Association: 1–54.
Wild, Fridolin. 2015. “CRAN Task View: Natural Language Processing.” https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.
Zhao, Yanchang. 2015. “R and Data Mining: Examples and Case Studies.”