Webmining

Webmining with Tm Plugin Webmining

The internet is rife with useful information that we would like to analyze. During the previous tutorials, we used twitteR to retrieve text from social media, and for this tutorial, we will use tm.plugin.webmining to get hold of more web sources e.g., XML, HTML and JSON. In addition, the webmining plugin also supports content retrieval from various news sites, namely, Google News, Yahoo News and more. Generally speaking the retrieval occurs in two separate steps:

  • Download the meta data.
  • Download the source content.

Installtion

install.packages('tm')
install.packages('tm.plugin.webmining')

tm.plugin.webming apparently depends on the tml package, RCurl for retrieval and XML for extraction.

Intro

library(tm)
## Loading required package: NLP
library(tm.plugin.webmining)
## 
## Attaching package: 'tm.plugin.webmining'
## The following object is masked from 'package:base':
## 
##     parse
After loading these two packages, all the methods we need will be made available. Similar to the tm package, the retried sources can be stored in Corpus so that they can be analyzed using other functions in tm. You can have a quick glimpse through what tm does in the Introduction to Text Ming Package. The are several already made methods that allow us to get content from the web.
londonNews <- WebCorpus(GoogleNewsSource('London'))
class(londonNews)
## [1] "WebCorpus" "VCorpus"   "Corpus"
As you can see, WebCorpusis derived from the Corpus with additional methods and meta data.
londonNews
## <>
## Metadata:  corpus specific: 3, document level (indexed): 0
## Content:  documents: 30
meta(londonNews[[1]])
##   author       : character(0)
##   datetimestamp: 2016-05-14 21:22:55
##   description  : Daily MailMiddle England invaded by London's most feared criminal drug lordsDaily MailPapers prepared for the Mayor of London's office as recently as February reveal there are now a staggering 83 London gangs operating outside the capital. It is so rife that gangs from 19 of the 32 boroughs are involved, with those from Hackney, Brent, ...
##   heading      : Middle England invaded by London's most feared criminal drug lords - Daily Mail
##   id           : tag:news.google.com,2005:cluster=http://www.dailymail.co.uk/news/article-3590650/The-drug-lords-Middle-England.html
##   language     : character(0)
##   origin       : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHk0GHqvOczo5290O7ApHIvQ7-DdA&clid=c3a7d30bb8a4878e06b80cf16b898331&ei=NkI4V-CJHIbO1Qb-pYZA&url=http://www.dailymail.co.uk/news/article-3590650/The-drug-lords-Middle-England.html
Each document in the Corpus has useful information such as author, datetimestamp and description.
londonNews[[1]]$content
## [1] "182\nshares\nThe drug lords of Middle England: London's most feared criminal gangs invade England's green and pleasant shires \nTunbridge Wells has fallen victim to a meticulously planned and chilling expansion of the London drug trade that has so far gone barely reported \nDrive for fresh sales territory uses business cards and travelling salesmen\nIt boasts the sort of ‘introductory offers’ familiar to any regular shopper\nPublished: 21:21 GMT, 14 May 2016 | Updated: 08:37 GMT, 15 May 2016\n182 shares\n"

One can also use GoogleFinanceSource, NYTimesSource, YahooFinanceSource, YahooInplaySource and YahooNewsSource for content retrial.

Update Corpus

Because for each query, we are only download 20-100 feeds, that's not enough text mining purposes. We can use corpus.update that continuously update our query, which has been efficiently implemented. It first checks the meta data to see if a document has already been downloaded, and then download the actual content accordingly.
londonNews <- corpus.update(londonNews)

Retrive from HTML

From the above procedure, one already gain access to an ample amount of information, whilst getting the content from a plain HTML file is also important.
climateArc <- extractContentDOM("http://www.economist.com/news/science-and-technology/21698641-new-way-understand-behaviour-ice-sheets-good-vibrations",0.5,FALSE)
There is also a simple trimWhiteSpaces method which transforms our text file similarly as stripWhiteSpace.
climateArc <- trimWhiteSpaces(climateArc)

Final remarks

The use of tm.plugin.webming is fairly easy but it gives us an abundant source of information as we can almost find anything on the internet don't we. After the retrial, we can make use of the methods in tm so that our data can be properly handled.

References

Annau, M. (2015, May 11). Package ‘tm.plugin.webmining’. Retrieved May 15, 2016, from https://cran.r-project.org/web/packages/tm.plugin.webmining/tm.plugin.webmining.pdf