The internet is rife with useful information that we would like to analyze. During the previous tutorials, we used twitteR
to retrieve text from social media, and for this tutorial, we will use tm.plugin.webmining
to get hold of more web sources e.g., XML, HTML and JSON. In addition, the webmining
plugin also supports content retrieval from various news sites, namely, Google News, Yahoo News and more. Generally speaking the retrieval occurs in two separate steps:
install.packages('tm')
install.packages('tm.plugin.webmining')
tm.plugin.webming
apparently depends on the tml
package, RCurl
for retrieval and XML
for extraction.
library(tm)
library(tm.plugin.webmining)
tm
package, the retried sources can be stored in Corpus
so that they can be analyzed using other functions in tm
. You can have a quick glimpse through what tm
does in the Introduction to Text Ming Package. The are several already made methods that allow us to get content from the web.
londonNews <- WebCorpus(GoogleNewsSource('London'))
class(londonNews)
## [1] "WebCorpus" "VCorpus" "Corpus"
WebCorpus
is derived from the Corpus
with additional methods and meta data.
londonNews
## <>
## Metadata: corpus specific: 3, document level (indexed): 0
## Content: documents: 30
meta(londonNews[[1]])
## author : character(0)
## datetimestamp: 2016-05-14 21:22:55
## description : Daily MailMiddle England invaded by London's most feared criminal drug lordsDaily MailPapers prepared for the Mayor of London's office as recently as February reveal there are now a staggering 83 London gangs operating outside the capital. It is so rife that gangs from 19 of the 32 boroughs are involved, with those from Hackney, Brent, ...
## heading : Middle England invaded by London's most feared criminal drug lords - Daily Mail
## id : tag:news.google.com,2005:cluster=http://www.dailymail.co.uk/news/article-3590650/The-drug-lords-Middle-England.html
## language : character(0)
## origin : http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHk0GHqvOczo5290O7ApHIvQ7-DdA&clid=c3a7d30bb8a4878e06b80cf16b898331&ei=NkI4V-CJHIbO1Qb-pYZA&url=http://www.dailymail.co.uk/news/article-3590650/The-drug-lords-Middle-England.html
Corpus
has useful information such as author, datetimestamp and description.
londonNews[[1]]$content
## [1] "182\nshares\nThe drug lords of Middle England: London's most feared criminal gangs invade England's green and pleasant shires \nTunbridge Wells has fallen victim to a meticulously planned and chilling expansion of the London drug trade that has so far gone barely reported \nDrive for fresh sales territory uses business cards and travelling salesmen\nIt boasts the sort of ‘introductory offers’ familiar to any regular shopper\nPublished: 21:21 GMT, 14 May 2016 | Updated: 08:37 GMT, 15 May 2016\n182 shares\n"
One can also use GoogleFinanceSource
, NYTimesSource
, YahooFinanceSource
, YahooInplaySource
and YahooNewsSource
for content retrial.
corpus.update
that continuously update our query, which has been efficiently implemented. It first checks the meta data to see if a document has already been downloaded, and then download the actual content accordingly.
londonNews <- corpus.update(londonNews)
climateArc <- extractContentDOM("http://www.economist.com/news/science-and-technology/21698641-new-way-understand-behaviour-ice-sheets-good-vibrations",0.5,FALSE)
trimWhiteSpaces
method which transforms our text file similarly as stripWhiteSpace
.
climateArc <- trimWhiteSpaces(climateArc)
The use of tm.plugin.webming
is fairly easy but it gives us an abundant source of information as we can almost find anything on the internet don't we. After the retrial, we can make use of the methods in tm
so that our data can be properly handled.