getURL, getURLContent (RCurl package)

getURL(url), getURLContent(url)
The getURL and getURLContent functions from the RCurl package are used to retrieve the source of a webpage, which is especially useful for retrieving pages for data processing (i.e. scraping). The getURLContent function is a little more robust, but the getURL function is usually sufficient.
  • url – A character string of a URL.

Example. Note that Ebay’s User Agreement indicates that scapers should not be used on their website, so this example is for illustrative purposes only. Below, an Ebay search results page is retrieved, and the "vip" substring was used to split the page into pieces, where each piece represents information for a particular auction. From there, the strsplit, tail, regexpr, and substr functions were used to retrieve auction titles and auction IDs.
The RCurl package was built under a later R version than what I am currently using, which explains the warning generated when the package is loaded.
> # install.packages("RCurl")
> library(RCurl)
Loading required package: bitops
Warning message:
package 'RCurl' was built under R version 2.13.2 
> 
> # Ebay search: big bang theory season 4
> URL  <- "http://www.ebay.com/ctg/Big-Bang-Theory-Complete-
Fourth-Season-DVD-2011-3-Disc-Set-/103149230?
LH_Auction=1&_dmpt=US_DVD_HD_DVD_Blu_ray&_pcategid=617&_pcati
d=1&_refkw=big+bang+theory+season
+4&_trkparms=65%253A12%257C66%253A4%257C39%253A1%257C72%253A5
841&_trksid=p3286.c0.m14"
> html <- getURLContent(URL)
> hold <- strsplit(html, "vip")[[1]]
> titles <- rep(NA, length(hold)-1)
> IDs    <- rep(NA, length(hold)-1)
> for(i in 2:length(hold)){
+   t1  <- strsplit(hold[i-1], "href=\"")[[1]]
+   t2  <- tail(t1, 1)
+   t3  <- regexpr("[0-9]{12}", t2)
+   t4  <- t3 + attr(t3, "match.length")-1
+   t5  <- substr(t2, t3, t4)
+   IDs[i-1]    <- as.numeric(t5)
+   titles[i-1] <- strsplit(hold[i], '"')[[1]][3]
+ }
> length(titles)
[1] 31
> length(IDs)
[1] 31
> 
> titles[15:20]
[1] "The Big Bang Theory: The Complete Fourth Season (DVD,
2011, 3-Disc Set)"
[2] "The Big Bang Theory: The Complete Fourth Season (DVD,
2011, 3-Disc Set)"
[3] "Big Bang Theory 4th Season"                                             
[4] "The Big Bang Theory: The Complete Fourth Season (DVD,
2011, 3-Disc Set)"
[5] "The Big Bang Theory: The Complete Fourth Season (DVD,
2011, 3-Disc Set)"
[6] "The Big Bang Theory: Complete Fourth Season DVD - NEW"
> 
> IDs
 [1] 220960551768 170787388552 320854069325 170787385659
 [5] 280831496183 170787388159 200716845129 370587970391
 [9] 110829125749 260962596359 280829416858 200717387825
[13] 200714525842 380412250744 180821350347 220956412284
[17] 200717780371 220957096328 320854463678 230748559062
[21] 180826493861 370587618205 200716281607 150763546285
[25] 220958674091 360435832284 230749607001 310381400163
[29] 180826493980 310381483526 251002332304

Leave a Reply