lapply

lapply(X, FUN, …)
Apply a function across each item in a list. The returned object is a list, where each item is the output corresponding to that item in the original list.
  • X – A list.
  • FUN – A function.
  • – Additional arguments, if necessary, to pass to FUN.

Example. The lapply is used below to help clean out a list of file names. I use the "[" (subset) function, but I provide an alternative new function in the comments that might be easier to first think about.
This example provides a website scraper the February 2012 code folder on this website (RFunction.com). We could identify this folder by going to a "Download the Code" link in February and cutting off the file name in the URL. (Most sites keep an index file that prevents this list from showing up, but I’ve left my files open to facilitate this sort of scraping.) If such an index were not explicitly available, there are often other ways to scrape sufficient data from a website to create your own index.
> library(RCurl)
Loading required package: bitops
> 
> #______ Retrieve Index Of Feb 2012 Code ______#
> html  <- getURL("http://rfunction.com/code/1202/")
> temp  <- strsplit(html, "<li><a href=\"")[[1]]
> files <- strsplit(temp, "\"")
> files <- lapply(files, "[", 1)
> # If didn't know the "[" trick:
> # files <- lapply(files, function(x){ x[1] })
> files <- unlist(files)
> 
> #______ Reviewed Contents ______#
> files
 [1] "<!DOCTYPE HTML PUBLIC " "/code/"                
 [3] "120201.R"               "120202.R"              
 [5] "120203.R"               "120204.R"              
 [7] "120205.R"               "120206.R"              
 [9] "120207.R"               "120208.R"              
[11] "120209.R"               "120210.R"              
[13] "120211.R"               "120212.R"              
[15] "120213.R"               "120214.R"              
[17] "120215.R"               "120216.R"              
[19] "120217.R"               "120218.R"              
[21] "120219.R"               "120220.R"              
[23] "120221.R"               "120222.R"              
[25] "120223.R"               "120224.R"              
[27] "120225-tip.R"           "120225.R"              
[29] "120226.R"               "120227.R"              
[31] "120228.R"               "120229.R"              
[33] "BarackObamaTweets.txt"  "data1.txt"             
[35] "par-120208.pdf"        
> files <- files[-(1:2)]
> 
> #______ Download All Files ______#
> baseURL <- "http://rfunction.com/code/1202/"
> for(i in 1:length(files)){
+   URL <- paste(baseURL, files[i], sep="")
+   download.file(URL, paste("1202", files[i], sep="/"), quiet=TRUE)
+   Sys.sleep(2) # Give target server a break
+ }
> list.files("1202")
 [1] "120201.R"              "120202.R"             
 [3] "120203.R"              "120204.R"             
 [5] "120205.R"              "120206.R"             
 [7] "120207.R"              "120208.R"             
 [9] "120209.R"              "120210.R"             
[11] "120211.R"              "120212.R"             
[13] "120213.R"              "120214.R"             
[15] "120215.R"              "120216.R"             
[17] "120217.R"              "120218.R"             
[19] "120219.R"              "120220.R"             
[21] "120221.R"              "120222.R"             
[23] "120223.R"              "120224.R"             
[25] "120225-tip.R"          "120225.R"             
[27] "120226.R"              "120227.R"             
[29] "120228.R"              "120229.R"             
[31] "BarackObamaTweets.txt" "data1.txt"            
[33] "par-120208.pdf"   
Functions used above with active RFunction.com pages: getURL, strsplit, unlist, download.file, Sys.sleep, list.files.
Tip. Check terms of use on websites before scraping their pages. You may scrape files on RFunction.com in the site’s main "code" folder, so long as there is a minimum of 2 seconds between each query. (Also, note that while scraping may be permitted on some websites, it does not change the copyright for that content, i.e. don’t repost pages or content without explicit permission.)

Leave a Reply