Introduction
RSelenium is an R client for the Selenium Remote WebDriver API, a project focused on automating web browsers. While its main applications are unit and regression testing on webapps and webpages across a range of browser/OS combinations, in the R environment specifically shiny applications, it's also useful for webscraping complex interactive and dynamic websites, when other tools like rvest falls short of.
Install Docker
The best way to currently run a Selenium server is inside a Docker container, so we first have to install Docker. The following instructions have been tried and tested on Ubuntu 16.04 LTS only.
- Install the dependencies first
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common - Set up the docker repository for apt:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" - Update / Upgrade apt:
sudo apt-get update sudo apt-get upgrade - To make sure, you're going to install docker from Docker repo:
apt-cache policy docker-ce - Install Docker CE:
sudo apt-get install docker-ce - check the status
sudo systemctl status docker - Verify the installation
sudo docker run hello-world - pull Selenium image:
sudo docker pull selenium/standalone-firefox - start "simple" container:
sudo docker run -d -p 4445:4444 selenium/standalone-firefox - start a container with a host to container mapped directory
docker run -d -p 4445:4444 -p 5901:5900 -v /path/in/host:/home/seluser/Downloads selenium/standalone-firefox - start a container with a predefined lable host to container mapped directory
docker run -d -l this_is_a_label -p 4445:4444 -p 5901:5900 -v /path/in/host:/home/seluser/Downloads selenium/standalone-firefox - see running containers
sudo docker -ps - stop a container named cont_name:
sudo docker stop cont_name - stop every running container:
sudo docker stop $(sudo docker ps -q) - stop a container labelled this_is_a_label:
sudo docker stop $(sudo docker ps -q -f "label=this_is_a_label") - from inside R, it is useful to know how to run sudo commands passing credentials:
system('cat /path/to/file/cred | sudo -S ...)'where
credis the name of a file that contains only the linux password for the current R user, which must obviousy be in the sudoer list for the command to work.
Install RSelenium
As of July 2018, RSelenium and some of its dependencies have been temporarily dropped from CRAN. Install it as follows:
library(devtools)
install_github('johndharrison/binman')
install_github('johndharrison/wdman')
install_github('ropensci/RSelenium')
Typical web driving/scraping script
-
load packages
lapply(c('data'table', 'RSelenium', 'rvest'), require, char = TRUE) -
start Docker container with required options for downloads
system(paste(get_sudo, 'run -d -l this_is_a_label -p 4445:4444 -p 5901:5900 -v /path/to/data:/home/seluser/Downloads selenium/standalone-firefox')) -
wait until Docker is actually up and running
while( length( system(paste(get_sudo, 'ps -q -f "label=this_is_a_label"'), intern = TRUE) ) == 0 ) Sys.sleep(0.5) -
set the Firefox profile for downloaded files in the desired folder on the host:
ffx_prof <- makeFirefoxProfile(list( browser.download.dir = '/home/seluser/Downloads', browser.download.folderList = 2L, browser.download.manager.showWhenStarting = FALSE, browser.helperApps.neverAsk.saveToDisk = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' )) -
start the Selenium server using above config
remDr <- remoteDriver( extraCapabilities = ffx_prof, remoteServerAddr = 'localhost', port = 4445L, browserName = 'firefox' ) -
remDr$open() remDr$navigate("http://www.google.com/ncr") ``` -
-
-
-
execute JS script to enable/show object that being otherwise disabled/hidden can't be actioned upon:
scrpt <- "return document.getElementById('elem_id').someproperty;" scrpt <- "document.getElementById('elem_id').someproperty = somevalue;" remDr$executeScript(scrpt, args = list("dummy"))The first JS command first inspect the property value.
-
Other Selenium Commands
```
remDr$getTitle()
```