lvalnegri

R-Selenium.md

Jul 6th, 2018
254
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!

Introduction

RSelenium is an R client for the Selenium Remote WebDriver API, a project focused on automating web browsers. While its main applications are unit and regression testing on webapps and webpages across a range of browser/OS combinations, in the R environment specifically shiny applications, it's also useful for webscraping complex interactive and dynamic websites, when other tools like rvest falls short of.

Install Docker

The best way to currently run a Selenium server is inside a Docker container, so we first have to install Docker. The following instructions have been tried and tested on Ubuntu 16.04 LTS only.

  • Install the dependencies first
    sudo apt-get install 
      apt-transport-https 
      ca-certificates 
      curl 
      software-properties-common
  • Set up the docker repository for apt:
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
  • Update / Upgrade apt:
    sudo apt-get update 
    sudo apt-get upgrade
  • To make sure, you're going to install docker from Docker repo:
    apt-cache policy docker-ce
  • Install Docker CE:
    sudo apt-get install docker-ce
  • check the status
    sudo systemctl status docker
  • Verify the installation
    sudo docker run hello-world
  • pull Selenium image:
    sudo docker pull selenium/standalone-firefox
  • start "simple" container:
    sudo docker run -d -p 4445:4444 selenium/standalone-firefox
  • start a container with a host to container mapped directory
    docker run -d -p 4445:4444 -p 5901:5900 -v /path/in/host:/home/seluser/Downloads selenium/standalone-firefox
  • start a container with a predefined lable host to container mapped directory
    docker run -d -l this_is_a_label -p 4445:4444 -p 5901:5900 -v /path/in/host:/home/seluser/Downloads selenium/standalone-firefox
  • see running containers
    sudo docker -ps
  • stop a container named cont_name:
    sudo docker stop cont_name
  • stop every running container:
    sudo docker stop $(sudo docker ps -q)
  • stop a container labelled this_is_a_label:
    sudo docker stop $(sudo docker ps -q -f "label=this_is_a_label")
  • from inside R, it is useful to know how to run sudo commands passing credentials:
    system('cat /path/to/file/cred | sudo -S ...)'

    where cred is the name of a file that contains only the linux password for the current R user, which must obviousy be in the sudoer list for the command to work.

Install RSelenium

As of July 2018, RSelenium and some of its dependencies have been temporarily dropped from CRAN. Install it as follows:

library(devtools)
install_github('johndharrison/binman')
install_github('johndharrison/wdman')
install_github('ropensci/RSelenium')

Typical web driving/scraping script

  • load packages

    lapply(c('data'table', 'RSelenium', 'rvest'), require, char = TRUE)
  • start Docker container with required options for downloads

    system(paste(get_sudo, 'run -d -l this_is_a_label -p 4445:4444 -p 5901:5900 -v /path/to/data:/home/seluser/Downloads selenium/standalone-firefox'))
  • wait until Docker is actually up and running

    while( length( system(paste(get_sudo, 'ps -q -f "label=this_is_a_label"'), intern = TRUE) ) == 0 )  Sys.sleep(0.5)
  • set the Firefox profile for downloaded files in the desired folder on the host:

    ffx_prof <- makeFirefoxProfile(list(
        browser.download.dir = '/home/seluser/Downloads',
        browser.download.folderList = 2L,
        browser.download.manager.showWhenStarting = FALSE,
        browser.helperApps.neverAsk.saveToDisk = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    ))
  • start the Selenium server using above config

    remDr <- remoteDriver(
                extraCapabilities = ffx_prof, 
                remoteServerAddr = 'localhost', port = 4445L, 
                browserName = 'firefox'
    )
  • remDr$open()
    remDr$navigate("http://www.google.com/ncr")    ```
  • execute JS script to enable/show object that being otherwise disabled/hidden can't be actioned upon:

    scrpt <- "return document.getElementById('elem_id').someproperty;"
    scrpt <- "document.getElementById('elem_id').someproperty = somevalue;"
    remDr$executeScript(scrpt, args = list("dummy"))

    The first JS command first inspect the property value.

Other Selenium Commands

```
remDr$getTitle()
```

Resources

Advertisement
Add Comment
Please, Sign In to add comment