notes Tina

---
title: "Analysing big data with Microsoft R"
output: html_notebook
---

# Microsoft R Server and R Client

R server is a standalone product intended for production architectures. R Client is a desktop version that allows data scientists to build workflows and optionally push to a production server.
R Client does not support chunked data, so all operations will attempt to read the entire dataset into memory first.

Important packages:

- **RevoScaleR** for most data science and data preperation steps
- **mrsdeploy** for interacting with remote R Servers

## Working with a remote server
A remote server can be accessed on port 12800 by default using the `remoteLogin` function. Alternatatively, use `remoteLoginAAD if using Azure Active Directory.

```r
remoteLogin("http://rsvr2.westeruope.cloudapp.azure.com:12800",
              username="admin",
              password = "Pa55w.rdPa55w.rd")
```
Execute on a remote Microsoft R server:

- Transferring objects between sessions
- putLocalObject(). Transfers an R object from local workspace to server workspace
- putLocalFile(). Uploads a file from the local machine and writes it to the working directory of the remote R session.
 - The remoteLogin function enables you to connect to a remote R Server and start an interactive session. You can use the remoteExecute function to run a non-interactive block of code remotely.


## The ScaleR functions

 - Process data that is too big to fit into memory
 - Works on chunked data
 - XDF (External data frame)
 - functions starts with rx, for data manipulation or analysis functions, or Rx, for functions that are class constructors for spefic data sources or compute context.
 - ScaleR functions needs to information: 1) where the computation should take place (the compute context, local,server,cluster,database), 2) what data to use (data source), text file, XDF, database connection


# Exploring Big Data

- Not all data sources are available in every compute context
- Use **RxOdbcData** or **RxSqlServerData** to connect to a SQL server.


```r
#Import airport data from the Airports table in the AirlineData database, and save it as an XDF file
conString <- "Server=LON-SQLR;Database=AirlineData;Trusted_Connection=TRUE"
airportData <- RxSqlServerData(connectionString = conString, table = "Airports")
```
Important functions:

 - **rxImport**. Used to import a non XDF to XDF. If you do not specify an outFile, it will save as data frame and store data in memory
    + Use colClasses to override the type of columns, useful for indicating factors
    + Use transforms argument if you want to make a transformation when reading in the data.

```r
# Transform the data - create a combined Delay column, filter all cancelled flights, and discard FlightNum, TailNum, and CancellationCode
# Test import and transform over a small sample first
flightDataSampleXDF <- rxImport(inData = "\\\\LON-RSVR\\Data\\2000.csv", outFile = "\\\\LON-RSVR\\Data\\Sample.xdf", overwrite = TRUE, append = "none", colClasses = flightDataColumns,
                                transforms = list(
                                  Delay = ArrDelay + DepDelay + ifelse(is.na(CarrierDelay), 0, CarrierDelay) + ifelse(is.na(WeatherDelay), 0, WeatherDelay) + ifelse(is.na(NASDelay), 0, NASDelay) + ifelse(is.na(SecurityDelay), 0, SecurityDelay) + ifelse(is.na(LateAircraftDelay), 0, LateAircraftDelay),
                                  MonthName = factor(month.name[as.numeric(Month)], levels=month.name)),
                                rowSelection = (Cancelled == 0),
                                varsToDrop = c("FlightNum", "TailNum", "CancellationCode"),
                                numRows = 1000
)
```

Examining and modifying the structure of an XDF object

 - **rxGetInfo**. Get info about metadata of the XDF file
 - **rxGetVarInfo**. returns metadata describing each variable
 - **rxSetInfo**.
 - **rxSetVarInfo**. Used to change the names of variables, and the levels of factors.

Changing the labels for a factor:
```r
 varInfo <- rxGetVarInfo(xdfSource)
 varInfo$Cancelled$levels <- c("No", "Yes")
 rxSetVarInfo(varInfo, xdfSource)
```

rXCube

 - Format that is ideal for graphs
 - Take transforms as argument, to for instance transform characters to factors

# Visualising Big Data

ggplot2 (for in memory data):

 - facet_grid
 - faced_wrap
 - Use gridExtra package to arrange different plots between each other, *grid.arrange(plot1,plot2,ncol=2)*

 revoScaleR include plots to use for big data

  - rxLinePLot
    + Use type argument to specify the chart type, default is line. Use "p" for scatter plot.
    + It is required that the input data is a data frame, so if you for instance have data returned from rxCube, you have to convert it to a dataframe using **rxResultsDF**.
 - rxHistogram
    + The thing we are going to analyse is on the right hand side of ~
    ```r
    ## Create distribution of departure delay, one histogram for each month
    flights_c %>%
      rxHistogram( ~ DepDelay | Month)
    ```


# Processing Big Data

Transformation:

 - Transformations can be permanent or transient
    + Depends on how often the transformation needs to be used
 - **rxDataStep**
    + Used to implement transformations in XDF objects
    + If you are adding a categorial variable, specify the levels and labels explicity to avoid inconsistencien between chunks. To avoid that prepare the known range of values.
    + Specify external packages in transformpackages argument
    + Use rxGet and rxSet to pass information between chunks
    + Use rowsPrRead to override the chunksize
    + A block is called a chunk when the block is read into memory
    + Use tranformfunc when tranformation is complex (eks. total sum of all chunks)
    + transformvars are used to input the variables needed in the tranformfunc
    + *transformFunc*: R function whose first argument and return value are named R lists with equal length vector elements. The output list can contain modifications to the input elements as well as newly named elements. R lists, instead of R data frames, are used to minimize function execution timings, implying that row name information will not be available during calculations.
    + *transformFunc*: can be quite complex, especially when your transformation depends on values from previous chunks.
    + *transformVars*: character vector selecting the names of the variables to be represented as list elements for the input to transformFunc.

Do **not** sort big data! If you have to, use rxSort.

rxMerge

 - Need to be sorted first, which means it is expensive during joins
 - union: append one dataset to another vertically (rows)
 - oneToone: append one dataset to another horizontally (columns)
 - If joining datasets across factor levels, ensure that the variables in both datasets has the same factor levels. If not - use **rxFactors**  for one or both datasets.
 - The joining column names must be the same in the two data sets


# Paralleising Analysis Operations

 - Using RxLocalParallel compute context
 - High Performance Analytics (HPA)
 - High Performance computing (HPC)
 - **rxExec**: Inteded to use in parallel environments such as clusters, the workhorse function

 rxExec
  - Enables you to run arbitary R code in parallel on distributed computing resources

 doRSR package enable you to use rxExec as a back end for the %dopar% syntax

 RevoPemaR

  - Parralel external memory algorithms
  - For real big datasets that cannot fit in memory
  - what functions do you need to set up at Pema class
  - what classes is important
  - outputTrace debugging
  - outputTraceLevel

###High Performance Computing (HPC) problems

 - little or no effort is required to split up the processing into tasks that can be run in parallel. In other words, the tasks are not dependent on each other, so they can be easily separated to run on different nodes.
 - *foreach* package: used for parallelizing loops.
 - The *revoScaleR* package includes functions that enable you to do HPC and embarrassingly parallel computations, in addition to HPA operations. Just like the rx* functions, these functions make use of the “write once, deploy anywhere” model, where you can write your code and check that it works locally before deploying it to a more powerful remote server by simply changing the compute context in which it runs.


rxExec:

 - Run an arbitrary R function on nodes or cores of a cluster.
 - Use the rxExec function to perform traditional HPC tasks by executing a function in parallel across the node of a cluster
   or cores of a remote server. Unlike the HPA ScaleR functions, you need to control how the computational tasks are distributed and you are responsible for any aggregation and final processing of results.
 - exposes its raw power in a parallel environment, such as a cluster.
 - Use the **RxLocalParallel** compute context to distribute computations, if you do not have a cluster available.
 - The **RxLocalParallel** compute context utilizes the doParallel back end for HPC computations.
 - You can only use the **RxLocalParallel** compute context with **rxExec**; it is ignored by the other ScaleR HPA functions that handle parallel and distributed computing in their own way.
 - Only well sutied to parrallel execution such as:
    + Embarrassingly parallel tasks where individual subtasks are not dependent upon each other
    + These include mathematical simulations
    + bootstrap replicates of relatively small models,
    + image processing, brute force searches,
    + growing trees in a random forest algorithm,
    + almost any situation where you would use the lapply function on a single core computer.
 - The default compute context of **rxExec** is **rxLocalSeq**. This enables only sequential proccesses.
 - Change the compute context to **RxlocalParallel**, to parralize
    + To change to the **RxLocalParallel**, you need to first create a RxLocalParallel object to use with rxExec, and then set this as the main compute context:

 		```r
			parallelContext <- RxLocalParallel()
			rxSetComputeContext(parallelContext)
		```
 - There are two primary use cases for running rxExec:
    + To run a function multiple times, collecting the results in a list.
    + As a parallel lapply type function that operates on each object of a list, vector or similarly iterable object.
 - *timesToRun* argument: specify the number of times to run the function
 - *taskChunkSize* argument: specifies the number of tasks that should be allocated to each node.
    + For example, if you set timesToRun to 5000 and you have a five-node cluster, you can set the *taskChunkSize* to 1000 to force each node to perform 1,000 iterations of the task rather than letting the master node decide.
 - *elemArgs* argument: The rxExec function applies the function f to each element specified in elemArgs. The times to run is now specified by the length of the elemArgs vector.
 - *rxElemArg*. By default, the same argument value will be passed to each of the nodes or cores. If instead, a vector or list of argument values is wrapped in a call to rxElemArg, a distinct argument value will be passed to each node or core.
 - **foreach** package provides a popular way to perform parallel processing in base R.
    + The **doRSR** package provides a parallel back end for the %dopar% function in foreach, built on top of rxExec.
 - **wait=FALSE**, By default, compute contexts are waiting (or blocking)
    + you might prefer to send the job out to the cluster and continue working on your local R session
    + To do this, you can specify a compute context to be non-waiting (or non-blocking), which will return control of the local session after the remote session has been started.
    + You cannot define a local compute context (RxLocalSeq or RxLocalParallel) as non-waiting.
    + Use **rxGetJobStatus** to check back on the progress of the job
    + Use **rxGetResults** to retrieve the results of the completed job.
 - you might have a cluster where each node has several cores.
    + You could then run an independent analysis on each node with the HPA functions making use of the available cores on their assigned node. To do this, you set the **elemType** argument in the rxExec function to “nodes”.


#RevoScaleR Functions for Spark on Hadoop

 - RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.
 - RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.
 - On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.


###Parallel External Memory Algorithms (PEMAs)

 - They can work with chunked data too big to fit in memory, and these chunks can be processed in parallel. The results are then combined and processed, either at the end of each iteration or at the very end of computation.
 - **RevoPemaR** package
 - In the PEMA model, a master node splits the data into chunks that can be processed in parallel and distributes the processing to multiple client nodes. Each client node can operate on its data in isolation, and sends its results back to the master node which is responsible for combining the results from individual nodes together before returning the result.

Construct a PEMA class:

 - Initilize: sets the initial field values
 - processData: controls how each chunk is processed and aggregates results for all chunks processed by a single node
 - updateResults: collects the results from the other nodes together in one place
 - processResults: calculates the final result
 - getVarsToUse: specifies the names of the variables to use; optional
 - the processData method is called once for each chunk of data it is assigned by the master node. It is therefore likely that the same node will be used many times. Make sure that you accumulate (add to) the results for each run of the processData method in the fields of the object, as shown in the previous example.
 - Important: Don't assume that the updateResults method will always run. In a single-node environment, or if the master node does not distribute the work, then the aggregated results should be available in the only running instance of the PEMA object—so there is no need for the master node to call updateResults.
 - Debugging information:
    + Use **outputTrace**.
    + Set the *traceLevel* field in the PEMA object.
    + Specify a value for the outputTraceLevel when calling outPutTrace
    + Trace messages will be displayed if traceLevel>=outputTraceLevel


# Creating and Evaluating Regression Models

##Clustering big data

 - No labeled response date (unsupervised learning)
 - Attempts to split data into natural groups
 - Useful to reduce datasets into subsets of similar data, dimension reduction

K-means clustering
(Setosa, website exploring concept in a visual way)

 - Has to determine center and number of cluster
 - There is no real measure of "accuracy" in a cluster analysis
 - calculate the ratio of "between cluster sums of squares" and "total sums of squares": clust$betweens/clust$tots - the higher the better, but **not** 100% indicating too many groups (one for each point)
 - Usually works on numeric data
 - **rxKmeans**
 - Variables on different scales can bias the distance calculation - variation in one variable can "swamp" that of the others
    + You can z transform (0 mean and 1 std) the data in the rxkmeans function to avoid that problem ***scale()***
    + This ensures the variables contribute their variation equally
 - Time consuming for large datasets
 - best to run models for several values of k over a representative sample of your data.
    + Take the centers from the best model and feed it to the centers argument of the kmeans algorithm for the entire dataset
 - .rxcluster added to the dataframe when running kmeans containing the predictions (grouping)


Regressions

```r
mod1<-rxLinMod(y ~ x, data= mydata,
              covCoef=TRUE, dropFirst=TRUE)
```
 - Test that errors are normal distributed
 - GLM relax the assumptions of linear regression
 - use summary to examing the model
 - Use **rxPredict** to run predictions
    + Adds a prediciton column to the data frame

Logistic regression:

  - Variant of linear regression where the response variable is binary cathegorial
  - rxLogit
  - The receiver operating characteristic curve gives visuel information on the predictive power of the model
  - **rxRocCurve**
     + Plots the true positive against the false positive rate.
     + If the curve is a straight line (45 degree curve), then the model is no better than guessing randomly


GLM:

 - Take a link and family as input
 - **rxGLM**

 Cube regressions:

- The linear model and GLM functions can use cubes to run large regression models
- Use **rxCube** to create cross classify factors

# Creating and Evaluating Partitioning Models

# Processing Big Data in SQL Server and Hadoop