Advertisement
Guest User

notes Tina

a guest
Apr 25th, 2018
34
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 17.33 KB | None | 0 0
  1. ---
  2. title: "Analysing big data with Microsoft R"
  3. output: html_notebook
  4. ---
  5.  
  6. # Microsoft R Server and R Client
  7.  
  8. R server is a standalone product intended for production architectures. R Client is a desktop version that allows data scientists to build workflows and optionally push to a production server.
  9. R Client does not support chunked data, so all operations will attempt to read the entire dataset into memory first.
  10.  
  11. Important packages:
  12.  
  13. - **RevoScaleR** for most data science and data preperation steps
  14. - **mrsdeploy** for interacting with remote R Servers
  15.  
  16. ## Working with a remote server
  17. A remote server can be accessed on port 12800 by default using the `remoteLogin` function. Alternatatively, use `remoteLoginAAD if using Azure Active Directory.
  18.  
  19. ```r
  20. remoteLogin("http://rsvr2.westeruope.cloudapp.azure.com:12800",
  21. username="admin",
  22. password = "Pa55w.rdPa55w.rd")
  23. ```
  24. Execute on a remote Microsoft R server:
  25.  
  26. - Transferring objects between sessions
  27. - putLocalObject(). Transfers an R object from local workspace to server workspace
  28. - putLocalFile(). Uploads a file from the local machine and writes it to the working directory of the remote R session.
  29. - The remoteLogin function enables you to connect to a remote R Server and start an interactive session. You can use the remoteExecute function to run a non-interactive block of code remotely.
  30.  
  31.  
  32. ## The ScaleR functions
  33.  
  34. - Process data that is too big to fit into memory
  35. - Works on chunked data
  36. - XDF (External data frame)
  37. - functions starts with rx, for data manipulation or analysis functions, or Rx, for functions that are class constructors for spefic data sources or compute context.
  38. - ScaleR functions needs to information: 1) where the computation should take place (the compute context, local,server,cluster,database), 2) what data to use (data source), text file, XDF, database connection
  39.  
  40.  
  41.  
  42. # Exploring Big Data
  43.  
  44. - Not all data sources are available in every compute context
  45. - Use **RxOdbcData** or **RxSqlServerData** to connect to a SQL server.
  46.  
  47.  
  48. ```r
  49. #Import airport data from the Airports table in the AirlineData database, and save it as an XDF file
  50. conString <- "Server=LON-SQLR;Database=AirlineData;Trusted_Connection=TRUE"
  51. airportData <- RxSqlServerData(connectionString = conString, table = "Airports")
  52. ```
  53. Important functions:
  54.  
  55. - **rxImport**. Used to import a non XDF to XDF. If you do not specify an outFile, it will save as data frame and store data in memory
  56. + Use colClasses to override the type of columns, useful for indicating factors
  57. + Use transforms argument if you want to make a transformation when reading in the data.
  58.  
  59. ```r
  60. # Transform the data - create a combined Delay column, filter all cancelled flights, and discard FlightNum, TailNum, and CancellationCode
  61. # Test import and transform over a small sample first
  62. flightDataSampleXDF <- rxImport(inData = "\\\\LON-RSVR\\Data\\2000.csv", outFile = "\\\\LON-RSVR\\Data\\Sample.xdf", overwrite = TRUE, append = "none", colClasses = flightDataColumns,
  63. transforms = list(
  64. Delay = ArrDelay + DepDelay + ifelse(is.na(CarrierDelay), 0, CarrierDelay) + ifelse(is.na(WeatherDelay), 0, WeatherDelay) + ifelse(is.na(NASDelay), 0, NASDelay) + ifelse(is.na(SecurityDelay), 0, SecurityDelay) + ifelse(is.na(LateAircraftDelay), 0, LateAircraftDelay),
  65. MonthName = factor(month.name[as.numeric(Month)], levels=month.name)),
  66. rowSelection = (Cancelled == 0),
  67. varsToDrop = c("FlightNum", "TailNum", "CancellationCode"),
  68. numRows = 1000
  69. )
  70. ```
  71.  
  72. Examining and modifying the structure of an XDF object
  73.  
  74. - **rxGetInfo**. Get info about metadata of the XDF file
  75. - **rxGetVarInfo**. returns metadata describing each variable
  76. - **rxSetInfo**.
  77. - **rxSetVarInfo**. Used to change the names of variables, and the levels of factors.
  78.  
  79. Changing the labels for a factor:
  80. ```r
  81. varInfo <- rxGetVarInfo(xdfSource)
  82. varInfo$Cancelled$levels <- c("No", "Yes")
  83. rxSetVarInfo(varInfo, xdfSource)
  84. ```
  85.  
  86. rXCube
  87.  
  88. - Format that is ideal for graphs
  89. - Take transforms as argument, to for instance transform characters to factors
  90.  
  91. # Visualising Big Data
  92.  
  93. ggplot2 (for in memory data):
  94.  
  95. - facet_grid
  96. - faced_wrap
  97. - Use gridExtra package to arrange different plots between each other, *grid.arrange(plot1,plot2,ncol=2)*
  98.  
  99. revoScaleR include plots to use for big data
  100.  
  101. - rxLinePLot
  102. + Use type argument to specify the chart type, default is line. Use "p" for scatter plot.
  103. + It is required that the input data is a data frame, so if you for instance have data returned from rxCube, you have to convert it to a dataframe using **rxResultsDF**.
  104. - rxHistogram
  105. + The thing we are going to analyse is on the right hand side of ~
  106. ```r
  107. ## Create distribution of departure delay, one histogram for each month
  108. flights_c %>%
  109. rxHistogram( ~ DepDelay | Month)
  110. ```
  111.  
  112.  
  113. # Processing Big Data
  114.  
  115. Transformation:
  116.  
  117. - Transformations can be permanent or transient
  118. + Depends on how often the transformation needs to be used
  119. - **rxDataStep**
  120. + Used to implement transformations in XDF objects
  121. + If you are adding a categorial variable, specify the levels and labels explicity to avoid inconsistencien between chunks. To avoid that prepare the known range of values.
  122. + Specify external packages in transformpackages argument
  123. + Use rxGet and rxSet to pass information between chunks
  124. + Use rowsPrRead to override the chunksize
  125. + A block is called a chunk when the block is read into memory
  126. + Use tranformfunc when tranformation is complex (eks. total sum of all chunks)
  127. + transformvars are used to input the variables needed in the tranformfunc
  128. + *transformFunc*: R function whose first argument and return value are named R lists with equal length vector elements. The output list can contain modifications to the input elements as well as newly named elements. R lists, instead of R data frames, are used to minimize function execution timings, implying that row name information will not be available during calculations.
  129. + *transformFunc*: can be quite complex, especially when your transformation depends on values from previous chunks.
  130. + *transformVars*: character vector selecting the names of the variables to be represented as list elements for the input to transformFunc.
  131.  
  132. Do **not** sort big data! If you have to, use rxSort.
  133.  
  134. rxMerge
  135.  
  136. - Need to be sorted first, which means it is expensive during joins
  137. - union: append one dataset to another vertically (rows)
  138. - oneToone: append one dataset to another horizontally (columns)
  139. - If joining datasets across factor levels, ensure that the variables in both datasets has the same factor levels. If not - use **rxFactors** for one or both datasets.
  140. - The joining column names must be the same in the two data sets
  141.  
  142.  
  143. # Paralleising Analysis Operations
  144.  
  145. - Using RxLocalParallel compute context
  146. - High Performance Analytics (HPA)
  147. - High Performance computing (HPC)
  148. - **rxExec**: Inteded to use in parallel environments such as clusters, the workhorse function
  149.  
  150. rxExec
  151. - Enables you to run arbitary R code in parallel on distributed computing resources
  152.  
  153. doRSR package enable you to use rxExec as a back end for the %dopar% syntax
  154.  
  155. RevoPemaR
  156.  
  157. - Parralel external memory algorithms
  158. - For real big datasets that cannot fit in memory
  159. - what functions do you need to set up at Pema class
  160. - what classes is important
  161. - outputTrace debugging
  162. - outputTraceLevel
  163.  
  164. ###High Performance Computing (HPC) problems
  165.  
  166. - little or no effort is required to split up the processing into tasks that can be run in parallel. In other words, the tasks are not dependent on each other, so they can be easily separated to run on different nodes.
  167. - *foreach* package: used for parallelizing loops.
  168. - The *revoScaleR* package includes functions that enable you to do HPC and embarrassingly parallel computations, in addition to HPA operations. Just like the rx* functions, these functions make use of the “write once, deploy anywhere” model, where you can write your code and check that it works locally before deploying it to a more powerful remote server by simply changing the compute context in which it runs.
  169.  
  170.  
  171. rxExec:
  172.  
  173. - Run an arbitrary R function on nodes or cores of a cluster.
  174. - Use the rxExec function to perform traditional HPC tasks by executing a function in parallel across the node of a cluster
  175. or cores of a remote server. Unlike the HPA ScaleR functions, you need to control how the computational tasks are distributed and you are responsible for any aggregation and final processing of results.
  176. - exposes its raw power in a parallel environment, such as a cluster.
  177. - Use the **RxLocalParallel** compute context to distribute computations, if you do not have a cluster available.
  178. - The **RxLocalParallel** compute context utilizes the doParallel back end for HPC computations.
  179. - You can only use the **RxLocalParallel** compute context with **rxExec**; it is ignored by the other ScaleR HPA functions that handle parallel and distributed computing in their own way.
  180. - Only well sutied to parrallel execution such as:
  181. + Embarrassingly parallel tasks where individual subtasks are not dependent upon each other
  182. + These include mathematical simulations
  183. + bootstrap replicates of relatively small models,
  184. + image processing, brute force searches,
  185. + growing trees in a random forest algorithm,
  186. + almost any situation where you would use the lapply function on a single core computer.
  187. - The default compute context of **rxExec** is **rxLocalSeq**. This enables only sequential proccesses.
  188. - Change the compute context to **RxlocalParallel**, to parralize
  189. + To change to the **RxLocalParallel**, you need to first create a RxLocalParallel object to use with rxExec, and then set this as the main compute context:
  190.  
  191. ```r
  192. parallelContext <- RxLocalParallel()
  193. rxSetComputeContext(parallelContext)
  194. ```
  195. - There are two primary use cases for running rxExec:
  196. + To run a function multiple times, collecting the results in a list.
  197. + As a parallel lapply type function that operates on each object of a list, vector or similarly iterable object.
  198. - *timesToRun* argument: specify the number of times to run the function
  199. - *taskChunkSize* argument: specifies the number of tasks that should be allocated to each node.
  200. + For example, if you set timesToRun to 5000 and you have a five-node cluster, you can set the *taskChunkSize* to 1000 to force each node to perform 1,000 iterations of the task rather than letting the master node decide.
  201. - *elemArgs* argument: The rxExec function applies the function f to each element specified in elemArgs. The times to run is now specified by the length of the elemArgs vector.
  202. - *rxElemArg*. By default, the same argument value will be passed to each of the nodes or cores. If instead, a vector or list of argument values is wrapped in a call to rxElemArg, a distinct argument value will be passed to each node or core.
  203. - **foreach** package provides a popular way to perform parallel processing in base R.
  204. + The **doRSR** package provides a parallel back end for the %dopar% function in foreach, built on top of rxExec.
  205. - **wait=FALSE**, By default, compute contexts are waiting (or blocking)
  206. + you might prefer to send the job out to the cluster and continue working on your local R session
  207. + To do this, you can specify a compute context to be non-waiting (or non-blocking), which will return control of the local session after the remote session has been started.
  208. + You cannot define a local compute context (RxLocalSeq or RxLocalParallel) as non-waiting.
  209. + Use **rxGetJobStatus** to check back on the progress of the job
  210. + Use **rxGetResults** to retrieve the results of the completed job.
  211. - you might have a cluster where each node has several cores.
  212. + You could then run an independent analysis on each node with the HPA functions making use of the available cores on their assigned node. To do this, you set the **elemType** argument in the rxExec function to “nodes”.
  213.  
  214.  
  215.  
  216. #RevoScaleR Functions for Spark on Hadoop
  217.  
  218. - RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.
  219. - RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.
  220. - On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.
  221.  
  222.  
  223.  
  224.  
  225. ###Parallel External Memory Algorithms (PEMAs)
  226.  
  227. - They can work with chunked data too big to fit in memory, and these chunks can be processed in parallel. The results are then combined and processed, either at the end of each iteration or at the very end of computation.
  228. - **RevoPemaR** package
  229. - In the PEMA model, a master node splits the data into chunks that can be processed in parallel and distributes the processing to multiple client nodes. Each client node can operate on its data in isolation, and sends its results back to the master node which is responsible for combining the results from individual nodes together before returning the result.
  230.  
  231. Construct a PEMA class:
  232.  
  233. - Initilize: sets the initial field values
  234. - processData: controls how each chunk is processed and aggregates results for all chunks processed by a single node
  235. - updateResults: collects the results from the other nodes together in one place
  236. - processResults: calculates the final result
  237. - getVarsToUse: specifies the names of the variables to use; optional
  238. - the processData method is called once for each chunk of data it is assigned by the master node. It is therefore likely that the same node will be used many times. Make sure that you accumulate (add to) the results for each run of the processData method in the fields of the object, as shown in the previous example.
  239. - Important: Don't assume that the updateResults method will always run. In a single-node environment, or if the master node does not distribute the work, then the aggregated results should be available in the only running instance of the PEMA object—so there is no need for the master node to call updateResults.
  240. - Debugging information:
  241. + Use **outputTrace**.
  242. + Set the *traceLevel* field in the PEMA object.
  243. + Specify a value for the outputTraceLevel when calling outPutTrace
  244. + Trace messages will be displayed if traceLevel>=outputTraceLevel
  245.  
  246.  
  247. # Creating and Evaluating Regression Models
  248.  
  249. ##Clustering big data
  250.  
  251. - No labeled response date (unsupervised learning)
  252. - Attempts to split data into natural groups
  253. - Useful to reduce datasets into subsets of similar data, dimension reduction
  254.  
  255. K-means clustering
  256. (Setosa, website exploring concept in a visual way)
  257.  
  258. - Has to determine center and number of cluster
  259. - There is no real measure of "accuracy" in a cluster analysis
  260. - calculate the ratio of "between cluster sums of squares" and "total sums of squares": clust$betweens/clust$tots - the higher the better, but **not** 100% indicating too many groups (one for each point)
  261. - Usually works on numeric data
  262. - **rxKmeans**
  263. - Variables on different scales can bias the distance calculation - variation in one variable can "swamp" that of the others
  264. + You can z transform (0 mean and 1 std) the data in the rxkmeans function to avoid that problem ***scale()***
  265. + This ensures the variables contribute their variation equally
  266. - Time consuming for large datasets
  267. - best to run models for several values of k over a representative sample of your data.
  268. + Take the centers from the best model and feed it to the centers argument of the kmeans algorithm for the entire dataset
  269. - .rxcluster added to the dataframe when running kmeans containing the predictions (grouping)
  270.  
  271.  
  272. Regressions
  273.  
  274. ```r
  275. mod1<-rxLinMod(y ~ x, data= mydata,
  276. covCoef=TRUE, dropFirst=TRUE)
  277. ```
  278. - Test that errors are normal distributed
  279. - GLM relax the assumptions of linear regression
  280. - use summary to examing the model
  281. - Use **rxPredict** to run predictions
  282. + Adds a prediciton column to the data frame
  283.  
  284. Logistic regression:
  285.  
  286. - Variant of linear regression where the response variable is binary cathegorial
  287. - rxLogit
  288. - The receiver operating characteristic curve gives visuel information on the predictive power of the model
  289. - **rxRocCurve**
  290. + Plots the true positive against the false positive rate.
  291. + If the curve is a straight line (45 degree curve), then the model is no better than guessing randomly
  292.  
  293.  
  294. GLM:
  295.  
  296. - Take a link and family as input
  297. - **rxGLM**
  298.  
  299. Cube regressions:
  300.  
  301. - The linear model and GLM functions can use cubes to run large regression models
  302. - Use **rxCube** to create cross classify factors
  303.  
  304. # Creating and Evaluating Partitioning Models
  305.  
  306. # Processing Big Data in SQL Server and Hadoop
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement