lvalnegri

R-core.md

Jan 29th, 2016
218
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Markdown 50.01 KB | None | 0 0

Basics

  • R can be used at its basics as a normal calculator. It can perform the following operations:

    • arithmetic:

      • addition: x + y
      • subtraction: x - y
      • multiplication: x * y
      • division: x / y
      • exponentiation: x^y, or x**y, raises the number to its left to the power of the number to its right
      • module: x %% y, returns the remainder of the division of the number to the left by the number on its right
      • integer division: x %/% y
      • round(x, digits = 2) Round the values to 0 decimal places by default. Try out ?round in the console for variations of round() and ways to change the number of digits to round to.
    • comparison:

      • x == y
      • x != y
      • x < y
      • x <= y
      • x > y
      • x >= y
    • logical:

      • NOT !x,
      • OR x | y or x || y, when we want the result to be a single value (scalar) even if applied to vectors
      • AND x & y, or x && y, when we want the result to be a scalar
    • rounding:

      • round(x, digits = 0) rounds the argument to the specified number of decimal places (default is 0).
      • floor(x) returns the largest integer(s) not greater than the corresponding elements of x
      • ceiling(x) returns the smallest integer(s) not less than the corresponding elements of x
      • trunc(x) returns the integer(s) formed by truncating the values in x toward 0 (trunc(x) = floor(x > 0) + ceiling(x < 0))
      • signif(x, digits = 6) rounds the argument to the specified number of significant digits (default is 6).
    • mathematical:

      • absolute value: abs(x)
      • sign: sgn(x)
      • square root: sqrt(x)
      • and many others ...
    • statistical:

      • sum(x),
      • mean(x),
      • sd(x),
      • var(x)
      • and many others ...
    • ...

      • diff(x) returns a vector of length(x) - 1 containing the differences between consecutive elements of x
      • difftime(x)
      • diffinv(x)

    Other special operators specific to R are:

    • : to create sequences
    • [ and [[ to index data structures
    • $ and @ to select elements or slots in data structures
    • x %in% y to look for elements
    • %any%
    • ~ to specify relations in a model, with . an additional argument that represents all the remaining features
    • :: and :::
  • Commands don't need to be terminated with any special character

  • R is case-sensitive

  • Spaces are ...

  • Comments are only one-liner and are identified by the hash sign #. It can be put anywhere and everything after that on the same line is dismissed by the interpreter. Adding comments to the code is extremely important to make sure that your future self and others can understand what your code is about.

  • Help can be found using one of the following commands:

    • help.start()
    • ?topic
    • help(topic)
    • ??topic
    • help.search('topic')
    • apropos('topic')
    • demo('pkg_name')
    • example('topic')
  • When referring to paths, backslashes \ are not admitted. In Windows, it's possible to use a double backslash instead.

Workspace / Environment

The workspace is the current R working environment available to users to store objects, and includes any user-defined objects.

  • getwd() return the working directory, which is the place where R looks by default (when not told otherwise) to retrieve or save any data

  • setwd('pathname') change the working directory to pathname. Note that R sees \ as an escape character, so in Windows the path needs to be inserted using a double backslash \\, or a forward slash / commonly found on UNIX systems.

  • ls()

  • load('myfile.RData') # load a workspace into the current session

  • save(object_list, file = 'myfile.RData') # save specific objects to a file

  • save.image() # save the workspace to the file .RData in the cwd

  • rm(x) remove the specified object x from the workspace

  • rm(list = ls()) remove all objects from the workspace

  • gc() performs a garbage collection

  • options() view current option settings

    • options(digits = 3) set number of digits to print on output
    • options( = ) # set
  • history() display last (25) co`mmands

  • history(max.show = Inf) display `all previous commands

  • savehistory(file = 'myfile') default i-s ".Rhistory" in working dir

  • loadhistory(file = 'myfile') default i -s ".Rhistory" from working dir

  • source('filename')

  • tempfile()

  • q()

R System

  • .libPaths()
  • R.Version()
  • getRversion()

File system

  • file.exists
  • file.create
  • file.path

Packages

An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. It's worth noting again that package names, like pretty much everything else in R, are case sensitive.

The main source for installing packages is the offical CRAN repository, but two other sources are GitHub and bioconductor.

To install a package from:

  • CRAN use the command:

    install.packages('pkg_name')

    where quotation marks are due. To install multiple packages, first stores the quoted names in a vector, than call the previous function on it:

    pkg_list <- c('pkg1_name', 'pkg2_name', ...) 
    install.packages(pkg_list)

    To look for a suitable CRAN package to solve a specific problem, the best way is using one of the CRAN Task Views.

  • GitHub install first the devtools package from CRAN as above, and then use the install_github function in the format:

    devtools::install_github('user/repository')

    Note that devtools often needs some extra non-R software on the system to compile packages correctly -- more specifically, Rtools on Windows OS and Xcode on OSX.

  • bioconductor

In order to use a package's functionality during an R session, you need to do one of two things:

  • load the entire package into your R session with one of the commands library('pkg_name') or require('pkg_name'), and then call any function embedded in the package using its name alone
  • to call the desired function including the package name, like this: pkg_name::fun_name().

To see which packages are currently loaded into the environment run the function search(pkg_name), while library() (without arguments) retrieves the names of all the packages installed in the machine.

R doesn't check beforehand if a package is already installed, if you ask to install a package it simply does it. If you have a list of packages to use, and want to install only the ones you actually miss:

To install an entire existing library on a new machine (only for CRAN packages, though; for more complex see [here]()):

  • export the list of installed packages on an already working machine:

    pkgs.installed <- as.data.frame(installed.packages())
    pkgs.installed <- pkgs.installed[is.na(pkgs.installed$Priority),]
    write.csv(pkgs.installed[, c('Package', 'Version', 'Depends', 'Imports')], 'pkgs.installed.csv', row.names = FALSE)
  • import the list on another machine to install the coresponding packages:

    pkgs.installed <- read.csv('pkgs.installed.csv')
    install.packages(pkgs.installed)
  • install.packages('pkgname')

  • search() Shows packages that are currently active

  • library(pkgname) makes the complete functionalities in the specified package available in the current environment

  • require(pkgname) equivalent to library

  • pkgname::funname

  • vignette()

  • vignette('pkgname')

  • browseVignettes()

  • browseVignettes('pkgname')

  • detach('package::pkgname', unload = TRUE)

  • remove.packages('pkgname')

  • update.packages('pkgname')

  • installed.packages() list all the packages that are installed in the local system

  • available.packages() list all packages stored on CRAN that are available to be installed on the local system

The above commands are related to packages officially deployed to CRAN. It's also possible to install packages stored on [GitHub]() using the devtools package:

devtools::install_github('reponame')

Core vs Base packages

The following is the list of all base packages:

  • base
  • compiler
  • datasets
  • graphics
  • grDevices
  • grid
  • methods
  • parallel
  • splines
  • stats
  • stats4
  • tcltk
  • tools
  • translations
  • utils

The following is the list of all recommended packages for the current 3.4.4 version, that together with the above base list, forms what is called core R:

datasets

Once R is started, there are lots of example data sets available within R itself, and along with loaded packages. You can list the data sets by their names and then load a data set into memory to be used in your statistical analysis.

  • data() list all the data sets contained in all the packages currently loaded into the local system
  • data(dtsname) load the dataset called dtsname
  • data(package = .packages(all.available = TRUE)) list all the data sets in all the available packages stored on CRAN

Some noticeable datasets are: iris and mtcars from base R, diamonds from the ggplot2 package, nasa and storms from the dplyr package, nycflights from the `` package

Constants

  • A character constant is any string delimited by single quotes (apostrophes) ', double quotes (quotation mark) " or backticks (backquotes or grave accent) `. They can be used interchangeably, but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes.

  • Escape sequences inside character constants are started using the backslash character \ as escape character (an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence). The only admissible escape sequences in R are the following:

    • \n newline
    • \r carriage return
    • \t tab
    • \b backspace
    • \\ backslash
    • \' single quote
    • \" double quote
    • \ backtick
    • \a alert (bell)
    • \f form feed
    • \v vertical tab
    • \nnn character with given octal code (1, 2 or 3 digits)
    • \xnn character with given hex code (1 or 2 hex digits)
    • \unnnn Unicode character with given code (1--4 hex digits)
    • \Unnnnnnnn Unicode character with given code (1--8 hex digits)

Variables

Assignment of values to variables can be done in a few ways:

var <- value 
value -> var 
var <- value -> var 
assign('var', value)

Note that even if the usual equal sign = is recognized, it is highly discouraged, and should only be used to assign values to arguments in a function call (or parameter during definition). Moreover, = has lower precedence than <-, so they should not be mixed together in the same command.

Identifiers for variables consist of a sequence of letters, digits, the period . and the underscore _. They must not start with a digit nor underscore, nor with a period followed by a digit. The following reserved words are not valid identifiers:

  • TRUE
  • FALSE
  • NaN
  • NULL
  • Inf
  • if
  • else
  • repeat
  • while
  • for
  • in
  • next
  • break
  • function
  • NANA_integer_
  • NA_real_
  • NA_complex_
  • NA_character_
  • ..., ..1, ..2, ...

To display the content of an object to the console just ...

  • ls() list all objects
  • rm(x) remove the object named x
  • rm(list = ls()) remove ALL objects in the current environment

Data Types

Variables

A basic concept in all programming languages is the variable. It allows the coder to store a value, or more generally an object, so that can be accessed later using its name. The great advantages of doing calculations with variables is reusability and abstraction.

In R is possible to use either the statement x <- obj or the statement assign(x, obj) to assign the object obj to the variable named x. If x already exists, its old value is overwitten with the new value obj. Note that R is case sensitive! So x and X are considered two different variables.

To print out the value of a variable, it suffices to write its name, if working from the console, or using the command print if from a script. Notice that R does not print the value of a variable to the console when assigning it. To assign a value and simultaneously print it the assignment should be surrounded by parenthesis, as in (x <- obj).

Variables can be of different nature or type, according to the nature of the object they store. To know more about their data type, here are some functions that can help:

  • class(x)
  • typeof(x)
  • mode(x)
  • functions that returns the type of an object
    • is.type(x)

It is often necessary to change the way that variables store their object(s), something called coercing or casting:

  • as.type(x)
    Only certain coercions are allowed though, and some of them, even if possible, lead to loss of information. All integer, numeric, and logical are coercible to character

Specific types of value:

  • NA is.na(). It should be noted that the basic NA is of type logical. There are also other NA for the other core types: NA_integer_, NA_real, NA_character, NA_complex
  • NULL is.null()
  • Inf is.infinite() for example: Inf/n, where nis any finite number
  • NaN is.nan() for example: 0/0and Inf/Inf

Testing and resolve for missing values:

  • anyNA(x) return TRUE if
  • na.omit(x)
  • na.exclude(x)
  • na.fail(x) fails even if one element of x is NA
  • na.pass(x) ignores any NA value in `x
  • na.rm = TRUE some function admit this argument . IT is usually set equal to FALSE as default, so the function will always return NA if na.rm is not set to TRUE

Character

Any type of text surrounded by single or double quotation marks indicate that the variable is of type character.

  • tolower(x)
  • toupper(x)

Numeric

Integer

When R encounter a number, it automatically assume that it's a numeric, whatever the value. To force R to store an integer value as integer type, you have to use L after the number.

Logical

Under the hood, the logical values TRUE and FALSE are coded respectively as 1 and 0. Therefore:

  • as.logical(0) returns FALSE, and as.numeric(FALSE) returns 0
  • as.numeric(TRUE) returns 1, but as.logical(x) returns TRUE for every x != 0.

Factors

R provides a convenient and efficient way to store values for categorical variables, that takes on a limited number of values, herein called levels.
To create a factor in R, use the factor(x) function, where x is the vector to be converted into a factor. This simple way to define a factor let R order the levels in alphabetical order, implicitly using sort(unique(x)). A different order can be specified passing a convenient vector through the levels argument.
If not specified otherwise, the above order is actually not meaningful, and R throws an error if trying to apply relational operators. But if the order itself has a true meaning, in the sense that the underlying variable is at least ordinal, it's possible to specify the ordered arguments as TRUE. In this case it's also good practice to specify the correct levels. To force an order on an existing unordered factor, it is possible to use the ordered function as in ordered(f, levels = c(...)).

The general form of the function is:

factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)

Once a factor is defined, the unique values of the original variable are mapped to integers, and ...

  • levels(f) lists all unique values taken on by f (NAs are ignored though)
  • levels(f) <- v rename the levels of f to a different set of values. Note that has to be length(v) = length(levels(f)), and that the elements in v are associated to the levels by the correspondent positions.
  • as.numeric(f) lists all numeric values associated with the values taken on by x
  • summary(f) now returns a frequency distribution of the underlying variable
  • plot(f) now returns a histogram of the underlying variable

The factor type could be used not only to store and manage categorical variables, but also numerical discrete variables, directly, and even continuous, once they have been discretized, a result that can be easily achieved using the cut function:

  • f <- cut(x, breaks = n) x is grouped into n evenly spaced buckets
  • f <- cut(x, breaks = c(x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>x</sub>)) x is grouped into n-1 bins using the n specified limit values.
    The above process could be also be used to group the values of a discrete variable that assumes too many values, to more easily analyze them.

Often, during the analysis, we encounter factors that have some levels with very low counts. To simplify the analysis, it often helps to drop such levels. In R, this requires two steps:

  • first filtering out any rows with the levels that have very low counts,
  • then removing these levels from the factor variable with droplevels.
    This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don't exist in a dataset.
    A similar behaviour happens when subsetting a factor, R removes the instances but leaves the level behind! In this case, though, we can directly add drop = TRUE to the subsetting operation to have the empty level(s) deleted as well as their values.

R's default behavior when creating data frames is to convert all characters into factors, often causing headache to the user trying to figure out why its character columns are not working properly... To turn off this behavior simply add the argument
stringsAsFactors = FALSE to the data.frame call. By the way, data.table does NOT convert characters to factors automatically.

To assign labels to levels, according to the actual values in data: fct <- factor(fct, levels = labels)
To change the labels of levels, according to the order they are already stored: levels(fct) <- labels
To acc
fct <- factor(fct, levels = names(sort(table(fct), decreasing = TRUE)))

Dates and Times

There are two main object to represent date and time in R:

  • date for calendar date, with the standard format being the ISO yyyy-mm-dd, but R recognizes automatically also yyyy/mm/dd
  • POSIXct/POSIXlt for date, time and timezone. The standard format in this case is yyyy-mm-dd hh:mm:ss
    Under the hood, both the above classes are simple numerical values, as the number of days, for the date class, or the number of seconds, for the POSIX objects, since the 1st January 1970. The 1st of January in 1970 is the common origin for representing times and dates in a wide range of programming languages. There is no particular reason for this; it is a simple convention. Of course, it's also possible to create dates and times before 1970; the corresponding numerical values are simply negative in this case.

unclass(x) returns the number of days, if x is Date, or seconds, if x is POSIXct, since '1970-01-01'
date(d)
format(x, format = '')
weekdays(d) returns the day(s) of the week for every element in x
months(d) the month(s)
quarters(x) the quarter(s) (in the form Qx)

Minor data types

  • complex

  • raw

Data Structures

  • str(x) displays information about the internal structure of the object x

  • head(x, n = 6) list the first n elements of the object x

  • tail(x, n = 6) list the last n elements of the object x

Vector

A set of values of the same type.

The usual way to create a vector by hand is to combine, concatenate, or collect its elements using the function c. Note that

Vectorization

Recycling

Sequences

  • start:stop ≡ c(start, start + 1, ..., stop - 1, stop the easiest way to create a sequence of integers is the colon,

  • seq()

    • seq_along()

Matrix

A matrix is simply a vector with a dimension attribute attached. The dimension could be applied using the dim command:

dim(x) <- c(nrows, ncols)

or the matrix command:

X <- matrix(x, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) 

where nrow and

Array

Dataframe

A horizontal or column binding of named vectors of the same length.

List

Data I/O

  • In most cases, empty cells, or placeholders, are treated as missing values, which are automatically converted by R to the special value NA.

  • Any character column is actually convert into a factor, unless stringsAsFactors = FALSE.

  • read.table(fname, sep = '', header = FALSE, dec = '')
    This is the main command for reading a text file in table format, while creating a dataframe out of it. It

    • read.csv() reads a CSV (Comma Separated File) file, implies header = TRUE, sep = ',' and dec = '.'
    • read.csv2() reads a CSV file, implies header = TRUE, sep = ';' and dec = ','
    • read.delim() reads a TSV (Tab Separated File) file, implies header = TRUE, sep = '\t' and dec = '.'
    • read.delim2() reads a TSV file, implies header = TRUE, sep = '\t' and dec = ','
    • read.fwf()
  • write.table(, row.names = TRUE, col.names = TRUE)

  • Datasets originated from other software, like SPSS, SAS and Minitab, can be loaded after they have been conerted into one of the above formats. Alternatively, it's possible to use the foreign package:

    foreign::read.spss(x, use.labels = TRUE)

Programming

Functions

fun.name <- function(arg.req, arg.opt = val.default){
 code
 return(object)
}

A function then returns the object explicited in the return function, if it exists; alternatively, the result from the last line executed. There could be many return calls in function's body.

There are three main parts to a function:

  • arguments formals(fun.name): the user inputs that the function works on. They can be the data that the function manipulates, or options that affect the calculation.
    Functions can have multiple arguments, separated by a comma with a single space only afterwards.
    Some or all the arguments could be optional, if a default value is set for them using an equal sign surrounded by spaces.
    Arguments can be explicited by position or by name. By style convention, the first argument, if required, is not named. However, beyond the first couple of arguments you should always use matching by name. It makes the code much easier for everyone to read. This is particularly important if the argument is optional, because it has a default. When overriding a default value, it's good practice to use the name.

    • Every time a function is called, it's given a new fresh and clean environment, first populated with the arguments.
    • Once the function has ended its job, all its environment is destroyed.
    • If a variable present in the code, but not herein defined, is not passed in as an argument, R looks for it outside of the function environment, starting one level up and . If it's not found in the global environment, R throws an error
    • Any variable defined inside a function, does not exist outside of its scope.
    • A variable defined inside a function can have the same name of a variable defined outside a function, but they are completely different objects disposing of a different scoping, with and the former taking precedence over the latter only inside the function.
  • body body(fun.name): the code that actually performs the manipulation.

  • scope environment(fun.name).
    Scoping is the process of how R looks for a variable's value when given a name. It is usually partitioned as global vs local.
    Every variable defined inside a function is bound to live only inside that function. If you try to access it outside of the scope of that function, you will get an error because it does not exist!
    If a variable is not found inside the function, but it exists in the global environment, it is passed into it by value, meaning that the function can play with it but can't change the original value even if that's what happens inside the function.

  • arguments required vs optional.
    Optional arguments are ones that don't have to be set by the user, either because they are given a default value, or because the function can infer them from the other data. Even though they don't have to be set, they often provide extra flexibility.

  • arguments vs parameters

  • In R functions are objects just like any other object. In particular, they can be argument to other functions!

  • documentation: help(fun.name), ?fun.name

  • nested functions.
    It's often useful to use the result of one function directly in another one, without having to create an intermediate variable.

  • anonymous function

  • Function names should be chosen as descriptive of the action taken, and because of that should always be expressed as verbs.

  • Argument names should be expressed as names and must avoid to be called as core R objects.
    Aside from giving your arguments good names, you should put some thought into what order your arguments are in and if they should have defaults or not. Arguments are often one of two types:

    • data arguments, that supply the data to compute on
    • detail arguments, that control the details of how the computation is done.
      Generally, data arguments should come first, while detail arguments should go on the end, and usually should have default values.
  • If in need to display messages about the code, it is bad practice to use cat or print, functions designed just to display output. The official R way to supply simple diagnostic information is the message function. The unnamed arguments are pasted together with no separator (and no need for a newline at the end) and by default are printed to the screen.

  • Notice that there are three kinds of functions in R:

    • most of the functions are called closures
    • language constructs are known as special functions
    • a few important functions are known as builtin functions, they use a special evaluation mechanism to make them go faster.

Colloquials

  • stopifnot(condition)
  • stop('message', .cond = TRUE)
    Using stop() inside a if statement that checks the condition, instead of the simpler but often obscure stopifnot(), allows to specify a more informative error message. Writing good error messages is an important part of writing a good function. We recommend your error tells the user what should be true, not what is false

Side Effects

Side effects describe what happens when running a function that alters the state of your R session. If foo() is a function with no side effects (a.k.a. pure), then when we run x <- foo(), the only change we expect is that the variable x now has a new value. No other variables in the global environment should be changed or created, no output should be printed, no plots displayed, no files saved, no options changed. We know exactly the changes to the state of the session just by reading the call to the function.
Of course functions with side effects are crucial for data analysis. You need to be aware of them, and deliberate in their usage. It's ok to use them if the side effect is desired, but don't surprise users with unexpected side

Robustness

A desired class of functions is the set of so-called pure functions, characterized by the following good traits:

Unstable types

A function is called type unconsistent when the type or class of its output depends on the type of input. This unpredictable behaviour is a sign that you shouldn't rely on type unconsistent functions inside your own functions, but use only alternative functions that are type consistent!

Typical examples of such functions are the following:

  • [, that when applied to a dataframe can return a vector or a dataframe depending on the dimension of the input. A simple way to overcome this problem is adding a drop = FALSE argument to the call.
  • sapply . The easiest way to avoid the situation is using instead vapply, or better any map functions from the purrr package.

Most of the time, switching to stable functions means that we have to accept failings, and possibly write down some informative error message to return, or write additional conditionals to decide which action to undertake in case of above failings.

Non-Standard Evaluation (NSE)

To avoid the problems caused by non-standard evaluation functions, you could avoid using them. In our example, we could achieve the same results by using standard subsetting (for example, using the standard bracket subsetting [ instead of dplyr::filter()).But if you do need to use non-standard evaluation functions, it's up to you to provide protection against the problem cases. That means you need to know what the problem cases are, to check for them, and to fail explicitly.

Hidden Arguments

A classic example of a hidden dependence is the stringsAsFactors argument to the read.csv() function (and a few other data frame functions). That's because if the argument is not specified, it inherits its value from getOption("stringsAsFactors"), a global option that a user may change. In general, you want to avoid having the return value of your own functions depend on any global options. That way, you and others can reason about your functions without needing to know the current state of the options.

It is, however, okay to have side effects of a function depend on global options. For example, the print() function uses getOption("digits") as the default for the digits argument. This gives users some control over how results are displayed, but doesn't change the underlying computation.

Conditionals

IF

  • if(condition) { code when TRUE }

  • if(condition) { code when TRUE } else { code when FALSE }

  • if(condition) { code when TRUE } else if(condition2) { code when condition2 TRUE } else { code when ALL previous FALSE }

  • break

  • ifelse(condition, code when TRUE, code when FALSE)
    It is vectorized!

Switch

Repetitions or Loops

Repeat

It's a basic infinite repetition of the inner code, with the only way to exit due to the break statement

# the code in this loop run at least once 
repeat {
    code
    if(condition) break
}
# the  code in this loop could possibly never run
repeat {
    if(condition) break
    code
}

While

  • break

For

When you know how many times you want to repeat an action, a for loop is a good option. The idea of the for loop is that you are stepping through a set or a sequence, one element at a time, and performing an action at each step along the way. Moreover, using nested loops it is possible to loop over multidimensional objects, like matrix
for(elem in set){ code }

for(index in seq){ code }

for(i in seq1){ for(j in seq2){ code } }

  • next : skip omly the current iteration, and continue the loop
  • break: break out of a for loop completely

Apply Family

lapply() is possibly the most known, and used, vectorized version of the classic for loop, and is often preferred (and simpler) in the R world because of its readiility and efficiency. As a matter of fact, it's just a of an entire ... family of functions that helps build repetitions of functions.

  • apply(x, MARGIN, FUN, ...)

  • lapply(X, FUN, ..., USE.NAMES = TRUE)
    lapply, short for list apply, is usually the first and arguably the most commonly used function of R's wide apply family. In general, lapply(x, funname, ...) takes a vector or list x, and applies funname to each of its members. If FUN requires additional arguments, you pass them after you've specified x and FUN.

The output of lapply is always a list, the same length as x, where each element is the result of applying funname on the corresponding element of x.

The function can be one of R's built-in functions, but it can also be a function written by the user. This self-written function can be defined beforehand, or can be inserted directly as an anonymous function as follow:
lapply(v, function(x) { code block })

Because dataframes are essentially lists under the hood, with the columns of the dataframe being the elements of the underlying list, calling an lapply with a dataframe as argument would apply the specified function to each column of the data frame.

  • sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
    Many of the operations performed by lapply on the specified vector or list, will each generate a list containing elements with the same type and length. These lists can just as well be represented by a vector, or a matrix. That's exactly what sapply, standing for simplified lapply. It performs first lapply, and then sees whether the result can be correctly represented as an array, because the lengths of the elements in the output is always the same. If the simplification is not possible, then the output is identical to the output the parent lapply would have generated, without any warning. Henceforth, sapply is not a robust tool, because the output structure tends to be dependent on the inputs, and there's no way to enforce a different behaviour. The usual suggestion is that it's a great tool for interactive use, but not such a safe tool to be included in functions or, worse, in production.

Another neat feature of sapply is that if it could gather sufficient information from the input, being it a named vector or list, it then tries to meaningfully name the elements of the output structure. To avoid such behaviour, add the argument USE.NAMES = FALSE.

  • vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
    With vapply the user can define the exact structure of the output, so to avoid any unpredicted behaviour which you can typical incurred when using sapply. There is an additional argument FUN.VALUE that expects a template for the return argument of the function FUN, such as datatype(n), where datatype could be: integer, numeric, character or logical, while n is the length of result. When the structure of the output of the function does not correspond to the above template, vapply will throw an error that informs about the misalignment between expected and actual output. vapply can then be considered a more robust and safer version of sapply, and converting all sapply calls to a vapply version is therefore a good practice. There's really no reason to use sapply. if the output that lapply would generate can be simplified to an array, you'll want to use vapply, specifying the structure of the expected output, to do this securely. If simplification is not possible, simply stick to lapply.

  • tapply

  • mapply

  • rapply

  • eapply

Debugging

  • tryCatch

Functional vs Object Oriented

Object-oriented programming (OOP) is very powerful, but not appropriate for every data analysis workflow. OOP is usually suitable for workflows where you can describe exhaustively a limited number of objects.

One of the principles of OOP is that functions can behave differently for different kinds of object inputs. The summary() function is a good example of this. Since different types of variable need to be summarized in different ways, the output that is displayed to you varies depending upon what you pass into it.

R has many OOP frameworks, some of which are better than others. Knowing how to use S3 is a fundamental R skill. R6 and ReferenceClasses are powerful OOP frameworks, with the former being much simpler. S4 is useful for working with Bioconductor. Don't bother with the others.

S3

R6

ReferenceClasses

Descriptive Statistics

Summaries

Univariate

  • frequency distribution
    table(x)

  • Location or Position

  • Spread or variability

  • Skewness

  • Kurtosis

Bivariate

  • association: crosstabs

  • correlation

Charts

Univariate

  • barplot, for categorical variables

  • histogram

  • boxplot

Bivariate

  • highlight table

  • scatterplot

    plot(
       V1, V2, 
       main = "plot title", xlab = "x-axis label", ylab  ="y-axis label", 
       col = 'colour', type = '', pch = shape, cex = size
    )
    # add more points
    points(V1, V2, ...)
    # add fit lines
    abline(lm(V1~V2, data = dts), ...)      # regression line (y~x) 
    lines(lowess(V1, V2), ...)              # lowess line (x,y)
    # add legend
    legend(posx, posy, legend = c(), pch = c(), pt.cex = c(), col = c() )
    # add text
    text(posx, posy, 'text')

    In the cases of legends and text, it's possible to use the function locator(n) that once drawn the chart, R will stop n times for the user to click over the position(s) to draw legend and text labels.

    More parameters can be added using the par function before running any plot command. In particular, par lets create a grid to present more than one chart at a time:

    par(mfrow = c(nrows, ncols)) # fills the grid by rows
    par(mfcol = c(nrows, ncols)) # fills the grid by cols

    There are also different options when looking at additional packages:

    car::scatterplot(
            mpg ~ wt | cyl, data=mtcars, 
            xlab="Weight of Car", ylab="Miles Per Gallon", main="Enhanced Scatter Plot", labels=row.names(mtcars),
            reg.line = lm, smooth = TRUE, spread = TRUE, boxplots = 'xy', span = 0.5
    )

Multivariate

  • scatterplot matrix

Probability

Basics

Independence and Conditional Probability

The conditional probability of an event A, given that it's known that another event B has happened, is defined as

P(A|B) = P(A & B) / P(B)

Two processes, or events, are called independent if the outcome of any one of them doesn't effect the outcome of the other.
In formulas, A and B are independent if P(A|B) = P(A).
It's easy to prove that when the above happens it's also P(A & B) = P(A) * P(B)

Simulation with R

There are four main different ways in R to get values related to probability distributions. The user needs to concatenate one of the following four prefixes:

  • r generates n random numbers. If length(n) > 1, the length of n, instead of its value, is taken to be the number required)
  • d computes the probability density function f(x), or p.d.f., where x is a vector of quantiles\
    For a discrete RV d is simply the probability that X is exactly equal to x.
  • p calculates the (cumulative) distribution function F(q), or c.d.f., where q is a vector of quantiles
  • q returns the quantile function Q(p), where p is a vector of probabilities. As the quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function, Q is also the inverse c.d.f., ie: *Q = F<sup>-1</sup>,

followed by the acronym of the desired distribution:

suffix distribution parameters type mean variance
binom [Binomial]() size, prob discrete size * prob size * prob * (1 - prob)
pois [Poisson]() = discrete
geom [Geometric]() = discrete
hyper [Hypergeometric]() = discrete
nbinom [Negative Binomial]() = discrete
unif [Uniform]() = continuous 
norm [Normal]() mean = 1, sd = 0 continuous 
t [Student t]() = continuous 
chisq [Chi-Square]() = continuous 
f [F]() = continuous 
exp [Exponential]() = continuous 
gamma [Gamma]() = continuous 
beta Beta = continuous 
cauchy [Cauchy]() = continuous 
logis [Logistic]() = continuous 
lnorm [Log Normal]() = continuous 
weibull [Weibull]() = continuous 
tukey [Studentized Range]() = continuous 
wilcox [Wilcoxon Rank Sum Statistic]() = continuous 
signrank [Wilcoxon Signed Rank Statistic]() = continuous 

Additional arguments are:

  • log = FALSE, for d
  • log.p = FALSE for p and q
  • lower.tail = TRUE for p and q

Binomial B(n, p)

Model for the number x = 0,1,...,n of successes in n trials, where n is the total number of trials, and p is the probability of success in each trial. The probability function for B(n, p) is given by:

p(x) = choose(n, x) p^x (1-p)^(n-x), (x = 0, …, n), 

where the binomial coefficients can be computed in R using the function choose(n, p).

P[X == x; n, p] = dbinom(x, n, p) ~ mean(rbinom(bignum, n, p) == x)
P[X <= x; n, p] = pbinom(x, n, p) ~ mean(rbinom(bignum, n, p) <= x)
P[X >= x; n, p] = 1 - pbinom(x - 1, n, p) ~ mean(rbinom(bignum, n, p) >= x)

E[X] = n * p ~ mean(rbinom(bignum, n, p))
V[X] = n * p * (1 - p) ~ var(rbinom(bignum, n, p))
E[a + m * X] = a + m * E[X]   ==> mean(a + m * rbinom(bignum, n, p)) ~ a + m * mean(binom(bignum, n, p)) 
V[a + m * X] = m^2 * V[X] ==> var(a + m * rbinom(bignum, n, p)) ~ m^2 * var(rbinom(bignum, n, p))
E[X + Y] = E[X] + E[Y]   
    ==> mean(rbinom(bignum, nx, px) + rbinom(bignum, ny, py)) ~ mean(binom(bignum, nx, px)) + mean(binom(bignum, nx, px)) 
V[X + Y] = V[X] + V[Y] - cov(X, Y)
    ==> var(rbinom(bignum, n, p)) ~ m^2 * var(rbinom(bignum, n, p))
X, Y ind => cov(X, Y) = 0

P[X = x or Y = y] = P[X = x] + P[Y = y] - P[X = x, Y = y]
    = dbinom(x, nx, px) * dbinom(x, ny, py) - dbinom(x, nx, px) * dbinom(x, ny, py) 
    ~ mean( rbinom(bignum, nx, px) == x | rbinom(bignum, ny, py) == y) 
X, Y ind => P[X = x, Y = y] = P[X = x] * P[Y = y] 
    = dbinom(x, nx, px) * dbinom(x, ny, py) ~ mean( rbinom(bignum, nx, px) == x & rbinom(bignum, ny, py) == y) 
# While the above math formula becomes quite cumbersome for more than just two events, the simulation formula in R extends fairly naturally:  
P[X = x or Y = y or Z = z or ...] ~ mean( rbinom(bignum, nx, px) == x | rbinom(bignum, ny, py) == y | rbinom(bignum, nz, pz) == z | ...) 

Poisson P(m)

Geometric G(p)

Normal N(m, s)

Sampling

When dealing with statistical properties of real distribution, it's often possible to only

sample(x, size, replace = FALSE, prob = NULL)

where:

  • x identifies the vector where the sampling should be drawn. If length(x) = 1, and x >= 1, then that vector will be actually 1:x
  • size. When replace = FALSE, size should be greater than x, if length(x) = 1, or greater than length(x), otherwise.
  • replace
  • prob By default, all outcome in x have equal probability: 1/x, when length(x) = 1, or 1/length(x) otherwise. However, it's possible to set different probabilities by adding to the arguments a vector p of probability weights, one for each possible outcome in x, so that length(p) = x, when length(x) = 1, or length(p) = length(x) otherwise.

The bigger the sample set, the more representative of the complete population, and thus the higher its accuracy.

Let's say now we are interested in a statistic T = t(x) , and for that purpose we start drawing a sample x1 and calculate the value t1 = T(x1). Not surprisingly, we expect that every time we take another random sample xk, we get a different value tk = T(xk) for T. It's useful to get a sense of just how much variability we should expect when estimating the this way. The distribution of T, called the sampling distribution, can help us understand this variability. Using the sample command and a bit of iteration, it's easy to build a sampling distribution in R:

T.distr <- t(sapply(1:n_sim, function(x) t(sample(x, size))))

Varying the n_sim argument, and plotting the histograms for different values, it's possible to infer if there's an ultimate shape

Bootstrapping

When calculating the sampling distribution, the resampling is made from the population, the bootstrap distribution is calculated by resampling from the sample:

  • Take a bootstrap sample (a random sample with replacement of size equal to the original sample size) from the original sample.
  • Record the mean of this bootstrap sample.
  • Repeat previous steps many times to build a bootstrap distribution.
  • Calculate the XX% interval using the percentile or the standard error method.

set.seed(n)

Statistical Inference

z test, Normal Distribution N(mu, sigma)

t test, Student's t Distribution t($nu$)

ANOVA

Spatial

R core has no spatial capability. The minimum requirements are to load the maptools, rgdal, sp packages.

There exists two main way to store spatial data:

  • vector
  • raster

There are three ... to depict spatial shapes:

  • points
  • lines
  • polygons

Points

# csv files with two columns representing coordinates for each location 
x <- read.csv('fname') 
# transform x from simple dataframe to spatial object
coordinates(x) <- ~lon+lat
# apply a projection using a spatial reference ==> http://spatialreference.org/ or https://epsg.io/
proj4string(x) <- CRS("refsys")
# plot the 
Add Comment
Please, Sign In to add comment