Basics
-
R can be used at its basics as a normal calculator. It can perform the following operations:
-
arithmetic:
- addition:
x + y - subtraction:
x - y - multiplication:
x * y - division:
x / y - exponentiation:
x^y, orx**y, raises the number to its left to the power of the number to its right - module:
x %% y, returns the remainder of the division of the number to the left by the number on its right - integer division:
x %/% y round(x, digits = 2)Round the values to 0 decimal places by default. Try out ?round in the console for variations of round() and ways to change the number of digits to round to.
- addition:
-
comparison:
x == yx != yx < yx <= yx > yx >= y
-
logical:
- NOT
!x, - OR
x | yorx || y, when we want the result to be a single value (scalar) even if applied to vectors - AND
x & y, orx && y, when we want the result to be a scalar
- NOT
-
rounding:
round(x, digits = 0)rounds the argument to the specified number of decimal places (default is 0).floor(x)returns the largest integer(s) not greater than the corresponding elements of xceiling(x)returns the smallest integer(s) not less than the corresponding elements of xtrunc(x)returns the integer(s) formed by truncating the values in x toward 0 (trunc(x) = floor(x > 0) + ceiling(x < 0))signif(x, digits = 6)rounds the argument to the specified number of significant digits (default is 6).
-
mathematical:
- absolute value:
abs(x) - sign:
sgn(x) - square root:
sqrt(x) - and many others ...
- absolute value:
-
statistical:
sum(x),mean(x),sd(x),var(x)- and many others ...
-
...
diff(x)returns a vector oflength(x) - 1containing the differences between consecutive elements ofxdifftime(x)diffinv(x)
Other special operators specific to R are:
:to create sequences[and[[to index data structures$and@to select elements or slots in data structuresx %in% yto look for elements%any%~to specify relations in a model, with.an additional argument that represents all the remaining features::and:::
-
-
Commands don't need to be terminated with any special character
-
R is case-sensitive
-
Spaces are ...
-
Comments are only one-liner and are identified by the hash sign
#. It can be put anywhere and everything after that on the same line is dismissed by the interpreter. Adding comments to the code is extremely important to make sure that your future self and others can understand what your code is about. -
Help can be found using one of the following commands:
help.start()?topichelp(topic)??topichelp.search('topic')apropos('topic')demo('pkg_name')example('topic')
-
When referring to paths, backslashes
\are not admitted. In Windows, it's possible to use a double backslash instead.
Workspace / Environment
The workspace is the current R working environment available to users to store objects, and includes any user-defined objects.
-
getwd()return the working directory, which is the place where R looks by default (when not told otherwise) to retrieve or save any data -
setwd('pathname')change the working directory to pathname. Note that R sees\as an escape character, so in Windows the path needs to be inserted using a double backslash\\, or a forward slash/commonly found on UNIX systems. -
ls() -
load('myfile.RData')# load a workspace into the current session -
save(object_list, file = 'myfile.RData')# save specific objects to a file -
save.image()# save the workspace to the file .RData in the cwd -
rm(x)remove the specified object x from the workspace -
rm(list = ls())remove all objects from the workspace -
gc()performs a garbage collection -
options()view current option settingsoptions(digits = 3)set number of digits to print on outputoptions( = )# set
-
history()display last (25) co`mmands -
history(max.show = Inf)display `all previous commands -
savehistory(file = 'myfile')default i-s ".Rhistory" in working dir -
loadhistory(file = 'myfile')default i-s ".Rhistory" from working dir -
source('filename') -
tempfile() -
q()
R System
.libPaths()R.Version()getRversion()
File system
- file.exists
- file.create
- file.path
Packages
An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. It's worth noting again that package names, like pretty much everything else in R, are case sensitive.
The main source for installing packages is the offical CRAN repository, but two other sources are GitHub and bioconductor.
To install a package from:
-
CRAN use the command:
install.packages('pkg_name')where quotation marks are due. To install multiple packages, first stores the quoted names in a vector, than call the previous function on it:
pkg_list <- c('pkg1_name', 'pkg2_name', ...) install.packages(pkg_list)To look for a suitable CRAN package to solve a specific problem, the best way is using one of the CRAN Task Views.
-
GitHub install first the devtools package from CRAN as above, and then use the
install_githubfunction in the format:devtools::install_github('user/repository')Note that devtools often needs some extra non-R software on the system to compile packages correctly -- more specifically, Rtools on Windows OS and Xcode on OSX.
-
bioconductor
In order to use a package's functionality during an R session, you need to do one of two things:
- load the entire package into your R session with one of the commands
library('pkg_name')orrequire('pkg_name'), and then call any function embedded in the package using its name alone - to call the desired function including the package name, like this:
pkg_name::fun_name().
To see which packages are currently loaded into the environment run the function search(pkg_name), while library() (without arguments) retrieves the names of all the packages installed in the machine.
R doesn't check beforehand if a package is already installed, if you ask to install a package it simply does it. If you have a list of packages to use, and want to install only the ones you actually miss:
To install an entire existing library on a new machine (only for CRAN packages, though; for more complex see [here]()):
-
export the list of installed packages on an already working machine:
pkgs.installed <- as.data.frame(installed.packages()) pkgs.installed <- pkgs.installed[is.na(pkgs.installed$Priority),] write.csv(pkgs.installed[, c('Package', 'Version', 'Depends', 'Imports')], 'pkgs.installed.csv', row.names = FALSE) -
import the list on another machine to install the coresponding packages:
pkgs.installed <- read.csv('pkgs.installed.csv') install.packages(pkgs.installed) -
install.packages('pkgname') -
search()Shows packages that are currently active -
library(pkgname)makes the complete functionalities in the specified package available in the current environment -
require(pkgname)equivalent tolibrary -
pkgname::funname -
vignette() -
vignette('pkgname') -
browseVignettes() -
browseVignettes('pkgname') -
detach('package::pkgname', unload = TRUE) -
remove.packages('pkgname') -
update.packages('pkgname') -
installed.packages()list all the packages that are installed in the local system -
available.packages()list all packages stored on CRAN that are available to be installed on the local system
The above commands are related to packages officially deployed to CRAN. It's also possible to install packages stored on [GitHub]() using the devtools package:
devtools::install_github('reponame')
Core vs Base packages
The following is the list of all base packages:
- base
- compiler
- datasets
- graphics
- grDevices
- grid
- methods
- parallel
- splines
- stats
- stats4
- tcltk
- tools
- translations
- utils
The following is the list of all recommended packages for the current 3.4.4 version, that together with the above base list, forms what is called core R:
- KernSmooth
- MASS
- Matrix
- boot
- class
- cluster
- codetools
- foreign
- lattice
- mgcv
- nlme
- nnet
- rpart
- spatial
- survival
datasets
Once R is started, there are lots of example data sets available within R itself, and along with loaded packages. You can list the data sets by their names and then load a data set into memory to be used in your statistical analysis.
data()list all the data sets contained in all the packages currently loaded into the local systemdata(dtsname)load the dataset called dtsnamedata(package = .packages(all.available = TRUE))list all the data sets in all the available packages stored on CRAN
Some noticeable datasets are: iris and mtcars from base R, diamonds from the ggplot2 package, nasa and storms from the dplyr package, nycflights from the `` package
Constants
-
A character constant is any string delimited by single quotes (apostrophes)
', double quotes (quotation mark)"or backticks (backquotes or grave accent)`. They can be used interchangeably, but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes. -
Escape sequences inside character constants are started using the backslash character
\as escape character (an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence). The only admissible escape sequences in R are the following:\nnewline\rcarriage return\ttab\bbackspace\\backslash\'single quote\"double quote\backtick\aalert (bell)\fform feed\vvertical tab\nnncharacter with given octal code (1, 2 or 3 digits)\xnncharacter with given hex code (1 or 2 hex digits)\unnnnUnicode character with given code (1--4 hex digits)\UnnnnnnnnUnicode character with given code (1--8 hex digits)
Variables
Assignment of values to variables can be done in a few ways:
var <- value
value -> var
var <- value -> var
assign('var', value)
Note that even if the usual equal sign = is recognized, it is highly discouraged, and should only be used to assign values to arguments in a function call (or parameter during definition). Moreover, = has lower precedence than <-, so they should not be mixed together in the same command.
Identifiers for variables consist of a sequence of letters, digits, the period . and the underscore _. They must not start with a digit nor underscore, nor with a period followed by a digit. The following reserved words are not valid identifiers:
TRUEFALSENaNNULLInfifelserepeatwhileforinnextbreakfunctionNANA_integer_NA_real_NA_complex_NA_character_...,..1,..2, ...
To display the content of an object to the console just ...
ls()list all objectsrm(x)remove the object namedxrm(list = ls())remove ALL objects in the current environment
Data Types
Variables
A basic concept in all programming languages is the variable. It allows the coder to store a value, or more generally an object, so that can be accessed later using its name. The great advantages of doing calculations with variables is reusability and abstraction.
In R is possible to use either the statement x <- obj or the statement assign(x, obj) to assign the object obj to the variable named x. If x already exists, its old value is overwitten with the new value obj. Note that R is case sensitive! So x and X are considered two different variables.
To print out the value of a variable, it suffices to write its name, if working from the console, or using the command print if from a script. Notice that R does not print the value of a variable to the console when assigning it. To assign a value and simultaneously print it the assignment should be surrounded by parenthesis, as in (x <- obj).
Variables can be of different nature or type, according to the nature of the object they store. To know more about their data type, here are some functions that can help:
class(x)typeof(x)mode(x)
- functions that returns the type of an object
is.type(x)
It is often necessary to change the way that variables store their object(s), something called coercing or casting:
as.type(x)
Only certain coercions are allowed though, and some of them, even if possible, lead to loss of information. All integer, numeric, and logical are coercible to character
Specific types of value:
NAis.na(). It should be noted that the basicNAis of type logical. There are also otherNAfor the other core types:NA_integer_,NA_real,NA_character,NA_complexNULLis.null()Infis.infinite()for example:Inf/n, wherenis any finite numberNaNis.nan()for example:0/0andInf/Inf
Testing and resolve for missing values:
anyNA(x)return TRUE ifna.omit(x)na.exclude(x)na.fail(x)fails even if one element of x is NAna.pass(x)ignores any NA value in `xna.rm = TRUEsome function admit this argument . IT is usually set equal to FALSE as default, so the function will always return NA if na.rm is not set to TRUE
Character
Any type of text surrounded by single or double quotation marks indicate that the variable is of type character.
tolower(x)toupper(x)
Numeric
Integer
When R encounter a number, it automatically assume that it's a numeric, whatever the value. To force R to store an integer value as integer type, you have to use L after the number.
Logical
Under the hood, the logical values TRUE and FALSE are coded respectively as 1 and 0. Therefore:
as.logical(0)returns FALSE, andas.numeric(FALSE)returns 0as.numeric(TRUE)returns 1, butas.logical(x)returns TRUE for everyx != 0.
Factors
R provides a convenient and efficient way to store values for categorical variables, that takes on a limited number of values, herein called levels.
To create a factor in R, use the factor(x) function, where x is the vector to be converted into a factor. This simple way to define a factor let R order the levels in alphabetical order, implicitly using sort(unique(x)). A different order can be specified passing a convenient vector through the levels argument.
If not specified otherwise, the above order is actually not meaningful, and R throws an error if trying to apply relational operators. But if the order itself has a true meaning, in the sense that the underlying variable is at least ordinal, it's possible to specify the ordered arguments as TRUE. In this case it's also good practice to specify the correct levels. To force an order on an existing unordered factor, it is possible to use the ordered function as in ordered(f, levels = c(...)).
The general form of the function is:
factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)
Once a factor is defined, the unique values of the original variable are mapped to integers, and ...
levels(f)lists all unique values taken on by f (NAs are ignored though)levels(f) <- vrename the levels of f to a different set of values. Note that has to belength(v) = length(levels(f)), and that the elements in v are associated to the levels by the correspondent positions.as.numeric(f)lists all numeric values associated with the values taken on by xsummary(f)now returns a frequency distribution of the underlying variableplot(f)now returns a histogram of the underlying variable
The factor type could be used not only to store and manage categorical variables, but also numerical discrete variables, directly, and even continuous, once they have been discretized, a result that can be easily achieved using the cut function:
f <- cut(x, breaks = n)x is grouped into n evenly spaced bucketsf <- cut(x, breaks = c(x<sub>1</sub>, x<sub>2</sub>, ..., x<sub>x</sub>))x is grouped into n-1 bins using the n specified limit values.
The above process could be also be used to group the values of a discrete variable that assumes too many values, to more easily analyze them.
Often, during the analysis, we encounter factors that have some levels with very low counts. To simplify the analysis, it often helps to drop such levels. In R, this requires two steps:
- first filtering out any rows with the levels that have very low counts,
- then removing these levels from the factor variable with
droplevels.
This is because the droplevels() function would keep levels that have just 1 or 2 counts; it only drops levels that don't exist in a dataset.
A similar behaviour happens when subsetting a factor, R removes the instances but leaves the level behind! In this case, though, we can directly adddrop = TRUEto the subsetting operation to have the empty level(s) deleted as well as their values.
R's default behavior when creating data frames is to convert all characters into factors, often causing headache to the user trying to figure out why its character columns are not working properly... To turn off this behavior simply add the argument
stringsAsFactors = FALSE to the data.frame call. By the way, data.table does NOT convert characters to factors automatically.
To assign labels to levels, according to the actual values in data: fct <- factor(fct, levels = labels)
To change the labels of levels, according to the order they are already stored: levels(fct) <- labels
To acc
fct <- factor(fct, levels = names(sort(table(fct), decreasing = TRUE)))
Dates and Times
There are two main object to represent date and time in R:
- date for calendar date, with the standard format being the ISO
yyyy-mm-dd, but R recognizes automatically alsoyyyy/mm/dd - POSIXct/POSIXlt for date, time and timezone. The standard format in this case is
yyyy-mm-dd hh:mm:ss
Under the hood, both the above classes are simple numerical values, as the number of days, for the date class, or the number of seconds, for the POSIX objects, since the 1st January 1970. The 1st of January in 1970 is the common origin for representing times and dates in a wide range of programming languages. There is no particular reason for this; it is a simple convention. Of course, it's also possible to create dates and times before 1970; the corresponding numerical values are simply negative in this case.
unclass(x) returns the number of days, if x is Date, or seconds, if x is POSIXct, since '1970-01-01'
date(d)
format(x, format = '')
weekdays(d) returns the day(s) of the week for every element in x
months(d) the month(s)
quarters(x) the quarter(s) (in the form Qx)
Minor data types
-
complex
-
raw
Data Structures
-
str(x)displays information about the internal structure of the objectx -
head(x, n = 6)list the firstnelements of the objectx -
tail(x, n = 6)list the lastnelements of the objectx
Vector
A set of values of the same type.
The usual way to create a vector by hand is to combine, concatenate, or collect its elements using the function c. Note that
Vectorization
Recycling
Sequences
-
start:stop ≡ c(start, start + 1, ..., stop - 1, stopthe easiest way to create a sequence of integers is the colon, -
seq()seq_along()
Matrix
A matrix is simply a vector with a dimension attribute attached. The dimension could be applied using the dim command:
dim(x) <- c(nrows, ncols)
or the matrix command:
X <- matrix(x, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
where nrow and
Array
Dataframe
A horizontal or column binding of named vectors of the same length.
List
Data I/O
-
In most cases, empty cells, or placeholders, are treated as missing values, which are automatically converted by R to the special value
NA. -
Any character column is actually convert into a factor, unless
stringsAsFactors = FALSE. -
read.table(fname, sep = '', header = FALSE, dec = '')
This is the main command for reading a text file in table format, while creating a dataframe out of it. Itread.csv()reads a CSV (Comma Separated File) file, impliesheader = TRUE,sep = ','anddec = '.'read.csv2()reads a CSV file, impliesheader = TRUE,sep = ';'anddec = ','read.delim()reads a TSV (Tab Separated File) file, impliesheader = TRUE,sep = '\t'anddec = '.'read.delim2()reads a TSV file, impliesheader = TRUE,sep = '\t'anddec = ','read.fwf()
-
write.table(, row.names = TRUE, col.names = TRUE) -
Datasets originated from other software, like SPSS, SAS and Minitab, can be loaded after they have been conerted into one of the above formats. Alternatively, it's possible to use the foreign package:
foreign::read.spss(x, use.labels = TRUE)
Programming
Functions
fun.name <- function(arg.req, arg.opt = val.default){
code
return(object)
}
A function then returns the object explicited in the return function, if it exists; alternatively, the result from the last line executed. There could be many return calls in function's body.
There are three main parts to a function:
-
arguments
formals(fun.name): the user inputs that the function works on. They can be the data that the function manipulates, or options that affect the calculation.
Functions can have multiple arguments, separated by a comma with a single space only afterwards.
Some or all the arguments could be optional, if a default value is set for them using an equal sign surrounded by spaces.
Arguments can be explicited by position or by name. By style convention, the first argument, if required, is not named. However, beyond the first couple of arguments you should always use matching by name. It makes the code much easier for everyone to read. This is particularly important if the argument is optional, because it has a default. When overriding a default value, it's good practice to use the name.- Every time a function is called, it's given a new fresh and clean environment, first populated with the arguments.
- Once the function has ended its job, all its environment is destroyed.
- If a variable present in the code, but not herein defined, is not passed in as an argument, R looks for it outside of the function environment, starting one level up and . If it's not found in the global environment, R throws an error
- Any variable defined inside a function, does not exist outside of its scope.
- A variable defined inside a function can have the same name of a variable defined outside a function, but they are completely different objects disposing of a different scoping, with and the former taking precedence over the latter only inside the function.
-
body
body(fun.name): the code that actually performs the manipulation. -
scope
environment(fun.name).
Scoping is the process of how R looks for a variable's value when given a name. It is usually partitioned as global vs local.
Every variable defined inside a function is bound to live only inside that function. If you try to access it outside of the scope of that function, you will get an error because it does not exist!
If a variable is not found inside the function, but it exists in the global environment, it is passed into it by value, meaning that the function can play with it but can't change the original value even if that's what happens inside the function. -
arguments required vs optional.
Optional arguments are ones that don't have to be set by the user, either because they are given a default value, or because the function can infer them from the other data. Even though they don't have to be set, they often provide extra flexibility. -
arguments vs parameters
-
In R functions are objects just like any other object. In particular, they can be argument to other functions!
-
documentation: help(fun.name), ?fun.name
-
nested functions.
It's often useful to use the result of one function directly in another one, without having to create an intermediate variable. -
anonymous function
-
Function names should be chosen as descriptive of the action taken, and because of that should always be expressed as verbs.
-
Argument names should be expressed as names and must avoid to be called as core R objects.
Aside from giving your arguments good names, you should put some thought into what order your arguments are in and if they should have defaults or not. Arguments are often one of two types:- data arguments, that supply the data to compute on
- detail arguments, that control the details of how the computation is done.
Generally, data arguments should come first, while detail arguments should go on the end, and usually should have default values.
-
If in need to display messages about the code, it is bad practice to use
catorprint, functions designed just to display output. The official R way to supply simple diagnostic information is themessagefunction. The unnamed arguments are pasted together with no separator (and no need for a newline at the end) and by default are printed to the screen. -
Notice that there are three kinds of functions in R:
- most of the functions are called closures
- language constructs are known as special functions
- a few important functions are known as builtin functions, they use a special evaluation mechanism to make them go faster.
Colloquials
stopifnot(condition)stop('message', .cond = TRUE)
Usingstop()inside aifstatement that checks the condition, instead of the simpler but often obscurestopifnot(), allows to specify a more informative error message. Writing good error messages is an important part of writing a good function. We recommend your error tells the user what should be true, not what is false
Side Effects
Side effects describe what happens when running a function that alters the state of your R session. If foo() is a function with no side effects (a.k.a. pure), then when we run x <- foo(), the only change we expect is that the variable x now has a new value. No other variables in the global environment should be changed or created, no output should be printed, no plots displayed, no files saved, no options changed. We know exactly the changes to the state of the session just by reading the call to the function.
Of course functions with side effects are crucial for data analysis. You need to be aware of them, and deliberate in their usage. It's ok to use them if the side effect is desired, but don't surprise users with unexpected side
Robustness
A desired class of functions is the set of so-called pure functions, characterized by the following good traits:
Unstable types
A function is called type unconsistent when the type or class of its output depends on the type of input. This unpredictable behaviour is a sign that you shouldn't rely on type unconsistent functions inside your own functions, but use only alternative functions that are type consistent!
Typical examples of such functions are the following:
[, that when applied to a dataframe can return a vector or a dataframe depending on the dimension of the input. A simple way to overcome this problem is adding adrop = FALSEargument to the call.sapply. The easiest way to avoid the situation is using insteadvapply, or better anymapfunctions from the purrr package.
Most of the time, switching to stable functions means that we have to accept failings, and possibly write down some informative error message to return, or write additional conditionals to decide which action to undertake in case of above failings.
Non-Standard Evaluation (NSE)
To avoid the problems caused by non-standard evaluation functions, you could avoid using them. In our example, we could achieve the same results by using standard subsetting (for example, using the standard bracket subsetting [ instead of dplyr::filter()).But if you do need to use non-standard evaluation functions, it's up to you to provide protection against the problem cases. That means you need to know what the problem cases are, to check for them, and to fail explicitly.
Hidden Arguments
A classic example of a hidden dependence is the stringsAsFactors argument to the read.csv() function (and a few other data frame functions). That's because if the argument is not specified, it inherits its value from getOption("stringsAsFactors"), a global option that a user may change. In general, you want to avoid having the return value of your own functions depend on any global options. That way, you and others can reason about your functions without needing to know the current state of the options.
It is, however, okay to have side effects of a function depend on global options. For example, the print() function uses getOption("digits") as the default for the digits argument. This gives users some control over how results are displayed, but doesn't change the underlying computation.
Conditionals
IF
-
if(condition) { code when TRUE } -
if(condition) { code when TRUE } else { code when FALSE } -
if(condition) { code when TRUE } else if(condition2) { code when condition2 TRUE } else { code when ALL previous FALSE } -
break -
ifelse(condition, code when TRUE, code when FALSE)
It is vectorized!
Switch
Repetitions or Loops
Repeat
It's a basic infinite repetition of the inner code, with the only way to exit due to the break statement
# the code in this loop run at least once
repeat {
code
if(condition) break
}
# the code in this loop could possibly never run
repeat {
if(condition) break
code
}
While
break
For
When you know how many times you want to repeat an action, a for loop is a good option. The idea of the for loop is that you are stepping through a set or a sequence, one element at a time, and performing an action at each step along the way. Moreover, using nested loops it is possible to loop over multidimensional objects, like matrix
for(elem in set){ code }
for(index in seq){ code }
for(i in seq1){ for(j in seq2){ code } }
next: skip omly the current iteration, and continue the loopbreak: break out of a for loop completely
Apply Family
lapply() is possibly the most known, and used, vectorized version of the classic for loop, and is often preferred (and simpler) in the R world because of its readiility and efficiency. As a matter of fact, it's just a of an entire ... family of functions that helps build repetitions of functions.
-
apply(x, MARGIN, FUN, ...) -
lapply(X, FUN, ..., USE.NAMES = TRUE)
lapply, short for list apply, is usually the first and arguably the most commonly used function of R's wide apply family. In general,lapply(x, funname, ...)takes a vector or list x, and applies funname to each of its members. If FUN requires additional arguments, you pass them after you've specified x and FUN.
The output of lapply is always a list, the same length as x, where each element is the result of applying funname on the corresponding element of x.
The function can be one of R's built-in functions, but it can also be a function written by the user. This self-written function can be defined beforehand, or can be inserted directly as an anonymous function as follow:
lapply(v, function(x) { code block })
Because dataframes are essentially lists under the hood, with the columns of the dataframe being the elements of the underlying list, calling an lapply with a dataframe as argument would apply the specified function to each column of the data frame.
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
Many of the operations performed bylapplyon the specified vector or list, will each generate a list containing elements with the same type and length. These lists can just as well be represented by a vector, or a matrix. That's exactly whatsapply, standing for simplified lapply. It performs firstlapply, and then sees whether the result can be correctly represented as an array, because the lengths of the elements in the output is always the same. If the simplification is not possible, then the output is identical to the output the parent lapply would have generated, without any warning. Henceforth, sapply is not a robust tool, because the output structure tends to be dependent on the inputs, and there's no way to enforce a different behaviour. The usual suggestion is that it's a great tool for interactive use, but not such a safe tool to be included in functions or, worse, in production.
Another neat feature of sapply is that if it could gather sufficient information from the input, being it a named vector or list, it then tries to meaningfully name the elements of the output structure. To avoid such behaviour, add the argument USE.NAMES = FALSE.
-
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
With vapply the user can define the exact structure of the output, so to avoid any unpredicted behaviour which you can typical incurred when using sapply. There is an additional argument FUN.VALUE that expects a template for the return argument of the function FUN, such as datatype(n), where datatype could be: integer, numeric, character or logical, while n is the length of result. When the structure of the output of the function does not correspond to the above template, vapply will throw an error that informs about the misalignment between expected and actual output. vapply can then be considered a more robust and safer version of sapply, and converting all sapply calls to a vapply version is therefore a good practice. There's really no reason to use sapply. if the output that lapply would generate can be simplified to an array, you'll want to use vapply, specifying the structure of the expected output, to do this securely. If simplification is not possible, simply stick to lapply. -
tapply -
mapply -
rapply -
eapply
Debugging
- tryCatch
Functional vs Object Oriented
Object-oriented programming (OOP) is very powerful, but not appropriate for every data analysis workflow. OOP is usually suitable for workflows where you can describe exhaustively a limited number of objects.
One of the principles of OOP is that functions can behave differently for different kinds of object inputs. The summary() function is a good example of this. Since different types of variable need to be summarized in different ways, the output that is displayed to you varies depending upon what you pass into it.
R has many OOP frameworks, some of which are better than others. Knowing how to use S3 is a fundamental R skill. R6 and ReferenceClasses are powerful OOP frameworks, with the former being much simpler. S4 is useful for working with Bioconductor. Don't bother with the others.
S3
R6
ReferenceClasses
Descriptive Statistics
Summaries
Univariate
-
frequency distribution
table(x) -
Location or Position
-
Spread or variability
-
Skewness
-
Kurtosis
Bivariate
-
association: crosstabs
-
correlation
Charts
Univariate
-
barplot, for categorical variables
-
histogram
-
boxplot
Bivariate
-
highlight table
-
scatterplot
plot( V1, V2, main = "plot title", xlab = "x-axis label", ylab ="y-axis label", col = 'colour', type = '', pch = shape, cex = size ) # add more points points(V1, V2, ...) # add fit lines abline(lm(V1~V2, data = dts), ...) # regression line (y~x) lines(lowess(V1, V2), ...) # lowess line (x,y) # add legend legend(posx, posy, legend = c(), pch = c(), pt.cex = c(), col = c() ) # add text text(posx, posy, 'text')In the cases of legends and text, it's possible to use the function
locator(n)that once drawn the chart, R will stop n times for the user to click over the position(s) to draw legend and text labels.More parameters can be added using the
parfunction before running any plot command. In particular,parlets create a grid to present more than one chart at a time:par(mfrow = c(nrows, ncols)) # fills the grid by rows par(mfcol = c(nrows, ncols)) # fills the grid by colsThere are also different options when looking at additional packages:
car::scatterplot( mpg ~ wt | cyl, data=mtcars, xlab="Weight of Car", ylab="Miles Per Gallon", main="Enhanced Scatter Plot", labels=row.names(mtcars), reg.line = lm, smooth = TRUE, spread = TRUE, boxplots = 'xy', span = 0.5 )
Multivariate
-
scatterplot matrix
Probability
Basics
Independence and Conditional Probability
The conditional probability of an event A, given that it's known that another event B has happened, is defined as
P(A|B) = P(A & B) / P(B)
Two processes, or events, are called independent if the outcome of any one of them doesn't effect the outcome of the other.
In formulas, A and B are independent if P(A|B) = P(A).
It's easy to prove that when the above happens it's also P(A & B) = P(A) * P(B)
Simulation with R
There are four main different ways in R to get values related to probability distributions. The user needs to concatenate one of the following four prefixes:
- r generates n random numbers. If
length(n) > 1, the length of n, instead of its value, is taken to be the number required) - d computes the probability density function f(x), or p.d.f., where x is a vector of quantiles\
For a discrete RV d is simply the probability that X is exactly equal to x. - p calculates the (cumulative) distribution function F(q), or c.d.f., where q is a vector of quantiles
- q returns the quantile function Q(p), where p is a vector of probabilities. As the quantile is defined as the smallest value x such that F(x) ≥ p, where F is the distribution function, Q is also the inverse c.d.f., ie: *Q = F<sup>-1</sup>,
followed by the acronym of the desired distribution:
| suffix | distribution | parameters | type | mean | variance | |
|---|---|---|---|---|---|---|
| binom | [Binomial]() | size, prob | discrete | size * prob | size * prob * (1 - prob) | |
| pois | [Poisson]() | = | discrete | |||
| geom | [Geometric]() | = | discrete | |||
| hyper | [Hypergeometric]() | = | discrete | |||
| nbinom | [Negative Binomial]() | = | discrete | |||
| unif | [Uniform]() | = | continuous | |||
| norm | [Normal]() | mean = 1, sd = 0 | continuous | |||
| t | [Student t]() | = | continuous | |||
| chisq | [Chi-Square]() | = | continuous | |||
| f | [F]() | = | continuous | |||
| exp | [Exponential]() | = | continuous | |||
| gamma | [Gamma]() | = | continuous | |||
| beta | Beta | = | continuous | |||
| cauchy | [Cauchy]() | = | continuous | |||
| logis | [Logistic]() | = | continuous | |||
| lnorm | [Log Normal]() | = | continuous | |||
| weibull | [Weibull]() | = | continuous | |||
| tukey | [Studentized Range]() | = | continuous | |||
| wilcox | [Wilcoxon Rank Sum Statistic]() | = | continuous | |||
| signrank | [Wilcoxon Signed Rank Statistic]() | = | continuous |
Additional arguments are:
log = FALSE, for dlog.p = FALSEfor p and qlower.tail = TRUEfor p and q
Binomial B(n, p)
Model for the number x = 0,1,...,n of successes in n trials, where n is the total number of trials, and p is the probability of success in each trial. The probability function for B(n, p) is given by:
p(x) = choose(n, x) p^x (1-p)^(n-x), (x = 0, …, n),
where the binomial coefficients can be computed in R using the function choose(n, p).
P[X == x; n, p] = dbinom(x, n, p) ~ mean(rbinom(bignum, n, p) == x)
P[X <= x; n, p] = pbinom(x, n, p) ~ mean(rbinom(bignum, n, p) <= x)
P[X >= x; n, p] = 1 - pbinom(x - 1, n, p) ~ mean(rbinom(bignum, n, p) >= x)
E[X] = n * p ~ mean(rbinom(bignum, n, p))
V[X] = n * p * (1 - p) ~ var(rbinom(bignum, n, p))
E[a + m * X] = a + m * E[X] ==> mean(a + m * rbinom(bignum, n, p)) ~ a + m * mean(binom(bignum, n, p))
V[a + m * X] = m^2 * V[X] ==> var(a + m * rbinom(bignum, n, p)) ~ m^2 * var(rbinom(bignum, n, p))
E[X + Y] = E[X] + E[Y]
==> mean(rbinom(bignum, nx, px) + rbinom(bignum, ny, py)) ~ mean(binom(bignum, nx, px)) + mean(binom(bignum, nx, px))
V[X + Y] = V[X] + V[Y] - cov(X, Y)
==> var(rbinom(bignum, n, p)) ~ m^2 * var(rbinom(bignum, n, p))
X, Y ind => cov(X, Y) = 0
P[X = x or Y = y] = P[X = x] + P[Y = y] - P[X = x, Y = y]
= dbinom(x, nx, px) * dbinom(x, ny, py) - dbinom(x, nx, px) * dbinom(x, ny, py)
~ mean( rbinom(bignum, nx, px) == x | rbinom(bignum, ny, py) == y)
X, Y ind => P[X = x, Y = y] = P[X = x] * P[Y = y]
= dbinom(x, nx, px) * dbinom(x, ny, py) ~ mean( rbinom(bignum, nx, px) == x & rbinom(bignum, ny, py) == y)
# While the above math formula becomes quite cumbersome for more than just two events, the simulation formula in R extends fairly naturally:
P[X = x or Y = y or Z = z or ...] ~ mean( rbinom(bignum, nx, px) == x | rbinom(bignum, ny, py) == y | rbinom(bignum, nz, pz) == z | ...)
Poisson P(m)
Geometric G(p)
Normal N(m, s)
Sampling
When dealing with statistical properties of real distribution, it's often possible to only
sample(x, size, replace = FALSE, prob = NULL)
where:
- x identifies the vector where the sampling should be drawn. If length(x) = 1, and x >= 1, then that vector will be actually 1:x
- size. When
replace = FALSE, size should be greater than x, if length(x) = 1, or greater than length(x), otherwise. - replace
- prob By default, all outcome in x have equal probability: 1/x, when length(x) = 1, or 1/length(x) otherwise. However, it's possible to set different probabilities by adding to the arguments a vector p of probability weights, one for each possible outcome in x, so that length(p) = x, when length(x) = 1, or length(p) = length(x) otherwise.
The bigger the sample set, the more representative of the complete population, and thus the higher its accuracy.
Let's say now we are interested in a statistic T = t(x) , and for that purpose we start drawing a sample x1 and calculate the value t1 = T(x1). Not surprisingly, we expect that every time we take another random sample xk, we get a different value tk = T(xk) for T. It's useful to get a sense of just how much variability we should expect when estimating the this way. The distribution of T, called the sampling distribution, can help us understand this variability. Using the sample command and a bit of iteration, it's easy to build a sampling distribution in R:
T.distr <- t(sapply(1:n_sim, function(x) t(sample(x, size))))
Varying the n_sim argument, and plotting the histograms for different values, it's possible to infer if there's an ultimate shape
Bootstrapping
When calculating the sampling distribution, the resampling is made from the population, the bootstrap distribution is calculated by resampling from the sample:
- Take a bootstrap sample (a random sample with replacement of size equal to the original sample size) from the original sample.
- Record the mean of this bootstrap sample.
- Repeat previous steps many times to build a bootstrap distribution.
- Calculate the XX% interval using the percentile or the standard error method.
set.seed(n)
Statistical Inference
z test, Normal Distribution N(mu, sigma)
t test, Student's t Distribution t($nu$)
ANOVA
Spatial
R core has no spatial capability. The minimum requirements are to load the maptools, rgdal, sp packages.
There exists two main way to store spatial data:
- vector
- raster
There are three ... to depict spatial shapes:
- points
- lines
- polygons
Points
# csv files with two columns representing coordinates for each location
x <- read.csv('fname')
# transform x from simple dataframe to spatial object
coordinates(x) <- ~lon+lat
# apply a projection using a spatial reference ==> http://spatialreference.org/ or https://epsg.io/
proj4string(x) <- CRS("refsys")
# plot the