R amazing data.table
Basics
A data.table is an enhanced version of the core data.frame. This means that (nearly) everything that works for a data.frame also works for a data.table
You create a data.table:
- from scratch in the same way you'd build a data.frame:
data.table(name = vector, ...) - converting from another structure: permanently with
as.data.table(df)or temporary usingsetDT(df) - reading in a text file in various formats:
fread(fname, ...) - by mean of the fst package:
read_fst(fname, as.data.table = TRUE)
The minimal configuration for dealing with a data.table is: DT[i, j, by] where:
ifilter rows by rownumber, logical condition, regex, ...jtake action on the table: select, update, ...bygroup rows to perform ad-hoc parttionedcalculations
Besides the above parameters, the data.table syntax allows the following additional parameters:
keyandkeybywithwhichallow.cartesianrollandrollends.SDand.SDcolson,mult,nomatchare used to indicate how to join two data.tables.
On top of parameters, data.table also provides specific functions and operators to facilitate data wrangling:
%like%%between%%chin%a fast filtering for character columns.Nreturns the last row number of the dataset.Ireturns the current row number.BY.EACHsetnames(DT, oldnames, newnames)to change column names provided byoldnamesintonewnames, whereoldnamesandnewnameshave the same length and strings inoldnamesare all valid column names forDTsetnames(df, cnames)to substitute all columns names, with the length ofcnamesequals to the number of columns ofDTsetcolorder(DT, cnames)to change the order of columns, putting cnames in front of the dataset.setkey(DT, cname)to setcnameas an index forDTsetkeyv(DT, cnames)to setcnamesas indices forDTsetindex(DT, cnames)setorder(DT, cnames)to ordershift()frank()tables()indices(DT)
Finally, there are some options that could be applied to change data.table behaviour:
datatable.auto.indexdatatable.use.index- ``
- ``
- ``
- ``
Basic understandings:
- column names are viewed as variables when referred to inside the square brackets, so they do not need the container name
- in selection J and grouping by, columns names must be enclosed in a list, do not need quotes, and can be renamed on the spot without enclosing new names in quotes as well:
DT[, list(var1, newvar2 = var2), list(grp1, newgrp2 = grp2)] - the
list()notation injandbycan be substitute with the shorter alias.() - to exclude objects, both in filtering and selection, use
-or! - when filtering or grouping, DT build an index on the involved column(s), which is kept for future reference. This means that subsequent queries on the same column(s) are much faster ==> run
indices(DT)to see - when selecting only one column the result is still a data.table. To get a vector back, do not wrap the column name as a list. More specifically:
df[ , colname]returns a vectordf[ , .(colname)]returns a data.table
- calculations and transformations happen directly in j, after having performed, if existing, the request in i and the grouping in by.
- when grouping, it's possible to also group by a computed variable on the spot
- successive operations should be chained one over the other whenever possible. just be careful
-
to add/update/delete a column use the data.table own
:=operator, that manages all operations by reference instead of using a hidden deep copy:- For one column only:
DT[, varname := value] - for multiple columns there are two ways:
- the LHS := RHS form:
DT[, c('varname1', 'varname2', ...) := .(value1, value2, ...)] - the functional form:
DT[,:=( varname1 = value1, varname2 = value2, ...)]
- the LHS := RHS form:
To delete
varnameuseNULLas value in any of the above forms.It's also possible to operate an update on a subset adding a filter condition in i, or updating separately block of rows while performing grouped aggregations combining
:=with a convenient by condition. It's worth noting that as an analyst you often need to create new intermediate columns to get to the final result, that should be deleted before presenting the final result. For example, a typical process in data wrangling, namely extraction of can be easily peformed with data.table as:DT[][][, := NULL]Similarly, imputing missing values for a numeric variable using the average of the group it
DT[, mtmp := mean(varname, na.rm = TRUE), .(grpvar = some_calculation)][is.na(varname), varname:= mtmp][, mtmp := NULL]Notice that
valuecan actually take a shape of an entire function, when enclosed all the code in braces. In this case, for more clarity, it is anyway a better choice to define previously a helper function, and equal value as a call to that function. - For one column only:
Special helpers for searching and filtering:
%in%same as R core: value matching that returns a vector of the positions of (first) matches of its first argument in its second%chin%a much faster version of%in%only for character vectors%like%allows you to search for a (regex) pattern in a character or a factor vector (usage:col %like% pattern). In particular, the metacharacters^and$indicate that we're searching for strings that begins or ends with the specified pattern.%between%allows you to search for numeric values in the closed interval [val1, val2] (usage:numeric_col %between% c(val1, val2)). Much faster than using the classic structure:x >= val1 & x <= val2
Special symbols and functions:
.N: contains the number of rows (nrow(DT)) in the table (for example, DT[.N] returns the last row).- To add a column id:
DT[, id := 1:.N].
- To add a column id:
uniqueN(col1, ...)return the number of unique values in the specified column or combination of columns.SDstands for _S_ubset of _D_ata, it's a list containing.SDcolsholds the columns that should be included in .SD. By default contains all columns..I[]
Special functions:
shift()
fread
Unique command for reading in file or from different sources (local or remote file, csv/tsv/txt/fwf/..., ...)
Useful arguments:
- `header = FALSE'
nrow = nnumber of rows to read (beyond the header)skip = n|charnumber of rows not to read at the top or untilcharis foundas iscolClasses = charwherecharcan be:c('type1', 'type2', ..., 'typen'): a character vector listing the types for all the n columns in DT, in the same order they appearlist(type = intseq, ...): a named list for all or some of the columns in DT, where the names are one or more data types and the corresponding values are the positions for the columns with that typec('colname' = type, ...): a named list, where the names are variable names and values the corresponding data types
colNamesselect = cols(not to be used withdrop)drop = cols(not to be used withselect)fill = FALSEna.string = char' wherechar` is a list of values to be coded as missing
Tips and tricks (version >= 1.9.8)
-
select column(s) using their numeric position
# select single column dts[, i] # select multiple columns dts[, n1:n2] dts[, c(n1, n2, n3:n4)] -
select column(s) using their names
# select single column dt[, 'X'] # select multiple columns dt[, c('X', 'Y')] dt[, paste0('X', 1:10)] -
negate selection of one or more column(s)
-
substitute any cells under condition with some value:
dt[condition <- valueFor example:
- substitute all
NAwith0:dt[is.na(dt)] <- 0 - substitute all cells equal to
-withNA:dt[dt == '-'] <- NA
- substitute all
-
substitute one or more charachter in all dataset with some other value:
dt <- dt[, lapply(.SD, function(x) gsub('old', 'new', x))] -
calculate a function by group while subsetting by group as well
dt[, .SD[condition, fun(x)] , by = grp]Notice that
dt[condition, fun(x), by = grp]first subsets the entire dataset based oncondition, then appliesfunby group -
select only columns of a certain type
xxxdts[, sapply(dts, is.xxx), with = FALSE] dts[, .SD, .SDcols = sapply(dts, is.xxx)] -
cast columns of a certain type
xxxinto typeyyycols <- colnames(dts))[which(as.vector(dts[, lapply(.SD, class)][1]) == 'xxx')] if(length(cols) > 0) dts[, (cols) := lapply(.SD, as.yyy), .SDcols = cols] -
select (negate) rows by position(s)
dts[-n] dts[-n1:n2] dts[-c(n1, n2:n3)] -
Extract most and least popular records for (column A) when grouped by any combination of (column B, C, ...)
DT[order(columnA), .(columnA = columnA[c(1, .N)]), .(columnB, columnC, ...) ] -
Compute most n popular records for (column A) when grouped by any combination of (column B, C, ...)
DT[, .N, .(columnA, columnB, columnC, ...)][order(-N)][, .(columnA = head(columnA, n)), .(columnB, columnC, ...) ]For worst n records use
tailinstead ofhead -
Find the record corresponding to the shortest trip per month
cols <- c("start_station", "end_station", "start_date", "end_date", "duration") DT[, .SD[which.min(duration)], .(month(start_date)), .SDcols = cols] -
Find the total number of unique start stations and zip codes per month
DT[, lapply(.SD, uniqueN), .(month(start_date)), .SDcols = c("start_station", "zip_code")] -
col_mult <- ... # column to use as common multiplier cols <- c(...) # list of columns affected dts[, cols := lapply(.SD, function(x) = x * col_mult), .SDcols = cols ] -
sum multiple non-contigous columns based on condition over their names:
cols <- names(dts)[grepl('^[what]', tolower(names(dts)))] # prefix cols <- names(dts)[grepl('[what]$', tolower(names(dts)))] # suffix dts[, cname := rowSums(.SD), .SDcols = cols ] -
join tables with key columns having common name(s):
dts[lkp, on = 'y'] dts[lkp, on = c('y1', 'y2')] -
join tables with key columns having different names (yes, only one key has quotes):
dts[lkp, on = c(kdts = 'klkp')] dts[lkp, on = c(k1dts = 'k1lkp', k2dts = 'k2lkp')] -
update one column
xfor some rows only based on the values found in a lookup tablelkpand related to another columny:dts[y %in% lkp[, y], x := lkp[.SD[['y']], .(x), on = 'y'] ]it looks like it doesn't work anymore. Try this:
lkp[dts, on = 'y'][is.na(x), x := i.x][, i.x := NULL] -
update multiple columns
x1, x2for some rows only based on the values found in a lookup tablelkpand related to another columny:dts[y %in% lkp[, y], c('x1', 'x2') := lkp[.SD[['y']], .(x1, x2), on = 'y'] ] -
select the first n rows by group (by efficiency):
dts[dts[, .I[1:n], grp]$V1] dts[, .SD[1:n], grp] dts[, head(.SD, n), grp] -
select the rows in positions n1 to n2 by group:
dts[dts[, .I[n1:n2], grp]$V1] dts[, .SD[n1:n2], grp] -
select all records with a minimum (maximum, or other condition; adapt as necessary) value by group:
dts[dts[, .I[which.min(cname)], grp]$V1] dts[, .SD[which.min(cname)], grp] -
split a column into multiple parts, based on some character(s), adding simultaneously to dataset:
dts[, c('fld1', 'fld2') := tstrsplit(colname, 'sep', fixed = TRUE)] -
when updating/adding column in a derived table, the result is not displayed and is lost, so we usually have to actually store it to be able to see it. To avoid this, just add
[]at the end of the calculation. -
calculate absolute frequency, relative frequency, and percentages distributions:
y <- dts[, .(abs = sum(fldname)), grpname][, `:=`(rel = round(abs / sum(abs), 3), pct = round(100 * abs / sum(abs), 1))] -
extract
nrowsrandom rows fromfilename(does not work on Windows)fread('shuf -n nrows filename') -
download a zip file and open only a specific csv in it:
download.file(fileurl, destfile = 'temp.zip') dts <- fread('unzip -pq temp.zip path/to/filename.csv')Many other variants are possible. The following open the biggest file in the archive:
lst <- unzip('temp.zip', list = TRUE) dts <- fread(paste('unzip -pq temp.zip', lst[order(lst$Length, decreasing = TRUE), 1][1] ) ) -
return the row numbers of a data.table (subject to condition)
DT[condition, which = TRUE]When which equals to TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x.DT[, .I[condition]].I is a vector of row numbers of the (subsetted) table.
-
transpose a dataset, keeping column names as a distinct column in the result
y[rows, data.table(t(.SD), keep.rownames = TRUE), .SDcols = cols]
Reshaping a dataset
As a reminder, a dataset is said to be in a:
- long or tidy format, when there is one (or more) column(s) for all possible variable types, and one column only for all the values of those variables. The typical scenario is a grouped query or plot.
- wide format, when the dataset has a different column for each variable, reporting directly the associated value. Each row represents a case study, and each column an attribute/variable. This is the classical input for building a predictive model.
Reshaping a dataset means in general the process to move from the long format to the wide format, and vice versa, and it is based on the two following key functions:
-
meltthat takes a wide-format dataset and transforms it into a long-format. It is possible to specify the dataset only, and by default, the function assumes that all columns with numeric values are variables with values.melt(dataset, id.vars = c('var1', ...), variable.name = 'variable', value.name = 'value')where
id.varscontains the variable(s) that identify individual rows of datavariable.namevalue.name
-
dcastthat takes a long-format dataset and transforms it into the wide-format.dcast(dataset, ids ~ measures, value.var = 'value', fun.aggregate = length, na.rm = TRUE, fill = )where:
idsindicates the column(s) that we want to represent as individual rowsmeasuresindicates the column(s) that contain(s) the names that we want to swing into variables.value.varfun.aggregatethis is needed when not all columns are detailed, so R needs to know how to aggregate multiple rowsna.rmremove NA in the aggregation functionfill