R-data.table.md


        
                Markdown 15.06 KB
                                    
                        | None                    
                
                                        |
    0    0                            

            
                                    raw
                    download
                    clone
                    embed
                    print
                
                                    report
                
                
            R amazing data.table
Basics
A data.table is an enhanced version of the core data.frame. This means that (nearly) everything that works for a data.frame also works for a data.table
You create a data.table: 

from scratch in the same way you'd build a data.frame: data.table(name = vector, ...)
converting from another structure: permanently with as.data.table(df) or temporary using setDT(df)
reading in a text file in various formats: fread(fname, ...)
by mean of the fst package: read_fst(fname, as.data.table = TRUE)

The minimal configuration for dealing with a data.table is: DT[i, j, by] where:

i filter rows by rownumber, logical condition, regex, ... 
j take action on the table: select, update, ...
by group rows to perform ad-hoc parttionedcalculations

Besides the above parameters, the data.table syntax allows the following additional parameters:

key and keyby
with
which
allow.cartesian
roll and rollends
.SD and .SDcols
on, mult, nomatch are used to indicate how to join two data.tables.

On top of parameters, data.table also provides specific functions and operators to facilitate data wrangling:

%like%
%between%
%chin% a fast filtering for character columns
.N returns the last row number of the dataset
.I returns the current row number
.BY
.EACH
setnames(DT, oldnames, newnames) to change column names provided by oldnames into newnames, where oldnames and newnames have the same length and strings in oldnames are all valid column names for DT
setnames(df, cnames) to substitute all columns names, with the length of cnames equals to the number of columns of DT 
setcolorder(DT, cnames) to change the order of columns, putting cnames in front of the dataset.
setkey(DT, cname) to set cname as an index for DT
setkeyv(DT, cnames) to set cnames as indices for DT
setindex(DT, cnames)
setorder(DT, cnames) to order 
shift()
frank()
tables()
indices(DT)

Finally, there are some options that could be applied to change data.table behaviour:

datatable.auto.index
datatable.use.index
``
``
``
``

Basic understandings:

column names are viewed as variables when referred to inside the square brackets, so they do not need the container name
in selection J and grouping by, columns names must be enclosed in a list, do not need quotes, and can be renamed on the spot without enclosing new names in quotes as well: DT[, list(var1, newvar2 = var2), list(grp1, newgrp2 = grp2)]
the list() notation in j and by can be substitute with the shorter alias .()
to exclude objects, both in filtering and selection, use - or !
when filtering or grouping, DT build an index on the involved column(s), which is kept for future reference. This means that subsequent queries on the same column(s) are much faster ==> run indices(DT) to see 
when selecting only one column the result is still a data.table. To get a vector back, do not wrap the column name as a list. More specifically:

df[ , colname] returns a vector
df[ , .(colname)] returns a data.table

calculations and transformations happen directly in j, after having performed, if existing, the request in i and the grouping in by. 
when grouping, it's possible to also group by a computed variable on the spot 
successive operations should be chained one over the other whenever possible. just be careful  

to add/update/delete a column use the data.table own := operator, that manages all operations by reference instead of using a hidden deep copy:

For one column only: DT[, varname := value] 
for multiple columns there are two ways:

the LHS := RHS form: DT[, c('varname1', 'varname2', ...) := .(value1, value2, ...)]
the functional form: DT[,:=( varname1 = value1, varname2 = value2, ...)]


To delete varname use NULL as value in any of the above forms.
It's also possible to operate an update on a subset adding a filter condition in i, or updating separately block of rows while performing grouped aggregations combining := with a convenient by condition. It's worth noting that as an analyst you often need to create new intermediate columns to get to the final result, that should be deleted before presenting the final result. For example, a typical process in data wrangling, namely extraction of  can be easily peformed with data.table as:
DT[][][, := NULL]
Similarly, imputing missing values for a numeric variable using the average of the group it 
DT[, mtmp := mean(varname, na.rm = TRUE), .(grpvar = some_calculation)][is.na(varname), varname:= mtmp][, mtmp := NULL]
Notice that value can actually take a shape of an entire function, when enclosed all the code in braces. In this case, for more clarity, it is anyway a better choice to define previously a helper function, and equal value as a call to that function.


Special helpers for searching and filtering:

%in% same as R core: value matching that returns a vector of the positions of (first) matches of its first argument in its second
%chin% a much faster version of %in% only for character vectors
%like% allows you to search for a (regex) pattern in a character or a factor vector (usage: col %like% pattern). In particular, the metacharacters ^ and $ indicate that we're searching for strings that begins or ends with the specified pattern.
%between% allows you to search for numeric values in the closed interval [val1, val2] (usage: numeric_col %between% c(val1, val2)). Much faster than using the classic structure: x >= val1 & x <= val2

Special symbols and functions:

.N: contains the number of rows (nrow(DT)) in the table (for example, DT[.N] returns the last row). 

To add a column id: DT[, id := 1:.N]. 

uniqueN(col1, ...) return the number of unique values in the specified column or combination of columns
.SD stands for _S_ubset of _D_ata, it's a list containing 
.SDcols holds the columns that should be included in .SD. By default contains all columns.
.I[]

Special functions:

shift()

fread
Unique command for reading in file or from different sources (local or remote file, csv/tsv/txt/fwf/..., ...)
Useful arguments:

`header = FALSE' 
nrow = n number of rows to read (beyond the header)
skip = n|char number of rows not to read at the top or until char is foundas is
colClasses = char where char can be:

c('type1', 'type2', ..., 'typen'): a character vector listing the types for all the n columns in DT, in the same order they appear
list(type = intseq, ...): a named list for all or some of the columns in DT, where the names are one or more data types and the corresponding values are the positions for the columns with that type
c('colname' = type, ...): a named list, where the names are variable names and values the corresponding data types 

colNames
select = cols (not to be used with drop)
drop = cols (not to be used with select)
fill = FALSE 
na.string = char' wherechar` is a list of values to be coded as missing

Tips and tricks (version >= 1.9.8)


select column(s) using their numeric position
# select single column 
dts[, i]
# select multiple columns
dts[, n1:n2]
dts[, c(n1, n2, n3:n4)]


select column(s) using their names
# select single column
dt[, 'X']
# select multiple columns
dt[, c('X', 'Y')]
dt[, paste0('X', 1:10)]


negate selection of one or more column(s)


substitute any cells under condition with some value:
dt[condition <- value
For example:

substitute all NA with 0: dt[is.na(dt)] <- 0
substitute all cells equal to - with NA: dt[dt == '-'] <- NA


substitute one or more charachter in all dataset with some other value:
dt <- dt[, lapply(.SD, function(x) gsub('old', 'new', x))]


calculate a function by group while subsetting by group as well
dt[, .SD[condition, fun(x)] , by = grp]
Notice that dt[condition, fun(x), by = grp] first subsets the entire dataset based on condition, then applies fun by group


select only columns of a certain type xxx
dts[, sapply(dts, is.xxx), with = FALSE]
dts[, .SD, .SDcols = sapply(dts, is.xxx)]


cast columns of a certain type xxx into type yyy 
cols <- colnames(dts))[which(as.vector(dts[, lapply(.SD, class)][1]) == 'xxx')]
if(length(cols) > 0) dts[, (cols) := lapply(.SD, as.yyy), .SDcols = cols]


select (negate) rows by position(s)
dts[-n]
dts[-n1:n2] 
dts[-c(n1, n2:n3)] 


Extract most and least popular records for (column A) when grouped by any combination of (column B, C, ...) 
DT[order(columnA), .(columnA = columnA[c(1, .N)]), .(columnB, columnC, ...) ]


Compute most n popular records for (column A) when grouped by any combination of (column B, C, ...) 
DT[, .N, .(columnA, columnB, columnC, ...)][order(-N)][, .(columnA = head(columnA, n)), .(columnB, columnC, ...) ]
For worst n records use tail instead of head


Find the record corresponding to the shortest trip per month
cols <- c("start_station", "end_station", "start_date", "end_date", "duration")
DT[, .SD[which.min(duration)], .(month(start_date)), .SDcols = cols]


Find the total number of unique start stations and zip codes per month
DT[, lapply(.SD, uniqueN), .(month(start_date)), .SDcols = c("start_station", "zip_code")]


col_mult <- ...   # column to use as common multiplier
cols <- c(...)    # list of columns affected
dts[, cols := lapply(.SD, function(x) = x * col_mult), .SDcols = cols ]


sum multiple non-contigous columns based on condition over their names:
cols <- names(dts)[grepl('^[what]', tolower(names(dts)))]      # prefix
cols <- names(dts)[grepl('[what]$', tolower(names(dts)))]      # suffix
dts[, cname := rowSums(.SD), .SDcols = cols ]


join tables with key columns having common name(s):
dts[lkp, on = 'y']
dts[lkp, on = c('y1', 'y2')]


join tables with key columns having different names (yes, only one key has quotes):
dts[lkp, on = c(kdts = 'klkp')]
dts[lkp, on = c(k1dts = 'k1lkp', k2dts = 'k2lkp')]


update one column x for some rows only based on the values found in a lookup table lkp and related to another column y:
dts[y %in% lkp[, y], x := lkp[.SD[['y']], .(x), on = 'y'] ]
it looks like it doesn't work anymore. Try this:
lkp[dts, on = 'y'][is.na(x), x := i.x][, i.x := NULL]


update multiple columns x1, x2 for some rows only based on the values found in a lookup table lkp and related to another column y:
dts[y %in% lkp[, y], c('x1', 'x2') := lkp[.SD[['y']], .(x1, x2), on = 'y'] ]


select the first n rows by group (by efficiency):
dts[dts[, .I[1:n], grp]$V1]
dts[, .SD[1:n], grp]
dts[, head(.SD, n), grp]


select the rows in positions n1 to n2 by group:
dts[dts[, .I[n1:n2], grp]$V1]
dts[, .SD[n1:n2], grp]


select all records with a minimum (maximum, or other condition; adapt as necessary) value by group:
dts[dts[, .I[which.min(cname)], grp]$V1]
dts[, .SD[which.min(cname)], grp]


split a column into multiple parts, based on some character(s), adding simultaneously to dataset:
dts[, c('fld1', 'fld2') := tstrsplit(colname, 'sep', fixed = TRUE)]


when updating/adding column in a derived table, the result is not displayed and is lost, so we usually have to actually store it to be able to see it. To avoid this, just add [] at the end of the calculation.


calculate absolute frequency, relative frequency, and percentages distributions:
y <- dts[, .(abs = sum(fldname)), grpname][, `:=`(rel = round(abs / sum(abs), 3), pct = round(100 * abs / sum(abs), 1))]


extract nrows random rows from filename (does not work on Windows) 
fread('shuf -n nrows filename')


download a zip file and open only a specific csv in it: 
download.file(fileurl, destfile = 'temp.zip')
dts <- fread('unzip -pq temp.zip path/to/filename.csv')
Many other variants are possible. The following open the biggest file in the archive:
lst <- unzip('temp.zip', list = TRUE)
dts <- fread(paste('unzip -pq temp.zip', lst[order(lst$Length, decreasing = TRUE), 1][1] ) )


return the row numbers of a data.table (subject to condition)

DT[condition, which = TRUE] When which equals to TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x.
DT[, .I[condition]] .I is a vector of row numbers of the (subsetted) table.


transpose a dataset, keeping column names as a distinct column in the result  
y[rows, data.table(t(.SD), keep.rownames = TRUE), .SDcols = cols]


Reshaping a dataset
As a reminder, a dataset is said to be in a:        

long or tidy format, when there is one (or more) column(s) for all possible variable types, and one column only for all the values of those variables. The typical scenario is a grouped query or plot.
wide format, when the dataset has a different column for each variable, reporting directly the associated value. Each row represents a case study, and each column an attribute/variable. This is the classical input for building a predictive model.

Reshaping a dataset means in general the process to move from the long format to the wide format, and vice versa, and it is based on the two following key functions: 


melt that takes a wide-format dataset and transforms it into a long-format. It is possible to specify the dataset only, and by default, the function assumes that all columns with numeric values are variables with values. 
melt(dataset, id.vars = c('var1', ...), variable.name = 'variable', value.name = 'value')
where

id.vars contains the variable(s) that identify individual rows of data
variable.name  
value.name


dcast that takes a long-format dataset and transforms it into the wide-format.
dcast(dataset, ids ~ measures, value.var = 'value', fun.aggregate = length, na.rm = TRUE, fill = )
where:

ids indicates the column(s) that we want to represent as individual rows
measures indicates the column(s) that contain(s) the names that we want to swing into variables. 
value.var
fun.aggregate this is needed when not all columns are detailed, so R needs to know how to aggregate multiple rows
na.rm remove NA in the aggregation function
fill