lvalnegri

R-data.table.md

Jan 25th, 2018 (edited)
294
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Markdown 15.06 KB | None | 0 0

R amazing data.table

Basics

A data.table is an enhanced version of the core data.frame. This means that (nearly) everything that works for a data.frame also works for a data.table

You create a data.table:

  • from scratch in the same way you'd build a data.frame: data.table(name = vector, ...)
  • converting from another structure: permanently with as.data.table(df) or temporary using setDT(df)
  • reading in a text file in various formats: fread(fname, ...)
  • by mean of the fst package: read_fst(fname, as.data.table = TRUE)

The minimal configuration for dealing with a data.table is: DT[i, j, by] where:

  • i filter rows by rownumber, logical condition, regex, ...
  • j take action on the table: select, update, ...
  • by group rows to perform ad-hoc parttionedcalculations

Besides the above parameters, the data.table syntax allows the following additional parameters:

  • key and keyby
  • with
  • which
  • allow.cartesian
  • roll and rollends
  • .SD and .SDcols
  • on, mult, nomatch are used to indicate how to join two data.tables.

On top of parameters, data.table also provides specific functions and operators to facilitate data wrangling:

  • %like%
  • %between%
  • %chin% a fast filtering for character columns
  • .N returns the last row number of the dataset
  • .I returns the current row number
  • .BY
  • .EACH
  • setnames(DT, oldnames, newnames) to change column names provided by oldnames into newnames, where oldnames and newnames have the same length and strings in oldnames are all valid column names for DT
  • setnames(df, cnames) to substitute all columns names, with the length of cnames equals to the number of columns of DT
  • setcolorder(DT, cnames) to change the order of columns, putting cnames in front of the dataset.
  • setkey(DT, cname) to set cname as an index for DT
  • setkeyv(DT, cnames) to set cnames as indices for DT
  • setindex(DT, cnames)
  • setorder(DT, cnames) to order
  • shift()
  • frank()
  • tables()
  • indices(DT)

Finally, there are some options that could be applied to change data.table behaviour:

  • datatable.auto.index
  • datatable.use.index
  • ``
  • ``
  • ``
  • ``

Basic understandings:

  • column names are viewed as variables when referred to inside the square brackets, so they do not need the container name
  • in selection J and grouping by, columns names must be enclosed in a list, do not need quotes, and can be renamed on the spot without enclosing new names in quotes as well: DT[, list(var1, newvar2 = var2), list(grp1, newgrp2 = grp2)]
  • the list() notation in j and by can be substitute with the shorter alias .()
  • to exclude objects, both in filtering and selection, use - or !
  • when filtering or grouping, DT build an index on the involved column(s), which is kept for future reference. This means that subsequent queries on the same column(s) are much faster ==> run indices(DT) to see
  • when selecting only one column the result is still a data.table. To get a vector back, do not wrap the column name as a list. More specifically:
    • df[ , colname] returns a vector
    • df[ , .(colname)] returns a data.table
  • calculations and transformations happen directly in j, after having performed, if existing, the request in i and the grouping in by.
  • when grouping, it's possible to also group by a computed variable on the spot
  • successive operations should be chained one over the other whenever possible. just be careful
  • to add/update/delete a column use the data.table own := operator, that manages all operations by reference instead of using a hidden deep copy:

    • For one column only: DT[, varname := value]
    • for multiple columns there are two ways:
      • the LHS := RHS form: DT[, c('varname1', 'varname2', ...) := .(value1, value2, ...)]
      • the functional form: DT[,:=( varname1 = value1, varname2 = value2, ...)]

    To delete varname use NULL as value in any of the above forms.

    It's also possible to operate an update on a subset adding a filter condition in i, or updating separately block of rows while performing grouped aggregations combining := with a convenient by condition. It's worth noting that as an analyst you often need to create new intermediate columns to get to the final result, that should be deleted before presenting the final result. For example, a typical process in data wrangling, namely extraction of can be easily peformed with data.table as:

    DT[][][, := NULL]

    Similarly, imputing missing values for a numeric variable using the average of the group it

    DT[, mtmp := mean(varname, na.rm = TRUE), .(grpvar = some_calculation)][is.na(varname), varname:= mtmp][, mtmp := NULL]

    Notice that value can actually take a shape of an entire function, when enclosed all the code in braces. In this case, for more clarity, it is anyway a better choice to define previously a helper function, and equal value as a call to that function.

Special helpers for searching and filtering:

  • %in% same as R core: value matching that returns a vector of the positions of (first) matches of its first argument in its second
  • %chin% a much faster version of %in% only for character vectors
  • %like% allows you to search for a (regex) pattern in a character or a factor vector (usage: col %like% pattern). In particular, the metacharacters ^ and $ indicate that we're searching for strings that begins or ends with the specified pattern.
  • %between% allows you to search for numeric values in the closed interval [val1, val2] (usage: numeric_col %between% c(val1, val2)). Much faster than using the classic structure: x >= val1 & x <= val2

Special symbols and functions:

  • .N: contains the number of rows (nrow(DT)) in the table (for example, DT[.N] returns the last row).
    • To add a column id: DT[, id := 1:.N].
  • uniqueN(col1, ...) return the number of unique values in the specified column or combination of columns
  • .SD stands for _S_ubset of _D_ata, it's a list containing
  • .SDcols holds the columns that should be included in .SD. By default contains all columns.
  • .I[]

Special functions:

  • shift()

fread

Unique command for reading in file or from different sources (local or remote file, csv/tsv/txt/fwf/..., ...)

Useful arguments:

  • `header = FALSE'
  • nrow = n number of rows to read (beyond the header)
  • skip = n|char number of rows not to read at the top or until char is foundas is
  • colClasses = char where char can be:
    • c('type1', 'type2', ..., 'typen'): a character vector listing the types for all the n columns in DT, in the same order they appear
    • list(type = intseq, ...): a named list for all or some of the columns in DT, where the names are one or more data types and the corresponding values are the positions for the columns with that type
    • c('colname' = type, ...): a named list, where the names are variable names and values the corresponding data types
  • colNames
  • select = cols (not to be used with drop)
  • drop = cols (not to be used with select)
  • fill = FALSE
  • na.string = char' wherechar` is a list of values to be coded as missing

Tips and tricks (version >= 1.9.8)

  • select column(s) using their numeric position

    # select single column 
    dts[, i]
    # select multiple columns
    dts[, n1:n2]
    dts[, c(n1, n2, n3:n4)]
  • select column(s) using their names

    # select single column
    dt[, 'X']
    # select multiple columns
    dt[, c('X', 'Y')]
    dt[, paste0('X', 1:10)]
  • negate selection of one or more column(s)

  • substitute any cells under condition with some value:

    dt[condition <- value

    For example:

    • substitute all NA with 0: dt[is.na(dt)] <- 0
    • substitute all cells equal to - with NA: dt[dt == '-'] <- NA
  • substitute one or more charachter in all dataset with some other value:

    dt <- dt[, lapply(.SD, function(x) gsub('old', 'new', x))]
  • calculate a function by group while subsetting by group as well

    dt[, .SD[condition, fun(x)] , by = grp]

    Notice that dt[condition, fun(x), by = grp] first subsets the entire dataset based on condition, then applies fun by group

  • select only columns of a certain type xxx

    dts[, sapply(dts, is.xxx), with = FALSE]
    dts[, .SD, .SDcols = sapply(dts, is.xxx)]
  • cast columns of a certain type xxx into type yyy

    cols <- colnames(dts))[which(as.vector(dts[, lapply(.SD, class)][1]) == 'xxx')]
    if(length(cols) > 0) dts[, (cols) := lapply(.SD, as.yyy), .SDcols = cols]
  • select (negate) rows by position(s)

    dts[-n]
    dts[-n1:n2] 
    dts[-c(n1, n2:n3)] 
  • Extract most and least popular records for (column A) when grouped by any combination of (column B, C, ...)

    DT[order(columnA), .(columnA = columnA[c(1, .N)]), .(columnB, columnC, ...) ]
  • Compute most n popular records for (column A) when grouped by any combination of (column B, C, ...)

    DT[, .N, .(columnA, columnB, columnC, ...)][order(-N)][, .(columnA = head(columnA, n)), .(columnB, columnC, ...) ]

    For worst n records use tail instead of head

  • Find the record corresponding to the shortest trip per month

    cols <- c("start_station", "end_station", "start_date", "end_date", "duration")
    DT[, .SD[which.min(duration)], .(month(start_date)), .SDcols = cols]
  • Find the total number of unique start stations and zip codes per month

    DT[, lapply(.SD, uniqueN), .(month(start_date)), .SDcols = c("start_station", "zip_code")]
  • col_mult <- ...   # column to use as common multiplier
    cols <- c(...)    # list of columns affected
    dts[, cols := lapply(.SD, function(x) = x * col_mult), .SDcols = cols ]
  • sum multiple non-contigous columns based on condition over their names:

    cols <- names(dts)[grepl('^[what]', tolower(names(dts)))]      # prefix
    cols <- names(dts)[grepl('[what]$', tolower(names(dts)))]      # suffix
    dts[, cname := rowSums(.SD), .SDcols = cols ]
  • join tables with key columns having common name(s):

    dts[lkp, on = 'y']
    dts[lkp, on = c('y1', 'y2')]
  • join tables with key columns having different names (yes, only one key has quotes):

    dts[lkp, on = c(kdts = 'klkp')]
    dts[lkp, on = c(k1dts = 'k1lkp', k2dts = 'k2lkp')]
  • update one column x for some rows only based on the values found in a lookup table lkp and related to another column y:

    dts[y %in% lkp[, y], x := lkp[.SD[['y']], .(x), on = 'y'] ]

    it looks like it doesn't work anymore. Try this:

    lkp[dts, on = 'y'][is.na(x), x := i.x][, i.x := NULL]
  • update multiple columns x1, x2 for some rows only based on the values found in a lookup table lkp and related to another column y:

    dts[y %in% lkp[, y], c('x1', 'x2') := lkp[.SD[['y']], .(x1, x2), on = 'y'] ]
  • select the first n rows by group (by efficiency):

    dts[dts[, .I[1:n], grp]$V1]
    dts[, .SD[1:n], grp]
    dts[, head(.SD, n), grp]
  • select the rows in positions n1 to n2 by group:

    dts[dts[, .I[n1:n2], grp]$V1]
    dts[, .SD[n1:n2], grp]
  • select all records with a minimum (maximum, or other condition; adapt as necessary) value by group:

    dts[dts[, .I[which.min(cname)], grp]$V1]
    dts[, .SD[which.min(cname)], grp]
  • split a column into multiple parts, based on some character(s), adding simultaneously to dataset:

    dts[, c('fld1', 'fld2') := tstrsplit(colname, 'sep', fixed = TRUE)]
  • when updating/adding column in a derived table, the result is not displayed and is lost, so we usually have to actually store it to be able to see it. To avoid this, just add [] at the end of the calculation.

  • calculate absolute frequency, relative frequency, and percentages distributions:

    y <- dts[, .(abs = sum(fldname)), grpname][, `:=`(rel = round(abs / sum(abs), 3), pct = round(100 * abs / sum(abs), 1))]
  • extract nrows random rows from filename (does not work on Windows)

    fread('shuf -n nrows filename')
  • download a zip file and open only a specific csv in it:

    download.file(fileurl, destfile = 'temp.zip')
    dts <- fread('unzip -pq temp.zip path/to/filename.csv')

    Many other variants are possible. The following open the biggest file in the archive:

    lst <- unzip('temp.zip', list = TRUE)
    dts <- fread(paste('unzip -pq temp.zip', lst[order(lst$Length, decreasing = TRUE), 1][1] ) )
  • return the row numbers of a data.table (subject to condition)

    • DT[condition, which = TRUE] When which equals to TRUE returns the row numbers of x that i matches to. If NA, returns the row numbers of i that have no match in x.
    • DT[, .I[condition]] .I is a vector of row numbers of the (subsetted) table.
  • transpose a dataset, keeping column names as a distinct column in the result

    y[rows, data.table(t(.SD), keep.rownames = TRUE), .SDcols = cols]

Reshaping a dataset

As a reminder, a dataset is said to be in a:

  • long or tidy format, when there is one (or more) column(s) for all possible variable types, and one column only for all the values of those variables. The typical scenario is a grouped query or plot.
  • wide format, when the dataset has a different column for each variable, reporting directly the associated value. Each row represents a case study, and each column an attribute/variable. This is the classical input for building a predictive model.

Reshaping a dataset means in general the process to move from the long format to the wide format, and vice versa, and it is based on the two following key functions:

  • melt that takes a wide-format dataset and transforms it into a long-format. It is possible to specify the dataset only, and by default, the function assumes that all columns with numeric values are variables with values.

    melt(dataset, id.vars = c('var1', ...), variable.name = 'variable', value.name = 'value')

    where

    • id.vars contains the variable(s) that identify individual rows of data
    • variable.name
    • value.name
  • dcast that takes a long-format dataset and transforms it into the wide-format.

    dcast(dataset, ids ~ measures, value.var = 'value', fun.aggregate = length, na.rm = TRUE, fill = )

    where:

    • ids indicates the column(s) that we want to represent as individual rows
    • measures indicates the column(s) that contain(s) the names that we want to swing into variables.
    • value.var
    • fun.aggregate this is needed when not all columns are detailed, so R needs to know how to aggregate multiple rows
    • na.rm remove NA in the aggregation function
    • fill
Add Comment
Please, Sign In to add comment