Reference semantics

2024-10-09

This vignette discusses data.table‘s reference semantics which allows to add/update/delete columns of a data.table by reference, and also combine them with i and by. It is aimed at those who are already familiar with data.table syntax, its general form, how to subset rows in i, select and compute on columns, and perform aggregations by group. If you’re not familiar with these concepts, please read the “Introduction to data.table” vignette first.


Data

We will use the same flights data as in the “Introduction to data.table” vignette.

flights <- fread("flights14.csv")
flights
#          year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour
#         <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>
#      1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9
#      2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11
#      3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19
#      4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7
#      5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13
#     ---                                                                                    
# 253312:  2014    10    31         1       -30      UA    LGA    IAH      201     1416    14
# 253313:  2014    10    31        -5       -14      UA    EWR    IAH      189     1400     8
# 253314:  2014    10    31        -8        16      MQ    LGA    RDU       83      431    11
# 253315:  2014    10    31        -4        15      MQ    LGA    DTW       75      502    11
# 253316:  2014    10    31        -5         1      MQ    LGA    SDF      110      659     8
dim(flights)
# [1] 253316     11

Introduction

In this vignette, we will

  1. first discuss reference semantics briefly and look at the two different forms in which the := operator can be used

  2. then see how we can add/update/delete columns by reference in j using the := operator and how to combine with i and by.

  3. and finally we will look at using := for its side-effect and how we can avoid the side effects using copy().

1. Reference semantics

All the operations we have seen so far in the previous vignette resulted in a new data set. We will see how to add new column(s), update or delete existing column(s) on the original data.

a) Background

Before we look at reference semantics, consider the data.frame shown below:

DF = data.frame(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DF
#   ID a  b  c
# 1  b 1  7 13
# 2  b 2  8 14
# 3  b 3  9 15
# 4  a 4 10 16
# 5  a 5 11 17
# 6  c 6 12 18

When we did:

DF$c <- 18:13               # (1) -- replace entire column
# or
DF$c[DF$ID == "b"] <- 15:13 # (2) -- subassign in column 'c'

both (1) and (2) resulted in deep copy of the entire data.frame in versions of R versions < 3.1. It copied more than once. To improve performance by avoiding these redundant copies, data.table utilised the available but unused := operator in R.

Great performance improvements were made in R v3.1 as a result of which only a shallow copy is made for (1) and not deep copy. However, for (2) still, the entire column is deep copied even in R v3.1+. This means the more columns one subassigns to in the same query, the more deep copies R does.

shallow vs deep copy

A shallow copy is just a copy of the vector of column pointers (corresponding to the columns in a data.frame or data.table). The actual data is not physically copied in memory.

A deep copy on the other hand copies the entire data to another location in memory.

When subsetting a data.table using i (e.g., DT[1:10]), a deep copy is made. However, when i is not provided or equals TRUE, a shallow copy is made.

With data.table’s := operator, absolutely no copies are made in both (1) and (2), irrespective of R version you are using. This is because := operator updates data.table columns in-place (by reference).

b) The := operator

It can be used in j in two ways:

(a) The LHS := RHS form

DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]

(b) The functional form

DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]

Note that the code above explains how := can be used. They are not working examples. We will start using them on flights data.table from the next section.

In the two forms of := shown above, note that we don’t assign the result back to a variable. Because we don’t need to. The input data.table is modified by reference. Let’s go through examples to understand what we mean by this.

For the rest of the vignette, we will work with flights data.table.

2. Add/update/delete columns by reference

a) Add columns by reference

– How can we add columns speed and total delay of each flight to flights data.table?

flights[, `:=`(speed = distance / (air_time/60), # speed in mph (mi/h)
               delay = arr_delay + dep_delay)]   # delay in minutes
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour    speed
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>    <num>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9 413.6490
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11 409.0909
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19 423.0769
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7 395.5414
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13 424.2857
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18 434.3363
#    delay
#    <int>
# 1:    27
# 2:    10
# 3:    11
# 4:   -34
# 5:     3
# 6:     4

## alternatively, using the 'LHS := RHS' form
# flights[, c("speed", "delay") := list(distance/(air_time/60), arr_delay + dep_delay)]

Note that

b) Update some rows of columns by reference - sub-assign by reference

Let’s take a look at all the hours available in the flights data.table:

# get all 'hours' in flights
flights[, sort(unique(hour))]
#  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

We see that there are totally 25 unique values in the data. Both 0 and 24 hours seem to be present. Let’s go ahead and replace 24 with 0.

– Replace those rows where hour == 24 with the value 0

# subassign by reference
flights[hour == 24L, hour := 0L]

Let’s look at all the hours to verify.

# check again for '24'
flights[, sort(unique(hour))]
#  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Exercise:

What is the difference between flights[hour == 24L, hour := 0L] and flights[hour == 24L][, hour := 0L]? Hint: The latter needs an assignment (<-) if you would want to use the result later.

If you can’t figure it out, have a look at the Note section of ?":=".

c) Delete column by reference

– Remove delay column

flights[, c("delay") := NULL]
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour    speed
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>    <num>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9 413.6490
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11 409.0909
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19 423.0769
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7 395.5414
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13 424.2857
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18 434.3363

## or using the functional form
# flights[, `:=`(delay = NULL)]

{#delete-convenience}

d) := along with grouping using by

We have already seen the use of i along with := in Section 2b. Let’s now see how we can use := along with by.

– How can we add a new column which contains for each orig,dest pair the maximum speed?

flights[, max_speed := max(speed), by = .(origin, dest)]
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour    speed
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>    <num>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9 413.6490
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11 409.0909
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19 423.0769
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7 395.5414
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13 424.2857
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18 434.3363
#    max_speed
#        <num>
# 1:  526.5957
# 2:  526.5957
# 3:  526.5957
# 4:  517.5000
# 5:  526.5957
# 6:  518.4507

e) Multiple columns and :=

– How can we add two more columns computing max() of dep_delay and arr_delay for each month, using .SD?

in_cols  = c("dep_delay", "arr_delay")
out_cols = c("max_dep_delay", "max_arr_delay")
flights[, c(out_cols) := lapply(.SD, max), by = month, .SDcols = in_cols]
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour    speed
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>    <num>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9 413.6490
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11 409.0909
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19 423.0769
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7 395.5414
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13 424.2857
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18 434.3363
#    max_speed max_dep_delay max_arr_delay
#        <num>         <int>         <int>
# 1:  526.5957           973           996
# 2:  526.5957           973           996
# 3:  526.5957           973           996
# 4:  517.5000           973           996
# 5:  526.5957           973           996
# 6:  518.4507           973           996

Before moving on to the next section, let’s clean up the newly created columns speed, max_speed, max_dep_delay and max_arr_delay.

# RHS gets automatically recycled to length of LHS
flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18

– How can we update multiple existing columns in place using .SD?

flights[, names(.SD) := lapply(.SD, as.factor), .SDcols = is.character]

Let’s clean up again and convert our newly-made factor columns back into character columns. This time we will make use of .SDcols accepting a function to decide which columns to include. In this case, is.factor() will return the columns which are factors. For more on the Subset of the Data, there is also an SD Usage vignette.

Sometimes, it is also nice to keep track of columns that we transform. That way, even after we convert our columns we would be able to call the specific columns we were updating.

factor_cols <- sapply(flights, is.factor)
flights[, names(.SD) := lapply(.SD, as.character), .SDcols = factor_cols]
str(flights[, ..factor_cols])
# Classes 'data.table' and 'data.frame':	253316 obs. of  3 variables:
#  $ carrier: chr  "AA" "AA" "AA" "AA" ...
#  $ origin : chr  "JFK" "JFK" "JFK" "LGA" ...
#  $ dest   : chr  "LAX" "LAX" "LAX" "PBI" ...
#  - attr(*, ".internal.selfref")=<externalptr>

{.bs-callout .bs-callout-info}

3. := and copy()

:= modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use copy() function, as we will see in a moment.

a) := for its side effect

Let’s say we would like to create a function that would return the maximum speed for each month. But at the same time, we would also like to add the column speed to flights. We could write a simple function as follows:

foo <- function(DT) {
  DT[, speed := distance / (air_time/60)]
  DT[, .(max_speed = max(speed)), by = month]
}
ans = foo(flights)
head(flights)
#     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour    speed
#    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>    <num>
# 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9 413.6490
# 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11 409.0909
# 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19 423.0769
# 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7 395.5414
# 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13 424.2857
# 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18 434.3363
head(ans)
#    month max_speed
#    <int>     <num>
# 1:     1  535.6425
# 2:     2  535.6425
# 3:     3  549.0756
# 4:     4  585.6000
# 5:     5  544.2857
# 6:     6  608.5714

b) The copy() function

In the previous section, we used := for its side effect. But of course this may not be always desirable. Sometimes, we would like to pass a data.table object to a function, and might want to use the := operator, but wouldn’t want to update the original object. We can accomplish this using the function copy().

The copy() function deep copies the input object and therefore any subsequent update by reference operations performed on the copied object will not affect the original object.

There are two particular places where copy() function is essential:

  1. Contrary to the situation we have seen in the previous point, we may not want the input data.table to a function to be modified by reference. As an example, let’s consider the task in the previous section, except we don’t want to modify flights by reference.

    Let’s first delete the speed column we generated in the previous section.

    flights[, speed := NULL]
    

    Now, we could accomplish the task as follows:

    foo <- function(DT) {
      DT <- copy(DT)                              ## deep copy
      DT[, speed := distance / (air_time/60)]     ## doesn't affect 'flights'
      DT[, .(max_speed = max(speed)), by = month]
    }
    ans <- foo(flights)
    head(flights)
    #     year month   day dep_delay arr_delay carrier origin   dest air_time distance  hour
    #    <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>    <int> <int>
    # 1:  2014     1     1        14        13      AA    JFK    LAX      359     2475     9
    # 2:  2014     1     1        -3        13      AA    JFK    LAX      363     2475    11
    # 3:  2014     1     1         2         9      AA    JFK    LAX      351     2475    19
    # 4:  2014     1     1        -8       -26      AA    LGA    PBI      157     1035     7
    # 5:  2014     1     1         2         1      AA    JFK    LAX      350     2475    13
    # 6:  2014     1     1         4         0      AA    EWR    LAX      339     2454    18
    head(ans)
    #    month max_speed
    #    <int>     <num>
    # 1:     1  535.6425
    # 2:     2  535.6425
    # 3:     3  549.0756
    # 4:     4  585.6000
    # 5:     5  544.2857
    # 6:     6  608.5714
    

However we could improve this functionality further by shallow copying instead of deep copying. In fact, we would very much like to provide this functionality for v1.9.8. We will touch up on this again in the data.table design vignette.

  1. When we store the column names on to a variable, e.g., DT_n = names(DT), and then add/update/delete column(s) by reference. It would also modify DT_n, unless we do copy(names(DT)).

    DT = data.table(x = 1L, y = 2L)
    DT_n = names(DT)
    DT_n
    # [1] "x" "y"
    
    ## add a new column by reference
    DT[, z := 3L]
    
    ## DT_n also gets updated
    DT_n
    # [1] "x" "y" "z"
    
    ## use `copy()`
    DT_n = copy(names(DT))
    DT[, w := 4L]
    
    ## DT_n doesn't get updated
    DT_n
    # [1] "x" "y" "z"
    

Summary

The := operator

So far we have seen a whole lot in j, and how to combine it with by and little of i. Let’s turn our attention back to i in the next vignette “Keys and fast binary search based subset” to perform blazing fast subsets by keying data.tables.