data.table vs dplyr: can one do something well the other can't or does poorly? Ask Question

data.table vs dplyr: can one do something well the other can't or does poorly? Ask Question

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage

  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

ベストアンサー1

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt1000万行から20億行(RAM 100GB)のグループ化、1億~1000万グループとさまざまなグループ化列の比較pandas更新されたベンチマーク、およびも含まれSparkますPolars

ベンチマークでは、以下の残りの側面もカバーすると良いでしょう。

  • 行のサブセットを含むグループ化操作、つまりDT[x > val, sum(y), by = z]型操作。

  • 更新結合などの他の操作をベンチマークします。

  • また、実行時間に加えて、各操作のメモリ フットプリントもベンチマークします。

2. メモリ使用量

  1. filter()dplyr 内のorを含む操作は、slice()メモリ効率が悪くなる可能性があります (data.frames と data.tables の両方で)。この投稿を見る

    ご了承くださいハドリーのコメント速度について語っていますが(dplyr は彼にとって十分に高速です)、ここでの主な懸念はメモリです。

  2. 現時点では、data.table インターフェースを使用すると、参照によって列を変更/更新できます(結果を変数に再割り当てする必要がないことに注意してください)。

     # sub-assign by reference, updates 'y' in-place
     DT[x >= 1L, y := NA]
    

    しかし、dplyr は参照によって更新されることはありません。dplyr の同等のものは次のようになります (結果を再割り当てする必要があることに注意してください)。

     # copies the entire 'y' column
     ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    これに対する懸念は参照透明性関数内で参照によるdata.tableオブジェクトの更新は必ずしも望ましいとは限りません。しかし、これは非常に便利な機能です。これそしてこれ興味深い事例の投稿。そしてそれを維持したいと考えています。

    shallow()そのため、私たちは、ユーザーに両方の可能性を提供する data.table の関数をエクスポートすることに取り組んでいます。たとえば、関数内で入力 data.table を変更したくない場合は、次のようにします。

     foo <- function(DT) {
         DT = shallow(DT)          ## shallow copy DT
         DT[, newcol := 1L]        ## does not affect the original DT 
         DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
         DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                   ## also get modified.
     }
    

    を使用しないことでshallow()、古い機能が保持されます。

     bar <- function(DT) {
         DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
         DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
     }
    

    を使用して浅いコピーを作成することでshallow()、元のオブジェクトを変更したくないということを理解しています。 を保証しながら、絶対に必要な場合にのみ変更する列をコピーするように、すべてを内部的に処理します。 これを実装すると、参照の透明性の問題が完全に解決され、ユーザーに両方の可能性が提供されます。

    また、一度shallow()エクスポートされると、dplyr の data.table インターフェースはほとんどすべてのコピーを回避するはずです。そのため、dplyr の構文を好む人は、それを data.tables で使用できます。

    しかし、参照による (サブ) 割り当てなど、data.table が提供する多くの機能がまだ欠けています。

  3. 参加しながら集計:

    次のような 2 つの data.tables があるとします。

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
     #    x y z
     # 1: 1 a 1
     # 2: 1 a 2
     # 3: 1 b 3
     # 4: 1 b 4
     # 5: 2 a 5
     # 6: 2 a 6
     # 7: 2 b 7
     # 8: 2 b 8
     DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
     #    x y mul
     # 1: 1 a   4
     # 2: 2 b   3
    

    そして、列で結合しながらsum(z) * mul各行を取得したいとします。次のいずれかの方法があります。DT2x,y

      1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

        data.table way

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
        

        dplyr equivalent

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
           right_join(DF2) %>% mutate(z = z * mul)
        
      1. do it all in one go (using by = .EACHI feature):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
        

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

     DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations.

     # case (a)
     DT[, sum(y), by = z]                       ## data.table syntax
     DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
     DT[, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
     # case (b)
     DT[x > 2, sum(y), by = z]
     DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
     DT[x > 2, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
     # case (c)
     DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
     DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
     DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
     DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    
    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

     # case (a)
     DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
     DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
     DT[, (cols) := lapply(.SD, sum), by = z]
     ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
     # case (b)
     DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
     DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
     # case (c)
     DT[, c(.N, lapply(.SD, sum)), by = z]     
     DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    
    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

     setkey(DT1, x, y)
    
     # 1. normal join
     DT1[DT2]            ## data.table syntax
     left_join(DT2, DT1) ## dplyr syntax
    
     # 2. select columns while join    
     DT1[DT2, .(z, i.mul)]
     left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
     # 3. aggregate while join
     DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
     DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
         inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
     # 4. update while join
     DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
     ??
    
     # 5. rolling join
     DT1[DT2, roll = -Inf]
     ??
    
     # 6. other arguments to control output
     DT1[DT2, mult = "first"]
     ??
    
  • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

  • However dplyr joins do just that. Nothing more. Nothing less.

  • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

  • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

  • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

  • data.table also has mult = argument which selects first, last or all matches (6).

  • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

     DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
     DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
     DT[, list(x[1:2], y[1]), by = z]
     DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
     DT[, quantile(x, 0.25), by = z]
     DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
     DT[, quantile(x, c(0.25, 0.75)), by = z]
     DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
     DT[, as.list(summary(x)), by = z]
     DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    
  • .SD's equivalent is .

  • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

  • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • アドホック グループ化:dplyr実行中に変数をグループ化して結果を自動的に並べ替えますがsummarise()、必ずしも望ましいとは限りません。

  • 上で述べたように、data.table 結合には多くの利点があります (速度、メモリ効率、構文)。

  • 非等価結合: <=, <, >, >=data.table 結合のその他のすべての利点に加えて、他の演算子を使用した結合を可能にします。

  • 重複範囲結合最近data.tableに実装されました。チェックこの郵便受けベンチマーク付きの概要については、こちらをご覧ください。

  • setorder()参照によって data.tables を非常に高速に並べ替えることができる data.table 内の関数。

  • dplyrはデータベースへのインターフェース同じ構文を使用しますが、現時点では data.table では使用されません。

  • data.table集合演算(Jan Gorecki 著)に相当する、、fsetdifffintersectの高速化を提供します(SQL の場合と同様)。funionfsetequalall

  • data.tableはマスキング警告なしできれいにロードされ、説明されているメカニズムを持っていますここ互換性のため、任意[.data.frameのRパッケージに渡されます。dplyrは基本関数を変更するためfilter、問題が発生する可能性があります。例:lag[ここそしてここ


ついに:

  • データベースについて - data.table が同様のインターフェースを提供できない理由はありませんが、これは現時点では優先事項ではありません。ユーザーがその機能を非常に望んでいる場合は、優先順位が上がる可能性がありますが、わかりません。

  • 並列処理について - 誰かが先に進んで実行するまで、すべては困難です。もちろん、努力が必要です (スレッドセーフであること)。

    • 現在 (v1.9.7 開発中)、 を使用した段階的なパフォーマンス向上のために、時間のかかる既知の部分を並列化する作業が進められていますOpenMP

おすすめ記事