data.table vs dplyr: can one do something well the other can't or does poorly? Ask Question

Question

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt1000万行から20億行（RAM 100GB）のグループ化、1億～1000万グループとさまざまなグループ化列の比較pandas。更新されたベンチマーク、およびも含まれSparkますPolars。

ベンチマークでは、以下の残りの側面もカバーすると良いでしょう。

行のサブセットを含むグループ化操作、つまりDT[x > val, sum(y), by = z]型操作。
更新や結合などの他の操作をベンチマークします。
また、実行時間に加えて、各操作のメモリフットプリントもベンチマークします。

2. メモリ使用量

filter()dplyr 内のorを含む操作は、slice()メモリ効率が悪くなる可能性があります (data.frames と data.tables の両方で)。この投稿を見る。

ご了承くださいハドリーのコメント速度について語っていますが(dplyr は彼にとって十分に高速です)、ここでの主な懸念はメモリです。
現時点では、data.table インターフェースを使用すると、参照によって列を変更/更新できます(結果を変数に再割り当てする必要がないことに注意してください)。
```
 # sub-assign by reference, updates 'y' in-place
 DT[x >= 1L, y := NA]
```
しかし、dplyr は参照によって更新されることはありません。dplyr の同等のものは次のようになります (結果を再割り当てする必要があることに注意してください)。
```
 # copies the entire 'y' column
 ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
```
これに対する懸念は参照透明性関数内で参照によるdata.tableオブジェクトの更新は必ずしも望ましいとは限りません。しかし、これは非常に便利な機能です。これそしてこれ興味深い事例の投稿。そしてそれを維持したいと考えています。

shallow()そのため、私たちは、ユーザーに両方の可能性を提供する data.table の関数をエクスポートすることに取り組んでいます。たとえば、関数内で入力 data.table を変更したくない場合は、次のようにします。

foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }

を使用しないことでshallow()、古い機能が保持されます。

bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }

を使用して浅いコピーを作成することでshallow()、元のオブジェクトを変更したくないということを理解しています。を保証しながら、絶対に必要な場合にのみ変更する列をコピーするように、すべてを内部的に処理します。これを実装すると、参照の透明性の問題が完全に解決され、ユーザーに両方の可能性が提供されます。

また、一度shallow()エクスポートされると、dplyr の data.table インターフェースはほとんどすべてのコピーを回避するはずです。そのため、dplyr の構文を好む人は、それを data.tables で使用できます。

しかし、参照による (サブ) 割り当てなど、data.table が提供する多くの機能がまだ欠けています。

参加しながら集計:

次のような 2 つの data.tables があるとします。

 DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
 #    x y z
 # 1: 1 a 1
 # 2: 1 a 2
 # 3: 1 b 3
 # 4: 1 b 4
 # 5: 2 a 5
 # 6: 2 a 6
 # 7: 2 b 7
 # 8: 2 b 8
 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
 #    x y mul
 # 1: 1 a   4
 # 2: 2 b   3

そして、列で結合しながらsum(z) * mul各行を取得したいとします。次のいずれかの方法があります。DT2x,y

aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

data.table way

DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

dplyr equivalent

DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
   right_join(DF2) %>% mutate(z = z * mul)

do it all in one go (using by = .EACHI feature):
```
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
```

What is the advantage?

We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

Have a look at this, this and this posts for real usage scenarios.

In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

Update and joins:

Consider the data.table code shown below:

 DT1[DT2, col := i.mul]

adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

Basic aggregation/update operations.
```
 # case (a)
 DT[, sum(y), by = z]                       ## data.table syntax
 DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
 DT[, y := cumsum(y), by = z]
 ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))

 # case (b)
 DT[x > 2, sum(y), by = z]
 DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
 DT[x > 2, y := cumsum(y), by = z]
 ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))

 # case (c)
 DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
 DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
 DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
 DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
```
- data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
- In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.
  
  This is what we mean when we say the DT[i, j, by] form is consistent.
- Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.
  
  While it returns the same result, using filter() here makes the actual operation less obvious.
  
  It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

Aggregation / update on multiple columns

 # case (a)
 DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
 DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
 DT[, (cols) := lapply(.SD, sum), by = z]
 ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))

 # case (b)
 DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
 DF %>% group_by(z) %>% summarise_each(funs(sum, mean))

 # case (c)
 DT[, c(.N, lapply(.SD, sum)), by = z]     
 DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().
data.table's := requires column names to be provided, whereas dplyr generates it automatically.
In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

Joins

dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

 setkey(DT1, x, y)

 # 1. normal join
 DT1[DT2]            ## data.table syntax
 left_join(DT2, DT1) ## dplyr syntax

 # 2. select columns while join    
 DT1[DT2, .(z, i.mul)]
 left_join(select(DT2, x, y, mul), select(DT1, x, y, z))

 # 3. aggregate while join
 DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
 DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
     inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)

 # 4. update while join
 DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
 ??

 # 5. rolling join
 DT1[DT2, roll = -Inf]
 ??

 # 6. other arguments to control output
 DT1[DT2, mult = "first"]
 ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult = argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

do()...

dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

 DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
 DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
 DT[, list(x[1:2], y[1]), by = z]
 DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))

 DT[, quantile(x, 0.25), by = z]
 DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
 DT[, quantile(x, c(0.25, 0.75)), by = z]
 DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))

 DT[, as.list(summary(x)), by = z]
 DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

.SD's equivalent is .
In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.
In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
アドホックグループ化:dplyr実行中に変数をグループ化して結果を自動的に並べ替えますがsummarise()、必ずしも望ましいとは限りません。
上で述べたように、data.table 結合には多くの利点があります (速度、メモリ効率、構文)。
非等価結合: <=, <, >, >=data.table 結合のその他のすべての利点に加えて、他の演算子を使用した結合を可能にします。
重複範囲結合最近data.tableに実装されました。チェックこの郵便受けベンチマーク付きの概要については、こちらをご覧ください。
setorder()参照によって data.tables を非常に高速に並べ替えることができる data.table 内の関数。
dplyrはデータベースへのインターフェース同じ構文を使用しますが、現時点では data.table では使用されません。
data.table集合演算（Jan Gorecki 著）に相当する、、fsetdiff、fintersectの高速化を提供します（SQL の場合と同様）。funionfsetequalall
data.tableはマスキング警告なしできれいにロードされ、説明されているメカニズムを持っていますここ互換性のため、任意[.data.frameのRパッケージに渡されます。dplyrは基本関数を変更するためfilter、問題が発生する可能性があります。例:lag[ここそしてここ。

ついに：

データベースについて - data.table が同様のインターフェースを提供できない理由はありませんが、これは現時点では優先事項ではありません。ユーザーがその機能を非常に望んでいる場合は、優先順位が上がる可能性がありますが、わかりません。
並列処理について - 誰かが先に進んで実行するまで、すべては困難です。もちろん、努力が必要です (スレッドセーフであること)。
- 現在 (v1.9.7 開発中)、を使用した段階的なパフォーマンス向上のために、時間のかかる既知の部分を並列化する作業が進められていますOpenMP。

Answer 1

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt1000万行から20億行（RAM 100GB）のグループ化、1億～1000万グループとさまざまなグループ化列の比較pandas。更新されたベンチマーク、およびも含まれSparkますPolars。

ベンチマークでは、以下の残りの側面もカバーすると良いでしょう。

行のサブセットを含むグループ化操作、つまりDT[x > val, sum(y), by = z]型操作。
更新や結合などの他の操作をベンチマークします。
また、実行時間に加えて、各操作のメモリフットプリントもベンチマークします。

2. メモリ使用量

filter()dplyr 内のorを含む操作は、slice()メモリ効率が悪くなる可能性があります (data.frames と data.tables の両方で)。この投稿を見る。

ご了承くださいハドリーのコメント速度について語っていますが(dplyr は彼にとって十分に高速です)、ここでの主な懸念はメモリです。
現時点では、data.table インターフェースを使用すると、参照によって列を変更/更新できます(結果を変数に再割り当てする必要がないことに注意してください)。
```
 # sub-assign by reference, updates 'y' in-place
 DT[x >= 1L, y := NA]
```
しかし、dplyr は参照によって更新されることはありません。dplyr の同等のものは次のようになります (結果を再割り当てする必要があることに注意してください)。
```
 # copies the entire 'y' column
 ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
```
これに対する懸念は参照透明性関数内で参照によるdata.tableオブジェクトの更新は必ずしも望ましいとは限りません。しかし、これは非常に便利な機能です。これそしてこれ興味深い事例の投稿。そしてそれを維持したいと考えています。

shallow()そのため、私たちは、ユーザーに両方の可能性を提供する data.table の関数をエクスポートすることに取り組んでいます。たとえば、関数内で入力 data.table を変更したくない場合は、次のようにします。

foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }

を使用しないことでshallow()、古い機能が保持されます。

bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }

を使用して浅いコピーを作成することでshallow()、元のオブジェクトを変更したくないということを理解しています。を保証しながら、絶対に必要な場合にのみ変更する列をコピーするように、すべてを内部的に処理します。これを実装すると、参照の透明性の問題が完全に解決され、ユーザーに両方の可能性が提供されます。

また、一度shallow()エクスポートされると、dplyr の data.table インターフェースはほとんどすべてのコピーを回避するはずです。そのため、dplyr の構文を好む人は、それを data.tables で使用できます。

しかし、参照による (サブ) 割り当てなど、data.table が提供する多くの機能がまだ欠けています。

参加しながら集計:

次のような 2 つの data.tables があるとします。

 DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
 #    x y z
 # 1: 1 a 1
 # 2: 1 a 2
 # 3: 1 b 3
 # 4: 1 b 4
 # 5: 2 a 5
 # 6: 2 a 6
 # 7: 2 b 7
 # 8: 2 b 8
 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
 #    x y mul
 # 1: 1 a   4
 # 2: 2 b   3

そして、列で結合しながらsum(z) * mul各行を取得したいとします。次のいずれかの方法があります。DT2x,y

aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

data.table way

DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

dplyr equivalent

DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
   right_join(DF2) %>% mutate(z = z * mul)

do it all in one go (using by = .EACHI feature):
```
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
```

What is the advantage?

We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

Have a look at this, this and this posts for real usage scenarios.

In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

Update and joins:

Consider the data.table code shown below:

 DT1[DT2, col := i.mul]

adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

Basic aggregation/update operations.
```
 # case (a)
 DT[, sum(y), by = z]                       ## data.table syntax
 DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
 DT[, y := cumsum(y), by = z]
 ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))

 # case (b)
 DT[x > 2, sum(y), by = z]
 DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
 DT[x > 2, y := cumsum(y), by = z]
 ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))

 # case (c)
 DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
 DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
 DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
 DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
```
- data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
- In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.
  
  This is what we mean when we say the DT[i, j, by] form is consistent.
- Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.
  
  While it returns the same result, using filter() here makes the actual operation less obvious.
  
  It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

Aggregation / update on multiple columns

 # case (a)
 DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
 DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
 DT[, (cols) := lapply(.SD, sum), by = z]
 ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))

 # case (b)
 DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
 DF %>% group_by(z) %>% summarise_each(funs(sum, mean))

 # case (c)
 DT[, c(.N, lapply(.SD, sum)), by = z]     
 DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().
data.table's := requires column names to be provided, whereas dplyr generates it automatically.
In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

Joins

dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

 setkey(DT1, x, y)

 # 1. normal join
 DT1[DT2]            ## data.table syntax
 left_join(DT2, DT1) ## dplyr syntax

 # 2. select columns while join    
 DT1[DT2, .(z, i.mul)]
 left_join(select(DT2, x, y, mul), select(DT1, x, y, z))

 # 3. aggregate while join
 DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
 DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
     inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)

 # 4. update while join
 DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
 ??

 # 5. rolling join
 DT1[DT2, roll = -Inf]
 ??

 # 6. other arguments to control output
 DT1[DT2, mult = "first"]
 ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult = argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

do()...

dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

 DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
 DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
 DT[, list(x[1:2], y[1]), by = z]
 DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))

 DT[, quantile(x, 0.25), by = z]
 DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
 DT[, quantile(x, c(0.25, 0.75)), by = z]
 DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))

 DT[, as.list(summary(x)), by = z]
 DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

.SD's equivalent is .
In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.
In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
アドホックグループ化:dplyr実行中に変数をグループ化して結果を自動的に並べ替えますがsummarise()、必ずしも望ましいとは限りません。
上で述べたように、data.table 結合には多くの利点があります (速度、メモリ効率、構文)。
非等価結合: <=, <, >, >=data.table 結合のその他のすべての利点に加えて、他の演算子を使用した結合を可能にします。
重複範囲結合最近data.tableに実装されました。チェックこの郵便受けベンチマーク付きの概要については、こちらをご覧ください。
setorder()参照によって data.tables を非常に高速に並べ替えることができる data.table 内の関数。
dplyrはデータベースへのインターフェース同じ構文を使用しますが、現時点では data.table では使用されません。
data.table集合演算（Jan Gorecki 著）に相当する、、fsetdiff、fintersectの高速化を提供します（SQL の場合と同様）。funionfsetequalall
data.tableはマスキング警告なしできれいにロードされ、説明されているメカニズムを持っていますここ互換性のため、任意[.data.frameのRパッケージに渡されます。dplyrは基本関数を変更するためfilter、問題が発生する可能性があります。例:lag[ここそしてここ。

ついに：

データベースについて - data.table が同様のインターフェースを提供できない理由はありませんが、これは現時点では優先事項ではありません。ユーザーがその機能を非常に望んでいる場合は、優先順位が上がる可能性がありますが、わかりません。
並列処理について - 誰かが先に進んで実行するまで、すべては困難です。もちろん、努力が必要です (スレッドセーフであること)。
- 現在 (v1.9.7 開発中)、を使用した段階的なパフォーマンス向上のために、時間のかかる既知の部分を並列化する作業が進められていますOpenMP。

data.table vs dplyr: can one do something well the other can't or does poorly? Ask Question

Overview

Question

Examples

Usage

Benchmarks

Data

ベストアンサー1

1. Speed

2. メモリ使用量

3. Syntax

4. Features

おすすめ記事