Overview
I'm relatively familiar with data.table
, not so much with dplyr
. I've read through some dplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
anddplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)dplyr
has more accessible syntaxdplyr
abstracts (or will) potential DB interactions- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
Question
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr
would offer much beyond what I can already do in data.table
. Here is the dplyr
solution (data at end of Q):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a data.table
solution. That said, good data.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table
(i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the dplyr
or data.table
way is substantially more concise or performs substantially better.
Examples
Usage
dplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around usingdo
in the answer to @eddi's question).data.table
supports rolling joins (thanks @dholstius) as well as overlap joinsdata.table
internally optimises expressions of the formDT[col == value]
orDT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.dplyr
offers standard evaluation versions of functions (e.g.regroup
,summarize_each_
) that can simplify the programmatic use ofdplyr
(note programmatic use ofdata.table
is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks
- I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point
data.table
becomes substantially faster. - @Arun ran some benchmarks on joins, showing that
data.table
scales better thandplyr
as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values hasdata.table
~6x faster. - (Unverified) has
data.table
75% faster on larger versions of a group/apply/sort whiledplyr
was 40% faster on the smaller ones (another SO question from comments, thanks danas). - Matt, the main author of
data.table
, has benchmarked grouping operations ondata.table
,dplyr
and pythonpandas
on up to 2 billion rows (~100GB in RAM). - An older benchmark on 80K groups has
data.table
~8x faster
Data
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
ベストアンサー1
We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed
, Memory usage
, Syntax
and Features
.
My intent is to cover each one of these as clearly as possible from data.table perspective.
Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.
The data.table syntax is consistent in its form - DT[i, j, by]
. To keep i
, j
and by
together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.
1. Speed
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt1000万行から20億行(RAM 100GB)のグループ化、1億~1000万グループとさまざまなグループ化列の比較pandas
。更新されたベンチマーク、およびも含まれSpark
ますPolars
。
ベンチマークでは、以下の残りの側面もカバーすると良いでしょう。
行のサブセットを含むグループ化操作、つまり
DT[x > val, sum(y), by = z]
型操作。更新や結合などの他の操作をベンチマークします。
また、実行時間に加えて、各操作のメモリ フットプリントもベンチマークします。
2. メモリ使用量
filter()
dplyr 内のorを含む操作は、slice()
メモリ効率が悪くなる可能性があります (data.frames と data.tables の両方で)。この投稿を見る。ご了承くださいハドリーのコメント速度について語っていますが(dplyr は彼にとって十分に高速です)、ここでの主な懸念はメモリです。
現時点では、data.table インターフェースを使用すると、参照によって列を変更/更新できます(結果を変数に再割り当てする必要がないことに注意してください)。
# sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA]
しかし、dplyr は参照によって更新されることはありません。dplyr の同等のものは次のようになります (結果を再割り当てする必要があることに注意してください)。
# copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
これに対する懸念は参照透明性関数内で参照によるdata.tableオブジェクトの更新は必ずしも望ましいとは限りません。しかし、これは非常に便利な機能です。これそしてこれ興味深い事例の投稿。そしてそれを維持したいと考えています。
shallow()
そのため、私たちは、ユーザーに両方の可能性を提供する data.table の関数をエクスポートすることに取り組んでいます。たとえば、関数内で入力 data.table を変更したくない場合は、次のようにします。foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }
を使用しないことで
shallow()
、古い機能が保持されます。bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }
を使用して浅いコピーを作成することで
shallow()
、元のオブジェクトを変更したくないということを理解しています。 を保証しながら、絶対に必要な場合にのみ変更する列をコピーするように、すべてを内部的に処理します。 これを実装すると、参照の透明性の問題が完全に解決され、ユーザーに両方の可能性が提供されます。また、一度
shallow()
エクスポートされると、dplyr の data.table インターフェースはほとんどすべてのコピーを回避するはずです。そのため、dplyr の構文を好む人は、それを data.tables で使用できます。しかし、参照による (サブ) 割り当てなど、data.table が提供する多くの機能がまだ欠けています。
参加しながら集計:
次のような 2 つの data.tables があるとします。
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3
そして、列で結合しながら
sum(z) * mul
各行を取得したいとします。次のいずれかの方法があります。DT2
x,y
-
aggregate
DT1
to getsum(z)
, 2) perform a join and 3) multiply (or)data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)
-
do it all in one go (using
by = .EACHI
feature):DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at
j
in (2).
Check this post for a detailed explanation of
by = .EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go.Have a look at this, this and this posts for real usage scenarios.
In
dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).-
Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates
DT1
's columncol
withmul
fromDT2
on those rows whereDT2
's key column matchesDT1
. I don't think there is an exact equivalent of this operation indplyr
, i.e., without avoiding a*_join
operation, which would have to copy the entireDT1
just to add a new column to it, which is unnecessary.Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
3. Syntax
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use
filter()
in dplyr while summarising. But while updating, we had to move the logic insidemutate()
. In data.table however, we express both operations with the same logic - operate on rows wherex > 2
, but in first case, getsum(y)
, whereas in the second case update those rows fory
with its cumulative sum.This is what we mean when we say the
DT[i, j, by]
form is consistent.Similarly in case (c), when we have
if-else
condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where theif
condition satisfies and skip otherwise, we cannot usesummarise()
directly (AFAICT). We have tofilter()
first and then summarise becausesummarise()
always expects a single value.While it returns the same result, using
filter()
here makes the actual operation less obvious.It might very well be possible to use
filter()
in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
# case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function
lapply()
, whereasdplyr
introduces*_each()
along with a bunch of functions tofuns()
.data.table's
:=
requires column names to be provided, whereas dplyr generates it automatically.In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return
n()
as many times as many columns, instead of just once. In data.table, all we need to do is to return a list inj
. Each element of the list will become a column in the result. So, we can use, once again, the familiar base functionc()
to concatenate.N
to alist
which returns alist
.
Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.N
and.SD
at least. The equivalent in dplyr aren()
and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax
DT[i, j, by]
(and with reason). It also provides an equivalentmerge.data.table()
function as an alternative.setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's
DT[i, j, by]
, ormerge()
which is similar to base R.However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to
select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has
mult =
argument which selects first, last or all matches (6).data.table has
allow.cartesian = TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
do()
...dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to
do()
. You have to know beforehand about all your functions return value.DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
's equivalent is.
In data.table, you can throw pretty much anything in
j
- the only thing to remember is for it to return a list so that each element of the list gets converted to a column.In dplyr, cannot do that. Have to resort to
do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table's syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.
Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
4. Features
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
アドホック グループ化:
dplyr
実行中に変数をグループ化して結果を自動的に並べ替えますがsummarise()
、必ずしも望ましいとは限りません。上で述べたように、data.table 結合には多くの利点があります (速度、メモリ効率、構文)。
非等価結合:
<=, <, >, >=
data.table 結合のその他のすべての利点に加えて、他の演算子を使用した結合を可能にします。重複範囲結合最近data.tableに実装されました。チェックこの郵便受けベンチマーク付きの概要については、こちらをご覧ください。
setorder()
参照によって data.tables を非常に高速に並べ替えることができる data.table 内の関数。dplyrはデータベースへのインターフェース同じ構文を使用しますが、現時点では data.table では使用されません。
data.table
集合演算(Jan Gorecki 著)に相当する、、fsetdiff
、fintersect
の高速化を提供します(SQL の場合と同様)。funion
fsetequal
all
data.tableはマスキング警告なしできれいにロードされ、説明されているメカニズムを持っていますここ互換性のため、任意
[.data.frame
のRパッケージに渡されます。dplyrは基本関数を変更するためfilter
、問題が発生する可能性があります。例:lag
[
ここそしてここ。
ついに:
データベースについて - data.table が同様のインターフェースを提供できない理由はありませんが、これは現時点では優先事項ではありません。ユーザーがその機能を非常に望んでいる場合は、優先順位が上がる可能性がありますが、わかりません。
並列処理について - 誰かが先に進んで実行するまで、すべては困難です。もちろん、努力が必要です (スレッドセーフであること)。
- 現在 (v1.9.7 開発中)、 を使用した段階的なパフォーマンス向上のために、時間のかかる既知の部分を並列化する作業が進められています
OpenMP
。
- 現在 (v1.9.7 開発中)、 を使用した段階的なパフォーマンス向上のために、時間のかかる既知の部分を並列化する作業が進められています