I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table()
has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan()
can be quite fast, e.g.:
datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
ベストアンサー1
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packagesread.table
どの行が速度を低下させているかを判断します。おそらく、結果に基づいての短縮版を書くことができるでしょう。
もう 1 つの選択肢は、データを R に読み込む前にフィルタリングすることです。
または、定期的に読み込む必要がある場合は、これらの方法を使用してデータを一度読み込み、データフレームをバイナリブロブとして保存します。
save
saveRDS
次回は、
load
readRDS
。