https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d

I tried to run the same code, but with a larger CSV file. I generated one which is 5 times bigger than the previous (with 5 000 000 rows and a size of around 487 MB). I got the following results:

  • csv.DictReader took 9.799003601074219e-05 seconds
  • pd.read_csv took 11.01493215560913 seconds
  • pd.read_csv with chunksize took 11.402302026748657 seconds
  • dask.dataframe took 0.21671509742736816 seconds
  • datatable took 0.7201321125030518 seconds

I re-ran the test with a CSV file of 10 000 000 rows and a size of around 990 MB. The results are the following:

  • csv.DictReader took 0.00013709068298339844 seconds
  • pd.read_csv took 23.0141019821167 seconds
  • pd.read_csv with chunksize took 24.249807119369507 seconds
  • dask.dataframe took 0.49848103523254395 seconds
  • datatable took 1.45100998878479 seconds

Again ignoring the csv.DictReader, dask is by far the fastest. However, datatable also performs pretty well.

http://docs.dask.org/en/latest/dataframe.html