I tried to run the same code, but with a larger CSV file. I generated one which is 5 times bigger than the previous (with 5 000 000 rows and a size of around 487 MB). I got the following results:
- csv.DictReader took 9.799003601074219e-05 seconds
- pd.read_csv took 11.01493215560913 seconds
- pd.read_csv with chunksize took 11.402302026748657 seconds
dask.dataframe
took 0.21671509742736816 seconds- datatable took 0.7201321125030518 seconds
I re-ran the test with a CSV file of 10 000 000 rows and a size of around 990 MB. The results are the following:
- csv.DictReader took 0.00013709068298339844 seconds
- pd.read_csv took 23.0141019821167 seconds
- pd.read_csv with chunksize took 24.249807119369507 seconds
dask.dataframe
took 0.49848103523254395 seconds- datatable took 1.45100998878479 seconds
Again ignoring the csv.DictReader, dask
is by far the fastest. However, datatable also performs pretty well.