How I learned to stop using CSV and love HDF - 2018-11-06

Comparing file formats for data science work

Waiting for a CSV file to load before any data science can begin can be a great excuse for yet another coffee. While CSVs are human-readable, widely supported, and the go-to format for small data sets, they become unwieldy once we reach a few million rows. So while waiting for yet another file to load, I decided to investigate a few alternatives.

(image courtesy to https://xkcd-excuse.com)

Using a mock data set (created in pandas with one datetime[ns] and six float64 columns), I recorded

  • the time it takes to write the data to disk
  • the time it takes to read the resulting file into memory
  • the on-disk file size

The source code can be found in this Jupyter notebook.

The results quickly unveil just how bad CSV really is:

Let’s drop CSV from the comparison, and increase the number of rows further:

The clear winner for read speed is the fixed HDF format, while Parquet with gzip compression provides marginal on-disk space savings.

The one-off task of converting my development datasets to HDF easily saves hours of my time, and with pandas supporting HDF (almost) seamlessly, it seems there aren’t any reasons not to use it for simple tabular data.

Update: I have now uploaded a simple csv2hdf command line tool to https://github.com/ig248/tabelio

comments powered by Disqus