TL;DR

This post will expose you how to convert in a very convenient and fast way πŸš€ some Apache Parquet files to CSV, and vice-versa, using either DuckDB πŸ¦† or Pandas 🐍 for a baseline comparison

As a quick bonus, we will embedded this tool in a small convient CLI script, easily triggered from your favorite shell πŸ‘¨β€πŸ’»

Let’s go !

Intro

Recently, I’ve been working a little bit more on Data Engineering tasks (setup a Datalake, convert data, design pipelines, make cleanup of some data). πŸ“Š

From time to time, I had to convert .csv data, which is perfect to rapidly catch any important info or check that decryption is effective and so on.. πŸ‘€

But CSV files are very memory consuming, and in order to save some costs on AWS S3 Storage, it is way better to handle some files using Apache Parquet format ⚑

And eventually, I’ve been finding myself doing again the same commands, in order to convert Parquet to CSV and vice-versa. I tried to find a CLI tool which is plebiscited by Data Engineering community, but infortunalely I couldn’t encounter one !

And low efficient commands were going one again..

cat

So sad that with parquet you can’t vizualize your data , the following command won’t help you dealing with your data pieces.. πŸ˜₯

❯ head file.parquet
B%R1x,<I/

!
b
B5
 u
:!-
<M=
P(-    bG

As always in optimization problems, there is no free lunch, you cannot get the convience of being able to vizualize your data in a glimpse, and get a storage space super-efficient format. πŸ€‘

But, after a while I remembered a post from a DuckDB advocate, showing how DuckDB could handle this kind of operations, let’s try to do it on our machine ! πŸš€

Comparing Pandas and DuckDB

Few month ago, I’ve encountered a Linkedin post from a DuckDB advocate about crafting a one line script to efficiently convert a CSV file into a parquet file.

I decided to give it a try and compare it from classical tool to do so (like pandas).

Setup a baseline for a conversion tool

Let’s first download a medium size dataset, for instance the MovieLens 25M datasets

Stay tuned, we will be using it in a future post !

❯ head -n 3 ratings.csv
userId,movieId,rating,timestamp
1,296,5.0,1147880044
1,306,3.5,1147868817

Let see with pandas and pyarrow installed, how does the baseline tool behave with this operations. Just for the record and the sake of reproducibility is the snapshot of my pip freeze of my whole venv

❯ pip freeze
numpy==1.25.2
pandas==2.1.0
pyarrow==13.0.0
python-dateutil==2.8.2
pytz==2023.3.post1
six==1.16.0
tzdata==2023.3

And the performances are the following :

❯ /usr/bin/time -l -h -p python -c "import pandas; df=pandas.read_csv('ratings.csv'); df.to_parquet('ratings.parquet')"
real 14,43
user 9,40
sys 2,80
          1774600192  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1037341  page reclaims
                5284  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                 864  voluntary context switches
               25149  involuntary context switches
         77809521962  instructions retired
         42836239895  cycles elapsed
          2854670336  peak memory footprint

Now let’s use DuckDB πŸ¦†

To install it, very simple :

For macOS

brew install duckdb

and for Linux (be sure to get the right arch)

curl -SL https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-linux-amd64.zip -o /tmp/duckdb.zip
unzip /tmp/duckdb.zip
mv /tmp/duckdb/* /usr/local/bin/
chmod +x /usr/local/bin/duckdb

Let’s convert that .csv file πŸš€ :

❯ /usr/bin/time -l -h -p duckdb -c "COPY (select * from read_csv_auto('ratings.csv')) TO 'ratings.parquet' (FORMAT PARQUET)"
100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–
real 9,12
user 24,58
sys 2,36
           603959296  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              551447  page reclaims
                1741  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                 761  voluntary context switches
               34911  involuntary context switches
        135618259891  instructions retired
         96669139421  cycles elapsed
           645996544  peak memory footprint

Let’s compare the data between those runs !

data

First impression : Love that progress bar, it is always annoying to wait for operations and never be sure when it is going to end.

However it seems that the file is a little less compressed than pandas one (might be some tweaks to do to compress it with duckdb) :

❯ du -h ratings*.parquet
225M    ratings_duckdb.parquet
168M    ratings_pandas.parquet

However, when looking at the peak memory item, we can see that DuckDB process it on chunks, while Pandas loads all the object in memory. On low memory system or with big objects, it can be limitating.

The overall peak is 4.4 times less important with DuckDB. Excellent !

duck

Developing a small CLI tool - the fancy way

Simply tweak your .zshrc or .bashrc with theses incredible functions :

And transform your files in a one-liner command :

parquet_to_csv file.parquet

and boom, you get a file.csv πŸ’₯

And do the reverse operation very simply :

csv_to_parquet file.csv

And here is your file.parquet, so fast and efficient ! 🏎️

Conclusion

We just covered a very handy feature with DuckDB, but with this small example, we have been able to turn this versatile tool in a very handy CLI software, which will save you so much time in your daily Data Engineering life !

Do not hesitate to add some of your smarts one-liners commands and function you have in your .bashrc and .zshrc

Lots of ❀️ to the DuckDB team for the incredible work !