Hamilton Ulmer

Tools for Data Analysis & Deep Work

duckdb wishlist, two months in: 'is the duck rude?', cancelable queries, memory footprints, & javascript UDFs
Feb 21, 2022

This is an update to a quick post about my duckdb feature wishlist. We’ve been very productively using duckdb’s node library for almost two months now (basically since I started my new job).

I’ve figured out that really, duckdb can be used to do so much more than just analytical queries; you can build entire analytical systems that do exploratory data analysis for you. People enthusiastically responded to one of my automatic EDA demos on Twitter. We’re going to be going very deep on this idea within the context of “data modeling” – figuring out how we transform data throughout a data pipeline. Anyone who has been deep into data science or data engineering knows that a lot of our day-to-day involves cleaning and profiling our datasets. Why aren’t our tools helping us do this more efficiently? Boggles the mind. I have some ideas about how to make this better.

But first – “is the duck rude?” Before I get to my updated wishlist, one point of order: a colleague of mine wondered why the duck on the duckdb website seems to be turning away from her. It seems that the brain may register it this way because the beak is the same color as the body.

Once you see the duck turning away, you can’t unsee it.

The solution is simple: add triangular easing function for the beak color that peaks as the head faces forward. It’ll then be clear that the duck is not in fact turning away from the viewer.

At any rate, here are the features I’m generally hoping for today, about two months since I went head-first into duckdb land:

  • a way to interrupt & cancel queries – we are still looking for a way to pause or cancel queries. Since we end up running tons of column-summarizing queries, interrupting duckdb’s own query queue is pretty much essential for our use-case of describing all the columns of a dataset. Thankfully Mark Raasveldt merged the interface support for incremental & deferred queries. Now someone just needs to implement interrupts at the client level. But heck, I would love incremental updates as well! That would be a great thing to surface in our user interfaces.
  • a way to estimate or quickly calculate the memory footprint of a result set – it’s easy to take numeric columns and “multiply the rectangle” – the column’s data type determines the width, and the length is the number of rows. But VARCHAR is not as straightforward. We’re looking for a way to estimate these column sizes so we can create a scale-invariant “byte reduction ratio” for our model transforms – bytes in / bytes out for a query transformation.
  • a way to utilize javascript UDFs in the Node client – it looks like Hannes has a PR out that does this, but it hasn’t merged yet.
  • arrow support – it would be great to have the node library support arrow dataframes. All of the same goodness that we all know and love about Arrow outside of the browser also applies to it inside the browser. No parse step (this is actually huge), smaller footprints, and faster scanning.