Hamilton Ulmer

Tools for Data Analysis

TIL: dealing with duplicate columns on joins in duckdb & sqlite
Dec 5, 2022

In the DuckDB CLI, when I set .mode json, the following query

WITH x as (
  select 1 as a
  UNION ALL select 2 as a
SELECT * from x JOIN x AS xi on x.a = xi.a

Produces a weird result set:


If you were to parse this JSON string, the first “a” and “b” will get removed by the later ones. This is exactly what you see in Node and also in Rill Developer’s main branch right now. It’s weird behavior, since … you’re throwing out data.

I noticed this because Rill Developer used to return a result set that looked like this:


Which made application development much easier. Through Rill Developer we utilize these :<number> aliases to prevent duplicate keys in lists. Your result set not only has all the row values you’re looking for; it also makes it much easier to select duplicate values in profile queries.

It turns out that if you wrap that final SELECT statement as a CTE and select it - in other words, this:

WITH x as (
  select 1 as a
  UNION ALL select 2 as a
y as (
  SELECT * from x JOIN x AS xi on x.a = xi.a
SELECT * from y

You’ll get these nice :<number> aliases.

This behavior feels bizarre. Why not just always return this? Well, it turns out that sqlite has the same behavior. The choice to mirror this in duckdb was intentional. So while it may seem a little strange at first glance, there’s probably a deeper reason.

It turns out that this is how sqlite also handles these. If I were to materialize the first query as a table or view, or use a CTE, then I get the :<number> part.

It’s hard to fault DuckDB for taking the same approach as the much-more-mature SQLite. And other SQL engines have all sorts of behavior:

  • BigQuery will return a and a_1 for both queries;
  • Postgres will return a and a for both queries;
  • SQL Server will return a and a for the first query, and error for the second
  • Snowflake returns a and a for both queries