finding the smallest available timegrain with DuckDB

finding the smallest available timegrain with DuckDB

Aug 12, 2023

While many machine-generated datasets provide data down to the millisecond, it’s not always clear what granularity you’re getting in a dataset when you first receive it. The data might have been already aggregated, or perhaps it was manually collected and reported on a weekly basis. Instead of relying on trial-and-error to determine the correct time granularity, why not detect it automatically?

At Rill, we put together a simple heuristic for detecting the smallest-available timegrain. This was especially useful for Rill’s auto-generated time series dashboards, where the goal is to provide a starting point for exploration.

The basis of this heuristic is to look at the number of unique values for each timegrain. For instance, if I extract the milliseconds out of a column, and it turns out I only have one unique millisecond value, most likely the data is rolled up to a larger grain.

The SQL (in DuckDB) is below (with the source in Rill Developer here):

-- Pull the column. Optionally, it might make sense to sample if the
-- total number of rows is large enough to make this query run slowly.
WITH cleaned_column AS (
    SELECT "<COLUMN>" as cd
    from "<TABLE>"
    <SAMPLE CLAUSE, IF NEEDED>
),
-- For each component of the timestamp, count the
-- number of unique instances. So for instance, we'll
-- count the number of unique hours (a number between 0-23).
time_grains as (
SELECT
    approx_count_distinct(extract('years' from cd)) as year,
    approx_count_distinct(extract('months' from cd)) as month,
    approx_count_distinct(extract('dayofyear' from cd)) as dayofyear,
    approx_count_distinct(extract('dayofmonth' from cd)) as dayofmonth,
    -- This checks if the last day of the month is the reported period.
    min(cd = last_day(cd)) = TRUE as lastdayofmonth,
    approx_count_distinct(extract('weekofyear' from cd)) as weekofyear,
    approx_count_distinct(extract('dayofweek' from cd)) as dayofweek,
    approx_count_distinct(extract('hour' from cd)) as hour,
    approx_count_distinct(extract('minute' from cd)) as minute,
    approx_count_distinct(extract('second' from cd)) as second,
    approx_count_distinct(extract('millisecond' from cd) -
      extract('seconds' from cd) * 1000) as ms
FROM cleaned_column
)
SELECT
  -- Estimate the smallest available timegrain based on
  -- the CTE above.
  --
  -- For each one of these extracted unique time component,
  -- if the distinct count is 1, then this timegrain is probably not
  -- the smallest available one.
  COALESCE(
    -- We start by looking at milliseconds, then working up to hours.
      case WHEN ms > 1 THEN 'milliseconds' else NULL END,
      CASE WHEN second > 1 THEN 'seconds' else NULL END,
      CASE WHEN minute > 1 THEN 'minutes' else null END,
      CASE WHEN hour > 1 THEN 'hours' else null END,
      -- We need to reverse from here: years, months, weeks, days.
      -- for `years`, we know that if there is more than one
      -- dayofyear & more than one year, then it's plausible that
      -- `years` is the right resolution.
      CASE WHEN dayofyear = 1 and year > 1 THEN 'years' else null END,
      -- similarly, for months, if there is only one day of the month
      -- OR all the days are the last day of the month,
      -- AND there is more than one month,
      -- then it's plausible that we have the `months` timegrain.
      CASE WHEN (dayofmonth = 1 OR lastdayofmonth) and month > 1
        THEN 'months' else null END,
      -- It gets easier from here.
      CASE WHEN dayofweek = 1 and weekofyear > 1
        THEN 'weeks' else null END,
      CASE WHEN hour = 1 THEN 'days' else null END
  ) as estimatedSmallestTimeGrain
FROM time_grains

Tradeoffs

If my obervations are reported in half-day or three-hour periods, then we can’t detect the timegrain.
This heuristic also doesn’t quite handle quarterly reporting either. This could be potentially addressed by simply looking for the number of unique days in the year and building out a heuristic.
Another limitation is that it assumes the timestamps are evenly distributed across the time periods. If for some reason the data is sparse or clustered, it may detect a finer grain than is truly present. Looking at the distribution of timestamps could help mitigate this.
Ultimately, this heuristic should be one piece of a larger framework for autodetecting the right timegrain.

Performance

The performance of this heuristic is great in practice, especially if you choose to sample from the column first. You really don’t need all of the data to get value out of this query.

Thanks to Marissa Gorlick for working through this heuristic with me some time in 2022!

Hamilton Ulmer

Tradeoffs

Performance