Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Expressing complex logic in case statements in dbt using a lookup table

Tags:

sql

dbt

I'm reworking a good part of our Analytics Data Warehouse and am trying to build things out in a much more modular way. We've swapped to DBT vs an in house transformation tool and I'm trying to take advantage of the functionality it offers.

Previously, the way we classified our rental segments was in a series of CASE statements which evaluate a few fields. These are (sudocode)

CASE WHEN rental_rule_type <> monthly 
AND rental_length BETWEEN 6 AND 24
AND rental_day IN (0,1,2,3,4)
AND rental_starts IN (5,6,7,8,9,10,11)
THEN weekday_daytime_rental

This obviously works. But it's ugly and hard to update. If we want to adjust this, we'll need to do so in the SQL rather than in a lookup table.

What I'd like to build is a simple lookup table that holds these values that can be adjusted at a later date to easily adjust how we classify these rentals, but I'm not sure what the best approach is.

My current thought is to layer these conditions into an excel file, load it into the warehouse with DBT and then join on these conditions, however I'm not sure if that would end up being cleaner logic or not. It would mean there are no hardcoded values in the code, but it would likely still result in a ton of ugly cases and joins.

I think there are some global variables I could define as well in DBT which may help with this?

Anyone approach something similar? Would love to hear some best practices.

like image 493
Dug Avatar asked Sep 08 '25 08:09

Dug


2 Answers

Love the question here and actually tconbeer's answer as well. However, if you want an answer that "favor's readability over DRY-ness", there is an another appropriate middle ground here which is generally regarded as a dbt best-practice: model defined sets.

Example:

{% set rental_rule_type = ["bank_transfer"] %}
{% set rental_length_low = ["6"] %}
{% set rental_length_high = ["24"] %}
{% set rental_days = ["0","1","2","3","4"] %}
{% set rental_starts = ["5","6","7","8","9","10","11"] %}



with some_cte as (
  select * from {{ ref('some_source') }}
)
 select *,
CASE    
        WHEN rental_rule_type <> {{rental_rule_type}}
        AND rental_length BETWEEN {{rental_length_low}} AND {{rental_length_high}}
        AND rental_day IN (
            {% for rental_day in rental_days %}
            {{rental_day}} {%- if not loop.last -%}, {%- endif -%}
            {% endfor %}
        )
        AND rental_starts IN (
            {% for rental_start in rental_starts %}
            {{rental_start}} {%- if not loop.last -%}, {%- endif -%}
            {% endfor %}
        )
THEN weekday_daytime_rental
from some_cte

Example 2 (equivalent but cleaner as suggested in comment below):

{% set rental_rule_type = ["bank_transfer"] %}
{% set rental_length_low = ["6"] %}
{% set rental_length_high = ["24"] %}
{% set rental_days = ["0","1","2","3","4"] %}
{% set rental_starts = ["5","6","7","8","9","10","11"] %}



with some_cte as (
  select * from {{ ref('some_source') }}
)
 select *,
CASE    
        WHEN rental_rule_type <> {{rental_rule_type}}
        AND rental_length BETWEEN {{rental_length_low}} AND {{rental_length_high}}
        AND rental_day IN ( {{ rental_days | join(", ") }} )
        AND rental_starts IN (  {{ rental_starts | join(", ") }} )
THEN weekday_daytime_rental
from some_cte

In this format, all the logic stays accessible to someone who is reading the model but also makes changing the logic much more accessible since it contains all the variables to a single location.

Also, much easier to see that variables / macros are in play here at a quick glance of the model vs if your case statement is buried deep in a chain of CTEs or a more complex select statement after some CTEs.

warning I haven't compiled this so I'm not 100% it will be successful as-is. Should get you started though if you go in this direction.

like image 94
sgdata Avatar answered Sep 10 '25 04:09

sgdata


I've tried what you describe, and I've generally regretted it.

Logic really should be expressed as code, not data. It should be source-controlled, reviewable, and support multiple environments (dev and prod).

Seed files (with dbt seed) are kind-of data, and kind-of code, since they get checked into source control alongside the code. This at least solves the multiple environments problem, but it makes code review extremely difficult.

I'd recommend doing what software engineers do -- encapsulate the logic into easily-understandable and easily-testable components, and then compose those components in your model. Macros work pretty well for this.

For example, your case statement above could become a macro called is_weekday_daytime_rental():

{% macro is_weekday_daytime_rental(rental_rule_type, rental_legnth, rental_day, rental_starts) %}
CASE 
  WHEN {{ rental_rule_type }} <> monthly 
    AND {{ rental_length }} BETWEEN 6 AND 24
    AND {{ rental_day }} IN (0,1,2,3,4)
    AND {{ rental_starts }} IN (5,6,7,8,9,10,11)
  THEN true
  ELSE false
END
{% endmacro %}

then you could call that macro in your model, like:

CASE
    WHEN
        {{ is_weekly_daytime_rental(
            "rental_rule_type",
            "rental_legnth",
            "rental_day",
            "rental_starts"
        ) }} 
    THEN weekday_daytime_rental
    WHEN ...

But let's do better. Assuming you're also going to have is_weekend_daytime_rental, then each of those component bits of logic should be its own macro that you can reuse:

{% macro is_weekday_daytime_rental(rental_rule_type, rental_legnth, rental_day, rental_starts) %}
CASE 
  WHEN {{ is_daily_rental(rental_rule_type, rental_length) }}
    AND {{ is_weekday(rental_day) }}
    AND {{ is_daytime(rental_starts) }}
  THEN true
  ELSE false
END
{% endmacro %}

where each component looks like:

{% macro is_weekday(day_number) %}
CASE
    WHEN {{ day_number }} IN (0, 1, 2, 3, 4)
    THEN true
    ELSE false
END
{% endmacro %} 
like image 25
tconbeer Avatar answered Sep 10 '25 03:09

tconbeer