Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to CASE WHEN to create bins from a column

Tags:

sql

hive

hiveql

I have a table where a column (COL1) can have values from 0 to 1 (inclusive). From this column I would like to create a new column that categorizes COL1 using predefined ranges (i.e. create bins). When the number of categories / bins is small, this can easily be done using CASE WHEN, for example:

SELECT
       CASE WHEN COL1 <= 0.2 THEN 1
            WHEN COL1 > 0.2 AND COL1 <= 0.4 THEN 2
            ....
            WHEN COL1 > 0.8 AND COL1 <= 1 THEN 5
       END AS COL1_bin
FROM   Table

Since I would like to have 100 bins, I am looking for a way to do this more 'automatically' using a function, where I can for example give a minimum, maximum and stepsize / binsize (e.g., min = 0, max = 1, stepsize = 0.01)

I am doing this in HiveQL, which is supposed to be very similar to MySQL.

like image 247
Anonymous Avatar asked Dec 06 '25 12:12

Anonymous


2 Answers

Here a way to do this. If supposing you would like the bin ranges to be 0.1,0.2,0.3,... till 1. Then you would generate 10 rows using the block row_gen and setting the step_size as row_number/10, and also by setting the lower_limit and upper_limit for each bin.

After that you would need to check if the value for col1 fits in the corresponding lower and upper bound of the bin as follows. (I used left join to include those conditions where the col1 value is null. If this is not a likely scenario then it can be changed to a regular join)

with row_gen 
  as (select top 10 
             ,row_number() over(order by 1) as bin_id
             ,row_number() over(order by 1)/10 as as lower_limit
             ,row_number() over(order by 1)/10 + 1/10 as as upper_limit
        from table /*any table that has 10 rows*/ 
      )
   select a.*
          ,b.bin_id
     from your_table a
left join row_gen b
       on a.col1 between b.lower_rnk and b.upper_rnk
like image 164
George Joseph Avatar answered Dec 09 '25 13:12

George Joseph


I'm not sure if Hive has a binning function (Postgres does). But you can use arithmetic:

select floor( greatest(least($maxval, col1), $minval) - $minval) /
              $step_size
            ) as bin
from t;

If you want the min and maximum values from the data, you can use window functions:

select floor( greatest(least(maxval, col1), minval) - minval) /
              $step_size
            ) as bin
from (select t.*,
             min(col1) over () as minval,
             max(col1) over () as maxval
      from t
     ) t;

Strictly speaking, you don't need a subquery for the calculation. But this allows the bin to be used for aggregation.

like image 31
Gordon Linoff Avatar answered Dec 09 '25 13:12

Gordon Linoff



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!