I have a table where a column (COL1) can have values from 0 to 1 (inclusive). From this column I would like to create a new column that categorizes COL1 using predefined ranges (i.e. create bins). When the number of categories / bins is small, this can easily be done using CASE WHEN, for example:
SELECT
CASE WHEN COL1 <= 0.2 THEN 1
WHEN COL1 > 0.2 AND COL1 <= 0.4 THEN 2
....
WHEN COL1 > 0.8 AND COL1 <= 1 THEN 5
END AS COL1_bin
FROM Table
Since I would like to have 100 bins, I am looking for a way to do this more 'automatically' using a function, where I can for example give a minimum, maximum and stepsize / binsize (e.g., min = 0, max = 1, stepsize = 0.01)
I am doing this in HiveQL, which is supposed to be very similar to MySQL.
Here a way to do this. If supposing you would like the bin ranges to be 0.1,0.2,0.3,... till 1. Then you would generate 10 rows using the block row_gen and setting the step_size as row_number/10, and also by setting the lower_limit and upper_limit for each bin.
After that you would need to check if the value for col1 fits in the corresponding lower and upper bound of the bin as follows. (I used left join to include those conditions where the col1 value is null. If this is not a likely scenario then it can be changed to a regular join)
with row_gen
as (select top 10
,row_number() over(order by 1) as bin_id
,row_number() over(order by 1)/10 as as lower_limit
,row_number() over(order by 1)/10 + 1/10 as as upper_limit
from table /*any table that has 10 rows*/
)
select a.*
,b.bin_id
from your_table a
left join row_gen b
on a.col1 between b.lower_rnk and b.upper_rnk
I'm not sure if Hive has a binning function (Postgres does). But you can use arithmetic:
select floor( greatest(least($maxval, col1), $minval) - $minval) /
$step_size
) as bin
from t;
If you want the min and maximum values from the data, you can use window functions:
select floor( greatest(least(maxval, col1), minval) - minval) /
$step_size
) as bin
from (select t.*,
min(col1) over () as minval,
max(col1) over () as maxval
from t
) t;
Strictly speaking, you don't need a subquery for the calculation. But this allows the bin to be used for aggregation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With