I am using Snowflake to write my sql queries. We have a huge table with billions of records containing customer information. The goal is to get random sample out and use R to look at the distributions. Unfortunately, we can't use JDBC/ODBC connection from RStudio to the Database. It is a limitation. So I am left with pulling extract from Snowflake and importing into R.
The difficulty, is we have a column called CUSTOMER SEGMENT which has almost 24 unique values. The goal is to get a sample which represents significant proportion from each segment. I tried the following query;
SELECT DISTINCT *
FROM test sample(10)
to obtain random sample where each row has 10 percent probability of being selected. But I am not obtaining sample from each values of customer segment. May I know of any sql commands, which can help to stratify based on Customer segment. thanks in advance.
An alternative way of sampling for more equal-sized partitions is to use round robin sampling
select t.*
from (select t.*,
row_number() over (partition by segment order by random()) as seqnum,
count(*) over () as cnt
from test t
) t
where seqnum <= 20;
The "20" says up to 20 rows for each segment.
This can be modified for a percentage based sample. It is not clear if that is necessary.
For some large number of rows, this will approach a stratified sample.
select *
from test
order by row_number() over (partition by segment_1, segment_2 order by random()) /
count(*) over (partition by segment_1, segment_2)
limit 1000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With