How to perform stratification based on a column in Snowflake

Question

I am using Snowflake to write my sql queries. We have a huge table with billions of records containing customer information. The goal is to get random sample out and use R to look at the distributions. Unfortunately, we can't use JDBC/ODBC connection from RStudio to the Database. It is a limitation. So I am left with pulling extract from Snowflake and importing into R.

The difficulty, is we have a column called CUSTOMER SEGMENT which has almost 24 unique values. The goal is to get a sample which represents significant proportion from each segment. I tried the following query;

SELECT DISTINCT *
FROM test sample(10)

to obtain random sample where each row has 10 percent probability of being selected. But I am not obtaining sample from each values of customer segment. May I know of any sql commands, which can help to stratify based on Customer segment. thanks in advance.

Gordon Linoff · Accepted Answer

An alternative way of sampling for more equal-sized partitions is to use round robin sampling

select t.*
from (select t.*, 
             row_number() over (partition by segment order by random()) as seqnum,
             count(*) over () as cnt
      from test t
     ) t
where seqnum <= 20;

The "20" says up to 20 rows for each segment.

This can be modified for a percentage based sample. It is not clear if that is necessary.

jasonchang2018 · Answer

For some large number of rows, this will approach a stratified sample.

select     *
from       test
order by   row_number() over (partition by segment_1, segment_2 order by random()) /
           count(*) over (partition by segment_1, segment_2)
limit      1000000

How to perform stratification based on a column in Snowflake

Tags:

sql

r

snowflake-cloud-data-platform

jay

2 Answers

Gordon Linoff

jasonchang2018

Recent Activity

Donate For Us

How to perform stratification based on a column in Snowflake

Tags:

sql

r

snowflake-cloud-data-platform

jay

2 Answers

Gordon Linoff

jasonchang2018

Related questions

Recent Activity

Donate For Us