Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySql: Evenly distributed sample size on unevenly distributed data

Let's say I have 10.000 booking customer profiles. These profiles have the following variables:

  • Duration (days to go on holiday)
  • Destination (maybe Brazil)
  • People_amount (how many go with)
  • Takeoff (date when they want to leave)

I want to pass 1.000 (10%) of them through a booking engine for pricing, but to develop an independent insights analysis I have to (as much as I can) evenly distribute the characteristics of the profiles. E.g. if all the the profiles have 3 varieties of People_amount (1, 2 and 3), ultimately I would like to have a selection within the 10% of 33,33% with People_amount = 1, 33,33% with People_amount = 2 and 33,33% with People_amount = 3.

But...

Because the profile set isn't evenly distributed (for example 70% of all profiles consist of People_amount = 1) I can't figure out how to maybe find/create a sort of loop (or something else) that fills up the SELECT by all varieties within that characteristic untill 1 is exhausted and goes further with the rest.

Maybe an example on how I would like to fill up the 10% sample of my 10k profiles:

Profile_id  People_amount                                     Profile_id  People_amount
1           1                                                           1           1
2           1                                                           5           2
3           1                                                           8           3
4           1       --> Filling the sample by even distribution         2           1
5           2       of available profile characteristics                6           2
6           2                                                           9           3   
7           2                                                           3           1
8           3                                                           7           2
9           3                                                           4           1

Hope you can help!

like image 754
Dev Avatar asked Oct 17 '25 20:10

Dev


1 Answers

You can use union for that with a limit on each sub select:

(SELECT * FROM profiles WHERE People_amount=1 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=2 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=3 LIMIT 333)

The brackets are needed to apply the LIMIT to each sub select.

A more dynamic approach

In case the number of possible values of people_amount is not known, the above method is not workable. Then I would propose a query where the ORDER BY clause distributes people_amount values based on number of occurrences. It will not give an exact equal distribution, but the distinct values will have a comparable presence in the result set:

select     p.*
from       (
            select   people_amount,  
                     count(*) as occurrences
            from     profiles
            group by people_amount) as stats
inner join profiles p
        on p.people_amount = stats.people_amount         
order by   rand() * stats.occurrences
limit      1000

SQL fiddle (if not overloaded).

If you want to extend this to other columns, like Destination, you could do that as follows:

select     p.*
from       (
            select   people_amount,  
                     destination,
                     count(*) as occurrences
            from     profiles
            group by people_amount,
                     destination) as stats
inner join profiles p
        on p.people_amount = stats.people_amount         
       and p.destination = stats.destination
order by   rand() * stats.occurrences
limit      1000

The idea is that values that have low occurrences will get lower order-by values, and so will pop up more often in the beginning of the result set, compensating for their low frequency.

like image 182
trincot Avatar answered Oct 19 '25 11:10

trincot