Let's say I have 10.000 booking customer profiles. These profiles have the following variables:
I want to pass 1.000 (10%) of them through a booking engine for pricing, but to develop an independent insights analysis I have to (as much as I can) evenly distribute the characteristics of the profiles. E.g. if all the the profiles have 3 varieties of People_amount (1, 2 and 3), ultimately I would like to have a selection within the 10% of 33,33% with People_amount = 1, 33,33% with People_amount = 2 and 33,33% with People_amount = 3.
But...
Because the profile set isn't evenly distributed (for example 70% of all profiles consist of People_amount = 1) I can't figure out how to maybe find/create a sort of loop (or something else) that fills up the SELECT by all varieties within that characteristic untill 1 is exhausted and goes further with the rest.
Maybe an example on how I would like to fill up the 10% sample of my 10k profiles:
Profile_id People_amount Profile_id People_amount 1 1 1 1 2 1 5 2 3 1 8 3 4 1 --> Filling the sample by even distribution 2 1 5 2 of available profile characteristics 6 2 6 2 9 3 7 2 3 1 8 3 7 2 9 3 4 1
Hope you can help!
You can use union
for that with a limit on each sub select:
(SELECT * FROM profiles WHERE People_amount=1 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=2 LIMIT 333)
UNION
(SELECT * FROM profiles WHERE People_amount=3 LIMIT 333)
The brackets are needed to apply the LIMIT
to each sub select.
In case the number of possible values of people_amount
is not known, the above method is not workable. Then I would propose a query where the ORDER BY
clause distributes people_amount
values based on number of occurrences. It will not give an exact equal distribution, but the distinct values will have a comparable presence in the result set:
select p.*
from (
select people_amount,
count(*) as occurrences
from profiles
group by people_amount) as stats
inner join profiles p
on p.people_amount = stats.people_amount
order by rand() * stats.occurrences
limit 1000
SQL fiddle (if not overloaded).
If you want to extend this to other columns, like Destination
, you could do that as follows:
select p.*
from (
select people_amount,
destination,
count(*) as occurrences
from profiles
group by people_amount,
destination) as stats
inner join profiles p
on p.people_amount = stats.people_amount
and p.destination = stats.destination
order by rand() * stats.occurrences
limit 1000
The idea is that values that have low occurrences will get lower order-by values, and so will pop up more often in the beginning of the result set, compensating for their low frequency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With