Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I generate binary classification dataset and control the overlapping between 2 classes?

Is it possible to control overlapping while generating dataset with sklearn.datasets.make_classification?

I want to pass overlapping percentages between 2 classes and it will overlap the classes according to the given percentage.

The detailed requirement is: Generate an n-class classification dataset(Gaussian manner) in such a way that we can control it by adding covariance, overlapping percentage, and the shape of plot like a diagonal, straight line, horizontal line, etc Shape of plot

like image 205
Purvish Jariwala Avatar asked Nov 23 '25 11:11

Purvish Jariwala


1 Answers

make_classification draws samples from an n-dimensional hypercube. You might be able to achieve something like "overlapping classes according to a given percentage" by tweaking the class_sep parameter in specific cases, but I don't think it would work generally.

A solution might be to create binary classification data sets by sampling from Gaussian distributions with known mean and variance. Here's a short demo:

import numpy as np
from numpy.random import default_rng
import matplotlib.pyplot as plt

rng = default_rng()

N_POINTS = 10000
SCALE = 1.3

train_data = np.c_[
    np.r_[rng.normal(5, SCALE, (N_POINTS, 2)), rng.normal(10, SCALE, (N_POINTS, 2))],
    np.r_[np.zeros((N_POINTS, 1)), np.ones((N_POINTS, 1))],
]

# Plotting
fig1, ax = plt.subplots()
ax.scatter(train_data[:, 0], train_data[:, 1], c=train_data[:, 2])
ax.set_box_aspect(1)
plt.show()

Here's an example where SCALE = 0.5:

two gaussian blobs with scale 0.5, they are nowhere close to overlapping

... and here's an example where SCALE = 1.3:

two gaussian blobs with scale 1.3, they appear to overlap slightly

Samples generated by rng.normal should generally fall within two standard deviations from the means we located at (5, 5) and (10, 10).

Changing the SCALE parameter, knowing that the distance between your means is around 7.071, and knowing the expected radius of where your data should fall---should let you estimate how much overlap you expect to have between your classes.

Once you've done that, you can translate your findings back into parameters of sklearn.datasets.make_blobs

like image 167
Alexander L. Hayes Avatar answered Nov 26 '25 02:11

Alexander L. Hayes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!