Is it possible to control overlapping while generating dataset with sklearn.datasets.make_classification?
I want to pass overlapping percentages between 2 classes and it will overlap the classes according to the given percentage.
The detailed requirement is:
Generate an n-class classification dataset(Gaussian manner) in such a way that we can control it by adding covariance, overlapping percentage, and the shape of plot like a diagonal, straight line, horizontal line, etc

make_classification draws samples from an n-dimensional hypercube. You might be able to achieve something like "overlapping classes according to a given percentage" by tweaking the class_sep parameter in specific cases, but I don't think it would work generally.
A solution might be to create binary classification data sets by sampling from Gaussian distributions with known mean and variance. Here's a short demo:
import numpy as np
from numpy.random import default_rng
import matplotlib.pyplot as plt
rng = default_rng()
N_POINTS = 10000
SCALE = 1.3
train_data = np.c_[
np.r_[rng.normal(5, SCALE, (N_POINTS, 2)), rng.normal(10, SCALE, (N_POINTS, 2))],
np.r_[np.zeros((N_POINTS, 1)), np.ones((N_POINTS, 1))],
]
# Plotting
fig1, ax = plt.subplots()
ax.scatter(train_data[:, 0], train_data[:, 1], c=train_data[:, 2])
ax.set_box_aspect(1)
plt.show()
Here's an example where SCALE = 0.5:

... and here's an example where SCALE = 1.3:

Samples generated by rng.normal should generally fall within two standard deviations from the means we located at (5, 5) and (10, 10).
Changing the SCALE parameter, knowing that the distance between your means is around 7.071, and knowing the expected radius of where your data should fall---should let you estimate how much overlap you expect to have between your classes.
Once you've done that, you can translate your findings back into parameters of sklearn.datasets.make_blobs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With