How can I generate binary classification dataset and control the overlapping between 2 classes?

Question

Is it possible to control overlapping while generating dataset with sklearn.datasets.make_classification?

I want to pass overlapping percentages between 2 classes and it will overlap the classes according to the given percentage.

The detailed requirement is: Generate an n-class classification dataset(Gaussian manner) in such a way that we can control it by adding covariance, overlapping percentage, and the shape of plot like a diagonal, straight line, horizontal line, etc

Alexander L. Hayes · Accepted Answer

make_classification draws samples from an n-dimensional hypercube. You might be able to achieve something like "overlapping classes according to a given percentage" by tweaking the class_sep parameter in specific cases, but I don't think it would work generally.

A solution might be to create binary classification data sets by sampling from Gaussian distributions with known mean and variance. Here's a short demo:

import numpy as np
from numpy.random import default_rng
import matplotlib.pyplot as plt

rng = default_rng()

N_POINTS = 10000
SCALE = 1.3

train_data = np.c_[
    np.r_[rng.normal(5, SCALE, (N_POINTS, 2)), rng.normal(10, SCALE, (N_POINTS, 2))],
    np.r_[np.zeros((N_POINTS, 1)), np.ones((N_POINTS, 1))],
]

# Plotting
fig1, ax = plt.subplots()
ax.scatter(train_data[:, 0], train_data[:, 1], c=train_data[:, 2])
ax.set_box_aspect(1)
plt.show()

Here's an example where SCALE = 0.5:

two gaussian blobs with scale 0.5, they are nowhere close to overlapping

... and here's an example where SCALE = 1.3:

two gaussian blobs with scale 1.3, they appear to overlap slightly

Samples generated by rng.normal should generally fall within two standard deviations from the means we located at (5, 5) and (10, 10).

Changing the SCALE parameter, knowing that the distance between your means is around 7.071, and knowing the expected radius of where your data should fall---should let you estimate how much overlap you expect to have between your classes.

Once you've done that, you can translate your findings back into parameters of sklearn.datasets.make_blobs

How can I generate binary classification dataset and control the overlapping between 2 classes?

Tags:

python

machine-learning

classification

Purvish Jariwala

1 Answers

Alexander L. Hayes

Recent Activity

Donate For Us

How can I generate binary classification dataset and control the overlapping between 2 classes?

Tags:

python

machine-learning

classification

Purvish Jariwala

1 Answers

Alexander L. Hayes

Related questions

Recent Activity

Donate For Us