Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I predict Target Variable if it is not included in the test data for a binary classification task

I have a binary classification task with 2 datasets (train.csv and test.csv). the training data contains independent variables (x1, x2, x3) and a target variable (y) while the test only contains the independent variables. I want to make predictions(logistic Regression) on both data. The only problem is that my test data doesn't have a target var. I'm not sure how to approach this task since my data has already been split and they have different number of rows. How do I make predictions on the test set if I'm missing the target variable? sample data below: You can use any module to demonstrate this, it doesn't matter. I just want to see the approach e.g sklearn.

data1 = {'x1':['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female'],
    'x2':[13, 20, 21, 19, 18, 78, 22, 33, 56, 10],
    'x3': [335.5, 455.3, 109.4, 228.0, 220.9, -1.223, 700.4, 446.9, 499.1, 776.4],
    'y': [1, 0, 0, 1, 0, 0, 0, 1, 0, 0,]
   }

train = pd.DataFrame(data1) train

data2 = {'x1':['Female', 'Female', 'Male', 'Male', 'Male'],
    'x2':[16, 20, 33, 29, 18, ],
    'x3': [235.1, 395.0, 290.3, 118.6, 345.1]
   }

test = pd.DataFrame(data2) test

like image 615
Omomaxi Avatar asked Sep 07 '25 04:09

Omomaxi


1 Answers

The test dataset, as the name suggests, is only used for evaluation/testing your model. Your task is to produce the predictions for the test data, by learning a model through the training dataset. During training you use the given annotations/labels (what you refer to as 'response variables') of the training dataset to fit the model.

You can learn more about this concept e.g. here.

For your case where your goal is to learn a Logistic Regression model, you use the given data-prediction pairs ((x1, x2, x3), y) in the training dataset to learn the model parameters. After training the model, you now have the ability to create predictions for new data. So given your test data set you can now input the data points (x1, x2, x3) to obtain the classification results y according to your model.

Using sklearn and the data samples you provided:

from sklearn.linear_model import LogisticRegression

train = pd.DataFrame(data1)
test = pd.DataFrame(data2)

# create np.arrays from the trainings data
X_train = np.array([(train['x1']=='Male').astype(int), train['x2'], train['x3']]).T
y_train = np.array(train['y'])  # labels

# train the model = fit logistic function to trainings data
model = LogisticRegression().fit(X_train, y_train)

# Create predictions on the test set
X_test = np.array([(test['x1']=='Male').astype(int), test['x2'], test['x3']]).T
y_test = model.predict(X_test)  # create y-labels through the learned model
print(y_test)

The predictions for the test dataset can usually be submitted somewhere to obtain an evaluation on how good your model classified the data.

So in short: the trainig dataset is used to learn/fit a model, the test dataset is used to evaluate the performance

like image 73
NMme Avatar answered Sep 09 '25 18:09

NMme