As part of a project, I am trying to use the random forest classifier from Python's SKLearn library. I have been using this tutorial as a guide: https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/.
My code follows this tutorial line by line, but the only major difference is the structure of the data. In the tutorial, there are 4 features (4 columns in the data table), and each entry in a column is a number. In my code, I have 1 feature (1 column in the data table), and each entry in a column is a numpy array. When I call the fit() function, I get the following error: ValueError: setting an array element with a sequence.
Here is my code:
import pandas as pd
import numpy as np
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
trainingData = [[[0, 0, 3], 0.77], [[24, 0, 5], 30], [[0, 0, 4], 0.77], [[0, 0, 0], 0.77]]
vectors_train = []
for i in range (0, len(trainingData)):
vectors_train.append(trainingData[i][0])
testingData = [[[1, 0, 0], 0.77], [[30, 0, 5], 30], [[0, 0, 0], 0.77], [[0, 0, 0], 0.77]]
vectors_test = []
for i in range (0, len(testingData)):
vectors_test.append(testingData[i][0])
dataframe_training = pd.DataFrame(trainingData)
dataframe_training['is_train'] = True
dataframe_testing = pd.DataFrame(testingData)
dataframe_testing['is_train'] = False
frames = [dataframe_training, dataframe_testing]
dataframe = pd.concat(frames)
dataframe.rename(index = str, columns = {0: 'Vector', 1: 'Label', 2: 'is_train'})
train, test = dataframe[dataframe['is_train']==True], dataframe[dataframe['is_train']==False]
features = dataframe.columns[:1]
labels_train, uniques = pd.factorize(train[1], sort = True)
clf = RandomForestClassifier()
clf.fit(train[features], labels) # Value error occurs here
I am confused by what the error actually means. What array element is being set to a sequence, and where is this sequence? I'm also aware thattrain[features]
is a DataFrame object, and that the fit() function takes in two parameters, both of which must be array-like. labels
is an array, and the error specifically points to the first parameter being the problem, so is there a data type conversion I have to do?
When I replace the line clf.fit(train[features], labels)
with clf.fit(vectors_train, labels)
, the error goes away. However, I want to know why it is not working when I use the same strategy as the tutorial and how to get it to work in a similar fashion.
Any help would be much appreciated. Thanks!
Remove the features
variable and make the last line:
clf.fit(train[0].tolist(), labels)
No error raised with the code above.
Your code isn't working because columns
as you do column[:1]
returns a sequence with one column, however column[0]
won't, and if you feed that int to cls.fit
doing train[features]
with the columns[0]
as features
, it still won't work since it requires a list or array, so train[features].tolist()
will also work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With