I'm trying to predict the profit each film made on IMDb.
My dataframe and features are as follows:
Actor1 Actor2 Actor3 Actor4 Day Director Genre1 Genre2 Genre3 \
0 0 0 0 0 19.0 0 0 0 0
1 1 1 1 1 6.0 1 1 1 1
2 2 2 2 2 20.0 2 0 2 2
3 3 3 3 3 9.0 3 2 0 -1
4 4 4 4 4 9.0 4 3 3 3
Language Month Production Rated Runtime Writer Year BoxOffice
0 1 0 0 0 118.0 0 2007.0 37500000.0
1 2 1 1 0 151.0 1 2006.0 132300000.0
2 1 1 2 1 130.0 2 2006.0 53100000.0
3 1 2 1 0 117.0 3 2007.0 210500000.0
4 4 3 3 2 117.0 4 2006.0 244052771.0
and the value I'm trying to predict (target) is the BoxOffice.
I'm following documentation for sklearn exactly as it is (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)
from sklearn import preprocessing, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
X = dataset[:,0:16] # Features
Y = dataset[:,16] #Target
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33)
regr = linear_model.LinearRegression()
regr.fit(X_train,Y_train)
mean_squared_error(Y_test, regr.predict(X_test))
and the output is always something along the lines of: 11385650623660550 ($11,385,650,623,660,500.00)
While the mean of the BoxOffice is: 107989121
etc.
Ive tried multiple different approaches, cross-validation as well as other models (keras) and feel like I've tried everything.
The returning sum is extremely high which makes me question that the problem is not in the model or the data, but something else that I'm missing.
I think, your problem is not related with mean squared error, it is model itself.
For your categorical features, I recommend you to try another encode method like OneHotEncoder. LabelEncoder is not good option for lineer regression.
(For more information: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)
Before train your model, take a look correlation of your numeric features with your target variable maybe some of them irrelevant, for categorical features you can try different methods to analyze their relationship with your target variables (like boxplots)
Lineer regression need continuous variables so you may want to try other algorithms as well. Just make sure that you have the enough background before apply them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With