I'm working on a neural network system to perform SED fitting as part of a studentship project at the University of Western Australia.
I have created a set of around 20,000 runs through an SED fitting program known as MAGPHYS. Each run has the 42 input values and 32 output values that we're interested in (there are more outputs from the system, but we don't need them)
I've been experimenting around with the Keras neural network package in order to create a network to learn this function.
My current network design uses 4 hidden layers, fully interconnected, with 30 connections between each layer. Each layer is using TanH activation functions. I also have a 42 dimension input layer and 32 dimension output layer, both also using TanH activation, for a total of 6 layers.
model = Sequential()
loss = 'mse'
optimiser = SGD(lr=0.01, momentum=0.0, decay=0, nesterov=True)
model.add(Dense(output_dim=30, input_dim=42, init='glorot_uniform', activation='tanh'))
for i in range(0, 4):
    model.add(Dense(output_dim=30, input_dim=30, init='glorot_uniform', activation='tanh'))
model.add(Dense(output_dim=32, input_dim=30, init='glorot_uniform', activation='tanh'))
model.compile(loss=loss, optimizer=optimiser)
I have been using min/max normalisation of my input and output data to squash all of the values between 0 and 1. I'm using a stochastic gradient descent optimiser and I've experimented with various loss functions such as mean squared error, mean absolute error, mean absolute percentage error etc.
The main issue is that regardless of how I structure my network, it simply generates output values that are around the average of all of the training output values. It does not appear as through the network has actually learned the function correctly, it just generates values around the average. Worse still, some network designs I've experimented with, particularly those that use linear activation functions, will generate ONLY the average of the output values and will not vary at all.
Example (for one of the 32 outputs):
Output   Correct
9.42609868658  =   9.647
9.26345946681  =   9.487
9.43403506231  =   9.522
9.35685760748  =   9.792
9.20564885211  =   9.287
9.39240577382  =   8.002
Notice how all of the outputs are just around the 9.2 - 9.4 value, even though these values are quite incorrect.
With all of this in mind, what causes a network such as mine to generate these sorts of outputs that are all around the average?
What sort of things can I try to remedy this problem and create a network, of some sort, to actually generate correct outputs?
I just want to throw out some thoughts on this specific problem, in addition to CAFEBABE's comment:
42 input features is not a ton of features to work with. Not something you can necessarily fix, but it means that you'll want to have wider hidden layers (ie more nodes), to help with separability of the classes/labels. Furthermore, 20K observations isn't exactly a large dataset. If you can get more data, you should. This is pretty much always the case.
If you have a concrete reason for min/max normalization, then disregard this point, but you could consider BatchNormalizing your input, which tends to help the network's ability to predict accurately. This essentially allows the activation to have inputs occurring closer to the middle of the function, rather than the ends.
You should experiment more with your optimization. For example:
rmsprop or adam, or learning rates. Try some different activation functions. Recent research includes: ReLU, ELU, PReLU, SReLU. All available in keras.
Also try including some regularization, to avoid overfitting. Look into Dropout, or L2/L1
While having a deeper model (ie more layers) does often help, reducing the data dimensions from 42 features, down to 30, is likely hurting your ability to separate the data. Try something bigger, like 100, or 500, or 1000.
An example model you could try would be something like:
# imports 
from sklearn.cross_validation import train_test_split
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import ELU     
# data shapes
n_obs, n_feat = 20000, 42
n_hidden = 500 # play with this, bigger tends to give better separability
n_class = 32
# instantiate model
model = Sequential()
# first layer --- input
model.add(Dense(input_dim = n_feat, output_dim = n_hidden))
model.add(BatchNormalization())
model.add(ELU())
model.add(Dropout(p=0.2)) # means that 20% of the nodes are turned off, randomly
# second layer --- hidden
model.add(Dense(input_dim = n_hidden, output_dim = n_hidden))
model.add(BatchNormalization())
model.add(ELU())
model.add(Dropout(p=0.2))
# third layer --- output
model.add(Dense(input_dim = n_hidden, output_dim = n_class))
model.add(BatchNormalization())
model.add(Activation('softmax'))
# configure optimization
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy', metrics = ['accuracy'])
# split your data, so you test it on validation data
X_train, X_test, Y_train, Y_test = train_test_split(data, targets)
# train your model
model.fit(X_train, Y_train, validation_data = (X_test, Y_test))
Best of luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With