Late fusion for the CNN features

Question

I am working on early and late fusion of CNN features. I have taken features from multiple layer of CNN. For the early fusion I have captured the feature of three different layers and then horizontally concatenate them F= [F1' F2' F3']; For the late Fusion I was reading this paper. They have mentioned to do supervised learning twice. But couldn't understand the way.

For example this is the image taken from the above mentioned paper. The first image have three different features and for first supervised learning the labels lets say will be 1 in 4 class image set. The output for example is [1 1 3]. Lets say the third classifier has wrong result. Then my question is then the multimodal feature concatenation is like [1 1 3] with the label 1 lets say for class 1 image?

enter image description here

Jai · Accepted Answer

I might be wrong on this but this is my understanding (I am not sure about my answer)
So let's say you have 2 classes and you have 3 different models
So every model will output a vector of (2 x 1)
For example

Model-1 : [[0.3], [0.7]]
Model-2 : [[0.2], [0.8]]
Model-2 : [[0.6], [0.4]]
Now you will concatenate (Multi-modal Features Combination) the results as follows:
[0.3, 0.2, 0.6, 0.7, 0.8, 0.4]
The above feature vector will go as input to your final supervised learner as the mentioned in the diagram the concept score goes as input to the supervised learner
In the paper they mention about this as follows:
We concatenate the visual vector vi with the text vector ti.
After feature normalization, we obtain early fusion vector ei.
Then ei serves as the input for a SVM.
Now, Let's talk about the implementation of this model
What I would do is first train Model-1 individually, train Model-2 individually, train Model-3 individually
Now I will freeze the weights of Model-1, Model-2, Model-3 and extract the scores and combine them into the feature vector as discussed above and pass it to the final supervised learner and train it
Think of the three Unimodal Supervised Learner as Feature Extractors and concatenate their results just like you did for early fusion and pass it to a SVM
I would go with the class scores as feature vectors rather then the actual predictions they make which you have assumed
Why class scores and not the actual predictions? because the class score represents the confidence of the unimodals for their prediction of the classes

Late fusion for the CNN features

Tags:

machine-learning

matlab

computer-vision

conv-neural-network

feature-extraction

Addee

1 Answers

Jai

Recent Activity

Donate For Us

Late fusion for the CNN features

Tags:

machine-learning

matlab

computer-vision

conv-neural-network

feature-extraction

Addee

1 Answers

Jai

Related questions

Recent Activity

Donate For Us