I am working on early and late fusion of CNN features. I have taken features from multiple layer of CNN. For the early fusion I have captured the feature of three different layers and then horizontally concatenate them F= [F1' F2' F3']; For the late Fusion I was reading this paper. They have mentioned to do supervised learning twice. But couldn't understand the way.
For example this is the image taken from the above mentioned paper. The first image have three different features and for first supervised learning the labels lets say will be 1 in 4 class image set. The output for example is [1 1 3]. Lets say the third classifier has wrong result. Then my question is then the multimodal feature concatenation is like [1 1 3] with the label 1 lets say for class 1 image?

For example
Model-1 : [[0.3], [0.7]]
Model-2 : [[0.2], [0.8]]
Model-2 : [[0.6], [0.4]]
Now you will concatenate (Multi-modal Features Combination) the results as follows:
[0.3, 0.2, 0.6, 0.7, 0.8, 0.4]
The above feature vector will go as input to your final supervised learner as the mentioned in the diagram the concept score goes as input to the supervised learner
In the paper they mention about this as follows:
We concatenate the visual vector vi with the text vector ti.
After feature normalization, we obtain early fusion vector ei.
Then ei serves as the input for a SVM.
Now, Let's talk about the implementation of this model
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With