Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Late fusion for the CNN features

I am working on early and late fusion of CNN features. I have taken features from multiple layer of CNN. For the early fusion I have captured the feature of three different layers and then horizontally concatenate them F= [F1' F2' F3']; For the late Fusion I was reading this paper. They have mentioned to do supervised learning twice. But couldn't understand the way.

For example this is the image taken from the above mentioned paper. The first image have three different features and for first supervised learning the labels lets say will be 1 in 4 class image set. The output for example is [1 1 3]. Lets say the third classifier has wrong result. Then my question is then the multimodal feature concatenation is like [1 1 3] with the label 1 lets say for class 1 image?

enter image description here

like image 380
Addee Avatar asked Dec 14 '25 10:12

Addee


1 Answers

  • I might be wrong on this but this is my understanding (I am not sure about my answer)
  • So let's say you have 2 classes and you have 3 different models
  • So every model will output a vector of (2 x 1)
  • For example

    Model-1 : [[0.3], [0.7]]
    Model-2 : [[0.2], [0.8]]
    Model-2 : [[0.6], [0.4]]

  • Now you will concatenate (Multi-modal Features Combination) the results as follows:
    [0.3, 0.2, 0.6, 0.7, 0.8, 0.4]

  • The above feature vector will go as input to your final supervised learner as the mentioned in the diagram the concept score goes as input to the supervised learner

  • In the paper they mention about this as follows:
    We concatenate the visual vector vi with the text vector ti.
    After feature normalization, we obtain early fusion vector ei.
    Then ei serves as the input for a SVM.

  • Now, Let's talk about the implementation of this model

  • What I would do is first train Model-1 individually, train Model-2 individually, train Model-3 individually
  • Now I will freeze the weights of Model-1, Model-2, Model-3 and extract the scores and combine them into the feature vector as discussed above and pass it to the final supervised learner and train it
  • Think of the three Unimodal Supervised Learner as Feature Extractors and concatenate their results just like you did for early fusion and pass it to a SVM
  • I would go with the class scores as feature vectors rather then the actual predictions they make which you have assumed
  • Why class scores and not the actual predictions? because the class score represents the confidence of the unimodals for their prediction of the classes
like image 115
Jai Avatar answered Dec 16 '25 23:12

Jai



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!