Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

not able to remove duplicate image with hashing

My aim is to remove identical images like the following:

Image 1: https://i.sstatic.net/8dLPo.png

Image 2: https://i.sstatic.net/hF11m.png

Currently, I am using average hashing with

  • the hash size of 32 (hash size less than this is giving collision )
  • thresh hold of 10-20

I tried Phash as well, but it is removing almost similar images like the following, (which I don't want)

Image 3: https://i.sstatic.net/CwZ09.png

Image 4: https://i.sstatic.net/HvAaJ.png

So I am looking for some technique through which I can identify that

  • Image 1 and Image 2 are identical
  • Image 3 and Image 4 are Distinct

kindly help because I have been stuck on this problem for so long.

Note: Every time type/kind of images would be different so I can't even invest time to learn deep learning and give it a try.

like image 485
Sahil Lohiya Avatar asked Jan 30 '26 00:01

Sahil Lohiya


1 Answers

So I can't even invest time to learn deep learning and give it a try.

Something which might be beneficial for you to try is a similar image search utility using Locality Sensitive Hashing (LSH) and random projection on top of the image representations computed by a pretrained image classifier. This kind of search engine is also known as a near-duplicate (or near-dup) image detector.

You need to load a pretrained model and cut out an embedding layer block

You can download one like that

!wget -q https://git.io/JuMq0 -O flower_model_bit_0.96875.zip
!unzip -qq flower_model_bit_0.96875.zip

Another one might be better suiting (e.g. one for imagenet or similar).

You then use the output of the embedding model

bit_model = tf.keras.models.load_model("flower_model_bit_0.96875")

embedding_model = tf.keras.Sequential(
    [
        tf.keras.layers.Input((IMAGE_SIZE, IMAGE_SIZE, 3)),
        tf.keras.layers.Rescaling(scale=1.0 / 255),
        bit_model.layers[1],
        tf.keras.layers.Normalization(mean=0, variance=1),
    ],
    name="embedding_model",
)

like you would when calculating hashes

def hash_func(embedding, random_vectors):
    embedding = np.array(embedding)

    # Random projection.
    bools = np.dot(embedding, random_vectors) > 0
    return [bool2int(bool_vec) for bool_vec in bools]


def bool2int(x):
    y = 0
    for i, j in enumerate(x):
        if j:
            y += 1 << i
    return y

The full tutorial is described here.

like image 145
mrk Avatar answered Feb 01 '26 13:02

mrk