I have a project to detect and count seal cubs (the animal) in an aerial image which is taken from beach. The seal cubs are black and small compared to adult seals which are brown and large.
Some seal cubs are overlapped/partly occluded. The beach color is near yellow however there're some black rocks that increase the detection difficulty.
What kind of descriptor is most suitable for my project? HOG, SIFT, Haar-like features?
I'm asking for the theory part of this problem. I think to implement my project, the first step should be choose correct descriptor that can most represent the object, then (combine several weak features, not necessary?) train a classifier using machine learning method like boosting/SVM/neural_network, am I right?
Sample image:

Although most pups are born in February through July, the pupping season varies widely among the regional populations. Harbor seals of the northern Pacific population give birth from May to July. Farther south, the pupping season becomes progressively earlier; in Baja California, the season is February and March.
Mating and Gestation Since the seals' gestation period is seven months, this delay means that the young will be born after the female reaches her breeding ground the following year. Adult females may mate several times before returning to the ocean.
Seals and sea lions have just one pup a year. Others, like the harp seal, will have their babies directly on icebergs. Baby seals, called pups, will stay on land until their waterproof fur grows in. This can take around a month.
Almost immediately after birth, a pup is able to swim. At 2 days old, he is able to stay underwater for 2 minutes. Within a short time, he is playfully exploring his new watery world. Moms are very affectionate with their offspring, who usually accompany them on short swims offshore.
I'm not sure I agree that selecting the right descriptor is the right place to start. A fundamental issue is that all the objects are similar in shape. There are also substantial gradients within each animal. The complexity of poses is another issue. I would break the problem into two more simple steps: 1. Unique object detection (edge detection, watershed, graph cut , etc). Something like the "count blood cells" problem. 2. Object classification based on color and area (normalized to camera perspective). Compute the fractional amount of "yellow" colored pixels and "black" colored pixels in each object and use those values along with the object size as inputs to an object classifier (neural networks are a fun solution here!).
It is a fairly cluttered scene, so I would expect both of these algorithms to require some fine-tuning. If your requirements allow some level of analyst interaction, provide some sliders so the analyst can adjust each of the thresholds in your algorithms.
Accuracy in computer vision algorithms seems to rely heavily on being able to fine tune them to a specific problem. If you can make assumptions about the pictures you are handing your algorithm, like the fact that all of them are aerial images of seals on a similar beach scene, then you can take advantage of that. I'd say before trying to get too fancy with local features, you might want to try something like watershed segmentation and count the number of non-background segments. Watershed provides a convenient framework called "markers" for incorporating prior knowledge about your input to differentiate between "background" and "foreground" segments.
An approach like this might be easier and possibly more accurate than local features. In my experience, I haven't been able to extract and match lots of meaningful features from organic subject matter (like faces or animals) using SIFT and SURF features. For me they have tended to work better on pictures of rooms or buildings with lots of angles.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With