Generalize Animal Recognition CNN model with Background Removal


Camera traps are a widely used tool in wildlife research and conservation, but factors such as changing of cameras, poor performance in extreme environments and damage by wildlife may be hindering the effectiveness of the technology. Our project is trying to solve the problem of changing cameras by adding the optimal transportation layer to our modified ALexNet CNN model that will be trained on the iwildCam dataset from WILDs collected by Stanford. In order to make our model more efficient and precise, we will also try to remove the background in each image that will only have the animal in the image and normalize these images with z-score methods to adjust pixels to a common scale. For testing, we will build the same model without optimal transportation strategies or these optimization steps we build to evaluate our models.


One of the important issues in learning today is the ability to train on one distribution and test on a different one. It is also called dataset shift or domain adaptation. The problem of dataset shift can stem from the way input features are utilized, the way training and test sets are selected, data sparsity, shifts in the data distribution due to non-stationary environments, and also from changes in the activation patterns within layers of deep neural networks. Many companies are working on dealing with this problem, such as Amazon published a paper called CrossNorm and SelfNorm for generalization under distribution shifts in 2021 and Apple published a paper called Bridging the Domain Gap for Neural Model. Our primary focus of this project is the domain adaptation problem in wild animal recognition tasks on images from camera traps. Camera Traps are placed in the wild for scientists to collect information about wild animals. Camera Traps will be triggered when it detects any moving objects and can automatically collect large quantities of pictures. These data are crucial in monitoring the count, density of animal species in the area and understanding animal behavior.


For the method part, we mainly worked on three methods as below. The first two methods are more like a preparation of the data set before building the model and the last method is how we build our model.

Background Removal

Background removal is a process that involves subtracting the background image of a location from the original image, resulting in an image that highlights the animal features in the photo. In the context of a domain adaptation problem, where models are trained on one set of camera trap photos and tested on another set, background removal can be used as a method to reduce variation between camera trap deployments. Since different camera traps can have vastly different backgrounds, removing the background from images ensures that the resulting images contain only animal features that are consistent across different camera traps.


Apply z-score normalization to intensities of each color channel of each pixels. Apply after background subtraction.

Modeified AlexNet Model

Modified the Architecture of AlexNet: 5 learnable convolutional layers and 3 fully connected layers (Krizhevsky et al. 2012). The nonlinear activation function used is ReLU. The Input is image of size 227x227 and the Output is softmaxed probability distribution across 100 animal category classes.


The model's performance on test data from unseen locations exhibits improvement with the application of background removal.


The accuracy in predicting both the original test set and the balanced test increases with the use of this technique.


However, it is worth noting that the "Background Mask" method necessitates a considerably longer preprocessing time and relatively higher prediction time compared to other approaches.

Contact Us

If you have any questions or any good ideas about our project, please contact us!