Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection

Kuniaki Saito1     Donghyun Kim2     Piotr Teterwak1     Rogerio Feris2     Kate Saenko1, 3   

1. Boston University     2. MIT-IBM Watson AI Lab     3. Meta    

Paper | Code





Abstract

Building object detectors that are robust to domain shifts is critical for real-world applications. Prior approaches fine-tune a pre-trained backbone and risk overfitting it to in-distribution (ID) data and distorting features useful for out-of-distribution (OOD) generalization. We propose to use Relative Gradient Norm (RGN) as a way to measure the vulnerability of a backbone to feature distortion, and show that high RGN is indeed correlated with lower OOD performance. Our analysis of RGN yields interesting findings: some backbones lose OOD robustness during fine-tuning, but others gain robustness because their architecture prevents the parameters from changing too much from the initial model. Given these findings, we present recipes to boost OOD robustness for both types of backbones. Specifically, we investigate regularization and architectural choices for minimizing gradient updates so as to prevent the tuned backbone from losing generalizable features. Our proposed techniques complement each other and show substantial improvements over baselines on diverse architectures and datasets.

paper thumbnail

Paper

arxiv, 2023.

Citation

Kuniaki Saito, Donghyun Kim, Piotr Teterwak, Rogerio Feris, Kate Saenko
"Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection". Bibtex

Code





What is domain generalization?

Fig.1 - Left: training domains, Right: testing domains.


Domain generalization refers to the ability of a machine learning model to generalize its knowledge and perform well on new, unseen domains that differ from the domains it was trained on. In traditional machine learning settings, models are trained and tested on data from the same domain, assuming that the training and testing data follow the same distribution. However, in real-world scenarios, the distribution of data can vary across different domains, such as different geographical locations, different times, or different sources. As shown in Fig.1, we aim to train a model generalizable on diverse domains given a single domain as training data.


Model distortion matters in domain generalization

Fig.2. Pre-trained model with generalizable representations can lose suffer from model distortion.

The task of domain generalization is actively explored in image classification. One effective way is to use a model pre-trained with large scale dataset and fine-tune it on the downstream task. The model has the knowledge about diverse categories and it learns invariance across different domains. But, if we naively fine-tune the model, the model may perform well on the domain of training data, but will not perform well on OOD. The backbone can lose generalizable representations by updating parameters too much to adapt to the training data. So, distorting model parameters can decrease performance on OOD. On the other hand, if we freeze the model parameters and only tune the linear head, model will not be distorted and may show high accuracy in OOD. But, it may not perform well near ID because the backbone is not adapted to the training data. An important message from work on image classification is that we need to care about the model distortion in fine-tuning a model. We analyze the behavior of backbone models in the object detection and propose simple techiques to train a model generalizable in diverse domains.


Analysis on Model Distortion

Fig.3. DP: Decoder probing, freezing pre-trained backbone. FT: Fine-tuning.

First, we start from comparing two approaches: freezing backbone (pre-trained model) on top of detector and plain fine-tuning. Fig. 3 shows that network like ResNet50 shows significant decrease in OOD by updating backbone while other network gain performance by fine-tuning.

Fig.4. Relative gradient norm (RGN) per layer for three models. Legend shows RGN averaged over all layers.

We attempt to quantify the model distortion by relative gradient norm, which is computing the norm of gradient over that of parameters. Gradient is computed on detection loss. We see that ResNet50 suffers from large model distortion while other two have relatively small RGN across many layers.


Proposed Method

Given the intuition from the analysis, we propose to regularize the backbone so that it can keep close to the initial parameters during fine-tuning. We propose to apply both architecture- and objective-level regularization to achieve it. See our paper for more details.


Summary of Results

Fig.5. Our proposed tuning methods show remarkable improvements in OOD for diverse backbones.

We investigate the performance of our model on divese networks and datasets. In summary, we find that the proposed fine-tuning is effective for diverse backbones as shown in Fig. 5. Also, as in Fig. 6, it is effective for transformer-based backbone, image corruptions, and long-tailed object detection.

Fig.6. The proposed method is effective in Transformer-based backbone (Left), image corruptions (Center), and long-tailed object detection (Right).

Demo

Demo. Left: Baseline fine-tuning, Right: Our proposed fine-tuning.