GIF Image

GIF Image

Abstract

With the increasing importance of video data in real-world applications, there is a rising need for efficient object detection methods that utilize temporal information. While existing video object detection (VOD) techniques employ various strategies to address this challenge, they typically depend on locally adjacent frames or randomly sampled images within a clip. Although recent Transformer-based VOD methods have shown promising results, their reliance on multiple inputs and additional network complexity to incorporate temporal information limits their practical applicability. In this paper, we propose a novel approach to single image object detection, called Context Enhanced TRansformer (CETR), by incorporating temporal context into DETR using a newly designed memory module. To efficiently store temporal information, we construct a class-wise memory that collects contextual information across data. Additionally, we present a classification-based sampling technique to selectively utilize the relevant memory for the current image. In the testing, We introduce a test-time memory adaptation method that updates individual memory functions by considering the test distribution. Experiments with CityCam and ImageNet VID datasets exhibit the efficiency of the framework on various video systems.

Qualitative Results

Qualitative comparisions on CityCam Dataset.

Quantitative Results

The table presents our main results on the CityCam testing set. We applied our proposed framework to the single frame DETR-like methods and conducted quantitative comparisons with other single frame detection methods and a multi-frame DETR-like method, TransVOD [TPAMI'2023] that use Deformable DETR [ICLR'2021] as their baseline. Compared to the single frame baseline, our method showed improvements of 1.5% AP, with only a marginal increase of allocated memory and a marginal decrease of FPS.

Performance comparison with state-of-the-art real- time VOD methods with ResNet-101 backbone on ImageNet VID dataset. Here we use AP 50 , which is commonly used as mean average precision (mAP) in other VOD methods.

Main Architecture

Overview of our framework. CETR builds upon the DETR architecture. Within our framework, a pivotal component is the context memory module (CMM), which serves as the input for the Transformer encoder. Subsequently, the encoded memory features are passed through the classification network. Predicted probability serves as a threshold for score- based sampling. The sampled class-wise memory is aggregated with the query using the cross-attention mechanism within the memory-guided Transformer decoder (MGD).

Citation

Acknowledgements

The website template was borrowed from Michaƫl Gharbi.