Abstract
The modeling of temporal dependencies, and the associated computational load, remain challenges in video understanding. We here focus on using a more efficient sampling of color and temporal information. We sample color not from the same frame but from different consecutive frames to capture richer temporal information without increasing the computational load. We demonstrate the effectiveness of our approach for 2D-CNNs, 3D-CNNs, and Transformers, for which we obtain significant performance improvements on two benchmarks. The improvements are 2.43% on UCF101 and 4.55% on HMDB51 for the ResNet18, 10.28% and 7.12% for the 3D-ResNet18, and 15.11% and 13.71% for the UniFormerV2. These improvements are obtained without additional costs by just changing the way color is sampled.
| Original language | English |
|---|---|
| Title of host publication | Lecture Notes in Computer Science |
| Number of pages | 14 |
| Volume | 15293 |
| Publisher | Springer Nature Singapore |
| Publication date | 02.12.2024 |
| Pages | 413-426 |
| ISBN (Print) | 978-981-96-6598-3 |
| ISBN (Electronic) | 978-981-96-6596-9 |
| Publication status | Published - 02.12.2024 |