Video Understanding Using 2D-CNNs on Salient Spatio-Temporal Slices

Yaxin Hu*, Erhardt Barth

*Corresponding author for this work

Abstract

Video understanding remains a challenge even with advanced deep-learning methods that typically sample a few frames from which spatial and temporal features are extracted. Such down-sampling often leads to the loss of critical temporal information. Moreover, current state-of-the-art methods involve high computational costs. 2D Convolutional Neural Networks (2D-CNNs) have proven to be effective at capturing spatial features of images, but cannot make use of temporal information. To address these challenges, we propose to use 2D-CNNs not only on images, i.e. xy-slices of the video, but on salient spatio-temporal xt and yt slices to efficiently capture both spatial and temporal information of the entire video. As 2D-CNNs are known to extract local spatial orientation in xy, they can now extract motion, which is a local orientation in xt and yt. We complement the approach with a simple strategy for sampling the most informative slices and show that we can outperform alternative approaches in a number of tasks, especially in cases in which the actions are defined by their dynamics, i.e., by spatio-temporal patterns.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science : Artificial Neural Networks and Machine Learning – ICANN 2024
Volume15018
PublisherSpringer, Cham
Publication date17.09.2024
Pages256-270
ISBN (Print)978-3-031-72337-7
ISBN (Electronic)978-3-031-72338-4
Publication statusPublished - 17.09.2024

Fingerprint

Dive into the research topics of 'Video Understanding Using 2D-CNNs on Salient Spatio-Temporal Slices'. Together they form a unique fingerprint.

Cite this