Abstract
Video understanding remains a challenge even with advanced deep-learning methods that typically sample a few frames from which spatial and temporal features are extracted. Such down-sampling often leads to the loss of critical temporal information. Moreover, current state-of-the-art methods involve high computational costs. 2D Convolutional Neural Networks (2D-CNNs) have proven to be effective at capturing spatial features of images, but cannot make use of temporal information. To address these challenges, we propose to use 2D-CNNs not only on images, i.e. xy-slices of the video, but on salient spatio-temporal xt and yt slices to efficiently capture both spatial and temporal information of the entire video. As 2D-CNNs are known to extract local spatial orientation in xy, they can now extract motion, which is a local orientation in xt and yt. We complement the approach with a simple strategy for sampling the most informative slices and show that we can outperform alternative approaches in a number of tasks, especially in cases in which the actions are defined by their dynamics, i.e., by spatio-temporal patterns.
Original language | English |
---|---|
Title of host publication | Lecture Notes in Computer Science : Artificial Neural Networks and Machine Learning – ICANN 2024 |
Volume | 15018 |
Publisher | Springer, Cham |
Publication date | 17.09.2024 |
Pages | 256-270 |
ISBN (Print) | 978-3-031-72337-7 |
ISBN (Electronic) | 978-3-031-72338-4 |
Publication status | Published - 17.09.2024 |