Novel Design Ideas that Improve Video-Understanding Networks with Transformers

Yaxin Hu*, Erhardt Barth

*Corresponding author for this work

Abstract

With the development of deep learning, video understanding has become a promising and challenging research field. In recent years, different transformer architectures have shown state-of-the-art performance on most benchmarks. Although transformers can process longer temporal sequences and therefor perform better than convolution networks, they require huge datasets and have high computational costs. The inputs to video transformers are usually clips sampled out of a video, and the length of the clips is limited by the available computing resources. In this paper, we introduce novel methods to sample and tokenize the input video, such as to better capture the dynamics of the input without a large increase in computational costs. Moreover, we introduce the MinBlocks as a novel architecture inspired by neural processing in biological vision. The combination of variable tubes and MinBlocks improves network performance by 10.67%.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks (IJCNN)
PublisherIEEE
Publication date2024
Pages1-7
ISBN (Print)979-8-3503-5932-9
ISBN (Electronic)979-8-3503-5931-2
Publication statusPublished - 2024

Fingerprint

Dive into the research topics of 'Novel Design Ideas that Improve Video-Understanding Networks with Transformers'. Together they form a unique fingerprint.

Cite this