Novel Design Ideas that Improve Video-Understanding Networks with Transformers

Yaxin Hu*, Erhardt Barth

*Corresponding author for this work

Abstract

With the development of deep learning, video understanding has become a promising and challenging research field. In recent years, different transformer architectures have shown state-of-the-art performance on most benchmarks. Although transformers can process longer temporal sequences and therefor perform better than convolution networks, they require huge datasets and have high computational costs. The inputs to video transformers are usually clips sampled out of a video, and the length of the clips is limited by the available computing resources. In this paper, we introduce novel methods to sample and tokenize the input video, such as to better capture the dynamics of the input without a large increase in computational costs. Moreover, we introduce the MinBlocks as a novel architecture inspired by neural processing in biological vision. The combination of variable tubes and MinBlocks improves network performance by 10.67%.

Original languageEnglish
Title of host publication2024 International Joint Conference on Neural Networks (IJCNN)
PublisherIEEE
Publication date2024
Pages1-7
ISBN (Print)979-8-3503-5932-9
ISBN (Electronic)979-8-3503-5931-2
Publication statusPublished - 2024

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being
  2. SDG 4 - Quality Education
    SDG 4 Quality Education
  3. SDG 9 - Industry, Innovation, and Infrastructure
    SDG 9 Industry, Innovation, and Infrastructure
  4. SDG 11 - Sustainable Cities and Communities
    SDG 11 Sustainable Cities and Communities
  5. SDG 12 - Responsible Consumption and Production
    SDG 12 Responsible Consumption and Production
  6. SDG 14 - Life Below Water
    SDG 14 Life Below Water
  7. SDG 15 - Life on Land
    SDG 15 Life on Land

Fingerprint

Dive into the research topics of 'Novel Design Ideas that Improve Video-Understanding Networks with Transformers'. Together they form a unique fingerprint.

Cite this