Abstract
With the development of deep learning, video understanding has become a promising and challenging research field. In recent years, different transformer architectures have shown state-of-the-art performance on most benchmarks. Although transformers can process longer temporal sequences and therefor perform better than convolution networks, they require huge datasets and have high computational costs. The inputs to video transformers are usually clips sampled out of a video, and the length of the clips is limited by the available computing resources. In this paper, we introduce novel methods to sample and tokenize the input video, such as to better capture the dynamics of the input without a large increase in computational costs. Moreover, we introduce the MinBlocks as a novel architecture inspired by neural processing in biological vision. The combination of variable tubes and MinBlocks improves network performance by 10.67%.
| Original language | English |
|---|---|
| Title of host publication | 2024 International Joint Conference on Neural Networks (IJCNN) |
| Publisher | IEEE |
| Publication date | 2024 |
| Pages | 1-7 |
| ISBN (Print) | 979-8-3503-5932-9 |
| ISBN (Electronic) | 979-8-3503-5931-2 |
| Publication status | Published - 2024 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
-
SDG 4 Quality Education
-
SDG 9 Industry, Innovation, and Infrastructure
-
SDG 11 Sustainable Cities and Communities
-
SDG 12 Responsible Consumption and Production
-
SDG 14 Life Below Water
-
SDG 15 Life on Land
Fingerprint
Dive into the research topics of 'Novel Design Ideas that Improve Video-Understanding Networks with Transformers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver