Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks

Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, Radoslaw Mazur, Alfred Mertins


In this paper, we present an efficient approach for audio scene classification. We aim at learning representations for scene examples by exploring the structure of their class labels. A category taxonomy is automatically learned by collectively optimizing a tree-structured clustering of the given labels into multiple metaclasses. A scene recording is then transformed into a label-tree embedding image. Elements of the image represent the likelihoods that the scene instance belongs to the metaclasses. We investigate classification with label-tree embedding features learned fromdifferent low-level features as well as their fusion.We showthat the combination of multiple features is essential to obtain good performance. While averaging label-tree embedding images over time yields good performance, we argue that average pooling possesses an intrinsic shortcoming. We alternatively propose an improved classification scheme to bypass this limitation. We aim at automatically learning common templates that are useful for the classification task from these images using simple but tailored convolutional neural networks. The trained networks are then employed as a feature extractor thatmatches the learned templates across a label-tree embedding image and produce the maximum matching scores as features for classification. Since audio scenes exhibit rich content, template learning and matching on low-level features would be inefficient. With label-tree embedding features, we have quantized and reduced the low-level features into the likelihoods of the metaclasses, on which the template learning and matching are efficient. We study both training convolutional neural networks on stacked label-tree embedding images and multistream networks. Experimental results on the DCASE2016 and LITIS Rouen datasets demonstrate the efficiency of the proposed methods.
Original languageEnglish
JournalIEEE/ACM Trans. Audio, Speech, and Language Processing (TASLP)
Issue number6
Pages (from-to)1278-1290
Number of pages13
Publication statusPublished - 01.06.2017


Dive into the research topics of 'Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks'. Together they form a unique fingerprint.

Cite this