Publications

SEAR: Semantically-grounded Audio Representations

Abstract

Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally …

Date
2023
Authors
Rajat Hebbar, Digbalay Bose, Shrikanth Narayanan
Book
Proceedings of the 31st ACM International Conference on Multimedia
Pages
2785-2794