Publications
SEAR: Semantically-grounded Audio Representations
Abstract
Audio supports visual story-telling in movies through the use of different sounds. These sounds are often tied to different visual elements, including foreground entities, the interactions between them as well as background context. Visual captions provide a condensed view of an image, providing a natural language description of entities and the relationships between them. In this work, we utilize visual captions to semantically ground audio representations in a self-supervised setup. We leverage state-of-the-art vision-language models to augment movie datasets with visual captions at scale to the order of 9.6M captions to learn audio representations from over 2500 hours of movie data. We evaluate the utility of the learned representations and show state-of-the art performance on two movie understanding tasks, genre and speaking-style classification, outperforming video based methods and audio baselines. Finally …
- Date
- 2023
- Authors
- Rajat Hebbar, Digbalay Bose, Shrikanth Narayanan
- Book
- Proceedings of the 31st ACM International Conference on Multimedia
- Pages
- 2785-2794