Toward visual voice activity detection for unconstrained videos

Abstract

The prevalent audio-based Voice Activity Detection (VAD) systems are challenged by the presence of ambient noise and are sensitive to variations in the type of the noise. The use of information from the visual modality, when available, can help overcome some of the problems of audio-based VAD. Existing visual-VAD systems however do not operate directly on the whole image but require intermediate face detection, face landmark detection and subsequent facial feature extraction from the lip region. In this work we present an end-to-end trainable Hierarchical Context Aware (HiCA) architecture for visual-VAD for videos obtained in unconstrained environments which can be trained with videos as input and audio speech labels as output. The network is designed to account for local and global temporal information in a video sequence. In contrast to existing visual-VAD systems our proposed approach does not rely …

Date: 2019
Authors: Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
Conference: 2019 IEEE International Conference on Image Processing (ICIP)
Pages: 2991-2995
Publisher: IEEE

View Paper

Information Sciences Institute

Publications

Toward visual voice activity detection for unconstrained videos

Abstract