Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

¹Seoul National University, ²Reality Labs Research at Meta
ECCV 2024
^*Work done during an internship at Meta.

Abstract

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a worldlocked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

BibTeX

@InProceedings{ author = {Yun, Heeseung and Gao, Ruohan and Ananthabhotla, Ishwarya and Kumar, Anurag and Donley, Jacob and Li, Chao and Kim, Gunhee and Ithapu, Vamsi Krishna and Murdock, Calvin}, title = {Spherical World-Locking for Audio-Visual Localization in Egocentric Videos}, booktitle = {ECCV}, year = {2024} }

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Our spherical World-Locking framework compensates for self-motion with negligible overhead, leading to lower variability and better learnable scene representation for egocentric videos.

Abstract

Multisensory SWL Transformer (MuST)

Active Speaker Localization

Egocentric Behavior Anticipation

BibTeX