CVPR 2026 Workshop

Visual Geometry Grounded Novel-View Acoustic Synthesis

A feed-forward framework for viewpoint-accurate binaural audio synthesis without explicit 3D visual rendering, dense point maps, or target-view images.

Jay Polra · Dhwanil Chauhan · Wenjun Huang · Kyle Toth · Xianhui Wang · Yang Ni

Purdue University Northwest · University of California, Irvine · CIVS · San Diego State University

This work is supported by the Steel Manufacturing Simulation and Visualization Consortium (SMSVC).

Geometry-grounded acoustic synthesis overview showing visual viewpoints and binaural channel audio

Abstract

We present a unified framework for novel-view acoustic synthesis that directly grounds spatial audio generation in feed-forward visual geometry. The method synthesizes accurate and immersive binaural audio in 3D spaces without requiring viewpoint images, dense point maps, or ground-truth poses for input video.

The framework combines learned visual representations and geometry from feed-forward scene encoding with geometry-aware binauralization. Its Geometry-Grounded Acoustic Decoder dynamically attends to cross-modal features that embed local and global geometry across audio and visual modalities. Experiments show improved quality and viewpoint accuracy across NVAS benchmarks while avoiding time-consuming explicit rendering of novel-view images or dense point maps.

No costly photogrammetryBypasses Structure-from-Motion, explicit dense point maps, and target-view image rendering during inference.
Geometry-aware audioUses feed-forward visual geometry as grounding for listener-conditioned binaural transfer prediction.
Sparse-view robustRemains operational with sparse reference frames where reconstruction-dependent pipelines can fail.

Method

The pipeline constructs a multimodal context from sparse reference video frames and aligned audio. A multi-view geometry encoder extracts visual context and pose-aware geometry. Acoustic prototype initialization encodes reference mono and binaural audio into scene-specific transfer features.

The Geometry-Grounded Acoustic Decoder queries this context with frequency-aware target-pose tokens, attends over visual, geometric, and acoustic features, and predicts target-view binaural transfer fields. Spectral binaural synthesis then applies the predicted transfer fields to mono audio to reconstruct left-right binaural output.

Detailed method diagram for multimodal context construction, geometry grounded acoustic decoder, and binaural audio synthesis
Framework overview: multimodal context construction, GGAD decoding, and spectral binaural synthesis.

Results

Dataset Params FPS MAG ↓ ENV ↓ LRE ↓ DPAM ↓
RWAVS 3.24M 189 0.3485 0.1424 0.9589 0.2705
Replay-NVAS 3.24M 398 0.1590 0.0400 0.8060 0.2240

Lower is better for MAG, ENV, LRE, and DPAM. Replace or extend this table with camera-ready numbers as needed.

Key outcome

The method improves over AV-Cloud on RWAVS while reducing reliance on reconstruction-heavy preprocessing. Under stricter 50/50 train/test splits, the gains are especially visible in LRE and DPAM, indicating stronger viewpoint generalization.

Visualizations

Qualitative visualization of target-view binaural synthesis. The comparison shows the target view, ground truth, our model, AV-Cloud, and AV-NeRF, along with left and right channel waveform overlays. Our model more closely matches the ground-truth spectrogram and stereo waveform structure, demonstrating improved viewpoint-aware spatial audio generation.

Poster preview

BibTeX

@InProceedings{Polra_2026_CVPR,
    author    = {Polra, Jay and Chauhan, Dhwanil and Huang, Wenjun and Toth, Kyle and Wang, Xianhui and Ni, Yang},
    title     = {Visual Geometry Grounded Novel-View Acoustic Synthesis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2026},
    pages     = {7435-7444}
}