Visual Geometry Grounded Novel-View Acoustic Synthesis

Geometry-grounded acoustic synthesis overview showing visual viewpoints and binaural channel audio

Abstract

We present a unified framework for novel-view acoustic synthesis that directly grounds spatial audio generation in feed-forward visual geometry. The method synthesizes accurate and immersive binaural audio in 3D spaces without requiring viewpoint images, dense point maps, or ground-truth poses for input video.

The framework combines learned visual representations and geometry from feed-forward scene encoding with geometry-aware binauralization. Its Geometry-Grounded Acoustic Decoder dynamically attends to cross-modal features that embed local and global geometry across audio and visual modalities. Experiments show improved quality and viewpoint accuracy across NVAS benchmarks while avoiding time-consuming explicit rendering of novel-view images or dense point maps.

No costly photogrammetryBypasses Structure-from-Motion, explicit dense point maps, and target-view image rendering during inference.

Geometry-aware audioUses feed-forward visual geometry as grounding for listener-conditioned binaural transfer prediction.

Sparse-view robustRemains operational with sparse reference frames where reconstruction-dependent pipelines can fail.

Method

The pipeline constructs a multimodal context from sparse reference video frames and aligned audio. A multi-view geometry encoder extracts visual context and pose-aware geometry. Acoustic prototype initialization encodes reference mono and binaural audio into scene-specific transfer features.

The Geometry-Grounded Acoustic Decoder queries this context with frequency-aware target-pose tokens, attends over visual, geometric, and acoustic features, and predicts target-view binaural transfer fields. Spectral binaural synthesis then applies the predicted transfer fields to mono audio to reconstruct left-right binaural output.

Detailed method diagram for multimodal context construction, geometry grounded acoustic decoder, and binaural audio synthesis — Framework overview: multimodal context construction, GGAD decoding, and spectral binaural synthesis.

Results

Dataset	Params	FPS	MAG ↓	ENV ↓	LRE ↓	DPAM ↓
RWAVS	3.24M	189	0.3485	0.1424	0.9589	0.2705
Replay-NVAS	3.24M	398	0.1590	0.0400	0.8060	0.2240

Lower is better for MAG, ENV, LRE, and DPAM. Replace or extend this table with camera-ready numbers as needed.

Key outcome

The method improves over AV-Cloud on RWAVS while reducing reliance on reconstruction-heavy preprocessing. Under stricter 50/50 train/test splits, the gains are especially visible in LRE and DPAM, indicating stronger viewpoint generalization.

Visualizations

Qualitative visualization of target-view binaural synthesis. The comparison shows the target view, ground truth, our model, AV-Cloud, and AV-NeRF, along with left and right channel waveform overlays. Our model more closely matches the ground-truth spectrogram and stereo waveform structure, demonstrating improved viewpoint-aware spatial audio generation.

BibTeX

@InProceedings{Polra_2026_CVPR,
    author    = {Polra, Jay and Chauhan, Dhwanil and Huang, Wenjun and Toth, Kyle and Wang, Xianhui and Ni, Yang},
    title     = {Visual Geometry Grounded Novel-View Acoustic Synthesis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
    month     = {June},
    year      = {2026},
    pages     = {7435-7444}
}