Novel-View Acoustic Synthesis | Dhwanil R. Chauhan

Existing methods for novel-view acoustic synthesis (NVAS) depend on expensive per-scene 3D reconstruction pipelines — Structure-from-Motion, dense point maps, neural rendering — that are slow, fragile under sparse inputs, and impractical for real deployment. We rethink the problem from the ground up.

Our approach grounds spatial audio synthesis directly in feed-forward visual geometry, bypassing explicit 3D reconstruction entirely. Given a short video clip as input, our framework builds a multimodal context from reference views, combining visual semantics, estimated scene geometry, and acoustically grounded prototype embeddings. The Geometry-Grounded Acoustic Decoder (GGAD) then retrieves listener-conditioned acoustic transfer fields using cross-attention over this context.

My contribution was designing the output representation extracted from the VGGT feed-forward geometry model and formulating the query/key structure of the GGAD cross-attention mechanism — the architectural core that enables geometry-aware binauralization without requiring target-view images or dense point cloud reconstruction.

Results show our framework outperforms prior baselines across RWAVS and Replay-NVAS benchmarks in audio quality, efficiency, and robustness under sparse reference frames — while running significantly faster than reconstruction-dependent methods.

(Polra et al., 2026)

References

2026

CVPR Workshop
Visual Geometry Grounded Novel-View Acoustic Synthesis

Jay Polra, Dhwanil Chauhan, Wenjun Huang, and 3 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2026

Purdue University Northwest · UC Irvine · San Diego State University

Abs Bib HTML

We present the first unified framework for novel-view acoustic synthesis that entirely bypasses explicit 3D visual rendering and costly photogrammetry by directly grounding spatial audio generation in feed-forward visual geometry. Our framework synthesizes accurate and immersive spatial audio in 3D spaces without requiring viewpoint images, dense point maps, or ground-truth poses for input video. We propose the Geometry-Grounded Acoustic Decoder (GGAD) to dynamically attend to cross-modal features embedding local and global geometries in audio and visual modalities. Extensive experiments show that our framework outperforms prior work across various benchmarks in high-quality, viewpoint-accurate spatial audio synthesis, without requiring time-consuming explicit rendering of novel-view images or dense point maps.
@inproceedings{polra2026nvas, title = {Visual Geometry Grounded Novel-View Acoustic Synthesis}, author = {Polra, Jay and Chauhan, Dhwanil and Huang, Wenjun and Toth, Kyle and Wang, Xianhui and Ni, Yang}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops}, year = {2026}, url = {https://dhwanil832.github.io/projects/nvas/}, note = {Purdue University Northwest · UC Irvine · San Diego State University} }