Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
Published in arXiv preprint, 2025
We propose Dasheng AudioGen, a unified text-to-audio framework for general audio scene generation. Through structured multi-perspective descriptions and semantic-acoustic unified representation, it achieves end-to-end collaborative generation of speech, music, sound effects, and environmental sounds, with performance approaching real recordings across multiple audio categories.
Recommended citation: Jiahao Mei, Heinrich Dinkel, Yadong Niu, et al. "Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text." arXiv, 2025.
Download Paper
