Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Jiahao Mei1,2, Heinrich Dinkel2, Yadong Niu2, Xingwei Sun2, Gang Li2, Yifan Liao2, Jiahao Zhou2, Junbo Zhang2, Jian Luan2, Mengyue Wu1

1 X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China2 MiLM Plus, Xiaomi Inc., Beijing, China

A unified framework for generating coherent mixed-audio scenes with speech, music, sound effects, and ambient acoustics from text.

Abstract

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks.

Contributions

  • Dasheng AudioGen. A unified framework for general audio generation that synthesizes speech, music, sound effects, and environmental acoustics within a single coherent audio scene.
  • Structured multi-view captions. Complex audio scenes are decomposed into global descriptions, speaker style, speech transcripts, sound events, music, and acoustic environments for fine-grained and controllable generation.
  • Semantic-acoustic latent generation. Dasheng AudioGen operates on high-dimensional DashengTokenizer representations rather than low-dimensional acoustic VAEs, preserving semantic priors and acoustic details for overlapping sources.
  • Comprehensive evaluation pipeline. The system is evaluated across standard benchmarks, MECAT single-type and mixed-type scenes, human evaluation, and PAFI, showing strong performance in complex mixed scenes.

Overview

<|caption|> <|speech|> <|asr|> <|music|> <|sfx|> <|env|>

Structured multi-view audio scene captioning and agentic generation pipeline. Special tokens such as <|music|> describe different components of the target audio scene, while an agentic prompt refiner converts a simple scene description into a structured caption for fine-grained control.

Listening Demos

Selected Audio Scenes

Browse generated examples across mixed audio, clean speech, music, and sound effects.

Loading audio cases...

Mix Audio