Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

The Hugging Face demo and OpenClaw skill are currently available. The code is under open-source review and will be released soon.

Jiahao Mei1,2, Heinrich Dinkel2, Yadong Niu2, Xingwei Sun2, Gang Li2, Yifan Liao2, Jiahao Zhou2, Junbo Zhang2, Jian Luan2, Mengyue Wu1

1 X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China2 MiLM Plus, Xiaomi Inc., Beijing, China

A unified framework for generating coherent mixed-audio scenes with speech, music, sound effects, and ambient acoustics from text.

Abstract

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into individual descriptions for speech content, speaker style, sound effects, and music, thereby enabling fine-grained control over audio generation. Furthermore, we employ a high-dimensional unified semantic-acoustic representation (DashengTokenizer) as the shared latent space for flow matching. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. Extensive subjective and objective experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized expert models in single-type generation tasks. Demos and model checkpoints are available at this project page.

Contributions

  • Dasheng AudioGen. A unified framework for general audio generation that synthesizes speech, music, sound effects, and environmental acoustics within a single coherent audio scene.
  • Structured multi-view captions. Complex audio scenes are decomposed into global descriptions, speaker style, speech transcripts, sound events, music, and acoustic environments for fine-grained and controllable generation.
  • Semantic-acoustic latent generation. Dasheng AudioGen operates on high-dimensional DashengTokenizer representations rather than low-dimensional acoustic VAEs, preserving semantic priors and acoustic details for overlapping sources.
  • Comprehensive evaluation pipeline. The system is evaluated across standard benchmarks, MECAT single-type and mixed-type scenes, human evaluation, and PAFI, showing strong performance in complex mixed scenes.

Overview

<|caption|> <|speech|> <|asr|> <|music|> <|sfx|> <|env|>

Structured multi-view audio scene captioning and agentic generation pipeline. Special tokens such as <|music|> describe different components of the target audio scene, while an agentic prompt refiner converts a simple scene description into a structured caption for fine-grained control.

Listening Demos

Selected Audio Scenes

Browse generated examples across mixed audio, clean speech, music, and sound effects.

Loading audio cases...

Mix Audio