Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Mei, Jiahao; Dinkel, Heinrich; Niu, Yadong; Sun, Xingwei; Li, Gang; Liao, Yifan; Zhou, Jiahao; Zhang, Junbo; Luan, Jian; Wu, Mengyue

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

The Hugging Face demo and OpenClaw skill are currently available. The code is under open-source review and will be released soon.

Jiahao Mei^1,2, Heinrich Dinkel², Yadong Niu², Xingwei Sun², Gang Li², Yifan Liao², Jiahao Zhou², Junbo Zhang², Jian Luan², Mengyue Wu¹

¹ X-LANCE Lab, Shanghai Jiao Tong University, Shanghai, China² MiLM Plus, Xiaomi Inc., Beijing, China

OpenClaw Skill Try Hugging Face Demo Code

A unified framework for generating coherent mixed-audio scenes with speech, music, sound effects, and ambient acoustics from text.

Commit the downloaded file to data/selected_cases.json before pushing to GitHub.

Abstract

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into individual descriptions for speech content, speaker style, sound effects, and music, thereby enabling fine-grained control over audio generation. Furthermore, we employ a high-dimensional unified semantic-acoustic representation (DashengTokenizer) as the shared latent space for flow matching. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. Extensive subjective and objective experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized expert models in single-type generation tasks. Demos and model checkpoints are available at this project page.

Contributions

Dasheng AudioGen. A unified framework for general audio generation that synthesizes speech, music, sound effects, and environmental acoustics within a single coherent audio scene.
Structured multi-view captions. Complex audio scenes are decomposed into global descriptions, speaker style, speech transcripts, sound events, music, and acoustic environments for fine-grained and controllable generation.
Semantic-acoustic latent generation. Dasheng AudioGen operates on high-dimensional DashengTokenizer representations rather than low-dimensional acoustic VAEs, preserving semantic priors and acoustic details for overlapping sources.
Comprehensive evaluation pipeline. The system is evaluated across standard benchmarks, MECAT single-type and mixed-type scenes, human evaluation, and PAFI, showing strong performance in complex mixed scenes.

Overview

<|caption|> <|speech|> <|asr|> <|music|> <|sfx|> <|env|>

Structured multi-view audio scene captioning and agentic generation pipeline. Special tokens such as <|music|> describe different components of the target audio scene, while an agentic prompt refiner converts a simple scene description into a structured caption for fine-grained control.

Listening Demos

Selected Audio Scenes

Browse generated examples across mixed audio, clean speech, music, and sound effects.

Loading audio cases...

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Selected Audio Scenes

Mix Audio