About
I am a Master’s student in Computer Science (graduating 2028.3) at Shanghai Jiao Tong University (SJTU), working in the X-LANCE Lab under the supervision of Prof. Mengyue Wu and Prof. Kai Yu.
I received my B.E. in Computer Science from East China Normal University (ECNU) in 2025, where I worked in the ICALK Lab under the supervision of Prof. Daoguo Dong and Prof. Liang He.
My research interests lie in general audio understanding & generation and LLMs. Feel free to contact me via WeChat or email.
Research Interests
- General audio understanding & generation, including unified understanding and controllable generation of speech, music, and sound effects
- Audio LLM, including perception, understanding, reasoning, generation, and evaluation
- Multimodal generation & alignment
- EEG representation learning & affective computing
Education
Shanghai Jiao Tong University, Computer Science, M.S., 2025.9 - 2028.3
X-LANCE Lab, advised by Prof. Mengyue Wu and Prof. Kai Yu
East China Normal University, Computer Science, B.E., 2021.9 - 2025.3
- ICALK Lab, advised by Prof. Daoguo Dong and Prof. Liang He
Work Experience
Xiaomi Inc. (Beijing), XiaoAI PLUS · Speech Team, Algorithm Engineer, 2026.1 - 2026.7
- Advised by Heinrich Dinkel
- End-to-end mixed audio (speech, music, sound effects) generation system pretraining & optimization Dasheng AudioGen
Alibaba Group (Hangzhou), Tongyi Lab · NLP Team, Algorithm Engineer, 2024.1 - 2024.8
- Advised by Yuning Wu and Ming Yan
- Multi-agent storybook generation system MM-StoryAgent, deployed as product
- LLM creative writing evaluation benchmark WritingBench
- Unified audio generation framework UniFlow-Audio
Publications
Unified Audio Understanding & Generation
Jiahao Mei, Heinrich Dinkel, Yadong Niu, et al. “Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text.” arXiv, 2026. (Under review at NeurIPS)
- A unified text-to-audio framework for general audio scene generation, achieving end-to-end collaborative generation of speech, music, sound effects, and environmental sounds with performance approaching real recordings.
Xuenan Xu*, Jiahao Mei*, Zihao Zheng, et al. “UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities.” arXiv, 2025. (Under review at NeurIPS)
- The first fully open-source Flow Matching-based unified audio generation framework with a novel Dual-Fusion mechanism for both Time-Align and Non-Time-Align tasks. Supports text, audio, and video inputs across 7 tasks (TTS, TTA, V2A, etc.).
Heinrich Dinkel, Xingwei Sun, Gang Li, Jiahao Mei, et al. “DashengTokenizer: One Layer is Enough for Unified Audio Understanding and Generation.” arXiv, 2026. (Under review at NeurIPS)
- A unified continuous audio tokenizer for understanding and generation, significantly outperforming mainstream codec/tokenizer baselines on speech, music, and environmental sound tasks.
Controllable Music Generation
Jialing Zou*, Jiahao Mei*, Xudong Nan, et al. “TEAdapter: Supply Vivid Guidance for Controllable Text-to-Music Generation.” IEEE ICME, 2024.
- A lightweight plugin for diffusion models enabling fine-grained controllable music generation through chord, melody, and instrument features with structure-function-based multi-adapter collaboration.
Jiahao Mei, Xuenan Xu, Zeyu Xie, et al. “LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment.” arXiv, 2025.
- Latent Affective Representation Alignment for continuous fine-grained emotion control in music generation, accepting continuous valence-arousal values as input to decouple emotion from musical content.
Multimodal Generation
Xuenan Xu, Jiahao Mei, Chenliang Li, et al. “MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio.” NAACL, 2025.
- Open-source multi-modal multi-agent story video generation framework. 85K+ visits on ModelScope, deployed as product.
Kaiyuan Liu, Jiahao Mei, Hengyu Zhang, et al. “Moyun: A Diffusion-Based Model for Style-Specific Chinese Calligraphy Generation.” ACM MM Workshop, 2025.
- Vision Mamba and TripleLabel-based Chinese calligraphy generation model trained on 1.9M+ calligraphy images (Mobao dataset).
Evaluation & Datasets
Yuning Wu, Jiahao Mei, Ming Yan, et al. “WritingBench: A Comprehensive Benchmark for Generative Writing.” NeurIPS, 2025.
- A comprehensive benchmark covering 6 domains, 100 sub-domains, and 1239 queries with a dynamic evaluation framework achieving 83% human agreement.
Jialing Zou*, Jiahao Mei*, Guangze Ye, et al. “EMID: An Emotional Aligned Dataset in Audio-Visual Modality.” ACM MM Workshop, 2023.
- A high-quality music-image cross-modal matching dataset (30K+ pairs) using emotional consistency as the basis for cross-modal alignment.
Awards & Honors
- Excellent Graduate, Department of Education, Shanghai Province, China, 2024
- China International College Students’ Innovation Competition (formerly Internet plus), Shanghai Gold Award (Team Leader), 2024
- National Scholarship for Undergraduate, Ministry of Education, China, 2024
- Huaxin Scholarship, East China Normal University, 2024
- Nezha Scholarship, East China Normal University, 2023
