Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
Ground Truth
DegDiT (Ours)
Prompt: A cap gun fires from 1.045 to 1.438 seconds and again from 2.46 to 2.775 seconds, followed by a door sound from 4.91 to 5.288 seconds.
Ground Truth
DegDiT (Ours)
Prompt: Waves and surf can be heard continuously from 1.448 to 10.0 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A horse neighs and whinnies at intervals from 1.468 to 3.326 seconds, again from 3.852 to 5.71 seconds, and from 6.671 to 8.529 seconds.
Ground Truth
DegDiT (Ours)
Prompt: Fireworks erupt from 0.724 to 2.841 seconds, then again from 3.558 to 5.675 seconds, and from 6.468 to 8.585 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A ding occurs from 1.705 to 2.013 seconds, followed by croaking sounds from 4.671 to 5.325 seconds and 6.006 to 6.66 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A police car siren sounds from 1.488 to 5.669 seconds, and again from 6.754 to 10.0 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A sneeze occurs from 1.495 to 2.33 seconds and then again from 2.921 to 3.756 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A dog barks from 2.158 to 2.417 seconds and again from 3.88 to 4.139 seconds, while keys jangle from 6.484 to 9.059 seconds.
Ground Truth
DegDiT (Ours)
Prompt: Biting occurs from 1.469 to 2.358 seconds, followed by the sound of filing (rasp) from 4.091 to 6.531 seconds.
Ground Truth
DegDiT (Ours)
Prompt: A cat meows twice, from 1.025 to 1.82 seconds and again from 2.992 to 3.693 seconds, typewriter keys clack from 5.517 to 6.965 seconds, and a series of beeps occurs, first from 8.272 to 8.734 seconds and then from 9.29 to 9.518 seconds.
Experiments on AudioCondition and DESED dataset.
Case study for Tangoflux, Audiocomposer, and DegDiT.
@article{liu2025degdit,
title = {DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer},
author = {Yisu Liu and Chenxing Li and Wanqian Zhang and Wenfu Wang and Meng Yu and Ruibo Fu and Zheng Lin and Weiping Wang and Dong Yu},
journal = {arXiv preprint arXiv:2508.13786},
year = {2025}
}