DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

Yisu Liu^1,*, Chenxing Li^2,*, Wanqian Zhang^1,†, Wenfu Wang², Meng Yu², Ruibo Fu³, Zheng Lin¹, Weiping Wang¹, Dong Yu^2,†

¹Institute of Information Engineering, Chinese Academy of Sciences
²Tencent AI Lab
³Institute of Automation, Chinese Academy of Sciences

Overall Framework

Abstract

Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.

Audio Demonstrations

Gun and Door Sounds

Ground Truth

DegDiT (Ours)

Prompt: A cap gun fires from 1.045 to 1.438 seconds and again from 2.46 to 2.775 seconds, followed by a door sound from 4.91 to 5.288 seconds.

Ocean Waves

Ground Truth

DegDiT (Ours)

Prompt: Waves and surf can be heard continuously from 1.448 to 10.0 seconds.

Horse Sounds

Ground Truth

DegDiT (Ours)

Prompt: A horse neighs and whinnies at intervals from 1.468 to 3.326 seconds, again from 3.852 to 5.71 seconds, and from 6.671 to 8.529 seconds.

Fireworks

Ground Truth

DegDiT (Ours)

Prompt: Fireworks erupt from 0.724 to 2.841 seconds, then again from 3.558 to 5.675 seconds, and from 6.468 to 8.585 seconds.

Ding and Croaking

Ground Truth

DegDiT (Ours)

Prompt: A ding occurs from 1.705 to 2.013 seconds, followed by croaking sounds from 4.671 to 5.325 seconds and 6.006 to 6.66 seconds.

Police Siren

Ground Truth

DegDiT (Ours)

Prompt: A police car siren sounds from 1.488 to 5.669 seconds, and again from 6.754 to 10.0 seconds.

Sneeze

Ground Truth

DegDiT (Ours)

Prompt: A sneeze occurs from 1.495 to 2.33 seconds and then again from 2.921 to 3.756 seconds.

Dog and Keys

Ground Truth

DegDiT (Ours)

Prompt: A dog barks from 2.158 to 2.417 seconds and again from 3.88 to 4.139 seconds, while keys jangle from 6.484 to 9.059 seconds.

Biting and Filing

Ground Truth

DegDiT (Ours)

Prompt: Biting occurs from 1.469 to 2.358 seconds, followed by the sound of filing (rasp) from 4.091 to 6.531 seconds.

Cat, Typewriter, and Beeps

Ground Truth

DegDiT (Ours)

Prompt: A cat meows twice, from 1.025 to 1.82 seconds and again from 2.992 to 3.693 seconds, typewriter keys clack from 5.517 to 6.965 seconds, and a series of beeps occurs, first from 8.272 to 8.734 seconds and then from 9.29 to 9.518 seconds.

Experiments

Experiments on AudioCondition and DESED dataset.

Case study for Tangoflux, Audiocomposer, and DegDiT.

BibTeX

@article{liu2025degdit,
  title        = {DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer},
  author       = {Yisu Liu and Chenxing Li and Wanqian Zhang and Wenfu Wang and Meng Yu and Ruibo Fu and Zheng Lin and Weiping Wang and Dong Yu},
  journal      = {arXiv preprint arXiv:2508.13786},
  year         = {2025}
}