YAML Metadata Warning: The pipeline tag "audio-to-video" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

TunaDance

Music-to-dance generation with a Gradio web UI and single-command CLI.

TunaDance builds on FineDance (ICCV 2023), a diffusion-based model that generates full-body 3D dance from music. This fork finetunes the original model on additional data and for more epochs beyond the original 2000, and adds a user-friendly interface layer and macOS support so you can go from an audio file to a rendered dance video without touching the inference internals.

[Original Paper] | [Original Project Page] | [Original Repo]

What's New (vs. upstream FineDance)

  • Gradio Web UI (app.py) β€” Upload music in the browser, get a dance video back. No CLI knowledge required.
  • Single-command CLI (generate_dance.py) β€” One command handles the full pipeline: audio feature extraction, diffusion sampling, SMPLX rendering, and audio-video muxing.
  • macOS / MPS support β€” Updated render.py, vis.py, and inference code to run on Apple Silicon via MPS, with a dedicated environment_macos.yaml.
  • Accepts any audio format β€” Automatically converts input to WAV via ffmpeg (.mp3, .wav, .m4a, .flac, .ogg, etc.).
  • Finetuned checkpoint β€” Finetuned the original FineDance model on additional data and for more epochs beyond the original 2000, improving dance quality and diversity.
  • Cleaned-up repo β€” Removed wandb logs, debug scripts, and hardcoded paths.

Model Details

Architecture Transformer decoder with Gaussian diffusion
Input 35-dim audio features (onset, 20 MFCC, 12 chroma, peak/beat onehot) per 4s window
Output SMPLX body motion β€” 319-dim (4 contact + 3 translation + 52 joints x 6 rotation)
Checkpoint assets/checkpoints/train-2000.pt (finetuned beyond 2000 epochs on additional data)
Body model SMPLX (full body with hands)
Training data FineDance dataset (7.7 hours of music-dance pairs)

Quick Start

Prerequisites

# Install conda environment
conda env create -f environment.yaml        # Linux/CUDA
conda env create -f environment_macos.yaml   # macOS (Apple Silicon)

conda activate FineNet

Download the pretrained checkpoint and SMPLX model from Google Drive and place them under assets/.

Web UI (Recommended)

python app.py

Open http://127.0.0.1:7861 in your browser. Upload a music file and click Generate Dance.

Command Line

python generate_dance.py /path/to/music.mp3

Output is saved to output/<songname>_dance.mp4. Use --output for a custom path:

python generate_dance.py /path/to/music.mp3 --output my_dance.mp4

Output Specs

Property Value
Resolution 1200 x 1200
Frame rate 30 fps
Duration ~30 seconds
Body model SMPLX (full body with hands)

How It Works

  1. Audio conversion β€” Converts input to WAV if needed via ffmpeg
  2. Feature extraction β€” Slices audio into 4s windows (2s stride), extracts 35-dim features using librosa
  3. Dance generation β€” Diffusion model generates SMPLX motion sequence from audio features
  4. Rendering β€” Converts motion to SMPLX meshes, renders 900 frames at 30fps with pyrender
  5. Muxing β€” Merges rendered video with original audio via ffmpeg

Training

Only needed to train from scratch. The pretrained checkpoint is included.

python data/code/pre_motion.py                              # preprocess
accelerate launch train_seq.py --batch_size 32 --epochs 200 # train

Key flags:

  • --batch_size β€” Default 400; reduce to 32 or lower for Mac MPS
  • --epochs β€” Default 2000
  • --checkpoint β€” Resume from a saved checkpoint

FineDance Dataset

The dataset (7.7 hours) is available from Google Drive or Baidu Cloud. Place it under ./data.

import numpy as np
data = np.load("motion/001.npy")
smpl_poses = data[:, 3:]   # joint rotations
smpl_trans = data[:, :3]   # root translation

Two dataset splits are provided:

  • FineDance@Genre (recommended) β€” Broader genre coverage in the test set
  • FineDance@Dancer β€” Splits by dancer identity

Project Structure

TunaDance/
β”œβ”€β”€ app.py                   # Gradio web UI  [NEW]
β”œβ”€β”€ generate_dance.py        # End-to-end CLI [NEW]
β”œβ”€β”€ environment_macos.yaml   # macOS conda env [NEW]
β”œβ”€β”€ train_seq.py             # Training script
β”œβ”€β”€ test.py                  # Original inference script
β”œβ”€β”€ render.py                # SMPLX mesh rendering (updated for MPS)
β”œβ”€β”€ vis.py                   # Skeleton/FK utilities (updated for MPS)
β”œβ”€β”€ args.py                  # CLI argument definitions
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ checkpoints/
β”‚   β”‚   └── train-2000.pt    # Pretrained model (2000 epochs)
β”‚   └── smpl_model/
β”‚       └── smplx/
β”‚           └── SMPLX_NEUTRAL.npz
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ model.py             # SeqModel (transformer decoder)
β”‚   └── diffusion.py         # Gaussian diffusion
β”œβ”€β”€ dataset/
β”‚   └── FineDance_dataset.py
└── data/
    └── finedance/           # Training data (music + motion pairs)

Acknowledgments

This project is built on FineDance by Li et al. We thank the original authors for their work.

Upstream acknowledgments: EDGE, MDM, Adan, Diffusion, SMPLX.

Citation

@inproceedings{li2023finedance,
  title={FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation},
  author={Li, Ronghui and Zhao, Junfan and Zhang, Yachao and Su, Mingyang and Ren, Zeping and Zhang, Han and Tang, Yansong and Li, Xiu},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10234--10243},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for NikhilMarisetty/TunaDance