YAML Metadata Warning:The pipeline tag "audio-to-video" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

TunaDance

Music-to-dance generation with a Gradio web UI and single-command CLI.

TunaDance builds on FineDance (ICCV 2023), a diffusion-based model that generates full-body 3D dance from music. This fork finetunes the original model on additional data and for more epochs beyond the original 2000, and adds a user-friendly interface layer and macOS support so you can go from an audio file to a rendered dance video without touching the inference internals.

[Original Paper] | [Original Project Page] | [Original Repo]

What's New (vs. upstream FineDance)

Gradio Web UI (app.py) — Upload music in the browser, get a dance video back. No CLI knowledge required.
Single-command CLI (generate_dance.py) — One command handles the full pipeline: audio feature extraction, diffusion sampling, SMPLX rendering, and audio-video muxing.
macOS / MPS support — Updated render.py, vis.py, and inference code to run on Apple Silicon via MPS, with a dedicated environment_macos.yaml.
Accepts any audio format — Automatically converts input to WAV via ffmpeg (.mp3, .wav, .m4a, .flac, .ogg, etc.).
Finetuned checkpoint — Finetuned the original FineDance model on additional data and for more epochs beyond the original 2000, improving dance quality and diversity.
Cleaned-up repo — Removed wandb logs, debug scripts, and hardcoded paths.

Model Details


Architecture	Transformer decoder with Gaussian diffusion
Input	35-dim audio features (onset, 20 MFCC, 12 chroma, peak/beat onehot) per 4s window
Output	SMPLX body motion — 319-dim (4 contact + 3 translation + 52 joints x 6 rotation)
Checkpoint	`assets/checkpoints/train-2000.pt` (finetuned beyond 2000 epochs on additional data)
Body model	SMPLX (full body with hands)
Training data	FineDance dataset (7.7 hours of music-dance pairs)

Quick Start

Prerequisites

# Install conda environment
conda env create -f environment.yaml        # Linux/CUDA
conda env create -f environment_macos.yaml   # macOS (Apple Silicon)

conda activate FineNet

Download the pretrained checkpoint and SMPLX model from Google Drive and place them under assets/.

Web UI (Recommended)

python app.py

Open http://127.0.0.1:7861 in your browser. Upload a music file and click Generate Dance.

Command Line

python generate_dance.py /path/to/music.mp3

Output is saved to output/<songname>_dance.mp4. Use --output for a custom path:

python generate_dance.py /path/to/music.mp3 --output my_dance.mp4

Output Specs

Property	Value
Resolution	1200 x 1200
Frame rate	30 fps
Duration	~30 seconds
Body model	SMPLX (full body with hands)

How It Works

Audio conversion — Converts input to WAV if needed via ffmpeg
Feature extraction — Slices audio into 4s windows (2s stride), extracts 35-dim features using librosa
Dance generation — Diffusion model generates SMPLX motion sequence from audio features
Rendering — Converts motion to SMPLX meshes, renders 900 frames at 30fps with pyrender
Muxing — Merges rendered video with original audio via ffmpeg

Training

Only needed to train from scratch. The pretrained checkpoint is included.

python data/code/pre_motion.py                              # preprocess
accelerate launch train_seq.py --batch_size 32 --epochs 200 # train

Key flags:

--batch_size — Default 400; reduce to 32 or lower for Mac MPS
--epochs — Default 2000
--checkpoint — Resume from a saved checkpoint

FineDance Dataset

The dataset (7.7 hours) is available from Google Drive or Baidu Cloud. Place it under ./data.

import numpy as np
data = np.load("motion/001.npy")
smpl_poses = data[:, 3:]   # joint rotations
smpl_trans = data[:, :3]   # root translation

Two dataset splits are provided:

FineDance@Genre (recommended) — Broader genre coverage in the test set
FineDance@Dancer — Splits by dancer identity

Project Structure

TunaDance/
├── app.py                   # Gradio web UI  [NEW]
├── generate_dance.py        # End-to-end CLI [NEW]
├── environment_macos.yaml   # macOS conda env [NEW]
├── train_seq.py             # Training script
├── test.py                  # Original inference script
├── render.py                # SMPLX mesh rendering (updated for MPS)
├── vis.py                   # Skeleton/FK utilities (updated for MPS)
├── args.py                  # CLI argument definitions
├── assets/
│   ├── checkpoints/
│   │   └── train-2000.pt    # Pretrained model (2000 epochs)
│   └── smpl_model/
│       └── smplx/
│           └── SMPLX_NEUTRAL.npz
├── model/
│   ├── model.py             # SeqModel (transformer decoder)
│   └── diffusion.py         # Gaussian diffusion
├── dataset/
│   └── FineDance_dataset.py
└── data/
    └── finedance/           # Training data (music + motion pairs)

Acknowledgments

This project is built on FineDance by Li et al. We thank the original authors for their work.

Upstream acknowledgments: EDGE, MDM, Adan, Diffusion, SMPLX.

Citation

@inproceedings{li2023finedance,
  title={FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation},
  author={Li, Ronghui and Zhao, Junfan and Zhang, Yachao and Su, Mingyang and Ren, Zeping and Zhang, Han and Tang, Yansong and Li, Xiu},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10234--10243},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for NikhilMarisetty/TunaDance

FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation

Paper • 2212.03741 • Published Dec 7, 2022