YAML Metadata Warning:The pipeline tag "text-to-motion" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

KV-Control (T-Concat v4 backbone)

Sparse-keyframe, multi-joint controllable text-to-motion generation. The repository at github.com/Tevior/KV-Control contains the full training and inference code.

What is here

Path Content Size
base_t_concat_v4/model/net_best_fid.tar Pre-trained T-Concat v4 masked-transformer base (the paper main backbone) 168 MB
kv_control/model/net_best_kps.tar KV-Control adapter trained on the base above 520 MB
vqvae/net_best_fid.pth Part-aware VQ-VAE tokenizer (128 codes ร— 6 parts) 236 MB
vqvae/skeleton_partition.json Skeleton partition for the part-aware VQ 1 KB
stats/{mean,std}.npy Normalization stats matching the released VQ 4 KB
clip/ViT-B-32.pt OpenAI CLIP ViT-B/32 visual + text encoder 336 MB
t2m/Comp_v6_KLD005/opt.txt + meta/ Frozen evaluation encoder config & stats 3 KB
t2m/text_mot_match/model/finest.tar Pre-trained text-motion eval encoder (Guo et al., 2022) 235 MB
t2m/length_estimator/model/finest.tar Pre-trained motion-length predictor 1.7 MB
aux/body_models/ SMPL neutral mesh + face / J_regressor (SMPL license) 234 MB
aux/glove/ Vocab files for the length estimator 10 MB

How to use

git clone https://github.com/Tevior/KV-Control.git
cd KV-Control
bash scripts/download_checkpoints.sh   # populates checkpoints/, aux/ โ†’ glove/, body_models/

Refer to the GitHub README for installation and quick-start commands.

Licenses

  • Our weights (base_t_concat_v4, kv_control, vqvae, stats) โ€” MIT.
  • CLIP ViT-B/32 โ€” released by OpenAI under MIT.
  • SMPL body model under aux/body_models/ โ€” original SMPL license (research-only).
  • Text-motion eval encoder / length estimator under t2m/ โ€” re-distributed from the HumanML3D / Guo et al. 2022 release for reproducibility.

Citation

@article{kvcontrol2026,
  title  = {KV-Control: Sparse-Keyframe Multi-Joint Text-to-Motion Generation},
  author = {... (under review) ...},
  year   = {2026},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support