skeleton-gif
Deterministic text β skeleton-GIF generator. Any prompt produces a 512Γ512 looping GIF of a skeleton performing the requested action, with an emotion visible on its face + body posture, in an optional scene backdrop.
Zero hallucination guarantee. No diffusion model is in the generation path. The only ML component is facebook/bart-large-mnli used strictly for zero-shot text classification into a closed label set (92 actions Γ 10 emotions Γ 78 scenes). Rendering is pure procedural PIL code β we draw every bone, every joint, every backdrop ourselves.
Quick start
from skeleton_gif_model import SkeletonGif
model = SkeletonGif.from_pretrained("ocmannazirbriet/skeleton-gif") # or local dir
out = model("a sad man reading a book in a bedroom")
print(out.action, out.emotion, out.scene) # 'reading' 'sad' 'bedroom'
out.save("result.gif")
Install
pip install -r requirements.txt
Dependencies: pillow, transformers, torch, huggingface_hub.
On first call the underlying facebook/bart-large-mnli is downloaded (~1.6 GB) and cached. Prompts that match built-in keyword rules skip the classifier entirely.
How it works
prompt ββ(keyword match / BART zero-shot)βββΆ (action, emotion, scene)
ββ(procedural PIL render, 24 frames)βββΆ .gif
- Prompt parsing.
SkeletonGif.__call__routes the prompt through a deterministic keyword matcher; if nothing unambiguous hits, it falls back to zero-shot classification over the closed label set. Confidence thresholds on each channel (action β₯ 0.35,emotion β₯ 0.55,scene β₯ 0.55) send low-confidence calls to safe defaults (standing_idle/neutral/none). - Skeleton keyframes. Each of the 92 actions is a pure function
action_X(t β [0, 1)) -> {joint_id: (x, y)}computing joint positions at framet. 15-joint OpenPose-style rig. - Emotion transform. A post-processing pass re-shapes posture (slouch / lean / tremble / bounce) and sets face parameters (mouth and eye shapes) per emotion.
- PIL render. Each frame: scene backdrop β bones β ribs + pelvis β joints β prop β skull w/ face.
- GIF export. 24 frames at 10 fps, looping.
Outputs
The model returns a SkeletonGifOutput dataclass:
| field | type | description |
|---|---|---|
prompt |
str |
original input |
action |
str |
one of 92 closed-set labels |
emotion |
str |
one of 10 closed-set labels |
scene |
str |
one of 78 closed-set labels (or "none") |
gif_bytes |
bytes |
GIF payload β write directly or .save(path) |
Label sets
See config.json for the full canonical label lists. A few highlights:
- Actions: walking, dancing, reading, working, eating, vacuuming, cooking, football, cricket, basketball, climbing, swimming, playing_guitar, meditating, texting, shopping, bowing, hugging, β¦ (92 total)
- Emotions: happy, sad, angry, tired, excited, neutral, scared, surprised, bored, confused
- Scenes: bedroom, kitchen, office, park, beach, church, space, stadium, museum, cemetery, castle, airport, casino, nightclub, farm, living_room, β¦ (78 total, plus
none)
Guarantees
- Output is always a
.gifβ PIL writes the bytes directly. - Output is always a skeleton β procedural drawing; no diffusion model can drift.
- Emotion is always visible β on the face and in body posture.
- No hallucination. Classifications are chosen from a closed set; rendering is deterministic math. For any fixed
(prompt, engine version)the output is bit-identical. - No image-generation API. BART-MNLI is text-only. Nothing ever hits a remote image service.
Limitations
- Art style is intentionally minimal (stick-figure-plus-skull). Fluid / photographic motion is not an output mode.
- Labels are a closed set. Unusual prompts snap to the nearest canonical label or to neutral defaults β not to novel concepts.
- Scene backdrops are schematic, not photorealistic.
Citation
If you use this in academic work, cite the closed-set-procedural-rendering approach:
@software{skeleton_gif,
title = {skeleton-gif: deterministic text-to-GIF with zero hallucination},
year = {2026},
url = {https://huggingface.co/ocmannazirbriet/skeleton-gif}
}
License
MIT.
- Downloads last month
- 29