skeleton-gif

Deterministic text → skeleton-GIF generator. Any prompt produces a 512×512 looping GIF of a skeleton performing the requested action, with an emotion visible on its face + body posture, in an optional scene backdrop.

Zero hallucination guarantee. No diffusion model is in the generation path. The only ML component is facebook/bart-large-mnli used strictly for zero-shot text classification into a closed label set (92 actions × 10 emotions × 78 scenes). Rendering is pure procedural PIL code — we draw every bone, every joint, every backdrop ourselves.

Quick start

from skeleton_gif_model import SkeletonGif

model = SkeletonGif.from_pretrained("ocmannazirbriet/skeleton-gif")   # or local dir
out = model("a sad man reading a book in a bedroom")

print(out.action, out.emotion, out.scene)   # 'reading' 'sad' 'bedroom'
out.save("result.gif")

Install

pip install -r requirements.txt

Dependencies: pillow, transformers, torch, huggingface_hub.

On first call the underlying facebook/bart-large-mnli is downloaded (~1.6 GB) and cached. Prompts that match built-in keyword rules skip the classifier entirely.

How it works

prompt  ──(keyword match / BART zero-shot)──▶  (action, emotion, scene)
        ──(procedural PIL render, 24 frames)──▶  .gif

Prompt parsing. SkeletonGif.__call__ routes the prompt through a deterministic keyword matcher; if nothing unambiguous hits, it falls back to zero-shot classification over the closed label set. Confidence thresholds on each channel (action ≥ 0.35, emotion ≥ 0.55, scene ≥ 0.55) send low-confidence calls to safe defaults (standing_idle / neutral / none).
Skeleton keyframes. Each of the 92 actions is a pure function action_X(t ∈ [0, 1)) -> {joint_id: (x, y)} computing joint positions at frame t. 15-joint OpenPose-style rig.
Emotion transform. A post-processing pass re-shapes posture (slouch / lean / tremble / bounce) and sets face parameters (mouth and eye shapes) per emotion.
PIL render. Each frame: scene backdrop → bones → ribs + pelvis → joints → prop → skull w/ face.
GIF export. 24 frames at 10 fps, looping.

Outputs

The model returns a SkeletonGifOutput dataclass:

field	type	description
`prompt`	`str`	original input
`action`	`str`	one of 92 closed-set labels
`emotion`	`str`	one of 10 closed-set labels
`scene`	`str`	one of 78 closed-set labels (or `"none"`)
`gif_bytes`	`bytes`	GIF payload — write directly or `.save(path)`

Label sets

See config.json for the full canonical label lists. A few highlights:

Actions: walking, dancing, reading, working, eating, vacuuming, cooking, football, cricket, basketball, climbing, swimming, playing_guitar, meditating, texting, shopping, bowing, hugging, … (92 total)
Emotions: happy, sad, angry, tired, excited, neutral, scared, surprised, bored, confused
Scenes: bedroom, kitchen, office, park, beach, church, space, stadium, museum, cemetery, castle, airport, casino, nightclub, farm, living_room, … (78 total, plus none)

Guarantees

Output is always a .gif — PIL writes the bytes directly.
Output is always a skeleton — procedural drawing; no diffusion model can drift.
Emotion is always visible — on the face and in body posture.
No hallucination. Classifications are chosen from a closed set; rendering is deterministic math. For any fixed (prompt, engine version) the output is bit-identical.
No image-generation API. BART-MNLI is text-only. Nothing ever hits a remote image service.

Limitations

Art style is intentionally minimal (stick-figure-plus-skull). Fluid / photographic motion is not an output mode.
Labels are a closed set. Unusual prompts snap to the nearest canonical label or to neutral defaults — not to novel concepts.
Scene backdrops are schematic, not photorealistic.

Citation

If you use this in academic work, cite the closed-set-procedural-rendering approach:

@software{skeleton_gif,
  title  = {skeleton-gif: deterministic text-to-GIF with zero hallucination},
  year   = {2026},
  url    = {https://huggingface.co/ocmannazirbriet/skeleton-gif}
}

License

MIT.

Downloads last month: 29