Intellite-500M / README.md
ProCreations's picture
Wire up ZeroGPU: import spaces, @spaces.GPU(duration=60) on chat, cuda module-level
d1036c6 verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: intellite-500m-sft
emoji: πŸ’¬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false

intellite-500M SFT β€” RLHF data collector

Serves the SFT-tuned intellite 500M model in a chat UI. Every assistant reply gets πŸ‘ / πŸ‘Ž buttons; each rating appends one JSONL record to a local folder that a CommitScheduler pushes to a dataset repo on the Hub every 5 minutes.

Weights are loaded from a bundled bf16 checkpoint (best.pt, ~1 GB).

Best sampling defaults are baked into the sliders: temp 0.7 Β· top-k 40 Β· top-p 0.7 Β· rep penalty 1.1 β€” found by grid sweep against this checkpoint. You can override per-message via the right-side panel.

Setup

  1. Upload the SFT checkpoint to the Space root as best.pt (or set INTELLITE_CKPT=/path/to/file.pt in Settings β†’ Variables).
  2. Create the dataset repo ProCreations/Intellite-storage (the scheduler will auto-create it on first push too).
  3. Set HF_TOKEN in Settings β†’ Secrets β€” a token with write scope on the dataset repo. Without it, the Space runs but feedback only persists in-memory until the container restarts.
  4. (Optional) Override FEEDBACK_REPO in Settings β†’ Variables if you want to use a different dataset repo.

Data format

Each record is a single line of JSONL in data/data_<uuid>.jsonl on the dataset repo (one file per Space replica/restart):

{"ts":"2026-04-25T15:23:45","system":"You are a helpful, honest, and concise assistant.","prompt_messages":[{"role":"user","content":"..."},{"role":"assistant","content":"..."},{"role":"user","content":"..."}],"response":"...","liked":true}

Each record is exactly (prompt, response, reward∈{0,1}) β€” the shape any preference/RL trainer expects. For DPO, group records by identical prompt_messages and pair a liked=true response (chosen) with a liked=false one (rejected). For REINFORCE/PPO, feed liked as a reward.

Downloading the data

hf download ProCreations/Intellite-storage --repo-type=dataset --local-dir ./rlhf-data

Hardware: ZeroGPU (half-H200, dynamic)

This Space runs on HuggingFace ZeroGPU β€” a half-H200 slice (70 GB VRAM) is allocated on demand each time you press Send, then released when the reply finishes. Per-message latency:

  • Cold start (first message after idle): ~3–5 s of GPU queueing + ~2 s model warm
  • Warm: ~5–10 s for a typical 200–400 token reply (β‰ˆ80 tok/s on H200)
  • Max-length 800-token reply: ~10–15 s

The chat function is decorated with @spaces.GPU(duration=60) so the GPU stays allocated for the duration of the streamed reply, then releases.

ZeroGPU has a per-account daily quota (3.5 min free / 25 min PRO); heavy users will hit a queue. Generation is otherwise free.

If the Space stalls on cold container boot, give it ~30 s β€” that's the 1 GB bf16 weights downloading from ProCreations/intellite-500m-sft. Subsequent restarts hit the cached copy.