RLHF — PPO fine-tuning 🚧 not trained yet

Optimize an LLM with PPO against a reward model — the classic RLHF recipe.

Status — documented recipe (placeholder). A production-grade pipeline from Ropedia Academy for an advanced, GPU-heavy task. Everything below — base model, objective, dataset, config, the exact evaluation — is specified; the weights / metrics / figures land here automatically when you run the notebook on a GPU (one click below). Try the trained models live in the Ropedia demos Space.

At a glance


Base model	An SFT LLM + a reward model (demo: GPT-2 sentiment)
Task	RL fine-tuning against a reward model
Training objective	PPO — maximize reward-model score with a KL penalty to the reference.
Track	LM · Language & multimodal
Built on	huggingface/trl (PPOTrainer)
Notebook
Compute / storage / time	GPU required — see the Compute · storage · time table in the notebook

Dataset

Source: Prompt set (demo: IMDB).

Training config

GPU-scale — the notebook ships a demo profile (free Colab T4) and a full profile, with an exact Compute · storage · time table. Hyperparameters (optimizer, steps, batch, LoRA rank, …) are in the training cell.

Evaluation results

⏳ Pending — run the notebook on a GPU to fill this in. This lab reports mean reward (+ KL-to-reference) on a held-out split (see its Evaluate cell).

Inference example

No weights are published yet. After a GPU run, load the checkpoint/adapter the notebook saves (it also has a ready inference cell). Base model: An SFT LLM + a reward model (demo: GPT-2 sentiment).

How to fill this repo

Open the notebook in Colab → Runtime → GPU → Run all (runs the real pipeline).
Run its Publish to the Hugging Face Hub step (or HfApi().upload_folder(...)) — the checkpoint + metrics.json + figures replace this placeholder.

Train / run on a GPU · [ ] upload weights · [ ] add metrics.json · [ ] add figures · [ ] swap in the real results card

Limitations

Not yet trained — no numbers to report. The pipeline is GPU-heavy (see the compute table); on free Colab use the demo-scale settings. This is an educational, reproducible recipe, not a tuned production release.

License

Code: MIT (this repository). The base model (huggingface/trl (PPOTrainer)) and dataset are each under their own licenses — check the upstream source before redistribution.

Citation

@misc{ropedia_academy,
  title  = {Ropedia Academy: an interactive course on embodied & spatial AI},
  author = {Ropedia Academy},
  year   = {2026},
  howpublished = {\url{https://chaoyue0307.github.io/ropedia-academy/}}
}

Method / original work: Ouyang et al., InstructGPT, 2022; Schulman et al., PPO, 2017.

Related assets

🚀 Live demos: https://huggingface.co/spaces/cy0307/ropedia-demos
🤗 All models + collection: https://huggingface.co/cy0307
📚 Course & all labs: https://chaoyue0307.github.io/ropedia-academy/ · Labs tab
💻 Source / notebooks: github.com/ChaoYue0307/ropedia-academy
🔗 Relates to tracks: A · D

Documented placeholder in the Ropedia Academy collection — train it on a GPU to publish the real model. Contributions welcome on GitHub.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for cy0307/lm-rlhf-ppo

Base model

lvwerra/gpt2-imdb

Finetuned

(102)

this model

Collection including cy0307/lm-rlhf-ppo

Ropedia Academy — trained models

Collection

45 models · embodied & spatial AI: human motion, 3D rendering, egocentric vision, world models, LLMs & agents. Trained in Ropedia Academy. • 45 items • Updated about 11 hours ago • 1