Spaces:

yashash045
/

devops-pipeline-gym

Sleeping

App Files Files Community

devops-pipeline-gym / docs /video_script.md

yashash045

video_script: fill placeholders, fix framing line

1c58eb9 verified about 1 month ago

preview code

raw

history blame contribute delete

3.54 kB

	# 90-Second Video Script — DevOps Pipeline Gym

	Length target: 85-90 seconds (250-280 words spoken at ~3 wps)
	Format: Screen recording (Loom or OBS) with you reading off this script
	Upload: YouTube unlisted, link from README badge row

	Recording setup:
	- Open three browser tabs in advance: HF Space, Colab notebook, BLOG.md
	- Open one terminal showing a sample `[STEP]` log
	- Hit record, read smoothly, single take if possible

	---

	## Section 1 — Hook (0:00–0:12, ~12 seconds)

	> "Frontier LLMs already know how to fix a broken database. They can recite connection pool errors in their sleep. What they don't reliably do is check before changing anything."

	[Cut to terminal showing a `[STEP]` log with "view_pipeline" actions before "deploy"]

	> "Incident response is sequencing, not knowledge. We built an OpenEnv environment that trains exactly that."

	---

	## Section 2 — The Environment (0:12–0:35, ~23 seconds)

	[Switch to the HF Space, click `/reset` or show the Gradio demo]

	> "Five microservices in a dependency graph. Nine actions split across three roles — DEV, SRE, OPS — that rotate between steps the way a real on-call handoff would. Health is masked until you investigate."

	[Show the role-gated action panel, click view_logs, see service health update]

	> "The reward is six deterministic Python components, bounded per step. No LLM judge in the loop. Same trajectory in, same score out, every time."

	---

	## Section 3 — Results (0:35–1:05, ~30 seconds)

	[Switch to the bar chart from the Colab — base vs trained]

	> "We trained Qwen3 1.7B with QLoRA on 80 expert trajectories. Same task, same seed, same prompt format — same scoring rubric across all baselines."

	[Read the numbers off the chart, e.g.]

	> "Untrained Qwen2.5 7B baseline on judgment_call: -1.200 reward. Our trained Qwen3 1.7B with the SFT adapter: -0.044 reward. That's a +1.156 delta — a 1.7B model trained on 80 trajectories beats an untrained 7B same-family baseline on the same task, same seed, same prompt format."

	[Switch to the frontier model chart if rendered, or just describe]

	> "And here's the interesting part — that 1.7B trained model outperforms several 70B-plus frontier baselines on the same task. Because we trained the right skill, not the bigger model."

	---

	## Section 4 — Why It Matters + Wrap (1:05–1:25, ~20 seconds)

	> "The bigger thesis: deterministic, verifiable RL environments for professional decisions are the missing rung between toy gridworlds and shipping real agents. We picked DevOps because failures are well-documented and graders can be pure functions. The same approach generalizes — legal triage, incident command, supply chain rerouting."

	> "Try it on the Colab badge in our README. Play it interactively in our Gradio demo. We're Team Tripod. Thanks for watching."

	---

	## Tips

	- Pace. Don't rush. 85 seconds is plenty. Stops between sentences are fine.
	- Tone. Matter-of-fact and curious. Not hyped.
	- Don't memorize. Read it. Eyes on the script, voice on the explanation.
	- One take is fine. If you fumble, just re-roll the section. Loom and OBS both let you trim.
	- Numbers in `{...}` get filled in after eval results land. Re-record only that section if numbers change significantly.

	## Fallback if you run out of time

	If recording is taking too long, cut Section 4 down to one line:
	> "Try it on the Colab badge in our README. We're Team Tripod."

	The hook + env + results are the points worth the score lift. Section 4 is gravy.