NEW RELEASE: Esper 3.1 for Qwen 3.6!

- Your dedicated DevOps expert: Esper 3.1 maximizes DevOps and architecture helpfulness, powered by high-difficulty DevOps and architecture data generated with DeepSeek-V3.1-Terminus!
- Improved coding performance: challenging code-reasoning datasets stretch DeepSeek-V3.1-Terminus and DeepSeek-V3.2 to the limits, allowing Esper 3.1 to tackle harder coding tasks!
- AI to build AI: our high-difficulty AI expertise data boosts Esper 3.1's MLOps, AI architecture, AI research, and general reasoning skills.

Get it now: ValiantLabs/Qwen3.6-35B-A3B-Esper3.1

We're working on more finetunes for the newest Qwen and Gemma models, and we've also started working on the agentic-first datasets for Esper 4 :) we're going to make open source better and better for your work!

Please note that real life financial and family concerns have popped up and have imposed unfortunate limitations on our ability to devote time to our open-source work :( If you would like to see Esper 4 and our other releases speed up instead of slowing down, this is the best way you can help us: sequelbox/SupportOpenSource

No matter what, we'll keep fighting and we won't give up!

with love,
allegra

1 reply

reacted to Ujjwal-Tyagi's post with 👍 22 days ago

Post

3940

We are hiring at Shirova AI. We need AI researchers and engineers to work in our research lab. Shirova AI is a research lab in India, so we can help our researchers move to nearby workspaces or let them work from home without ever coming to the lab. We're building our founding team, so the pay will be good. You can learn, so don't hesitate to mail us at: careers@shirova.com

reacted to anakin87's post with ❤️ 22 days ago

Post

10392

How LLM training with RL Environments works?

It all starts with 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗲𝗿𝗶𝗳𝗶𝗮𝗯𝗹𝗲 𝗥𝗲𝘄𝗮𝗿𝗱𝘀
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training

In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env ❌⭕
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use 𝗚𝗿𝗼𝘂𝗽 𝗥𝗲𝗹𝗮𝘁𝗶𝘃𝗲 𝗣𝗼𝗹𝗶𝗰𝘆 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗮𝘁𝗶𝗼𝗻 with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1️⃣ Rollout generation: from the same board, model plays N games via sampling
2️⃣ Each game scored with deterministic rewards (win, format, ...)
3️⃣ Mean score computed across the group
4️⃣ Each rollout's advantage = its score minus the group mean
5️⃣ Model updated to favor trajectories above baseline

🔁 Repeat

For a deep dive, check out
🌱 https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs

2 replies

KleinMiclaur

AI & ML interests

Recent Activity

Organizations

KleinMiclaur's activity

Sunflower.sunbird.ai not responding

README

AI Deadlines

Zephyr Chat