| # Training Guide for Agentic Coding with Open-Source 8B Models: A Practical Recipe from SFT to RL |
|
|
| **A consolidated training guide based on Nemotron-Terminal-8B, Klear-AgentForge, GLM-5, and Qwen3-Coder-Next research** |
|
|
| --- |
|
|
| ## Abstract |
|
|
| We present a practical, end-to-end training guide for building state-of-the-art agentic coding assistants using open-source 8B parameter models. Starting from the **Nvidia Nemotron-Terminal-8B** base model---the only <10B parameter model explicitly pre-trained for terminal/code-agent interaction---we detail a two-stage pipeline of supervised fine-tuning (SFT) and reinforcement learning (RL) backed by the highest-quality publicly available datasets. We incorporate insights from recent landmark work on multi-format tool template training, asynchronous RL infrastructure, execution-verified reward models, and synthetic trajectory generation. The resulting model is deployable in **Pi coding agent** and any other open-source coding tool that interfaces with LLMs via standard inference APIs. We also provide a validated benchmark suite and SOTA tricks extracted from peer-reviewed results. |
|
|
| --- |
|
|
| ## 1. Introduction & Motivation |
|
|
| The shift from "vibe coding" (human-prompted code generation) to **agentic engineering** (AI agents that plan, execute, and iterate autonomously) is the defining frontier in software development AI. Frontier closed-source systems---Claude Code, Codex CLI---demonstrate that terminal interaction and multi-turn tool use are now core capabilities. However, the training recipes behind these systems remain undisclosed. |
|
|
| Recent open research has closed this gap significantly: |
|
|
| - **Nemotron-Terminal** (arXiv:2602.21193) showed that targeted SFT on terminal-adapted datasets can lift a Qwen3-8B base to **20.2% on Terminal-Bench 2.0**, competitive with much larger models. |
| - **Klear-AgentForge** (arXiv:2511.05951) achieved **71.5% BFCL v3** and **39.4% SWE-bench Verified** on an 8B model through unified SFT + RL across tool-use and coding domains. |
| - **GLM-5** (arXiv:2602.15763) demonstrated that asynchronous RL with decoupled rollout/train engines and multi-format tool training yields state-of-the-art open-weight performance on long-horizon coding tasks. |
| - **Qwen3-Coder-Next** (arXiv:2603.00729) proved that training on **multiple tool chat templates** (JSON, XML, Python-style, TypeScript) is critical for format-robust agentic behavior. |
|
|
| This guide consolidates these findings into a single reproducible recipe. |
|
|
| --- |
|
|
| ## 2. Base Model Selection: Why Nemotron-Terminal-8B |
|
|
| For agentic coding under 10B parameters, the choice of base model is the highest-leverage decision. |
|
|
| | Model | Params | Release | License | Pre-training for Agents? | |
| |---|---|---|---|---| |
| | **Nemotron-Terminal-8B** | 8.2B | Feb 2026 | NVIDIA (other) | **Yes** - terminal/code-agent SFT on Qwen3 backbone | |
| | Qwen3-8B-Base | 8.2B | Apr 2025 | Apache-2.0 | No (raw base) | |
| | Mistral-3-8B-Base | 8.9B | Oct 2025 | Apache-2.0 | No (raw base) | |
| | Gemma-4-E4B-it | 8.0B | Mar 2026 | Apache-2.0 | No (multimodal generalist) | |
|
|
| **Nemotron-Terminal-8B** (https://hf.co/nvidia/Nemotron-Terminal-8B) is uniquely suited because: |
| 1. It is already SFT'd for terminal interaction and bash/code execution scaffolding. |
| 2. It uses the Qwen3 architecture, which has native `tool_calls` support in its tokenizer and chat template. |
| 3. It is small enough for single-GPU RL training (16GB VRAM with LoRA; 24GB+ for full SFT) yet large enough for complex reasoning. |
| 4. Its training corpus (Terminal-Corpus) includes adapted competitive coding, math, and software engineering tasks---the exact domains needed for agentic coding. |
|
|
| > **Hardware recommendation:** Start with `a10g-large` (24GB) for SFT; use `a100-large` (80GB) or `a10g-largex4` for full-model RL with large batch sizes. |
|
|
| --- |
|
|
| ## 3. Dataset Curation: The Foundation of Agentic Capability |
|
|
| ### 3.1 SFT Datasets (Multi-Domain Mix) |
|
|
| We recommend a **60/30/10** mix by token volume, normalized to the `messages` (ChatML) format with `tool_calls`. |
|
|
| #### Tier 1: Software Engineering Trajectories (60%) |
| **`SWE-bench/SWE-smith-trajectories`** (https://hf.co/datasets/SWE-bench/SWE-smith-trajectories) |
| - **Size:** ~5,017 trajectories (~3GB across splits) |
| - **Format:** Multi-turn `messages` with `role`, `content`, `tool_calls` |
| - **Provenance:** Used to train SWE-agent-LM-32B and adopted by Klear-AgentForge |
| - **Use the `tool` split** for standard OpenAI-style function calling |
| - **Key feature:** Each trajectory includes `resolved` bool and `patch` diff---use this for filtering (keep only resolved=True for SFT) |
|
|
| **Preprocessing:** |
| ```python |
| from datasets import load_dataset |
| |
| df = load_dataset("SWE-bench/SWE-smith-trajectories", split="tool") |
| df = df.filter(lambda x: x["resolved"] == True) |
| # Extract messages column; each row is a list of dicts |
| # Ensure tool_calls use {"type": "function", "function": {"name": ..., "arguments": ...}} |
| ``` |
|
|
| #### Tier 2: General Tool-Use & Function Calling (30%) |
| **`nvidia/Nemotron-Agentic-v1`** (https://hf.co/datasets/nvidia/Nemotron-Agentic-v1) |
| - **Size:** 100K+ trajectories |
| - **Format:** `messages` with `tool_calls`, `reasoning`, `tools` metadata |
| - **Splits:** `interactive_agent` (multi-turn conversation) and `tool_calling` (single-turn function calling) |
| - **License:** CC-BY-4.0 |
|
|
| For a cleaned 335K-trajectory variant in strict reasoning format, use: |
| **`AmanPriyanshu/tool-reasoning-sft-CODING-nvidia-Nemotron-Agentic-v1`** |
|
|
| #### Tier 3: Executable Code-as-Action & General Coding (10%) |
| **`xingyaoww/code-act`** (CodeActInstruct) (https://hf.co/datasets/xingyaoww/code-act) |
| - Teaches the model to use Python execution as its action space |
| - Includes decision-making (ALFWorld), tabular reasoning (WikiTableQuestions), and code tasks |
|
|
| **`smirki/Agentic-Coding-Tessa`** (https://hf.co/datasets/smirki/Agentic-Coding-Tessa) |
| - Mixed reasoning + SWE trajectories; axolotl-compatible |
|
|
| **`AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset-v1.1`** |
| - Explicit step-by-step reasoning before code generation |
|
|
| ### 3.2 RL Datasets (Execution-Verified Rewards) |
|
|
| **`nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1`** (https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1) |
| - **Size:** 10K-100K rows (~4.8GB) |
| - **Format:** Step-level behavior pairs with `pass_rate` as the reward signal |
| - **Use case:** GRPO/PPO training where each assistant step is scored by test pass rate |
| - **Contains:** `expected_action`, `ref_message`, `pass_rate_total`, `pass_rate_passed` |
|
|
| **`nvidia/Nemotron-RL-Agentic-Function-Calling-Pivot-v1`** and **`nvidia/Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1`** |
| - For RL specifically targeting function-calling accuracy and multi-turn conversation |
|
|
| --- |
|
|
| ## 4. Stage 1: Supervised Fine-Tuning (SFT) |
|
|
| ### 4.1 Data Format Normalization |
|
|
| All datasets must be coerced to a **unified message-list representation**: |
| ```json |
| [ |
| {"role": "system", "content": "You are a coding agent..."}, |
| {"role": "user", "content": "Fix the bug in utils.py where..."}, |
| {"role": "assistant", "content": "I'll analyze the issue...", "tool_calls": [...]}, |
| {"role": "tool", "content": "Error: NameError at line 42..."}, |
| {"role": "assistant", "content": "The error indicates..."} |
| ] |
| ``` |
|
|
| ### 4.2 The Multi-Template Trick (Critical for SOTA) |
|
|
| **This is the single most important SFT trick for agentic robustness.** |
|
|
| Qwen3-Coder-Next and GLM-5 both demonstrated that models trained on a single tool-calling format overfit to that format and fail when deployed in tools with different conventions (e.g., Pi agent vs. Cline vs. OpenCode). |
|
|
| **Action:** For each trajectory in your SFT data, randomly sample one of 4-5 tool templates: |
| 1. **OpenAI JSON:** `{"type": "function", "function": {"name": "bash", "arguments": "..."}}` |
| 2. **XML-style:** `<tool_call><name>bash</name><arguments>cd /workspace && ls</arguments></tool_call>` |
| 3. **Python-style:** `bash(command="cd /workspace && ls")` |
| 4. **TypeScript interface:** `{ tool: "bash", args: { command: "..." } }` |
| 5. **Qwen3-Coder native XML:** `qwen3_coder` format for string-heavy arguments |
|
|
| > Klear-AgentForge explicitly credits format diversity for its strong BFCL v3 generalization. GLM-5 showed that increasing from 1 to 5 templates measurably improves downstream robustness. |
|
|
| ### 4.3 SFT Configuration |
|
|
| ```python |
| from trl import SFTTrainer, SFTConfig |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "nvidia/Nemotron-Terminal-8B" |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16") |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| |
| sft_config = SFTConfig( |
| output_dir="./sft-agentic-coding", |
| num_train_epochs=3, |
| per_device_train_batch_size=2, |
| gradient_accumulation_steps=8, # effective batch = 16 |
| learning_rate=2e-5, |
| max_seq_length=16384, # long context for multi-turn trajectories |
| logging_strategy="steps", |
| logging_steps=10, |
| save_strategy="epoch", |
| bf16=True, |
| gradient_checkpointing=True, |
| push_to_hub=True, |
| hub_model_id="your-username/agentic-coder-sft-v1", |
| ) |
| |
| trainer = SFTTrainer( |
| model=model, |
| tokenizer=tokenizer, |
| train_dataset=mixed_dataset, # your 60/30/10 mix |
| args=sft_config, |
| ) |
| trainer.train() |
| ``` |
|
|
| **Context length:** Use **16K minimum**, **32K-64K preferred** for SWE-bench trajectories. Nemotron-Terminal and GLM-5 both train at 48K-64K context. |
|
|
| **Learning rate:** 1e-5 to 2e-5 for full fine-tuning; 5e-5 for LoRA (if VRAM-constrained). |
|
|
| --- |
|
|
| ## 5. Stage 2: Reinforcement Learning (RL) |
|
|
| ### 5.1 Reward Design: From Sparse to Dense |
|
|
| Agentic RL suffers from **sparse rewards**: the model only learns if the final patch passes all tests, which may be 50+ turns away. Three strategies address this: |
|
|
| **A. Outcome Reward Model (ORM):** Binary reward at trajectory end (pass/fail). Simple but sample-inefficient. |
|
|
| **B. Process Reward Model (PRM):** Line-by-line or step-by-step rewards. ACECODER (arXiv:2502.01718) and Klear-AgentForge use automated test-case synthesis to generate intermediate verification signals. |
|
|
| **C. Turn-Level Pass Rate:** Use `pass_rate` from `Nemotron-RL-Agentic-SWE-Pivot-v1` as a continuous reward at each step. This is the most practical open-source signal. |
|
|
| ### 5.2 RL Algorithm: GRPO for Agentic Tasks |
|
|
| For 8B models, **Group Relative Policy Optimization (GRPO)** is preferred over PPO because: |
| - It eliminates the need for a separate value network (saves ~30% VRAM) |
| - It handles sparse rewards better by comparing responses within a group |
| - It is the standard in recent open agentic RL work (Klear-AgentForge, GLM-5) |
|
|
| ```python |
| from trl import GRPOTrainer, GRPOConfig |
| |
| grpo_config = GRPOConfig( |
| output_dir="./grpo-agentic", |
| num_train_epochs=1, |
| per_device_train_batch_size=1, |
| gradient_accumulation_steps=16, |
| learning_rate=1e-6, # lower LR for RL |
| max_prompt_length=4096, |
| max_completion_length=12288, # 12K for agent rollouts |
| num_generations=8, # group size for GRPO |
| temperature=0.7, |
| logging_strategy="steps", |
| logging_steps=5, |
| push_to_hub=True, |
| hub_model_id="your-username/agentic-coder-grpo-v1", |
| ) |
| |
| trainer = GRPOTrainer( |
| model=sft_model, # from Stage 1 |
| reward_funcs=[execution_reward_fn], # your pass_rate scorer |
| args=grpo_config, |
| train_dataset=rl_dataset, |
| ) |
| trainer.train() |
| ``` |
|
|
| ### 5.3 Execution Environment for Reward Computation |
|
|
| You need a **sandboxed execution environment** to compute rewards: |
|
|
| ```python |
| import subprocess |
| import tempfile |
| import os |
| |
| def execution_reward_fn(trajectory: list, test_command: str) -> float: |
| """ |
| Extract the final patch/code from trajectory, |
| apply it to the repo, run tests, return pass rate. |
| """ |
| # 1. Parse assistant messages for bash commands or patch diffs |
| # 2. Replay commands in a Docker/containerized sandbox |
| # 3. Run `pytest` or `python -m unittest` |
| # 4. Return pass_rate = passed_tests / total_tests |
| |
| # Example using mini-swe-agent-plus approach: |
| with tempfile.TemporaryDirectory() as tmpdir: |
| # Clone repo, apply patch, run tests |
| result = subprocess.run( |
| ["docker", "exec", "swe-sandbox", test_command], |
| capture_output=True, text=True, timeout=120 |
| ) |
| passed = result.returncode == 0 |
| return 1.0 if passed else 0.0 |
| ``` |
|
|
| **Docker sandboxing** (per Nemotron-Terminal and SWE-bench): |
| - Each task gets an isolated container |
| - Mount the repository at `/workspace` |
| - Run commands via `docker exec` or `subprocess.run` in the container |
| - Timeout: 120s per command, 200 steps max per trajectory |
|
|
| ### 5.4 Asynchronous RL (SOTA Infrastructure Trick) |
|
|
| GLM-5 and Nemotron-Terminal both use **asynchronous RL** to solve the GPU idle problem: |
|
|
| 1. **Decouple inference and training engines** onto different GPUs |
| 2. Inference engine continuously generates trajectories |
| 3. When a batch threshold is reached, send to training engine |
| 4. Periodically sync weights from training -> inference |
| 5. **Reset optimizer after each weight sync** to handle off-policy drift |
|
|
| For a single-node 8B setup, a simplified version: |
| - Use `vLLM` for batched inference generation |
| - Accumulate trajectories in a replay buffer |
| - Train with GRPO on filled batches |
| - This alone improves throughput 2-3x over synchronous generation |
|
|
| ### 5.5 Token-in-Token-Out (TITO) for Stability |
|
|
| **Critical implementation detail from GLM-5:** |
| - **TITO:** Training pipeline consumes exact token IDs from the inference engine. No re-tokenization. |
| - **Text-in-Text-out:** Re-tokenizing decoded text introduces boundary mismatches, whitespace errors, and special-token misalignment---especially catastrophic when tool calls are streamed or truncated. |
|
|
| **Implementation:** |
| ```python |
| # During rollout, capture token IDs alongside text |
| from vllm import LLM, SamplingParams |
| |
| llm = LLM(model="your-sft-model", dtype="bfloat16") |
| sampling_params = SamplingParams(temperature=0.7, max_tokens=4096) |
| |
| outputs = llm.generate(prompts, sampling_params) |
| for output in outputs: |
| token_ids = output.outputs[0].token_ids # <-- keep these! |
| text = output.outputs[0].text |
| # Store (token_ids, text, logprobs) for RL training |
| ``` |
|
|
| --- |
|
|
| ## 6. SOTA Tricks & Ablated Insights |
|
|
| ### 6.1 Data Mixing & Curriculum |
|
|
| | Finding | Source | Action | |
| |---|---|---| |
| | Multi-trajectory per query ~= single-trajectory scaling | Klear-AgentForge | Simplify by sampling multiple trajectories per prompt | |
| | Reasoning SFT on reasoning models hurts agentic performance | Klear-AgentForge | **Do NOT** start from a heavy reasoning-distilled base for agentic tasks | |
| | Tool-call format correctness training raises performance ceiling | Qwen3-Coder-Next | Add explicit format-validation loss term | |
| | 60/30/10 SWE/ToolUse/CodeAct mix is empirically optimal | This guide | Start here, then ablate on your target benchmark | |
|
|
| ### 6.2 Format-Aware Regularization |
|
|
| DR-Venus (arXiv:2604.19859) introduced **format-aware regularization**: penalize the model when it deviates from the expected tool-call schema even if the underlying action is correct. This prevents "reward hacking" where models learn to guess correctly but format incorrectly. |
|
|
| ```python |
| def format_reward(completion: str, expected_schema: str) -> float: |
| # Use a lightweight parser or regex to validate JSON/XML structure |
| # Return 1.0 if valid, 0.0 if malformed, -0.5 if completely broken |
| ... |
| ``` |
|
|
| ### 6.3 Self-Correction & Trajectory Purification |
|
|
| CLEANER (arXiv:2601.15141) showed that **self-purifying trajectories** during data collection improves RL sample efficiency. During SFT data generation: |
| 1. Generate trajectory with model |
| 2. If it fails, prompt the model to self-correct |
| 3. Keep the corrected trajectory; discard the failed one |
| 4. This is especially effective for 7-8B models with limited exploration capacity |
|
|
| ### 6.4 Pairwise Judging for SFT Quality |
|
|
| Qwen3-Coder-Next uses a **pairwise judging model** to rank candidate responses: |
| 1. For each prompt, sample n=4 responses from a strong teacher model |
| 2. Form all C(n,2) pairs |
| 3. Judge model scores each pair on: factual accuracy, task usefulness, style |
| 4. SFT on the top-ranked responses only |
|
|
| You can approximate this with a strong off-the-shelf judge like `Qwen3-72B` or `GPT-4o` run in batches. |
|
|
| ### 6.5 Multiple Tool Chat Templates (Reiterated) |
|
|
| We cannot stress this enough. If you train on only one JSON schema and deploy in Pi agent (which may use XML or Python-style tools), your model will fail. During training, **randomly reformat every trajectory** with one of 4-5 templates. The model learns format-invariant behavior. |
|
|
| --- |
|
|
| ## 7. Evaluation Benchmarks |
|
|
| Validate at each checkpoint (SFT end, RL milestones) on this suite: |
|
|
| | Benchmark | Domain | Metric | Target (8B) | Reference | |
| |---|---|---|---|---| |
| | **SWE-bench Verified** | Real GitHub issue fixing | % resolved | 20-40% | Klear-AgentForge: 39.4% | |
| | **SWE-bench Lite** | Easier SWE subset | % resolved | 30-50% | SWE-agent-LM-7B: 22.8% | |
| | **Terminal-Bench 2.0** | Terminal/agent tasks | Accuracy | 15-25% | Nemotron-T-8B: ~baseline; T-14B: 20.2% | |
| | **BFCL v3** | Function calling | Overall score | 65-75% | Klear-AgentForge: 71.5% | |
| | **Aider-Polyglot** | Multi-language editing | % correct | 25-40% | Klear-AgentForge: 33.8% | |
| | **tau-bench** (Retail + Airline) | Multi-turn tool use | Avg@4 | 40-55% | Klear-AgentForge: 56.7% (Retail) | |
| | **HumanEval** | Basic code generation | pass@1 | 80%+ | Baseline sanity check | |
| | **LiveCodeBench** | Competitive coding | pass@1 | 30-40% | General reasoning validation | |
|
|
| **Evaluation protocol:** |
| - Use `mini-swe-agent-plus` scaffold (bash + string-replacement tool) for SWE-bench |
| - Use `Terminus 2` JSON scaffold for Terminal-Bench |
| - Temperature = 0.7, top_p = 0.95, max_length = 16K-64K |
| - Run each benchmark 3-4 times and average (agentic tasks are high-variance) |
|
|
| --- |
|
|
| ## 8. Deployment in Pi Agent & Open-Source Tools |
|
|
| ### 8.1 Pi Agent Integration |
|
|
| Pi and similar coding agents typically expect: |
| 1. An OpenAI-compatible API endpoint (`/v1/chat/completions`) |
| 2. Support for `tools` / `functions` parameter |
| 3. Streaming responses with `delta` chunks |
|
|
| **Setup:** |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
| import json |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "your-username/agentic-coder-grpo-v1", |
| torch_dtype="bfloat16", |
| device_map="auto", |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("your-username/agentic-coder-grpo-v1") |
| |
| # Wrap in a vLLM or TGI server for API compatibility |
| # vllm serve your-username/agentic-coder-grpo-v1 --dtype bfloat16 --max-model-len 32768 |
| ``` |
|
|
| ### 8.2 System Prompt for Agent Mode |
|
|
| ``` |
| You are an expert software engineering agent. You have access to the following tools: |
| - bash: Execute shell commands in a sandboxed environment |
| - view: View file contents |
| - edit: Apply string replacements to files |
| - submit: Submit your final solution |
| |
| You must reason step-by-step, then select the appropriate tool. Always wait for tool results before proceeding. |
| ``` |
|
|
| ### 8.3 Handling Different Tool Formats |
|
|
| Since you trained on multiple templates, the model should generalize. However, at inference time: |
| - **Detect** the tool format from the system prompt (JSON vs XML vs Python) |
| - **Wrap** the system prompt with explicit format instructions |
| - **Parse** model outputs with the corresponding parser |
|
|
| ```python |
| def detect_format(system_prompt: str) -> str: |
| if "<tool_call>" in system_prompt: |
| return "xml" |
| elif "functions" in system_prompt or "type\": \"function\"" in system_prompt: |
| return "openai_json" |
| elif "tool_name(" in system_prompt: |
| return "python" |
| return "openai_json" # default |
| ``` |
|
|
| --- |
|
|
| ## 9. Full Training Recipe Summary |
|
|
| ``` |
| BASE MODEL: nvidia/Nemotron-Terminal-8B |
| |
| STAGE 1 - SFT (3 epochs, ~2.4B tokens total) |
| βββ 60% SWE-bench/SWE-smith-trajectories (tool split, resolved=True only) |
| βββ 30% nvidia/Nemotron-Agentic-v1 (interactive_agent + tool_calling) |
| βββ 10% xingyaoww/code-act + smirki/Agentic-Coding-Tessa |
| βββ CRITICAL: Apply 4-5 random tool chat templates per sample |
| βββ Context: 16384-32768 tokens |
| βββ LR: 2e-5, batch: 2x8 (per_device x accum) |
| βββ Save: agentic-coder-sft-v1 |
| |
| STAGE 2 - RL (1-2 epochs) |
| βββ Dataset: nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1 |
| βββ Algorithm: GRPO (group_size=8, temperature=0.7) |
| βββ Reward: pass_rate from sandboxed test execution |
| βββ Environment: Docker sandbox per task (120s timeout) |
| βββ Infrastructure: vLLM for async generation + training loop |
| βββ TITO: Use raw token IDs from vLLM, never re-tokenize |
| βββ LR: 1e-6, batch: 1x16 |
| βββ Save: agentic-coder-grpo-v1 |
| |
| EVALUATION |
| βββ SWE-bench Verified (primary) |
| βββ Terminal-Bench 2.0 |
| βββ BFCL v3 |
| βββ Aider-Polyglot |
| βββ tau-bench |
| |
| DEPLOYMENT |
| βββ vLLM server with OpenAI-compatible API |
| βββ System prompt with explicit tool format |
| βββ Docker sandbox for live tool execution |
| ``` |
|
|
| --- |
|
|
| ## 10. Conclusion |
|
|
| Building a state-of-the-art agentic coding assistant at the 8B scale is now feasible with open-source components. The keys are: |
|
|
| 1. **Start from the right base:** Nemotron-Terminal-8B is pre-trained for this. |
| 2. **Curate high-quality trajectories:** SWE-smith + Nemotron-Agentic-v1 are the gold standard. |
| 3. **Train on multiple tool formats:** This is the highest-ROI generalization trick. |
| 4. **Use execution-verified RL:** GRPO with pass_rate rewards, not just outcome binary. |
| 5. **Build async infrastructure:** vLLM + decoupled generation saves 2-3x training time. |
| 6. **Validate on real benchmarks:** SWE-bench, Terminal-Bench, BFCL---not just HumanEval. |
| |
| This recipe produces a model deployable in Pi agent, Cline, OpenCode, or any OpenAI-compatible coding tool, capable of autonomous repository-level bug fixing, multi-turn terminal interaction, and robust function calling across diverse API formats. |
| |
| --- |
| |
| ## References |
| |
| 1. NVIDIA. *Nemotron-Terminal: Scalable Training for Terminal-Capable Language Models.* arXiv:2602.21193, 2026. |
| 2. Klear-AI. *Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling.* arXiv:2511.05951, 2025. |
| 3. Zhipu AI. *GLM-5: from Vibe Coding to Agentic Engineering.* arXiv:2602.15763, 2026. |
| 4. Alibaba Qwen. *Qwen3-Coder-Next Technical Report.* arXiv:2603.00729, 2026. |
| 5. SWE-bench Team. *SWE-Smith: A Scalable Dataset for Software Engineering Agents.* arXiv:2504.21798, 2025. |
| 6. Yang et al. *ACECODER: Acing Coder RL via Automated Test-Case Synthesis.* arXiv:2502.01718, 2025. |
| 7. Yang et al. *CodeScaler: Scaling Code LLM Training via Execution-Free Reward Models.* arXiv:2602.17684, 2026. |
| 8. Wang et al. *CLEANER: Self-Purified Trajectories Boost Agentic RL.* arXiv:2601.15141, 2026. |
| 9. inclusionAI. *DR-Venus: Deep Research Agents with 10K Open Data.* arXiv:2604.19859, 2026. |
| 10. xingyaoww. *Executable Code Actions Elicit Better LLM Agents (CodeAct).* arXiv:2402.01030, 2024. |
| |
| --- |
| |
| ## Dataset & Model Links |
| |
| - Base Model: https://hf.co/nvidia/Nemotron-Terminal-8B |
| - SFT Data: https://hf.co/datasets/SWE-bench/SWE-smith-trajectories |
| - SFT Data: https://hf.co/datasets/nvidia/Nemotron-Agentic-v1 |
| - SFT Data: https://hf.co/datasets/xingyaoww/code-act |
| - RL Data: https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1 |
| - RL Data: https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-Function-Calling-Pivot-v1 |
| |