# Training Guide for Agentic Coding with Open-Source 8B Models: A Practical Recipe from SFT to RL **A consolidated training guide based on Nemotron-Terminal-8B, Klear-AgentForge, GLM-5, and Qwen3-Coder-Next research** --- ## Abstract We present a practical, end-to-end training guide for building state-of-the-art agentic coding assistants using open-source 8B parameter models. Starting from the **Nvidia Nemotron-Terminal-8B** base model---the only <10B parameter model explicitly pre-trained for terminal/code-agent interaction---we detail a two-stage pipeline of supervised fine-tuning (SFT) and reinforcement learning (RL) backed by the highest-quality publicly available datasets. We incorporate insights from recent landmark work on multi-format tool template training, asynchronous RL infrastructure, execution-verified reward models, and synthetic trajectory generation. The resulting model is deployable in **Pi coding agent** and any other open-source coding tool that interfaces with LLMs via standard inference APIs. We also provide a validated benchmark suite and SOTA tricks extracted from peer-reviewed results. --- ## 1. Introduction & Motivation The shift from "vibe coding" (human-prompted code generation) to **agentic engineering** (AI agents that plan, execute, and iterate autonomously) is the defining frontier in software development AI. Frontier closed-source systems---Claude Code, Codex CLI---demonstrate that terminal interaction and multi-turn tool use are now core capabilities. However, the training recipes behind these systems remain undisclosed. Recent open research has closed this gap significantly: - **Nemotron-Terminal** (arXiv:2602.21193) showed that targeted SFT on terminal-adapted datasets can lift a Qwen3-8B base to **20.2% on Terminal-Bench 2.0**, competitive with much larger models. - **Klear-AgentForge** (arXiv:2511.05951) achieved **71.5% BFCL v3** and **39.4% SWE-bench Verified** on an 8B model through unified SFT + RL across tool-use and coding domains. - **GLM-5** (arXiv:2602.15763) demonstrated that asynchronous RL with decoupled rollout/train engines and multi-format tool training yields state-of-the-art open-weight performance on long-horizon coding tasks. - **Qwen3-Coder-Next** (arXiv:2603.00729) proved that training on **multiple tool chat templates** (JSON, XML, Python-style, TypeScript) is critical for format-robust agentic behavior. This guide consolidates these findings into a single reproducible recipe. --- ## 2. Base Model Selection: Why Nemotron-Terminal-8B For agentic coding under 10B parameters, the choice of base model is the highest-leverage decision. | Model | Params | Release | License | Pre-training for Agents? | |---|---|---|---|---| | **Nemotron-Terminal-8B** | 8.2B | Feb 2026 | NVIDIA (other) | **Yes** - terminal/code-agent SFT on Qwen3 backbone | | Qwen3-8B-Base | 8.2B | Apr 2025 | Apache-2.0 | No (raw base) | | Mistral-3-8B-Base | 8.9B | Oct 2025 | Apache-2.0 | No (raw base) | | Gemma-4-E4B-it | 8.0B | Mar 2026 | Apache-2.0 | No (multimodal generalist) | **Nemotron-Terminal-8B** (https://hf.co/nvidia/Nemotron-Terminal-8B) is uniquely suited because: 1. It is already SFT'd for terminal interaction and bash/code execution scaffolding. 2. It uses the Qwen3 architecture, which has native `tool_calls` support in its tokenizer and chat template. 3. It is small enough for single-GPU RL training (16GB VRAM with LoRA; 24GB+ for full SFT) yet large enough for complex reasoning. 4. Its training corpus (Terminal-Corpus) includes adapted competitive coding, math, and software engineering tasks---the exact domains needed for agentic coding. > **Hardware recommendation:** Start with `a10g-large` (24GB) for SFT; use `a100-large` (80GB) or `a10g-largex4` for full-model RL with large batch sizes. --- ## 3. Dataset Curation: The Foundation of Agentic Capability ### 3.1 SFT Datasets (Multi-Domain Mix) We recommend a **60/30/10** mix by token volume, normalized to the `messages` (ChatML) format with `tool_calls`. #### Tier 1: Software Engineering Trajectories (60%) **`SWE-bench/SWE-smith-trajectories`** (https://hf.co/datasets/SWE-bench/SWE-smith-trajectories) - **Size:** ~5,017 trajectories (~3GB across splits) - **Format:** Multi-turn `messages` with `role`, `content`, `tool_calls` - **Provenance:** Used to train SWE-agent-LM-32B and adopted by Klear-AgentForge - **Use the `tool` split** for standard OpenAI-style function calling - **Key feature:** Each trajectory includes `resolved` bool and `patch` diff---use this for filtering (keep only resolved=True for SFT) **Preprocessing:** ```python from datasets import load_dataset df = load_dataset("SWE-bench/SWE-smith-trajectories", split="tool") df = df.filter(lambda x: x["resolved"] == True) # Extract messages column; each row is a list of dicts # Ensure tool_calls use {"type": "function", "function": {"name": ..., "arguments": ...}} ``` #### Tier 2: General Tool-Use & Function Calling (30%) **`nvidia/Nemotron-Agentic-v1`** (https://hf.co/datasets/nvidia/Nemotron-Agentic-v1) - **Size:** 100K+ trajectories - **Format:** `messages` with `tool_calls`, `reasoning`, `tools` metadata - **Splits:** `interactive_agent` (multi-turn conversation) and `tool_calling` (single-turn function calling) - **License:** CC-BY-4.0 For a cleaned 335K-trajectory variant in strict reasoning format, use: **`AmanPriyanshu/tool-reasoning-sft-CODING-nvidia-Nemotron-Agentic-v1`** #### Tier 3: Executable Code-as-Action & General Coding (10%) **`xingyaoww/code-act`** (CodeActInstruct) (https://hf.co/datasets/xingyaoww/code-act) - Teaches the model to use Python execution as its action space - Includes decision-making (ALFWorld), tabular reasoning (WikiTableQuestions), and code tasks **`smirki/Agentic-Coding-Tessa`** (https://hf.co/datasets/smirki/Agentic-Coding-Tessa) - Mixed reasoning + SWE trajectories; axolotl-compatible **`AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset-v1.1`** - Explicit step-by-step reasoning before code generation ### 3.2 RL Datasets (Execution-Verified Rewards) **`nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1`** (https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1) - **Size:** 10K-100K rows (~4.8GB) - **Format:** Step-level behavior pairs with `pass_rate` as the reward signal - **Use case:** GRPO/PPO training where each assistant step is scored by test pass rate - **Contains:** `expected_action`, `ref_message`, `pass_rate_total`, `pass_rate_passed` **`nvidia/Nemotron-RL-Agentic-Function-Calling-Pivot-v1`** and **`nvidia/Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1`** - For RL specifically targeting function-calling accuracy and multi-turn conversation --- ## 4. Stage 1: Supervised Fine-Tuning (SFT) ### 4.1 Data Format Normalization All datasets must be coerced to a **unified message-list representation**: ```json [ {"role": "system", "content": "You are a coding agent..."}, {"role": "user", "content": "Fix the bug in utils.py where..."}, {"role": "assistant", "content": "I'll analyze the issue...", "tool_calls": [...]}, {"role": "tool", "content": "Error: NameError at line 42..."}, {"role": "assistant", "content": "The error indicates..."} ] ``` ### 4.2 The Multi-Template Trick (Critical for SOTA) **This is the single most important SFT trick for agentic robustness.** Qwen3-Coder-Next and GLM-5 both demonstrated that models trained on a single tool-calling format overfit to that format and fail when deployed in tools with different conventions (e.g., Pi agent vs. Cline vs. OpenCode). **Action:** For each trajectory in your SFT data, randomly sample one of 4-5 tool templates: 1. **OpenAI JSON:** `{"type": "function", "function": {"name": "bash", "arguments": "..."}}` 2. **XML-style:** `bashcd /workspace && ls` 3. **Python-style:** `bash(command="cd /workspace && ls")` 4. **TypeScript interface:** `{ tool: "bash", args: { command: "..." } }` 5. **Qwen3-Coder native XML:** `qwen3_coder` format for string-heavy arguments > Klear-AgentForge explicitly credits format diversity for its strong BFCL v3 generalization. GLM-5 showed that increasing from 1 to 5 templates measurably improves downstream robustness. ### 4.3 SFT Configuration ```python from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "nvidia/Nemotron-Terminal-8B" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16") tokenizer = AutoTokenizer.from_pretrained(model_id) sft_config = SFTConfig( output_dir="./sft-agentic-coding", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=8, # effective batch = 16 learning_rate=2e-5, max_seq_length=16384, # long context for multi-turn trajectories logging_strategy="steps", logging_steps=10, save_strategy="epoch", bf16=True, gradient_checkpointing=True, push_to_hub=True, hub_model_id="your-username/agentic-coder-sft-v1", ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=mixed_dataset, # your 60/30/10 mix args=sft_config, ) trainer.train() ``` **Context length:** Use **16K minimum**, **32K-64K preferred** for SWE-bench trajectories. Nemotron-Terminal and GLM-5 both train at 48K-64K context. **Learning rate:** 1e-5 to 2e-5 for full fine-tuning; 5e-5 for LoRA (if VRAM-constrained). --- ## 5. Stage 2: Reinforcement Learning (RL) ### 5.1 Reward Design: From Sparse to Dense Agentic RL suffers from **sparse rewards**: the model only learns if the final patch passes all tests, which may be 50+ turns away. Three strategies address this: **A. Outcome Reward Model (ORM):** Binary reward at trajectory end (pass/fail). Simple but sample-inefficient. **B. Process Reward Model (PRM):** Line-by-line or step-by-step rewards. ACECODER (arXiv:2502.01718) and Klear-AgentForge use automated test-case synthesis to generate intermediate verification signals. **C. Turn-Level Pass Rate:** Use `pass_rate` from `Nemotron-RL-Agentic-SWE-Pivot-v1` as a continuous reward at each step. This is the most practical open-source signal. ### 5.2 RL Algorithm: GRPO for Agentic Tasks For 8B models, **Group Relative Policy Optimization (GRPO)** is preferred over PPO because: - It eliminates the need for a separate value network (saves ~30% VRAM) - It handles sparse rewards better by comparing responses within a group - It is the standard in recent open agentic RL work (Klear-AgentForge, GLM-5) ```python from trl import GRPOTrainer, GRPOConfig grpo_config = GRPOConfig( output_dir="./grpo-agentic", num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=1e-6, # lower LR for RL max_prompt_length=4096, max_completion_length=12288, # 12K for agent rollouts num_generations=8, # group size for GRPO temperature=0.7, logging_strategy="steps", logging_steps=5, push_to_hub=True, hub_model_id="your-username/agentic-coder-grpo-v1", ) trainer = GRPOTrainer( model=sft_model, # from Stage 1 reward_funcs=[execution_reward_fn], # your pass_rate scorer args=grpo_config, train_dataset=rl_dataset, ) trainer.train() ``` ### 5.3 Execution Environment for Reward Computation You need a **sandboxed execution environment** to compute rewards: ```python import subprocess import tempfile import os def execution_reward_fn(trajectory: list, test_command: str) -> float: """ Extract the final patch/code from trajectory, apply it to the repo, run tests, return pass rate. """ # 1. Parse assistant messages for bash commands or patch diffs # 2. Replay commands in a Docker/containerized sandbox # 3. Run `pytest` or `python -m unittest` # 4. Return pass_rate = passed_tests / total_tests # Example using mini-swe-agent-plus approach: with tempfile.TemporaryDirectory() as tmpdir: # Clone repo, apply patch, run tests result = subprocess.run( ["docker", "exec", "swe-sandbox", test_command], capture_output=True, text=True, timeout=120 ) passed = result.returncode == 0 return 1.0 if passed else 0.0 ``` **Docker sandboxing** (per Nemotron-Terminal and SWE-bench): - Each task gets an isolated container - Mount the repository at `/workspace` - Run commands via `docker exec` or `subprocess.run` in the container - Timeout: 120s per command, 200 steps max per trajectory ### 5.4 Asynchronous RL (SOTA Infrastructure Trick) GLM-5 and Nemotron-Terminal both use **asynchronous RL** to solve the GPU idle problem: 1. **Decouple inference and training engines** onto different GPUs 2. Inference engine continuously generates trajectories 3. When a batch threshold is reached, send to training engine 4. Periodically sync weights from training -> inference 5. **Reset optimizer after each weight sync** to handle off-policy drift For a single-node 8B setup, a simplified version: - Use `vLLM` for batched inference generation - Accumulate trajectories in a replay buffer - Train with GRPO on filled batches - This alone improves throughput 2-3x over synchronous generation ### 5.5 Token-in-Token-Out (TITO) for Stability **Critical implementation detail from GLM-5:** - **TITO:** Training pipeline consumes exact token IDs from the inference engine. No re-tokenization. - **Text-in-Text-out:** Re-tokenizing decoded text introduces boundary mismatches, whitespace errors, and special-token misalignment---especially catastrophic when tool calls are streamed or truncated. **Implementation:** ```python # During rollout, capture token IDs alongside text from vllm import LLM, SamplingParams llm = LLM(model="your-sft-model", dtype="bfloat16") sampling_params = SamplingParams(temperature=0.7, max_tokens=4096) outputs = llm.generate(prompts, sampling_params) for output in outputs: token_ids = output.outputs[0].token_ids # <-- keep these! text = output.outputs[0].text # Store (token_ids, text, logprobs) for RL training ``` --- ## 6. SOTA Tricks & Ablated Insights ### 6.1 Data Mixing & Curriculum | Finding | Source | Action | |---|---|---| | Multi-trajectory per query ~= single-trajectory scaling | Klear-AgentForge | Simplify by sampling multiple trajectories per prompt | | Reasoning SFT on reasoning models hurts agentic performance | Klear-AgentForge | **Do NOT** start from a heavy reasoning-distilled base for agentic tasks | | Tool-call format correctness training raises performance ceiling | Qwen3-Coder-Next | Add explicit format-validation loss term | | 60/30/10 SWE/ToolUse/CodeAct mix is empirically optimal | This guide | Start here, then ablate on your target benchmark | ### 6.2 Format-Aware Regularization DR-Venus (arXiv:2604.19859) introduced **format-aware regularization**: penalize the model when it deviates from the expected tool-call schema even if the underlying action is correct. This prevents "reward hacking" where models learn to guess correctly but format incorrectly. ```python def format_reward(completion: str, expected_schema: str) -> float: # Use a lightweight parser or regex to validate JSON/XML structure # Return 1.0 if valid, 0.0 if malformed, -0.5 if completely broken ... ``` ### 6.3 Self-Correction & Trajectory Purification CLEANER (arXiv:2601.15141) showed that **self-purifying trajectories** during data collection improves RL sample efficiency. During SFT data generation: 1. Generate trajectory with model 2. If it fails, prompt the model to self-correct 3. Keep the corrected trajectory; discard the failed one 4. This is especially effective for 7-8B models with limited exploration capacity ### 6.4 Pairwise Judging for SFT Quality Qwen3-Coder-Next uses a **pairwise judging model** to rank candidate responses: 1. For each prompt, sample n=4 responses from a strong teacher model 2. Form all C(n,2) pairs 3. Judge model scores each pair on: factual accuracy, task usefulness, style 4. SFT on the top-ranked responses only You can approximate this with a strong off-the-shelf judge like `Qwen3-72B` or `GPT-4o` run in batches. ### 6.5 Multiple Tool Chat Templates (Reiterated) We cannot stress this enough. If you train on only one JSON schema and deploy in Pi agent (which may use XML or Python-style tools), your model will fail. During training, **randomly reformat every trajectory** with one of 4-5 templates. The model learns format-invariant behavior. --- ## 7. Evaluation Benchmarks Validate at each checkpoint (SFT end, RL milestones) on this suite: | Benchmark | Domain | Metric | Target (8B) | Reference | |---|---|---|---|---| | **SWE-bench Verified** | Real GitHub issue fixing | % resolved | 20-40% | Klear-AgentForge: 39.4% | | **SWE-bench Lite** | Easier SWE subset | % resolved | 30-50% | SWE-agent-LM-7B: 22.8% | | **Terminal-Bench 2.0** | Terminal/agent tasks | Accuracy | 15-25% | Nemotron-T-8B: ~baseline; T-14B: 20.2% | | **BFCL v3** | Function calling | Overall score | 65-75% | Klear-AgentForge: 71.5% | | **Aider-Polyglot** | Multi-language editing | % correct | 25-40% | Klear-AgentForge: 33.8% | | **tau-bench** (Retail + Airline) | Multi-turn tool use | Avg@4 | 40-55% | Klear-AgentForge: 56.7% (Retail) | | **HumanEval** | Basic code generation | pass@1 | 80%+ | Baseline sanity check | | **LiveCodeBench** | Competitive coding | pass@1 | 30-40% | General reasoning validation | **Evaluation protocol:** - Use `mini-swe-agent-plus` scaffold (bash + string-replacement tool) for SWE-bench - Use `Terminus 2` JSON scaffold for Terminal-Bench - Temperature = 0.7, top_p = 0.95, max_length = 16K-64K - Run each benchmark 3-4 times and average (agentic tasks are high-variance) --- ## 8. Deployment in Pi Agent & Open-Source Tools ### 8.1 Pi Agent Integration Pi and similar coding agents typically expect: 1. An OpenAI-compatible API endpoint (`/v1/chat/completions`) 2. Support for `tools` / `functions` parameter 3. Streaming responses with `delta` chunks **Setup:** ```python from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline import json model = AutoModelForCausalLM.from_pretrained( "your-username/agentic-coder-grpo-v1", torch_dtype="bfloat16", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("your-username/agentic-coder-grpo-v1") # Wrap in a vLLM or TGI server for API compatibility # vllm serve your-username/agentic-coder-grpo-v1 --dtype bfloat16 --max-model-len 32768 ``` ### 8.2 System Prompt for Agent Mode ``` You are an expert software engineering agent. You have access to the following tools: - bash: Execute shell commands in a sandboxed environment - view: View file contents - edit: Apply string replacements to files - submit: Submit your final solution You must reason step-by-step, then select the appropriate tool. Always wait for tool results before proceeding. ``` ### 8.3 Handling Different Tool Formats Since you trained on multiple templates, the model should generalize. However, at inference time: - **Detect** the tool format from the system prompt (JSON vs XML vs Python) - **Wrap** the system prompt with explicit format instructions - **Parse** model outputs with the corresponding parser ```python def detect_format(system_prompt: str) -> str: if "" in system_prompt: return "xml" elif "functions" in system_prompt or "type\": \"function\"" in system_prompt: return "openai_json" elif "tool_name(" in system_prompt: return "python" return "openai_json" # default ``` --- ## 9. Full Training Recipe Summary ``` BASE MODEL: nvidia/Nemotron-Terminal-8B STAGE 1 - SFT (3 epochs, ~2.4B tokens total) ├── 60% SWE-bench/SWE-smith-trajectories (tool split, resolved=True only) ├── 30% nvidia/Nemotron-Agentic-v1 (interactive_agent + tool_calling) ├── 10% xingyaoww/code-act + smirki/Agentic-Coding-Tessa ├── CRITICAL: Apply 4-5 random tool chat templates per sample ├── Context: 16384-32768 tokens ├── LR: 2e-5, batch: 2x8 (per_device x accum) └── Save: agentic-coder-sft-v1 STAGE 2 - RL (1-2 epochs) ├── Dataset: nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1 ├── Algorithm: GRPO (group_size=8, temperature=0.7) ├── Reward: pass_rate from sandboxed test execution ├── Environment: Docker sandbox per task (120s timeout) ├── Infrastructure: vLLM for async generation + training loop ├── TITO: Use raw token IDs from vLLM, never re-tokenize ├── LR: 1e-6, batch: 1x16 └── Save: agentic-coder-grpo-v1 EVALUATION ├── SWE-bench Verified (primary) ├── Terminal-Bench 2.0 ├── BFCL v3 ├── Aider-Polyglot └── tau-bench DEPLOYMENT ├── vLLM server with OpenAI-compatible API ├── System prompt with explicit tool format └── Docker sandbox for live tool execution ``` --- ## 10. Conclusion Building a state-of-the-art agentic coding assistant at the 8B scale is now feasible with open-source components. The keys are: 1. **Start from the right base:** Nemotron-Terminal-8B is pre-trained for this. 2. **Curate high-quality trajectories:** SWE-smith + Nemotron-Agentic-v1 are the gold standard. 3. **Train on multiple tool formats:** This is the highest-ROI generalization trick. 4. **Use execution-verified RL:** GRPO with pass_rate rewards, not just outcome binary. 5. **Build async infrastructure:** vLLM + decoupled generation saves 2-3x training time. 6. **Validate on real benchmarks:** SWE-bench, Terminal-Bench, BFCL---not just HumanEval. This recipe produces a model deployable in Pi agent, Cline, OpenCode, or any OpenAI-compatible coding tool, capable of autonomous repository-level bug fixing, multi-turn terminal interaction, and robust function calling across diverse API formats. --- ## References 1. NVIDIA. *Nemotron-Terminal: Scalable Training for Terminal-Capable Language Models.* arXiv:2602.21193, 2026. 2. Klear-AI. *Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling.* arXiv:2511.05951, 2025. 3. Zhipu AI. *GLM-5: from Vibe Coding to Agentic Engineering.* arXiv:2602.15763, 2026. 4. Alibaba Qwen. *Qwen3-Coder-Next Technical Report.* arXiv:2603.00729, 2026. 5. SWE-bench Team. *SWE-Smith: A Scalable Dataset for Software Engineering Agents.* arXiv:2504.21798, 2025. 6. Yang et al. *ACECODER: Acing Coder RL via Automated Test-Case Synthesis.* arXiv:2502.01718, 2025. 7. Yang et al. *CodeScaler: Scaling Code LLM Training via Execution-Free Reward Models.* arXiv:2602.17684, 2026. 8. Wang et al. *CLEANER: Self-Purified Trajectories Boost Agentic RL.* arXiv:2601.15141, 2026. 9. inclusionAI. *DR-Venus: Deep Research Agents with 10K Open Data.* arXiv:2604.19859, 2026. 10. xingyaoww. *Executable Code Actions Elicit Better LLM Agents (CodeAct).* arXiv:2402.01030, 2024. --- ## Dataset & Model Links - Base Model: https://hf.co/nvidia/Nemotron-Terminal-8B - SFT Data: https://hf.co/datasets/SWE-bench/SWE-smith-trajectories - SFT Data: https://hf.co/datasets/nvidia/Nemotron-Agentic-v1 - SFT Data: https://hf.co/datasets/xingyaoww/code-act - RL Data: https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-SWE-Pivot-v1 - RL Data: https://hf.co/datasets/nvidia/Nemotron-RL-Agentic-Function-Calling-Pivot-v1