Qwen3-4B-CodeAgent

A fine-tuned code execution and Code Review agent based on Qwen3-4B-Instruct, trained to follow a structured ReAct (Plan → Execute → Reflect → Finish) workflow with XML-formatted responses.

Model Description

This model is a LoRA fine-tuned version of Qwen3-4B-Instruct designed to function as an autonomous coding agent. It generates structured XML responses that can be parsed by an orchestration framework to execute code, review results, and iteratively debug.

Attribute Value
Base Model Qwen3-4B-Instruct (3.6B params)
Architecture Qwen3ForCausalLM, 36 layers, 2560 hidden size, GQA (32 heads / 8 KV heads)
Fine-tuning Method LoRA (4-bit quantization + LoRA r=32, alpha=32)
Framework Unsloth + TRL SFTTrainer
Training Data m-a-p/Code-Feedback (~47K train samples)
Context Length 4096 tokens
Precision bfloat16 (merged weights)

Intended Use

This model is designed for building code agent systems that need structured, parseable output. It is suitable for:

  • Automated code generation with execution feedback loops
  • Code review and iterative debugging pipelines
  • Tool-augmented LLM applications with sandbox execution
  • Educational coding assistants

Output Format

The model outputs XML-structured responses following a ReAct workflow:

<agent_response>
  <node>Plan</node>
  <next_node>Execute</next_node>
  <content>
    ## Analysis
    The task requires implementing a binary search algorithm.
    
    ## Plan
    1. Define the function signature
    2. Implement iterative binary search
    3. Handle edge cases (empty array, target not found)
  </content>
</agent_response>

Node Types

Node Trigger Content Next Node
Plan User sends a task Markdown-formatted solution plan Execute
Execute After Plan or Reflect {"tool_name": "python_sandbox", "arguments": {"code": "..."}} Execute
Reflect Execute fails (exit_code=1) Root cause analysis and fix direction Execute
Finish Execute succeeds (exit_code=0) Task summary Finish

Standard Workflow

Plan → Execute → (failure → Reflect → Execute → ...) → Finish

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Nanami14138/qwen3-4b-instruct-code-agent"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 🛠️ Prompting Strategy (系统提示词策略)

本模型被设计为一个基于 ReAct 框架的智能 Code Agent。为了让模型严格按照状态机(Plan -> Execute -> Reflect -> Finish)运行,并输出结构化的 XML 格式,**强烈建议在推理时使用以下 System Prompt**:

system_prompt = """你是一个专业的代码执行与Code Review智能Agent,遵循ReAct工作流。

## 输出格式
你的每一次回复都必须严格使用以下XML格式:
<agent_response>
  <node>当前节点</node>
  <next_node>下一个节点</next_node>
  <content>输出内容</content>
</agent_response>

## 节点定义
### Plan(规划)
- 触发:收到用户任务后立即进入
- <content>:分析任务需求,以 Markdown 格式输出解决方案规划
- <next_node>:Execute

### Execute(执行)
- 触发:Plan 或 Reflect 之后进入
- <content>:输出 {"tool_name": "python_sandbox", "arguments": {"code": "你的代码"}}
- <next_node>:Execute(等待执行结果)

### Reflect(反思)
- 触发:Execute 执行失败(exit_code=1)后进入
- <content>:分析失败原因,定位根因,给出修正方向
- <next_node>:Execute(修正后重新执行)

### Finish(完成)
- 触发:Execute 执行成功(exit_code=0)后进入
- <content>:输出任务总结
- <next_node>:Finish

## 标准工作流
Plan → Execute → (失败 → Reflect → Execute → ...) → Finish"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "任务:Write a Python function to check if a number is prime.\n\n当前状态:Start"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, top_p=0.95)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

With Unsloth (Faster Inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="your-username/qwen3-4b-code-agent",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
# Then use the same message format as above

Training Details

Data

Trained on m-a-p/Code-Feedback, a multi-turn code conversation dataset with ~66K examples. The data was processed into three pools:

Pool Description Train Samples Ratio
Pool A (Base SFT) Single-turn code Q&A, plain text 117 0.2%
Pool B (Code Review) Multi-turn debug/review → ReAct XML format 29,562 62.3%
Pool C (Discussion) Multi-turn code discussion → ReAct XML format 17,737 37.4%

The system prompt is injected at training time (not stored in the data) to ensure consistent behavior.

Hyperparameters

Parameter Value
LoRA rank (r) 32
LoRA alpha 32
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate 2e-4 (cosine schedule)
Warmup ratio 0.1
Batch size 4 × 4 (gradient accumulation) = 16 effective
Max sequence length 4096
Precision LoRA 4-bit (training), bfloat16 (merged)
Optimizer AdamW 8-bit
Epochs 3 (stopped early at ~5.8% progress, step 620/8892)

Training Curve

Step Train Loss Eval Loss
20 1.927 1.905
100 0.649 0.573
200 0.463 0.454
300 0.412 0.422
400 0.413 0.409
500 0.374 0.401
600 0.383 0.397

Loss decreased from 1.90 to 0.40 with no signs of overfitting. The checkpoint at step 620 was merged for this release.

Hardware

  • 8× NVIDIA L20 (48GB each), single-GPU training via LoRA

Evaluation

HumanEval (10-problem subset)

Metric Score
Pass@1 62.6%
Pass@2 71.14%
Pass@3 75.61%
Avg tokens/problem 215.2

Evaluation was conducted on a 10-problem subset of HumanEval. Full 164-problem evaluation is planned.

Limitations

  • Early checkpoint: This model was merged at step 620 out of 8892 total steps (~3.4% of training). Performance will likely improve with continued training.
  • English-centric data: The training data (Code-Feedback) is predominantly in English. Chinese language coding tasks may have lower quality.
  • XML format dependency: The model is trained to output structured XML. Without the system prompt, it may not follow the expected format.
  • No real execution: The training data simulates tool responses; the model has not been trained with actual code execution feedback.
  • Limited code languages: While the training data covers multiple languages, Python is heavily overrepresented.
  • Hallucination risk: Like all LLMs, the model may generate plausible but incorrect code, especially for complex algorithms or domain-specific tasks.

Ethical Considerations

  • The model should not be used to generate malicious code or exploit vulnerabilities.
  • Generated code should always be reviewed by a human before deployment in production systems.
  • The model may reproduce biases present in the training data (e.g., coding style preferences, library choices).

Citation

If you use this model, please cite the base model and training dataset:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025}
}

@misc{code-feedback,
  title={Code-Feedback: Multi-turn Code Conversation Dataset},
  author={m-a-p},
  url={https://huggingface.co/datasets/m-a-p/Code-Feedback}
}
Downloads last month
330
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Nanami14138/qwen3-4b-instruct-code-agent

Evaluation results