Instructions to use Nanami14138/qwen3-4b-instruct-code-agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Nanami14138/qwen3-4b-instruct-code-agent with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Nanami14138/qwen3-4b-instruct-code-agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Nanami14138/qwen3-4b-instruct-code-agent")
model = AutoModelForCausalLM.from_pretrained("Nanami14138/qwen3-4b-instruct-code-agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use Nanami14138/qwen3-4b-instruct-code-agent with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Nanami14138/qwen3-4b-instruct-code-agent",
	filename="qwen3-4b-instruct-code-agent-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Nanami14138/qwen3-4b-instruct-code-agent with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Use Docker

docker model run hf.co/Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

LM Studio
Jan

vLLM

How to use Nanami14138/qwen3-4b-instruct-code-agent with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Nanami14138/qwen3-4b-instruct-code-agent"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanami14138/qwen3-4b-instruct-code-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

SGLang

How to use Nanami14138/qwen3-4b-instruct-code-agent with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Nanami14138/qwen3-4b-instruct-code-agent" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanami14138/qwen3-4b-instruct-code-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Nanami14138/qwen3-4b-instruct-code-agent" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Nanami14138/qwen3-4b-instruct-code-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Nanami14138/qwen3-4b-instruct-code-agent with Ollama:
```
ollama run hf.co/Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
```

Unsloth Studio

How to use Nanami14138/qwen3-4b-instruct-code-agent with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Nanami14138/qwen3-4b-instruct-code-agent to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Nanami14138/qwen3-4b-instruct-code-agent to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Nanami14138/qwen3-4b-instruct-code-agent to start chatting

How to use Nanami14138/qwen3-4b-instruct-code-agent with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Nanami14138/qwen3-4b-instruct-code-agent with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Nanami14138/qwen3-4b-instruct-code-agent with Docker Model Runner:
```
docker model run hf.co/Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M
```

Lemonade

How to use Nanami14138/qwen3-4b-instruct-code-agent with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Nanami14138/qwen3-4b-instruct-code-agent:Q4_K_M

Run and chat with the model

lemonade run user.qwen3-4b-instruct-code-agent-Q4_K_M

List all available models

lemonade list

Qwen3-4B-CodeAgent

A fine-tuned code execution and Code Review agent based on Qwen3-4B-Instruct, trained to follow a structured ReAct (Plan → Execute → Reflect → Finish) workflow with XML-formatted responses.

Model Description

This model is a LoRA fine-tuned version of Qwen3-4B-Instruct designed to function as an autonomous coding agent. It generates structured XML responses that can be parsed by an orchestration framework to execute code, review results, and iteratively debug.

Attribute	Value
Base Model	Qwen3-4B-Instruct (3.6B params)
Architecture	Qwen3ForCausalLM, 36 layers, 2560 hidden size, GQA (32 heads / 8 KV heads)
Fine-tuning Method	LoRA (4-bit quantization + LoRA r=32, alpha=32)
Framework	Unsloth + TRL SFTTrainer
Training Data	m-a-p/Code-Feedback (~47K train samples)
Context Length	4096 tokens
Precision	bfloat16 (merged weights)

Intended Use

This model is designed for building code agent systems that need structured, parseable output. It is suitable for:

Automated code generation with execution feedback loops
Code review and iterative debugging pipelines
Tool-augmented LLM applications with sandbox execution
Educational coding assistants

Output Format

The model outputs XML-structured responses following a ReAct workflow:

<agent_response>
  <node>Plan</node>
  <next_node>Execute</next_node>
  <content>
    ## Analysis
    The task requires implementing a binary search algorithm.
    
    ## Plan
    1. Define the function signature
    2. Implement iterative binary search
    3. Handle edge cases (empty array, target not found)
  </content>
</agent_response>

Node Types

Node	Trigger	Content	Next Node
Plan	User sends a task	Markdown-formatted solution plan	Execute
Execute	After Plan or Reflect	`{"tool_name": "python_sandbox", "arguments": {"code": "..."}}`	Execute
Reflect	Execute fails (exit_code=1)	Root cause analysis and fix direction	Execute
Finish	Execute succeeds (exit_code=0)	Task summary	Finish

Standard Workflow

Plan → Execute → (failure → Reflect → Execute → ...) → Finish

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Nanami14138/qwen3-4b-instruct-code-agent"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

## 🛠️ Prompting Strategy (系统提示词策略)

本模型被设计为一个基于 ReAct 框架的智能 Code Agent。为了让模型严格按照状态机（Plan -> Execute -> Reflect -> Finish）运行，并输出结构化的 XML 格式，**强烈建议在推理时使用以下 System Prompt**：

system_prompt = """你是一个专业的代码执行与Code Review智能Agent，遵循ReAct工作流。

## 输出格式
你的每一次回复都必须严格使用以下XML格式：
<agent_response>
  <node>当前节点</node>
  <next_node>下一个节点</next_node>
  <content>输出内容</content>
</agent_response>

## 节点定义
### Plan（规划）
- 触发：收到用户任务后立即进入
- <content>：分析任务需求，以 Markdown 格式输出解决方案规划
- <next_node>：Execute

### Execute（执行）
- 触发：Plan 或 Reflect 之后进入
- <content>：输出 {"tool_name": "python_sandbox", "arguments": {"code": "你的代码"}}
- <next_node>：Execute（等待执行结果）

### Reflect（反思）
- 触发：Execute 执行失败（exit_code=1）后进入
- <content>：分析失败原因，定位根因，给出修正方向
- <next_node>：Execute（修正后重新执行）

### Finish（完成）
- 触发：Execute 执行成功（exit_code=0）后进入
- <content>：输出任务总结
- <next_node>：Finish

## 标准工作流
Plan → Execute → (失败 → Reflect → Execute → ...) → Finish"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "任务：Write a Python function to check if a number is prime.\n\n当前状态：Start"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, top_p=0.95)

response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

With Unsloth (Faster Inference)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="your-username/qwen3-4b-code-agent",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
# Then use the same message format as above

Training Details

Data

Trained on m-a-p/Code-Feedback, a multi-turn code conversation dataset with ~66K examples. The data was processed into three pools:

Pool	Description	Train Samples	Ratio
Pool A (Base SFT)	Single-turn code Q&A, plain text	117	0.2%
Pool B (Code Review)	Multi-turn debug/review → ReAct XML format	29,562	62.3%
Pool C (Discussion)	Multi-turn code discussion → ReAct XML format	17,737	37.4%

The system prompt is injected at training time (not stored in the data) to ensure consistent behavior.

Hyperparameters

Parameter	Value
LoRA rank (r)	32
LoRA alpha	32
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Learning rate	2e-4 (cosine schedule)
Warmup ratio	0.1
Batch size	4 × 4 (gradient accumulation) = 16 effective
Max sequence length	4096
Precision	LoRA 4-bit (training), bfloat16 (merged)
Optimizer	AdamW 8-bit
Epochs	3 (stopped early at ~5.8% progress, step 620/8892)

Training Curve

Step	Train Loss	Eval Loss
20	1.927	1.905
100	0.649	0.573
200	0.463	0.454
300	0.412	0.422
400	0.413	0.409
500	0.374	0.401
600	0.383	0.397

Loss decreased from 1.90 to 0.40 with no signs of overfitting. The checkpoint at step 620 was merged for this release.

Hardware

8× NVIDIA L20 (48GB each), single-GPU training via LoRA

Evaluation

HumanEval (10-problem subset)

Metric	Score
Pass@1	62.6%
Pass@2	71.14%
Pass@3	75.61%
Avg tokens/problem	215.2

Evaluation was conducted on a 10-problem subset of HumanEval. Full 164-problem evaluation is planned.

Limitations

Early checkpoint: This model was merged at step 620 out of 8892 total steps (~3.4% of training). Performance will likely improve with continued training.
English-centric data: The training data (Code-Feedback) is predominantly in English. Chinese language coding tasks may have lower quality.
XML format dependency: The model is trained to output structured XML. Without the system prompt, it may not follow the expected format.
No real execution: The training data simulates tool responses; the model has not been trained with actual code execution feedback.
Limited code languages: While the training data covers multiple languages, Python is heavily overrepresented.
Hallucination risk: Like all LLMs, the model may generate plausible but incorrect code, especially for complex algorithms or domain-specific tasks.

Ethical Considerations

The model should not be used to generate malicious code or exploit vulnerabilities.
Generated code should always be reviewed by a human before deployment in production systems.
The model may reproduce biases present in the training data (e.g., coding style preferences, library choices).

Citation

If you use this model, please cite the base model and training dataset:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025}
}

@misc{code-feedback,
  title={Code-Feedback: Multi-turn Code Conversation Dataset},
  author={m-a-p},
  url={https://huggingface.co/datasets/m-a-p/Code-Feedback}
}

Downloads last month: 226

Safetensors

Model size

4B params

Tensor type

BF16

Dataset used to train Nanami14138/qwen3-4b-instruct-code-agent

Evaluation results

Pass@1 on HumanEval (10-problem subset)
self-reported

62.600
Pass@3 on HumanEval (10-problem subset)
self-reported

75.610