Instructions to use aiXcoder/aiXapply-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aiXcoder/aiXapply-4B-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="aiXcoder/aiXapply-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("aiXcoder/aiXapply-4B-SFT")
model = AutoModelForCausalLM.from_pretrained("aiXcoder/aiXapply-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use aiXcoder/aiXapply-4B-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "aiXcoder/aiXapply-4B-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aiXcoder/aiXapply-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/aiXcoder/aiXapply-4B-SFT

SGLang

How to use aiXcoder/aiXapply-4B-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "aiXcoder/aiXapply-4B-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aiXcoder/aiXapply-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "aiXcoder/aiXapply-4B-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aiXcoder/aiXapply-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use aiXcoder/aiXapply-4B-SFT with Docker Model Runner:
```
docker model run hf.co/aiXcoder/aiXapply-4B-SFT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

aiXapply-4B-SFT

aiXapply-4B-SFT is the supervised fine-tuned aiXapply model for Full-File Apply. Given an original file and a localized update snippet, it generates the complete updated file while preserving everything outside the requested edit.

Use this SFT model as the default choice for high full-file Apply accuracy and long-context fidelity. It reaches 94.4% average equivalence accuracy on the 1,637-sample main benchmark and shows stronger long-context structural preservation in the reported generalization experiments. For the RL-aligned variant used in the latency/accuracy frontier and cross-format experiments, also see aiXcoder/aiXapply-4B-RL.

This model is part of the official artifact release for paper:

AiXapply: Fast and Reliable Full-File Code Integration with Specialized Small Models for IDE Workflows

Overview

Modern coding assistants often produce a local edit snippet first. The hard downstream step is applying that snippet to the original file without changing unrelated code. Unified diffs are compact but brittle, and search-and-replace is easy to generate but depends on exact string matching. aiXapply treats this downstream step as a standalone code-integration task.

In an IDE workflow, an upstream coding assistant proposes an update snippet, aiXapply expands it into a complete updated file, and the IDE presents the resulting diff for review. See the code repository for figures, scripts, and full experiment details.

The repository includes:

Component	Path
OpenAI-compatible inference scripts	`experiments/aiXapply/`
Experiment entrypoints for full-file Apply, unified diff, and search-and-replace	`experiments/`
Shared evaluation and six-class error taxonomy	`experiments/evaluation/`
Multi-language data construction pipeline	`data_generation/`
SFT and RL training scripts	`training/sft/`, `training/rl/`
Continue IDE integration adapter	`continue_config/`

Highlights

High accuracy: aiXapply-SFT reaches 94.4% average equivalence accuracy on the 1,637-sample main benchmark, close to Qwen3.5-397B-A17B (94.8%) and above DeepSeek-V3.2 (91.6%).
Fast full-file generation: with n-gram speculative decoding, aiXapply reaches 1.06s average latency and 2692 tokens/s on a single A100 40GB GPU.
Deployment-ready apply backend: the model can be served behind an OpenAI-compatible endpoint and used as a dedicated apply model in Continue.
Reproducible pipeline: data generation, training, inference, scoring, and error classification scripts are included.

Resources

This release is split into one GitHub repository and three Hugging Face artifacts:

Artifact	Release target	Description
Code repository	GitHub	Open-source project repository containing inference scripts, data construction code, training recipes, evaluation tools, Continue integration, and documentation.
Test dataset	Hugging Face Dataset	Public evaluation set for Full-File Apply, covering 20 programming languages and file formats. Use this artifact to reproduce benchmark scores without rebuilding the training data pipeline.
RL model	Hugging Face Model	4B Apply model post-trained with reinforcement learning / GRPO. It is optimized for task-level correctness, locality, and robustness under alternative edit representations.
SFT model	Hugging Face Model	4B Apply model trained with supervised fine-tuning. It provides strong in-distribution accuracy and better long-context structural preservation in our experiments.

Task Definition

Full-File Apply takes:

<language>{language}</language>
<source_file>{original full file}</source_file>
<update_snippet>{localized update snippet}</update_snippet>

and returns:

<update_file>{complete updated file}</update_file>

The task has three core requirements:

Complete output: the model must return the full updated file, not a patch or partial fragment.
No side effects: content outside the requested edit region should remain identical to the source file.
Placeholder expansion: markers such as // ... existing code ... mean "copy the corresponding original content exactly"; placeholders must not appear in the final output.

If anchors in the update snippet are ambiguous or cannot be located safely, the model should fail conservatively rather than hallucinate an unrelated edit.

Quick Start

Install

git clone --depth 1 --recurse-submodules https://github.com/aixcoder-plugin/aiXapply-4B.git
cd aiXapply-4B

python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt

For model serving, install a vllm build compatible with your CUDA and PyTorch environment.

Serve a Model with vLLM

export WEIGHT_DIR=/path/to/aiXapply-4B-RL  # or /path/to/aiXapply-4B-SFT
export SERVE_MODEL_NAME=aiXapply-4B-RL

CUDA_VISIBLE_DEVICES=0 vllm serve "$WEIGHT_DIR" \
  --host 0.0.0.0 \
  --port 12003 \
  --served-model-name "$SERVE_MODEL_NAME" \
  --tensor-parallel-size 1 \
  --enable-chunked-prefill \
  --kv-cache-dtype auto \
  --max-num-batched-tokens 4096 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95 \
  --speculative-config '{"method":"ngram","num_speculative_tokens":128,"prompt_lookup_max":7}'

Use --max-model-len 262144 only if your serving setup has enough memory for the full long-context configuration.

Call the Endpoint

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:12003/v1", api_key="local")

system_prompt = """You are a deterministic Code Patching Engine. Your task is to synthesize a "Updated File" by applying a partial "Update Snippet" to the provided "Source File".

### Algorithm
1. **Context Matching**: Analyze the `Update Snippet` to identify the context anchors (the lines of code surrounding the changes). Locate the exact corresponding block in the `Source File`. The match must be unique.
2. **Code Merging**: Replace the matched block in the `Source File` with the logic from the `Update Snippet`.
3. **Expansion**: The `Update Snippet` contains omission markers (e.g., `// ... existing code ...`). You MUST replace these markers with the original, unchanged lines from the `Source File`.
4. **Output Generation**: Output the FULL content of the resulting file.

### Constraints
- **NO Laziness**: Never output comments like `// ... rest of code ...` in the final output. You must write out every single line of the final code.
- **Strict Fidelity**: Preserve the original indentation style (spaces/tabs) and comments of the Source File for all unchanged parts.
- **Safety**: If the context in the snippet is ambiguous or cannot be found, output nothing inside the tags.

### Output Format
<update_file>[Your final code here]</update_file>"""

user_prompt = """<language>{language}</language>

<source_file>{source_file}</source_file>

<update_snippet>{update_snippet}</update_snippet>

Please generate the full updated code strictly following the instructions."""


LANGUAGE = "python"
SOURCE_FILE = """def add(a, b):
    return a + b

def main():
    print(add(1, 2))
"""
UPDATE_SNIPPET = """#  ... existing code ...
def main():
    print(add(7, 8))
"""


response = client.chat.completions.create(
    model="aiXapply-4B-RL",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt.format(language=LANGUAGE, source_file=SOURCE_FILE, update_snippet=UPDATE_SNIPPET)},
    ],
    temperature=0,
)

print(response.choices[0].message.content)

Continue Integration

continue_config/ contains an adapter for using aiXapply as Continue's dedicated Apply backend.

The recommended local workflow is:

Continue -> continue_apply_proxy.py -> OpenAI-compatible aiXapply endpoint

Start the proxy:

cd continue_config
export APPLY_PROXY_UPSTREAM_CHAT_URL="http://127.0.0.1:12003/v1/chat/completions"
export APPLY_PROXY_HOST="127.0.0.1"
export APPLY_PROXY_PORT="14124"
python3 continue_apply_proxy.py

Then merge the apply model block from continue_config/continue.config.yaml.example into your Continue config. The proxy strips <update_file>...</update_file> tags before returning the result to Continue and supports streaming responses.

See continue_config/README.md for configuration details and troubleshooting.

Dataset

The public test dataset is released separately on Hugging Face. It contains the benchmark examples used to evaluate aiXapply and comparable models. Each example follows the Apply format:

<source_file, update_snippet, update_file>

The broader training-data construction pipeline is included in this repository. It synthesizes Apply examples from real-world commits, including CommitPack-style records with (old_file, new_file, commit_message).

Figure 2: Dataset construction pipeline. Raw CommitPack records are sampled, consistency-verified, solvability-filtered, and split into train/test sets.

High-level pipeline:

Sampling and filtering: keep localized same-file edits and balance languages/formats.
Change description generation: make the intent of each commit explicit.
Snippet synthesis: produce a localized update_snippet and full-file ground truth.
Consistency verification: ensure every diff is explained by the snippet and no extra change is introduced.
Solvability filtering: remove ambiguous or non-reproducible samples, then convert to training format.

Dataset scale:

Split	Samples	Notes
Train	19,347	Multi-language Apply training examples
Test	1,637	Public Hugging Face test dataset

The test set covers C, C++, Dockerfile, Go, HTML, INI, Java, JavaScript, JSON, Makefile, Markdown, Python, reStructuredText, Rust, Shell, SQL, Text, TypeScript, XML, and YAML.

See data_generation/README.md for scripts, configs, and reconstruction steps.

Training

aiXapply is trained from a Qwen3-4B backbone with two complementary strategies:

SFT: direct supervised learning from (source_file, update_snippet) to update_file.
RL / GRPO: task-level optimization with rewards based on equivalence, patch correctness, and side-effect penalties.

The released model artifacts are aiXapply-4B-SFT and aiXapply-4B-RL. Use the SFT model as the default choice for high full-file Apply accuracy and long-context fidelity; use the RL model when you want the RL-aligned variant used in the latency/accuracy frontier and cross-format experiments.

SFT

python -m pip install --extra-index-url https://download.pytorch.org/whl/cu128 -r training/sft/requirements.txt

cd training/sft
WANDB_PROJECT=aiXapply_sft \
WANDB_RUN_NAME=qwen3-4b-sft \
accelerate launch --config_file fsdp_config.yaml run_sft.py \
  --train_dataset_path /path/to/train.parquet \
  --test_dataset_path /path/to/test.parquet \
  --model_name /path/to/Qwen3-4B \
  --output_dir checkpoints/full_finetune

Update training/sft/fsdp_config.yaml for your machine, especially num_processes and context-parallel settings.

RL / GRPO

The RL setup uses veRL. A typical training environment can be started with:

docker pull verlai/verl:vllm011.latest

export WORKSPACE=/path/to/workspace
docker create -it --runtime=nvidia --gpus all --net=host --ipc=host \
  --cap-add=SYS_ADMIN \
  -v "$WORKSPACE:$WORKSPACE" \
  --entrypoint /bin/bash \
  --name aixapply_verl \
  verlai/verl:vllm011.latest \
  -c "sleep infinity"

docker start aixapply_verl
docker exec -it aixapply_verl bash

Inside the container:

git submodule update --init --recursive
cd training/rl/verl
pip install -e .
pip install -e .[sglang]
cd ../../..

cd training/rl
MODEL_PATH=/path/to/Qwen3-4B \
TRAIN_FILES=/path/to/train.parquet \
TEST_FILES=/path/to/test.parquet \
bash run_qwen3-4b_sgl_megatron_multi_grpo.sh

Training is resource-intensive; the paper experiments use multi-GPU A100-class hardware.

Evaluation

Run inference:

python experiments/aiXapply/infer_openai.py \
  --provider local \
  --data-path /path/to/test.parquet

The local provider in experiments/aiXapply/infer_openai.py expects an OpenAI-compatible endpoint at http://127.0.0.1:12003/v1. If you serve the model on a different port or with a different served model name, update the local provider config in that script before running evaluation.

Score predictions:

python experiments/evaluation/run_evaluation.py \
  -i predictions/xxx.jsonl \
  --classify_errors

Optional LLM-assisted error classification:

export OPENAI_BASE_URL="http://your_endpoint/v1"
export OPENAI_MODEL="your_judge_model"

python experiments/evaluation/run_evaluation.py \
  -i predictions/xxx.jsonl \
  --classify_errors \
  --llm

The primary metric is equivalence accuracy:

Code files are compared with Pygments token equivalence.
Structured formats such as JSON, YAML, XML, and INI are parsed or classified as invalid when parsing fails.
Errors can be grouped into OUTPUT_INVALID, PATCH_NOT_APPLIED, PATCH_INCOMPLETE, PATCH_INCORRECT, WRONG_POSITION, and OUT_OF_PATCH_SIDE_EFFECT.

See experiments/README.md and experiments/evaluation/README.md for the full experiment layout.

Results

aiXapply-RL keeps full-file Apply accuracy while reducing latency to an interactive range in the latency/accuracy frontier experiments, while aiXapply-SFT provides the strongest reported main-benchmark accuracy and long-context result.

Main Benchmark

Average equivalence accuracy on the 1,637-example aiXapply test set:

Model	Avg Accuracy
Qwen3-4B baseline	0.626
Fast-Apply-7B	0.620
DeepSeek-V3.2	0.916
GLM-5	0.921
aiXapply-RL	0.938
aiXapply-SFT	0.944
Qwen3.5-397B-A17B	0.948

Editing Paradigms

Under the same DeepSeek-V3.2 model, full-file Apply improves one-shot accuracy over common edit representations:

Representation	Accuracy	Avg Latency
Unified diff	0.560	14.22s
Search-and-replace	0.749	28.48s
Full-file Apply	0.916	108.96s
aiXapply-RL full-file Apply	0.938	1.44s

Speculative Decoding

Method	Avg Latency	P95 Latency	Throughput
No speculation	28.83s	90.23s	102.04 tokens/s
Suffix default	5.75s	20.74s	509.54 tokens/s
N-gram default	2.17s	6.94s	1343.99 tokens/s
N-gram best (`n=7`, `k=128`)	1.06s	3.38s	2692.01 tokens/s

Generalization

Setting	DeepSeek-V3.2	aiXapply-RL	aiXapply-SFT
Long context	0.588	0.647	0.843
Untrained languages avg.	0.932	0.938	0.941
Random placeholders avg.	0.932	0.948	0.951
Chunk file avg.	0.850	0.881	0.900

Industrial Deployment

In the aiXcoder IDE plugin, aiXapply is deployed as a dedicated Apply service after the upstream model generates an update snippet. In production traces, the Apply stage drops from 50s average latency to 1.89s, with P95 latency reduced from 89s to 3.78s. The setup also offloads full-file generation from the upstream large model, improving serving capacity and reducing cost.

Repository Notes

The current release focuses on single-file Apply. Multi-file edits and interactive multi-step editing are future work.
aiXapply optimizes deterministic integration, not semantic validation. You should still run tests and review generated diffs before accepting edits.
Do not commit secrets, checkpoints, datasets, or generated prediction artifacts unless they are intentionally part of a release.

Contributing

Contributions are welcome. Please read CONTRIBUTING.md before opening issues or pull requests.

For useful bug reports, include the script or endpoint you ran, the command/configuration, the observed output or traceback, and enough model/provider context to reproduce the problem.

License

This model is licensed under the Apache License 2.0. See the code repository LICENSE for details.

Citation

If you find aiXapply useful, please cite:

@misc{jiang2026aixapply,
  title = {AiXapply: Fast and Reliable Full-File Code Integration with Specialized Small Models for IDE Workflows},
  author = {Jiang, Siyuan and Cai, Xiang and Wang, Peixu and Han, Yu and Dong, Yihong and Ning, Wei and Guo, Xuyuan and Wen, Jincheng and Zhao, Wei and Li, Ge},
  year = {2026},
  url = {https://github.com/aixcoder-plugin/aiXapply-4B}
}

Downloads last month: 29

Safetensors

Model size

4B params

Tensor type

F32

Model tree for aiXcoder/aiXapply-4B-SFT

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(650)

this model

Quantizations

1 model

Dataset used to train aiXcoder/aiXapply-4B-SFT

Evaluation results

Average equivalence accuracy on aiXapply main benchmark
self-reported

0.944