Instructions to use continuous-lab/FastEdit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use continuous-lab/FastEdit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="continuous-lab/FastEdit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("continuous-lab/FastEdit", dtype="auto")

llama-cpp-python

How to use continuous-lab/FastEdit with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="continuous-lab/FastEdit",
	filename="gguf/fastedit-1.7b-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use continuous-lab/FastEdit with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf continuous-lab/FastEdit:Q8_0
# Run inference directly in the terminal:
llama-cli -hf continuous-lab/FastEdit:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf continuous-lab/FastEdit:Q8_0
# Run inference directly in the terminal:
llama-cli -hf continuous-lab/FastEdit:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf continuous-lab/FastEdit:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf continuous-lab/FastEdit:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf continuous-lab/FastEdit:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf continuous-lab/FastEdit:Q8_0

Use Docker

docker model run hf.co/continuous-lab/FastEdit:Q8_0

LM Studio
Jan

vLLM

How to use continuous-lab/FastEdit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "continuous-lab/FastEdit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "continuous-lab/FastEdit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/continuous-lab/FastEdit:Q8_0

SGLang

How to use continuous-lab/FastEdit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "continuous-lab/FastEdit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "continuous-lab/FastEdit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "continuous-lab/FastEdit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "continuous-lab/FastEdit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use continuous-lab/FastEdit with Ollama:
```
ollama run hf.co/continuous-lab/FastEdit:Q8_0
```

Unsloth Studio new

How to use continuous-lab/FastEdit with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for continuous-lab/FastEdit to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for continuous-lab/FastEdit to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for continuous-lab/FastEdit to start chatting

Pi new

How to use continuous-lab/FastEdit with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf continuous-lab/FastEdit:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "FastEdit"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use continuous-lab/FastEdit with Docker Model Runner:
```
docker model run hf.co/continuous-lab/FastEdit:Q8_0
```

Lemonade

How to use continuous-lab/FastEdit with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull continuous-lab/FastEdit:Q8_0

Run and chat with the model

lemonade run user.FastEdit-Q8_0

List all available models

lemonade list

FastEdit 1.7B

A fine-tuned Qwen2.5-Coder-1.5B-Instruct for merging code edit snippets into source files. Given an original code chunk (~35 lines) and a compact edit snippet with context markers, the model produces the merged result.

This model is designed to be used with the FastEdit toolkit, which handles AST scoping, deterministic edits, and post-processing. Using the model directly requires the exact prompt format described below.

Model variants

All variants are in this repo under subfolders:

Subfolder	Format	Size	Use case
`bf16/`	BF16 safetensors	3.2 GB	Fine-tuning, reference, GPU serving via vLLM/TGI
`mlx-8bit/`	MLX 8-bit	1.7 GB	Apple Silicon (recommended for local use)
`gguf/`	GGUF Q8_0	1.7 GB	llama.cpp, LM Studio, Ollama

Prompt format

The model expects a specific 2-message chat format. Using a different prompt will produce poor results.

System message

You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. /no_think

The /no_think suffix disables Qwen's thinking mode — without it, the model may emit thousands of reasoning tokens before producing output.

User message

Merge all changes from the <update> snippet into the <code> below.
- Preserve the code's structure, order, comments, and indentation exactly.
- Output only the updated code, enclosed within <updated-code> and </updated-code> tags.
- Do not include any additional text, explanations, placeholders, ellipses, or code fences.

<code>{original_code}</code>

<update>{update_snippet}</update>

Provide the complete updated code.

Expected output

The model outputs the merged code wrapped in <updated-code> tags:

<updated-code>
def process(data):
    try:
        result = transform(data)
        return result
    except Error as e:
        return {"error": str(e)}
</updated-code>

Complete example

Original code (what tree-sitter extracts for the target function):

def process(data):
    result = transform(data)
    return result

Edit snippet (what the user/agent writes):

def process(data):
    try:
        # ... existing code ...
    except Error as e:
        return {"error": str(e)}

Model output:

<updated-code>
def process(data):
    try:
        result = transform(data)
        return result
    except Error as e:
        return {"error": str(e)}
</updated-code>

The model understands # ... existing code ... markers (and language-specific variants like // ... existing code ...) as instructions to preserve the original lines in that region.

How it fits into FastEdit

In production, the model is the fallback — not the primary path:

AST scoping — tree-sitter finds the target function by name (~35 lines), so the model never sees the whole file
Deterministic text-match �� 74% of edits are resolved by matching context lines and splicing in new lines (0 tokens, <1ms)
Model merge — the remaining 26% of edits (structural changes like wrapping in try/catch, full rewrites) go to this model

The model only ever processes ~35-line chunks. It was trained on function-scoped edits, not whole files. Feeding it large inputs will degrade quality.

Using without FastEdit

If you want to use the model directly (without the toolkit), you need to:

Scope the input yourself — extract only the target function/class, not the whole file
Use the exact prompt format above — different prompts will produce different (worse) results
Parse the output — extract text between <updated-code> and </updated-code> tags
Handle edge cases — the model may emit <think> blocks (strip them), use variant tag names (<update-code>, <updated_code>), or truncate output on long functions

from transformers import AutoModelForCausalLM, AutoTokenizer

# BF16 (GPU / fine-tuning)
model = AutoModelForCausalLM.from_pretrained("continuous-lab/FastEdit", subfolder="bf16", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("continuous-lab/FastEdit", subfolder="bf16")

messages = [
    {"role": "system", "content": "You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. /no_think"},
    {"role": "user", "content": """Merge all changes from the <update> snippet into the <code> below.
- Preserve the code's structure, order, comments, and indentation exactly.
- Output only the updated code, enclosed within <updated-code> and </updated-code> tags.
- Do not include any additional text, explanations, placeholders, ellipses, or code fences.

<code>def process(data):
    result = transform(data)
    return result</code>

<update>def process(data):
    try:
        # ... existing code ...
    except Error as e:
        return {"error": str(e)}</update>

Provide the complete updated code."""}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0)
result = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
# Parse: extract text between <updated-code> and </updated-code>

Training

Base model: Qwen2.5-Coder-1.5B-Instruct
Task: Code edit merging across 13 languages

Evaluation

Tested on 22 structurally distinct edit patterns (73 cases) across 13 languages:

Path	Accuracy	Avg tokens	Avg latency
Deterministic (74% of edits)	100%	0	<1ms
Model (26% of edits)	92%	~40	~500ms
Combined	~98%	~10	~130ms

Per-language model accuracy (156-example benchmark):

Language	Accuracy
Python, Java, Kotlin, C, PHP	92%
JavaScript, TypeScript, Rust, Swift	85%
Go, C++, Ruby	77%

Limitations

Performance degrades on inputs longer than ~100 lines.
Does not handle whole-file edits well — use the FastEdit toolkit's AST scoping.
The edit snippet must use # ... existing code ... markers (or language-equivalent) for context preservation. Without markers, the model treats the entire snippet as a replacement.
Languages not in the training set may work but are untested.