Qwen3-4B-Instruct-2507-ZIP-RC
This model is a modified version of Qwen/Qwen3-4B-Instruct-2507, trained to support Zero-Overhead Introspection (ZIP-RC) for adaptive test-time compute.
It was created as part of a Paper Replication experiment for: "Zero-Overhead Introspection for Adaptive Test-Time Compute" (Manvi et al., 2025).
| Links | Description |
|---|---|
| Quickstart Notebook: Run adaptive inference immediately. | |
| Replication Notebook: Full experiments and reproduction. |
Model Description
This model retains the full reasoning capabilities of the base Qwen3-4B-Instruct model but features a fine-tuned LM Head. The head has been trained to repurpose unused logit space to predict a joint distribution of Expected Reward (Correctness) and Remaining Generation Length at every token step.
This allows the model to "introspect" during generation with zero computational overhead, enabling:
- Adaptive Sampling: Dynamically pruning low-quality trajectories.
- Budget Management: Balancing compute cost vs. accuracy.
- Self-Correction: Detecting when a reasoning path is failing before it finishes.
Usage
1. Quick Start: Adaptive Inference
The easiest way to use the model is via the ziprc helper library, which handles the Meta-MDP logic (branching, pruning, and swapping).
import torch
import sys
import os
from huggingface_hub import hf_hub_download
# 1. Download the helper script dynamically
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))
# 2. Import the downloaded module
import ziprc
# 3. Run Inference
model = ziprc.ZIPRCModel(ziprc.ZIPRCConfig())
sampler = ziprc.ZIPRCSampler(model)
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
trajectories = sampler.generate(prompt, initial_samples=2)
best = sampler.select_best_trajectory(trajectories)
print(f"Confidence: {best['final_score']:.2%}")
2. Advanced Usage: Streaming & Configuration
This example shows how to configure the pruning aggressiveness (alpha) and cost penalty (beta), and how to stream the result to see the introspection in action.
import sys
import os
#import tqdm
from huggingface_hub import hf_hub_download
from tqdm import tqdm
# 1. Download the helper script dynamically from the repo
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))
# 2. Import the module
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler
# 3. Configure and Load Model
# Note: The model weights are downloaded automatically here
cfg = ZIPRCConfig(
model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
alpha=0.1, # Threshold for pruning
beta=0.05, # Cost penalty
smoothing_window=3 # For stable predictions
)
model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)
# 4. Generate with Introspection
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
# generate_stream produces trajectories with introspection data
trajectories = sampler.generate_stream(prompt, initial_samples=2)
# Select the best answer based on the introspection score
best = sampler.select_best_trajectory(trajectories)
print(f"Confidence: {best['final_score']:.2%}")
print(f"Answer: {model.tokenizer.decode(best['ids'][0], skip_special_tokens=True)}")
3. Low-Level: Reading the Logits
You can manually decode the introspection signal (Reward and Cost) from the reserved tokens in the logits without using the sampler.
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
# Configuration used during training
reward_bins = 8
length_bins = 7
total_zip_tokens = 56
zip_start_offset = 56
# ZIP tokens are located at the very end of the vocabulary
zip_start_id = model.config.vocab_size - zip_start_offset
def get_introspection_probs(logits):
"""
Extracts the joint distribution P(Reward, Length) from the logits.
"""
# Slice the reserved ZIP logits
zip_logits = logits[:, zip_start_id : zip_start_id + total_zip_tokens]
# Softmax over the flat ZIP tokens to get valid probabilities
probs = F.softmax(zip_logits, dim=-1)
# Reshape to [Batch, Reward_Bins, Length_Bins]
return probs.view(-1, reward_bins, length_bins)
# Example Inference Step
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(inputs.input_ids)
next_token_logits = outputs.logits[:, -1, :]
# Get Introspection Signal (Zero Overhead)
joint_dist = get_introspection_probs(next_token_logits)
# 1. Marginalize over length to get P(Reward) distribution
p_reward = joint_dist.sum(dim=2) # Shape: [Batch, Reward_Bins]
# 2. Calculate Expected Reward (Confidence)
# The reward bins are linearly spaced [0, 1]. We use bin centers for the weighted sum.
# centers = 0.0625, 0.1875, ..., 0.9375
reward_grid = torch.linspace(0.0625, 0.9375, reward_bins).to(model.device)
# E[R] = sum(P(r) * r)
expected_reward = (p_reward * reward_grid).sum(dim=1).item()
print(f"Model Confidence: {expected_reward:.2%}")
4. OpenAI-Compatible Streaming (Async)
This method exposes the introspection data (zip_rc field) alongside standard text generation chunks, suitable for integration with frontends.
import asyncio
import nest_asyncio
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler
# 1. Setup (Run once)
# This patch is required for running async loops in Colab/Jupyter
nest_asyncio.apply()
# Load Model
cfg = ZIPRCConfig(model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)
async def consume_inference_stream():
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
print(f"User: {prompt}\n" + "-"*60)
print("Assistant (Streaming with Introspection):")
# 2. Get the OpenAI-compatible stream
# Returns an async generator yielding chunk objects
stream = sampler.openai(prompt, max_tokens=256)
final_clean_answer = ""
async for chunk in stream:
# --- Channel A: Standard Text (Compatible with standard UIs) ---
# Use .get() to handle the final chunk where delta is empty
# Use .get() to safely handle the final chunk where delta is empty
delta = chunk.choices[0].delta
content = delta.get("content", "")
if content:
print(content, end="", flush=True)
# --- Channel B: Zero-Overhead Introspection (The "Pareto" Gain) ---
# We access the side-channel data to see what the model is thinking
# without running separate reward model inference.
if hasattr(chunk, 'zip_rc'):
info = chunk.zip_rc
# If the model performs a meta-action (Branching/Pruning), log it
# Filter out 'finished' to avoid accessing missing utility/score fields
if info.action not in ['keep', 'finished']:
print(f"\n[โ๏ธ META-ACTION: {info.action} | Utility: {info.utility:.4f}] ", end="")
# Check for the Final Answer
if info.get('action') == 'finished' and 'final_text' in info:
final_clean_answer = info['final_text']
# Optional: Peek at the "Confidence" (Expected Correctness) in real-time
# if info.step % 10 == 0:
# print(f" (Conf: {info.lhs_score:.1%}) ", end="")
print("\n" + "-" * 40)
print("๐ FINAL BEST ANSWER (Clean):")
print("-" * 40)
print(final_clean_answer)
# 3. Execution
loop = asyncio.get_event_loop()
loop.run_until_complete(consume_inference_stream())
5. Local Server Deployment
You can deploy an OpenAI-compatible API server that streams both text and introspection data.
import sys
import os
import asyncio
import uvicorn
from huggingface_hub import hf_hub_download
# 1. Download server.py
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="server.py")
sys.path.append(os.path.dirname(script_path))
# 2. Import the app
# NOTE: This will load the model weights again if they aren't cached.
# If you are low on VRAM, restart your runtime before running this cell.
from server import app
# 3. Run the Server (Colab/Jupyter Compatible)
HOST = "0.0.0.0"
PORT = 8000
config = uvicorn.Config(app, host=HOST, port=PORT)
server = uvicorn.Server(config)
try:
# Check if we are in an existing loop (Colab)
loop = asyncio.get_running_loop()
print(f"๐ Server running in background on http://{HOST}:{PORT}")
loop.create_task(server.serve())
except RuntimeError:
# Standard script execution
asyncio.run(server.serve())
Citation
@article{manvi2025ziprc,
title={Zero-Overhead Introspection for Adaptive Test-Time Compute},
author={Manvi, Rohin and Hong, Joey and Seyde, Tim and Labonne, Maxime and Lechner, Mathias and Levine, Sergey},
journal={arXiv preprint arXiv:2512.01457},
year={2025}
}
- Downloads last month
- 103
Model tree for dataopsnick/Qwen3-4B-Instruct-2507-zip-rc
Base model
Qwen/Qwen3-4B-Instruct-2507