Instructions to use dataopsnick/Qwen3-4B-Instruct-2507-zip-rc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dataopsnick/Qwen3-4B-Instruct-2507-zip-rc with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
model = AutoModelForCausalLM.from_pretrained("dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dataopsnick/Qwen3-4B-Instruct-2507-zip-rc with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dataopsnick/Qwen3-4B-Instruct-2507-zip-rc

SGLang

How to use dataopsnick/Qwen3-4B-Instruct-2507-zip-rc with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dataopsnick/Qwen3-4B-Instruct-2507-zip-rc with Docker Model Runner:
```
docker model run hf.co/dataopsnick/Qwen3-4B-Instruct-2507-zip-rc
```

Qwen3-4B-Instruct-2507-ZIP-RC

This model is a modified version of Qwen/Qwen3-4B-Instruct-2507, trained to support Zero-Overhead Introspection (ZIP-RC) for adaptive test-time compute.

It was created as part of a Paper Replication experiment for: "Zero-Overhead Introspection for Adaptive Test-Time Compute" (Manvi et al., 2025).

Links	Description
	Quickstart Notebook: Run adaptive inference immediately.
	Replication Notebook: Full experiments and reproduction.

Model Description

This model retains the full reasoning capabilities of the base Qwen3-4B-Instruct model but features a fine-tuned LM Head. The head has been trained to repurpose unused logit space to predict a joint distribution of Expected Reward (Correctness) and Remaining Generation Length at every token step.

This allows the model to "introspect" during generation with zero computational overhead, enabling:

Adaptive Sampling: Dynamically pruning low-quality trajectories.
Budget Management: Balancing compute cost vs. accuracy.
Self-Correction: Detecting when a reasoning path is failing before it finishes.

Usage

1. Quick Start: Adaptive Inference

The easiest way to use the model is via the ziprc helper library, which handles the Meta-MDP logic (branching, pruning, and swapping).

import torch
import sys
import os
from huggingface_hub import hf_hub_download

# 1. Download the helper script dynamically
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the downloaded module
import ziprc 

# 3. Run Inference
model = ziprc.ZIPRCModel(ziprc.ZIPRCConfig())
sampler = ziprc.ZIPRCSampler(model)

prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
trajectories = sampler.generate(prompt, initial_samples=2)

best = sampler.select_best_trajectory(trajectories)
print(f"Confidence: {best['final_score']:.2%}")

2. Advanced Usage: Streaming & Configuration

This example shows how to configure the pruning aggressiveness (alpha) and cost penalty (beta), and how to stream the result to see the introspection in action.

import sys
import os
#import tqdm
from huggingface_hub import hf_hub_download
from tqdm import tqdm

# 1. Download the helper script dynamically from the repo
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="ziprc.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the module
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler

# 3. Configure and Load Model
# Note: The model weights are downloaded automatically here
cfg = ZIPRCConfig(
    model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc",
    alpha=0.1,          # Threshold for pruning
    beta=0.05,          # Cost penalty
    smoothing_window=3  # For stable predictions
)

model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)

# 4. Generate with Introspection
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"

# generate_stream produces trajectories with introspection data
trajectories = sampler.generate_stream(prompt, initial_samples=2)

# Select the best answer based on the introspection score
best = sampler.select_best_trajectory(trajectories)

print(f"Confidence: {best['final_score']:.2%}")
print(f"Answer: {model.tokenizer.decode(best['ids'][0], skip_special_tokens=True)}")

3. Low-Level: Reading the Logits

You can manually decode the introspection signal (Reward and Cost) from the reserved tokens in the logits without using the sampler.

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "dataopsnick/Qwen3-4B-Instruct-2507-zip-rc"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

# Configuration used during training
reward_bins = 8
length_bins = 7
total_zip_tokens = 56
zip_start_offset = 56
# ZIP tokens are located at the very end of the vocabulary
zip_start_id = model.config.vocab_size - zip_start_offset

def get_introspection_probs(logits):
    """
    Extracts the joint distribution P(Reward, Length) from the logits.
    """
    # Slice the reserved ZIP logits
    zip_logits = logits[:, zip_start_id : zip_start_id + total_zip_tokens]
    
    # Softmax over the flat ZIP tokens to get valid probabilities
    probs = F.softmax(zip_logits, dim=-1)
    
    # Reshape to [Batch, Reward_Bins, Length_Bins]
    return probs.view(-1, reward_bins, length_bins)

# Example Inference Step
prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(inputs.input_ids)
    next_token_logits = outputs.logits[:, -1, :]
    
    # Get Introspection Signal (Zero Overhead)
    joint_dist = get_introspection_probs(next_token_logits)
    
    # 1. Marginalize over length to get P(Reward) distribution
    p_reward = joint_dist.sum(dim=2)  # Shape: [Batch, Reward_Bins]
    
    # 2. Calculate Expected Reward (Confidence)
    # The reward bins are linearly spaced [0, 1]. We use bin centers for the weighted sum.
    # centers = 0.0625, 0.1875, ..., 0.9375
    reward_grid = torch.linspace(0.0625, 0.9375, reward_bins).to(model.device)
    
    # E[R] = sum(P(r) * r)
    expected_reward = (p_reward * reward_grid).sum(dim=1).item()
    
    print(f"Model Confidence: {expected_reward:.2%}")

4. OpenAI-Compatible Streaming (Async)

This method exposes the introspection data (zip_rc field) alongside standard text generation chunks, suitable for integration with frontends.

import asyncio
import nest_asyncio
from ziprc import ZIPRCModel, ZIPRCConfig, ZIPRCSampler

# 1. Setup (Run once)
# This patch is required for running async loops in Colab/Jupyter
nest_asyncio.apply()

# Load Model
cfg = ZIPRCConfig(model_name="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc")
model = ZIPRCModel(cfg)
sampler = ZIPRCSampler(model)

async def consume_inference_stream():
    prompt = "Solve the following logic puzzle: Five adults check into a hotel with three dogs. How many shoes are they all wearing?"
    
    print(f"User: {prompt}\n" + "-"*60)
    print("Assistant (Streaming with Introspection):")
    
    # 2. Get the OpenAI-compatible stream
    # Returns an async generator yielding chunk objects
    stream = sampler.openai(prompt, max_tokens=256)

    final_clean_answer = ""

    async for chunk in stream:
        # --- Channel A: Standard Text (Compatible with standard UIs) ---
        # Use .get() to handle the final chunk where delta is empty
        # Use .get() to safely handle the final chunk where delta is empty
        delta = chunk.choices[0].delta
        content = delta.get("content", "")
        
        if content:
            print(content, end="", flush=True)
                    
        # --- Channel B: Zero-Overhead Introspection (The "Pareto" Gain) ---
        # We access the side-channel data to see what the model is thinking
        # without running separate reward model inference.
        if hasattr(chunk, 'zip_rc'):
            info = chunk.zip_rc
            
            # If the model performs a meta-action (Branching/Pruning), log it
            # Filter out 'finished' to avoid accessing missing utility/score fields
            if info.action not in ['keep', 'finished']:
                print(f"\n[⚙️ META-ACTION: {info.action} | Utility: {info.utility:.4f}] ", end="")

            # Check for the Final Answer
            if info.get('action') == 'finished' and 'final_text' in info:
                final_clean_answer = info['final_text']
            
            # Optional: Peek at the "Confidence" (Expected Correctness) in real-time
            # if info.step % 10 == 0:
            #     print(f" (Conf: {info.lhs_score:.1%}) ", end="")

    print("\n" + "-" * 40)
    print("🏆 FINAL BEST ANSWER (Clean):")
    print("-" * 40)
    print(final_clean_answer)

# 3. Execution
loop = asyncio.get_event_loop()
loop.run_until_complete(consume_inference_stream())

5. Local Server Deployment

You can deploy an OpenAI-compatible API server that streams both text and introspection data.

import sys
import os
import asyncio
import uvicorn
from huggingface_hub import hf_hub_download

# 1. Download server.py
script_path = hf_hub_download(repo_id="dataopsnick/Qwen3-4B-Instruct-2507-zip-rc", filename="server.py")
sys.path.append(os.path.dirname(script_path))

# 2. Import the app
# NOTE: This will load the model weights again if they aren't cached.
# If you are low on VRAM, restart your runtime before running this cell.
from server import app

# 3. Run the Server (Colab/Jupyter Compatible)
HOST = "0.0.0.0"
PORT = 8000
config = uvicorn.Config(app, host=HOST, port=PORT)
server = uvicorn.Server(config)

try:
    # Check if we are in an existing loop (Colab)
    loop = asyncio.get_running_loop()
    print(f"🚀 Server running in background on http://{HOST}:{PORT}")
    loop.create_task(server.serve())
except RuntimeError:
    # Standard script execution
    asyncio.run(server.serve())

Citation

@article{manvi2025ziprc,
  title={Zero-Overhead Introspection for Adaptive Test-Time Compute},
  author={Manvi, Rohin and Hong, Joey and Seyde, Tim and Labonne, Maxime and Lechner, Mathias and Levine, Sergey},
  journal={arXiv preprint arXiv:2512.01457},
  year={2025}
}

Downloads last month: 5

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for dataopsnick/Qwen3-4B-Instruct-2507-zip-rc

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1713)

this model

Paper for dataopsnick/Qwen3-4B-Instruct-2507-zip-rc

Zero-Overhead Introspection for Adaptive Test-Time Compute

Paper • 2512.01457 • Published Dec 1, 2025 • 3