Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

README.md +276 -3
client.py +203 -0
inference_hf.py +383 -0
serve.py +95 -0
setup.sh +125 -0
setup_model_dir.py +128 -0
start_server.sh +48 -0
terminator.pt +3 -0
vllm_terminator/__init__.py +19 -0
vllm_terminator/model.py +553 -0
vllm_terminator/terminator_head.py +135 -0

README.md CHANGED Viewed

@@ -1,3 +1,276 @@
----
-license: apache-2.0
----

+---
+language:
+  - en
+license: apache-2.0
+library_name: vllm
+tags:
+  - reasoning
+  - chain-of-thought
+  - efficiency
+  - inference-optimization
+  - qwen3
+base_model: Qwen/Qwen3-14B
+base_model_relation: finetune
+pipeline_tag: text-generation
+---
+# Terminator-Qwen3-14B
+**Terminator** is a lightweight neural module that predicts when a reasoning language model has reached its final answer during chain-of-thought (CoT) generation. When the Terminator detects the model has committed to an answer, it truncates the remaining reasoning and forces the model to begin its response, thereby delivering the same answer with significantly less computation.
+This repository contains everything needed to run **Terminator-Qwen3-14B**:
+- Trained Terminator checkpoint (1 extra transformer layer + prediction head)
+- vLLM plugin code (`vllm_terminator/`) for high-performance serving
+- Server launcher and streaming client
+- Standalone HuggingFace inference script (no server required)
+- Automated setup script
+**Note**: Terminator currently supports **single-GPU, single-sequence inference only**.
+---
+## Quick Start
+```bash
+# 1. Clone the repository (requires Git LFS: https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/acnagle/Terminator-Qwen3-14B
+cd Terminator-Qwen3-14B
+# 2. Run automated setup (creates conda env, installs vllm, downloads base model)
+./setup.sh
+# 3. Start the server
+./start_server.sh
+# 4. In another terminal, chat with the model
+python client.py --interactive
+```
+---
+## Requirements
+- **GPU**: Single NVIDIA GPU with at least ~40GB VRAM (e.g., A100 40GB)
+- **CUDA**: Compatible CUDA driver installed, 12.9 and above recommended.
+- **Python**: 3.12
+- **OS**: Linux (recommended) or any OS supported by vLLM
+---
+## Installation
+### Option A: Automated Setup
+The `setup.sh` script handles everything:
+```bash
+./setup.sh
+```
+This will:
+1. Create a conda environment called `terminator` with Python 3.12
+2. Install [uv](https://docs.astral.sh/uv/), [vLLM](https://docs.vllm.ai/), and [openai](https://pypi.org/project/openai/)
+3. Download Qwen3-14B base model weights (~28GB) from HuggingFace
+4. Create the model directory (`model_dir/`)
+### Option B: Manual Setup
+**1. Create a Python environment**
+Using conda or micromamba:
+```bash
+conda create -n terminator python=3.12 -y
+conda activate terminator
+```
+**2. Install uv**
+```bash
+pip install --upgrade uv
+```
+Or see the [uv installation guide](https://docs.astral.sh/uv/getting-started/installation/).
+**3. Install vLLM**
+```bash
+uv pip install vllm --torch-backend=auto
+```
+See the [vLLM installation guide](https://docs.vllm.ai/en/latest/getting_started/installation/) for alternative installation methods (ROCm, CPU, etc.).
+**4. Install openai (for the client)**
+```bash
+uv pip install openai
+```
+**5. Set up the model directory**
+This downloads the base Qwen3-14B weights and creates a vLLM-ready model directory:
+```bash
+python setup_model_dir.py
+```
+The script accepts optional arguments:
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--checkpoint` | `./terminator.pt` | Path to the Terminator checkpoint |
+| `--output-dir` | `./model_dir` | Output model directory |
+| `--threshold` | `0.7` | Prediction threshold for Terminator activation |
+| `--window-size` | `10` | Sliding window size for majority vote |
+| `--exit-message` | *(built-in message)* | Message injected when Terminator fires |
+---
+## Starting the Server
+```bash
+./start_server.sh
+```
+Or with custom configuration:
+```bash
+VLLM_GPU_UTIL=0.70 VLLM_MAX_MODEL_LEN=8192 ./start_server.sh
+```
+The server exposes an **OpenAI-compatible API** on the configured port (default: 8000).
+### Configuration
+Set these environment variables before running `start_server.sh` or `serve.py`:
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `VLLM_GPU_UTIL` | `0.90` | Fraction of GPU memory to use for the model |
+| `VLLM_MAX_MODEL_LEN` | *(auto)* | Maximum context length in tokens |
+| `VLLM_PORT` | `8000` | Server port |
+| `VLLM_ENFORCE_EAGER` | `0` | Set to `1` to disable CUDA graphs |
+| `VLLM_API_KEY` | *(none)* | Require this API key from clients |
+| `VLLM_SERVED_NAME` | `Terminator-Qwen3-14B` | Model name reported by the API |
+---
+## Standalone Inference (No Server)
+**Recommendation:** For the best performance, use the vLLM server described above. vLLM uses KV caching, CUDA graphs, and optimized kernels, making it **significantly faster** than HuggingFace-native inference. The script below is provided for quick testing and demos where spinning up a server is inconvenient.
+For quick testing without starting a vLLM server, use the HuggingFace-native inference script:
+```bash
+python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"
+```
+This loads the model directly via HuggingFace `transformers` and runs token-by-token generation with the Terminator head. Thinking content is streamed in dimmed text; the final answer is shown in bold.
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--prompt` | *(required)* | Input prompt |
+| `--model` | `Qwen/Qwen3-14B` | HuggingFace model name or path |
+| `--checkpoint` | `./terminator.pt` | Path to the Terminator checkpoint |
+| `--threshold` | `0.7` | Prediction threshold |
+| `--window-size` | `10` | Sliding window size for majority vote |
+| `--exit-message` | *(built-in message)* | Message injected when Terminator fires (empty string to disable) |
+| `--max-tokens` | `32768` | Maximum tokens to generate |
+| `--temperature` | `0.6` | Sampling temperature |
+---
+## Using the Client (vLLM Server)
+### Single Prompt
+```bash
+python client.py --prompt "What is the sum of the first 100 natural numbers?"
+```
+### Interactive Mode
+```bash
+python client.py --interactive
+```
+This starts a multi-turn conversation with the model. Thinking content is displayed in dimmed text; the final answer is shown in bold.
+### Client Options
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--base-url` | `http://localhost:8000/v1` | Server URL |
+| `--max-tokens` | *(server default)* | Maximum tokens to generate |
+| `--temperature` | `0.6` | Sampling temperature |
+### Using the API Directly
+The server is OpenAI-compatible. You can use any OpenAI client library. Replace `localhost` with your server's address if connecting remotely:
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="Terminator-Qwen3-14B",
+    messages=[{"role": "user", "content": "What is 25 * 37?"}],
+    temperature=0.6,
+    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
+)
+# Thinking content (chain-of-thought)
+print(response.choices[0].message.reasoning_content)
+# Final answer
+print(response.choices[0].message.content)
+```
+---
+## How Terminator Works
+Terminator is a single transformer layer followed by a prediction head, trained on top of a frozen Qwen3-14B base model. The transformer layer (initialized as a copy of the base model's final layer, then fine-tuned) takes the hidden states from the LLM and processes them before the prediction head, which outputs a per-token binary prediction: *has the model reached its final answer?*
+During generation, Terminator maintains a **sliding window** of the most recent predictions. When a majority of predictions in the window exceed the threshold (default: 0.7), the model is considered to have reached its final answer. At that point:
+1. A short **exit message** is injected into the reasoning (e.g., *"I've run out of thinking tokens. I need to commit to a final answer."*) to help the model transition smoothly.
+2. The `</think>` token is forced, ending the reasoning phase.
+3. The model generates its final answer normally.
+This allows the model to skip potentially thousands of redundant reasoning tokens while preserving answer quality.
+---
+## File Structure
+```
+Terminator-Qwen3-14B/
+├── README.md               This file
+├── terminator.pt            Trained Terminator checkpoint
+├── vllm_terminator/         vLLM plugin package
+│   ├── __init__.py          Registers the model architecture with vLLM
+│   ├── model.py             Qwen3TerminatorForCausalLM model class
+│   └── terminator_head.py   FFN classifier and checkpoint loading
+├── inference_hf.py          Standalone HuggingFace inference (no server)
+├── serve.py       vLLM server launcher
+├── setup_model_dir.py       Model directory setup (downloads base weights)
+├── client.py                Streaming chat client (connects to vLLM server)
+├── setup.sh                 Automated setup script
+└── start_server.sh          Server launcher with sensible defaults
+```
+---
+## Citation
+*Coming soon.*
+---
+## License
+This project builds on [Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) by the Qwen team. Please refer to the Qwen3 license for base model usage terms.

client.py ADDED Viewed

	@@ -0,0 +1,203 @@

+#!/usr/bin/env python3
+"""
+Client for the Terminator vLLM server.
+Supports single-prompt and multi-turn conversation modes with streaming
+output. Thinking content is displayed in dimmed text; answer content in
+normal text.
+Usage:
+    # Single prompt
+    python client.py --prompt "What is the sum of the first 100 natural numbers?"
+    # Interactive multi-turn conversation
+    python client.py --interactive
+    # Custom server URL and max tokens
+    python client.py --base-url http://localhost:8001/v1 --max-tokens 8192 --prompt "Hello"
+"""
+import argparse
+import sys
+from openai import OpenAI
+# ANSI escape codes
+DIM = "\033[2m"
+BOLD = "\033[1m"
+RESET = "\033[0m"
+BANNER_LINES = [
+    r"████████╗███████╗██████╗ ███╗   ███╗██╗███╗   ██╗ █████╗ ████████╗ ██████╗ ██████╗ ",
+    r"╚══██╔══╝██╔════╝██╔══██╗████╗ ████║██║████╗  ██║██╔══██╗╚══██╔══╝██╔═══██╗██╔══██╗",
+    r"   ██║   █████╗  ██████╔╝██╔████╔██║██║██╔██╗ ██║███████║   ██║   ██║   ██║██████╔╝",
+    r"   ██║   ██╔══╝  ██╔══██╗██║╚██╔╝██║██║██║╚██╗██║██╔══██║   ██║   ██║   ██║██╔══██╗",
+    r"   ██║   ███████╗██║  ██║██║ ╚═╝ ██║██║██║ ╚████║██║  ██║   ██║   ╚██████╔╝██║  ██║",
+    r"   ╚═╝   ╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚═╝  ╚═╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝",
+]
+# Dark red -> light red gradient (one color per row)
+_GRADIENT_RGB = [
+    (140, 0, 0),
+    (165, 15, 15),
+    (190, 35, 35),
+    (215, 55, 55),
+    (235, 70, 70),
+    (255, 90, 90),
+]
+def print_banner() -> None:
+    for line, (r, g, b) in zip(BANNER_LINES, _GRADIENT_RGB):
+        print(f"\033[38;2;{r};{g};{b}m{line}{RESET}")
+def detect_model(client: OpenAI) -> str:
+    """Auto-detect the served model name from the server."""
+    try:
+        models = client.models.list()
+        if not models.data:
+            print("ERROR: No models available on the server.", file=sys.stderr)
+            sys.exit(1)
+        return models.data[0].id
+    except Exception as e:
+        print(f"ERROR: Could not connect to server: {e}", file=sys.stderr)
+        sys.exit(1)
+def stream_response(
+    client: OpenAI,
+    model: str,
+    messages: list[dict],
+    max_tokens: int | None,
+    temperature: float,
+) -> str:
+    """Stream a chat completion response.
+    Thinking content is printed in dim text, answer content in normal text.
+    Returns the assistant's answer content (for conversation history).
+    """
+    kwargs = dict(
+        model=model,
+        messages=messages,
+        temperature=temperature,
+        stream=True,
+        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
+    )
+    if max_tokens is not None:
+        kwargs["max_tokens"] = max_tokens
+    stream = client.chat.completions.create(**kwargs)
+    in_thinking = False
+    in_answer = False
+    full_content = ""
+    try:
+        for chunk in stream:
+            if not chunk.choices:
+                continue
+            delta = chunk.choices[0].delta
+            reasoning = getattr(delta, "reasoning_content", None)
+            if reasoning:
+                if not in_thinking:
+                    sys.stdout.write(f"\n{DIM}Thinking...\n")
+                    in_thinking = True
+                sys.stdout.write(reasoning)
+                sys.stdout.flush()
+            if delta.content:
+                if not in_answer:
+                    if in_thinking:
+                        sys.stdout.write(RESET)
+                    sys.stdout.write(f"\n{BOLD}Answer:{RESET}\n")
+                    in_answer = True
+                sys.stdout.write(delta.content)
+                sys.stdout.flush()
+                full_content += delta.content
+    except KeyboardInterrupt:
+        pass
+    finally:
+        sys.stdout.write(RESET)
+        sys.stdout.flush()
+    print()
+    return full_content
+def run_single(client, model, prompt, max_tokens, temperature):
+    """Run a single prompt and exit."""
+    messages = [{"role": "user", "content": prompt}]
+    stream_response(client, model, messages, max_tokens, temperature)
+def run_interactive(client, model, max_tokens, temperature):
+    """Interactive multi-turn conversation loop."""
+    messages = []
+    print()
+    print_banner()
+    print()
+    print(f" Connected to {BOLD}{model}{RESET}")
+    print(f" Type your message and press Enter. Type {BOLD}quit{RESET} or Ctrl+C to exit.")
+    print(f" {DIM}Note: input is single-line only — compose your full message before pressing Enter.{RESET}")
+    print(f" {DIM}      Please ensure that copied text is formatted as a single line before pasting.{RESET}")
+    print()
+    while True:
+        try:
+            user_input = input(f"{BOLD}>>>{RESET} ")
+        except (KeyboardInterrupt, EOFError):
+            print("\nGoodbye!")
+            break
+        if user_input.strip().lower() in ("quit", "exit", "q"):
+            print("Goodbye!")
+            break
+        if not user_input.strip():
+            continue
+        messages.append({"role": "user", "content": user_input})
+        content = stream_response(client, model, messages, max_tokens, temperature)
+        messages.append({"role": "assistant", "content": content})
+        print()
+def main():
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    mode = parser.add_mutually_exclusive_group(required=True)
+    mode.add_argument("--prompt", type=str, help="Single prompt to send")
+    mode.add_argument(
+        "--interactive", action="store_true",
+        help="Start an interactive multi-turn conversation",
+    )
+    parser.add_argument(
+        "--base-url", default="http://localhost:8000/v1",
+        help="vLLM server URL (default: http://localhost:8000/v1)",
+    )
+    parser.add_argument(
+        "--max-tokens", type=int, default=None,
+        help="Maximum tokens to generate (default: server decides based on context length)",
+    )
+    parser.add_argument(
+        "--temperature", type=float, default=0.6,
+        help="Sampling temperature (default: 0.6)",
+    )
+    args = parser.parse_args()
+    client = OpenAI(base_url=args.base_url, api_key="EMPTY")
+    model = detect_model(client)
+    if args.prompt:
+        run_single(client, model, args.prompt, args.max_tokens, args.temperature)
+    else:
+        run_interactive(client, model, args.max_tokens, args.temperature)
+if __name__ == "__main__":
+    main()

inference_hf.py ADDED Viewed

	@@ -0,0 +1,383 @@

+#!/usr/bin/env python3
+"""
+HuggingFace-native inference for Terminator-Qwen3-14B.
+Loads the frozen Qwen3 base model + trained Terminator head (FFN + optional
+extra transformer layers) directly via HuggingFace transformers.
+Generates chain-of-thought reasoning token-by-token. The Terminator FFN
+predicts when the final answer has been reached; when a sliding-window
+majority vote exceeds the threshold, an exit message is injected and the
+model transitions to answering mode.
+Usage:
+    python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"
+    python inference_hf.py \\
+        --prompt "Solve x^2 - 5x + 6 = 0" \\
+        --model Qwen/Qwen3-14B \\
+        --checkpoint terminator.pt \\
+        --threshold 0.7 --window-size 10
+"""
+import argparse
+import os
+import sys
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers import TopKLogitsWarper, TopPLogitsWarper, TemperatureLogitsWarper
+from transformers.generation.logits_process import LogitsProcessorList
+# ---------------------------------------------------------------------------
+# Imports from the project
+# ---------------------------------------------------------------------------
+# Local: TerminatorFFN + checkpoint loader
+_script_dir = Path(__file__).resolve().parent
+sys.path.insert(0, str(_script_dir))
+from vllm_terminator.terminator_head import load_terminator_checkpoint
+# Parent dir: ExtraTransformerLayers from terminator_utils
+_repo_root = _script_dir.parent
+sys.path.insert(0, str(_repo_root))
+from terminator_utils import ExtraTransformerLayers
+# ---------------------------------------------------------------------------
+# ANSI escape codes
+# ---------------------------------------------------------------------------
+DIM = "\033[2m"
+BOLD = "\033[1m"
+RESET = "\033[0m"
+def load_model_and_tokenizer(model_name, device):
+    """Load base Qwen3 model and tokenizer."""
+    print(f"Loading tokenizer: {model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    think_token_id = tokenizer.convert_tokens_to_ids("<think>")
+    think_end_token_id = tokenizer.convert_tokens_to_ids("</think>")
+    if think_token_id == tokenizer.unk_token_id or think_end_token_id == tokenizer.unk_token_id:
+        raise ValueError(
+            f"<think>/<think> tokens not in tokenizer! "
+            f"IDs: {think_token_id}, {think_end_token_id}"
+        )
+    print(f"Loading model: {model_name}")
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.bfloat16,
+        device_map={"": device},
+        trust_remote_code=True,
+    )
+    for param in model.parameters():
+        param.requires_grad = False
+    model.eval()
+    print(
+        f"Model loaded: {model.config.num_hidden_layers} layers, "
+        f"hidden size {model.config.hidden_size}"
+    )
+    return model, tokenizer, think_token_id, think_end_token_id
+def build_extra_layers(base_model, checkpoint_config, extra_layers_state_dict, device):
+    """Reconstruct extra transformer layers from checkpoint state dict."""
+    num_extra_layers = checkpoint_config.get("num_extra_layers", 0)
+    if num_extra_layers == 0 or extra_layers_state_dict is None:
+        return None
+    print(f"Reconstructing {num_extra_layers} extra transformer layer(s)...")
+    base_layer_class = base_model.model.layers[0].__class__
+    model_config = base_model.config
+    rotary_emb = getattr(base_model.model, "rotary_emb", None)
+    extra_layers = ExtraTransformerLayers(
+        base_layer_class, num_extra_layers, model_config, rotary_emb=rotary_emb
+    ).to(device)
+    extra_layers.load_state_dict(extra_layers_state_dict)
+    extra_layers.eval()
+    param_count = sum(p.numel() for p in extra_layers.parameters())
+    print(f"Extra layers loaded ({param_count:,} parameters)")
+    return extra_layers
+def generate_with_terminator(
+    prompt,
+    model,
+    tokenizer,
+    ffn,
+    extra_layers,
+    layer_idx,
+    think_token_id,
+    think_end_token_id,
+    threshold,
+    window_size,
+    exit_message,
+    max_tokens,
+    temperature,
+    device,
+):
+    """Generate a response with Terminator early-exit logic.
+    Follows the same generation pattern as inference_terminator.py:mode1_generate().
+    Streams thinking tokens to the terminal as they are produced.
+    """
+    # Format prompt via chat template
+    messages = [{"role": "user", "content": prompt}]
+    prompt_text = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    # Tokenize and append <think>
+    prompt_ids = tokenizer(
+        prompt_text, add_special_tokens=False, return_tensors="pt"
+    )["input_ids"].to(device).long()
+    input_ids = torch.cat(
+        [prompt_ids, torch.tensor([[think_token_id]], dtype=torch.long, device=device)],
+        dim=1,
+    )
+    # Sampling processors
+    logits_processor = LogitsProcessorList([
+        TemperatureLogitsWarper(temperature=temperature),
+        TopKLogitsWarper(top_k=20),
+        TopPLogitsWarper(top_p=0.95),
+    ])
+    # Sliding-window state
+    predictions_list = []
+    reasoning_tokens = []
+    early_exit = False
+    # Start streaming thinking output
+    sys.stdout.write(f"\n{DIM}Thinking...\n")
+    sys.stdout.flush()
+    for step in range(max_tokens):
+        attention_mask = torch.ones_like(input_ids)
+        # Hook to capture hidden states from the target layer
+        captured = {}
+        def hook_fn(module, input, output):
+            if isinstance(output, tuple):
+                captured["hidden"] = output[0].detach()
+            else:
+                captured["hidden"] = output.detach()
+        target_layer = model.model.layers[layer_idx]
+        handle = target_layer.register_forward_hook(hook_fn)
+        with torch.no_grad():
+            outputs = model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                use_cache=False,
+            )
+        handle.remove()
+        hidden_states = captured["hidden"]  # [1, seq_len, hidden_size]
+        # Make prediction once we have at least one thinking token
+        if len(reasoning_tokens) > 0:
+            if extra_layers is not None:
+                h = hidden_states.float()
+                h = extra_layers(h, attention_mask=attention_mask)
+                last_h = h[:, -1:, :]
+                logits_pred = ffn(last_h.float())
+            else:
+                last_h = hidden_states[:, -1:, :]
+                logits_pred = ffn(last_h.float())
+            pred = torch.sigmoid(logits_pred)
+            predictions_list.append(pred[0, 0].item())
+            # Sliding-window majority vote
+            if len(predictions_list) >= window_size:
+                window = predictions_list[-window_size:]
+                n_above = sum(1 for p in window if p > threshold)
+                if n_above / window_size > 0.5:
+                    early_exit = True
+                    break
+        # Sample next token — LogitsProcessorList expects 2D [batch, vocab]
+        next_logits = outputs.logits[:, -1, :]  # [1, vocab_size]
+        next_logits = logits_processor(input_ids, next_logits)
+        probs = F.softmax(next_logits, dim=-1)
+        next_token = torch.multinomial(probs, num_samples=1)  # [1, 1]
+        # Natural </think>
+        if next_token.item() == think_end_token_id:
+            break
+        input_ids = torch.cat([input_ids, next_token], dim=1)
+        reasoning_tokens.append(next_token.item())
+        # Stream the token
+        token_text = tokenizer.decode([next_token.item()], skip_special_tokens=False)
+        sys.stdout.write(token_text)
+        sys.stdout.flush()
+    # End thinking section
+    if early_exit and exit_message:
+        sys.stdout.write(exit_message)
+    sys.stdout.write(f"{RESET}\n")
+    sys.stdout.flush()
+    # Build input for final answer generation
+    if early_exit and exit_message:
+        exit_ids = tokenizer(
+            exit_message, add_special_tokens=False, return_tensors="pt"
+        )["input_ids"].to(device).long()
+        input_ids = torch.cat(
+            [input_ids, exit_ids,
+             torch.tensor([[think_end_token_id]], dtype=torch.long, device=device)],
+            dim=1,
+        )
+    else:
+        input_ids = torch.cat(
+            [input_ids,
+             torch.tensor([[think_end_token_id]], dtype=torch.long, device=device)],
+            dim=1,
+        )
+    # Generate final answer
+    attention_mask = torch.ones_like(input_ids)
+    with torch.no_grad():
+        final_outputs = model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_tokens,
+            do_sample=True,
+            temperature=temperature,
+            top_p=0.95,
+            top_k=20,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.eos_token_id,
+        )
+    # Extract answer (everything after last </think>)
+    full_seq = final_outputs[0]
+    end_positions = (full_seq == think_end_token_id).nonzero(as_tuple=True)[0]
+    if len(end_positions) > 0:
+        answer_tokens = full_seq[end_positions[-1].item() + 1 :]
+        answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
+    else:
+        answer = ""
+    # Print answer
+    sys.stdout.write(f"{BOLD}Answer:{RESET}\n{answer}\n")
+    sys.stdout.flush()
+    # Summary
+    n_reasoning = len(reasoning_tokens)
+    exit_reason = "predictor" if early_exit else "natural_end"
+    print(
+        f"\n{DIM}[{exit_reason} | "
+        f"{n_reasoning} thinking tokens | "
+        f"{len(predictions_list)} predictions]{RESET}"
+    )
+def main():
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    parser.add_argument("--prompt", type=str, required=True, help="Input prompt")
+    parser.add_argument(
+        "--model", type=str, default="Qwen/Qwen3-14B", help="HuggingFace model name"
+    )
+    parser.add_argument(
+        "--checkpoint",
+        type=str,
+        default=None,
+        help="Path to terminator .pt checkpoint (default: ./terminator.pt)",
+    )
+    parser.add_argument(
+        "--threshold", type=float, default=0.7, help="Per-prediction binarization threshold"
+    )
+    parser.add_argument(
+        "--window-size", type=int, default=10, help="Sliding-window size for majority vote"
+    )
+    parser.add_argument(
+        "--exit-message",
+        type=str,
+        default="\nI've run out of thinking tokens. I need to commit to a final answer.",
+        help="Message injected when terminator fires (empty string to disable)",
+    )
+    parser.add_argument(
+        "--max-tokens", type=int, default=32768, help="Max tokens to generate"
+    )
+    parser.add_argument(
+        "--temperature", type=float, default=0.6, help="Sampling temperature"
+    )
+    parser.add_argument(
+        "--device", type=str, default="cuda", help="Device (default: cuda)"
+    )
+    args = parser.parse_args()
+    # Resolve checkpoint path
+    if args.checkpoint is None:
+        args.checkpoint = str(_script_dir / "terminator.pt")
+    if not Path(args.checkpoint).exists():
+        print(f"ERROR: Checkpoint not found: {args.checkpoint}", file=sys.stderr)
+        sys.exit(1)
+    # Handle empty exit message
+    if args.exit_message == "":
+        args.exit_message = None
+    device = torch.device(args.device if torch.cuda.is_available() else "cpu")
+    # Load base model
+    model, tokenizer, think_id, think_end_id = load_model_and_tokenizer(
+        args.model, device
+    )
+    # Load terminator checkpoint
+    rms_eps = getattr(model.config, "rms_norm_eps", 1e-6)
+    ffn, ckpt_config, layer_idx, num_extra_layers, extra_sd = load_terminator_checkpoint(
+        args.checkpoint, rms_norm_eps=rms_eps, device=device
+    )
+    ffn_params = sum(p.numel() for p in ffn.parameters())
+    print(
+        f"Terminator FFN loaded (layer_idx={layer_idx}, "
+        f"threshold={args.threshold}, window={args.window_size}, "
+        f"params={ffn_params:,})"
+    )
+    # Extra layers
+    extra_layers = build_extra_layers(model, ckpt_config, extra_sd, device)
+    # Generate
+    generate_with_terminator(
+        prompt=args.prompt,
+        model=model,
+        tokenizer=tokenizer,
+        ffn=ffn,
+        extra_layers=extra_layers,
+        layer_idx=layer_idx,
+        think_token_id=think_id,
+        think_end_token_id=think_end_id,
+        threshold=args.threshold,
+        window_size=args.window_size,
+        exit_message=args.exit_message,
+        max_tokens=args.max_tokens,
+        temperature=args.temperature,
+        device=device,
+    )
+if __name__ == "__main__":
+    main()

serve.py ADDED Viewed

	@@ -0,0 +1,95 @@

+#!/usr/bin/env python3
+"""
+vLLM API server launcher for Qwen3TerminatorForCausalLM.
+Imports vllm_terminator BEFORE vLLM initialises, which registers
+Qwen3TerminatorForCausalLM with vLLM's ModelRegistry.
+NOTE: Terminator currently supports single-GPU, single-sequence inference only.
+Tensor parallelism and concurrent sequences are not supported.
+Environment variables:
+  VLLM_MODEL          — path to terminator model directory (required)
+  VLLM_PORT           — port (default 8000)
+  VLLM_GPU_UTIL       — GPU memory fraction (default 0.90)
+  VLLM_MAX_MODEL_LEN  — max context length
+  VLLM_DTYPE          — dtype (default "auto")
+  VLLM_API_KEY        — require this API key from clients
+  VLLM_SERVED_NAME    — override served model name
+  VLLM_HOST           — bind address (default 0.0.0.0)
+  NO_PREFIX_CACHING   — set to 1 to disable prefix caching
+  VLLM_ENFORCE_EAGER  — set to 1 to disable CUDA graphs (default 0)
+  REASONING_PARSER    — set to "qwen3" to enable <think>/</think> parsing
+                        (splits reasoning_content from content in API responses)
+Example:
+  VLLM_MODEL=./model_dir python serve.py
+"""
+import os
+import runpy
+import sys
+# -----------------------------------------------------------------------
+# CRITICAL: import vllm_terminator HERE, before any vLLM code runs.
+# This registers Qwen3TerminatorForCausalLM with vLLM's ModelRegistry.
+# -----------------------------------------------------------------------
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import vllm_terminator  # noqa: F401  (registers the model as a side effect)
+def env(name, default=None, required=False):
+    v = os.environ.get(name, default)
+    if required and (v is None or v == ""):
+        print(f"Missing required env var: {name}", file=sys.stderr)
+        sys.exit(2)
+    return v
+def main():
+    model = env("VLLM_MODEL", required=True)
+    host = env("VLLM_HOST", "0.0.0.0")
+    port = env("VLLM_PORT", "8000")
+    max_len = env("VLLM_MAX_MODEL_LEN", None)
+    gpu_util = env("VLLM_GPU_UTIL", "0.90")
+    served_name = env("VLLM_SERVED_NAME", None)
+    dtype = env("VLLM_DTYPE", "auto")
+    api_key = env("VLLM_API_KEY", None)
+    no_prefix_caching = env("NO_PREFIX_CACHING", "0")
+    enforce_eager = env("VLLM_ENFORCE_EAGER", "0")
+    reasoning_parser = env("REASONING_PARSER", None)
+    argv = [
+        "vllm.entrypoints.openai.api_server",
+        "--model", model,
+        "--host", host,
+        "--port", str(port),
+        "--dtype", dtype,
+        "--gpu-memory-utilization", str(gpu_util),
+        "--tensor-parallel-size", "1",
+        "--max-num-seqs", "1",
+    ]
+    if served_name:
+        argv += ["--served-model-name", served_name]
+    if max_len:
+        argv += ["--max-model-len", str(max_len)]
+    if api_key:
+        argv += ["--api-key", api_key]
+    if no_prefix_caching == "1":
+        argv += ["--enable-prefix-caching", "False"]
+    if enforce_eager == "1":
+        argv += ["--enforce-eager"]
+    if reasoning_parser:
+        argv += ["--reasoning-parser", reasoning_parser]
+    print(f"Launching vLLM Terminator server with:\n  " + " ".join(argv[1:]), flush=True)
+    # Replace sys.argv so vLLM's argparse sees these arguments, then run the
+    # server module in-process (so vllm_terminator registration persists).
+    sys.argv = argv
+    runpy.run_module("vllm.entrypoints.openai.api_server", run_name="__main__")
+if __name__ == "__main__":
+    main()

setup.sh ADDED Viewed

	@@ -0,0 +1,125 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# ==========================================================================
+# Terminator-Qwen3-14B — Automated Setup
+#
+# This script:
+#   1. Creates a conda environment with Python 3.12
+#   2. Installs uv, vllm, and openai
+#   3. Downloads Qwen3-14B base model weights and creates the model directory
+#
+# Prerequisites:
+#   - NVIDIA GPU with sufficient VRAM (minimum ~40GB for Qwen3-14B)
+#   - CUDA drivers installed
+#   - conda or micromamba installed
+#
+# Usage:
+#   ./setup.sh
+# ==========================================================================
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ENV_NAME="${TERMINATOR_ENV_NAME:-terminator}"
+echo ""
+echo "====================================="
+echo "  Terminator-Qwen3-14B Setup"
+echo "====================================="
+echo ""
+# ------------------------------------------------------------------
+# Step 1: Create conda environment
+# ------------------------------------------------------------------
+# Detect conda or micromamba
+if command -v micromamba &>/dev/null; then
+    CONDA_CMD="micromamba"
+elif command -v conda &>/dev/null; then
+    CONDA_CMD="conda"
+else
+    echo "ERROR: Neither conda nor micromamba found."
+    echo ""
+    echo "Install micromamba:"
+    echo '  "${SHELL}" <(curl -L micro.mamba.pm/install.sh)'
+    echo ""
+    echo "Or install conda:"
+    echo "  https://docs.conda.io/en/latest/miniconda.html"
+    exit 1
+fi
+echo "[1/3] Setting up Python environment..."
+# Check if environment already exists
+if $CONDA_CMD env list 2>/dev/null | grep -q "^${ENV_NAME} \|/${ENV_NAME}\$"; then
+    echo "  Environment '${ENV_NAME}' already exists. Activating..."
+else
+    echo "  Creating environment '${ENV_NAME}' with Python 3.12..."
+    $CONDA_CMD create -n "${ENV_NAME}" python=3.12 -y
+fi
+# Helper: run a command inside the conda/micromamba environment
+run_in_env() {
+    if [ "$CONDA_CMD" = "micromamba" ]; then
+        micromamba run -n "${ENV_NAME}" "$@"
+    else
+        # For conda, activate in a subshell
+        (eval "$(conda shell.bash hook 2>/dev/null)" && conda activate "${ENV_NAME}" && "$@")
+    fi
+}
+echo "  Python: $(run_in_env python --version)"
+# ------------------------------------------------------------------
+# Step 2: Install packages
+# ------------------------------------------------------------------
+echo ""
+echo "[2/3] Installing packages..."
+echo "  Installing uv..."
+run_in_env pip install --upgrade uv --quiet
+echo "  Installing vllm (this may take a few minutes)..."
+run_in_env uv pip install vllm --torch-backend=auto
+echo "  Installing openai (for client)..."
+run_in_env uv pip install openai
+echo "  Installing accelerate (for HF inference)..."
+run_in_env uv pip install accelerate
+echo "  Done."
+# ------------------------------------------------------------------
+# Step 3: Set up model directory
+# ------------------------------------------------------------------
+echo ""
+echo "[3/3] Setting up model directory..."
+echo "  This downloads Qwen3-14B base weights (~28GB) from HuggingFace."
+echo "  (Skipped if already cached.)"
+echo ""
+cd "$SCRIPT_DIR"
+run_in_env python setup_model_dir.py
+# ------------------------------------------------------------------
+# Done
+# ------------------------------------------------------------------
+echo ""
+echo "====================================="
+echo "  Setup Complete!"
+echo "====================================="
+echo ""
+echo "To start the server:"
+echo "  $CONDA_CMD activate ${ENV_NAME}"
+echo "  cd $SCRIPT_DIR"
+echo "  ./start_server.sh"
+echo ""
+echo "Then in another terminal:"
+echo "  $CONDA_CMD activate ${ENV_NAME}"
+echo "  cd $SCRIPT_DIR"
+echo "  python client.py --interactive"
+echo ""
+echo "See README.md for configuration options (GPU memory, context length, etc.)"

setup_model_dir.py ADDED Viewed

	@@ -0,0 +1,128 @@

+#!/usr/bin/env python3
+"""
+Create a vLLM-ready model directory for Qwen3TerminatorForCausalLM.
+Downloads the base Qwen3-14B config and weights from HuggingFace (if not
+already cached), then creates a model directory with:
+  - config.json        (Qwen3-14B base config + terminator fields)
+  - tokenizer files    (symlinked from HF cache)
+  - model weights      (symlinked from HF cache)
+Usage:
+    # Default: uses ./terminator.pt checkpoint, creates ./model_dir
+    python setup_model_dir.py
+    # Custom paths and settings:
+    python setup_model_dir.py \\
+        --checkpoint /path/to/terminator.pt \\
+        --output-dir /path/to/model_dir \\
+        --threshold 0.5
+"""
+import argparse
+import os
+import sys
+from pathlib import Path
+from huggingface_hub import snapshot_download
+from transformers import AutoConfig
+def main():
+    parser = argparse.ArgumentParser(
+        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
+    )
+    parser.add_argument(
+        "--base-model", default="Qwen/Qwen3-14B",
+        help="HuggingFace model ID for the base model (default: Qwen/Qwen3-14B).",
+    )
+    parser.add_argument(
+        "--checkpoint", type=Path, default="./terminator.pt",
+        help="Path to trained terminator .pt checkpoint (default: ./terminator.pt).",
+    )
+    parser.add_argument(
+        "--output-dir", type=Path, default="./model_dir",
+        help="Destination directory (default: ./model_dir; created if missing).",
+    )
+    parser.add_argument(
+        "--threshold", type=float, default=0.7,
+        help="Terminator firing threshold (default 0.7).",
+    )
+    parser.add_argument(
+        "--window-size", type=int, default=10,
+        help="Sliding window size for majority vote (default 10).",
+    )
+    parser.add_argument(
+        "--exit-message", type=str,
+        default="\nI've run out of thinking tokens. I need to commit to a final answer.",
+        help="Message forced when terminator fires (default: standard exit message). "
+             "Set to empty string to disable.",
+    )
+    parser.add_argument(
+        "--no-download", action="store_true",
+        help="Fail if the base model is not already cached locally "
+             "(by default, downloads from HuggingFace if needed).",
+    )
+    parser.add_argument(
+        "--force", action="store_true",
+        help="Overwrite files in existing output directory.",
+    )
+    args = parser.parse_args()
+    checkpoint = args.checkpoint.resolve()
+    out_dir = args.output_dir.resolve()
+    if not checkpoint.is_file():
+        print(f"ERROR: checkpoint not found: {checkpoint}", file=sys.stderr)
+        sys.exit(1)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    # --- Build patched config.json ---
+    print(f"Loading config for {args.base_model} from HF cache...")
+    config = AutoConfig.from_pretrained(args.base_model)
+    config.architectures = ["Qwen3TerminatorForCausalLM"]
+    config.terminator_checkpoint_path = str(checkpoint)
+    config.terminator_threshold = args.threshold
+    config.terminator_window_size = args.window_size
+    config.terminator_exit_message = args.exit_message
+    # Remove auto_map if present from an older span-predictor config
+    if hasattr(config, "auto_map"):
+        del config.auto_map
+    config.save_pretrained(out_dir)
+    print(f"  Wrote config.json -> {out_dir / 'config.json'}")
+    # --- Symlink weights and tokenizer files from HF cache ---
+    print(f"Locating {args.base_model} in HF cache...")
+    allow_download = not args.no_download
+    base_dir = Path(snapshot_download(args.base_model, local_files_only=not allow_download))
+    print(f"  Found: {base_dir}")
+    linked = 0
+    for src in sorted(base_dir.iterdir()):
+        if src.name in ("config.json",):
+            continue  # we already wrote our own
+        dst = out_dir / src.name
+        if dst.exists() or dst.is_symlink():
+            if args.force:
+                dst.unlink()
+            else:
+                continue
+        os.symlink(src, dst)
+        print(f"  Linked {src.name}")
+        linked += 1
+    print(f"\nDone. Linked {linked} files into {out_dir}")
+    print(f"\nTo start the server:")
+    print(f"  ./start_server.sh")
+    print(f"\nOr manually:")
+    print(f"  VLLM_MODEL={out_dir} REASONING_PARSER=qwen3 python serve.py")
+if __name__ == "__main__":
+    main()

start_server.sh ADDED Viewed

	@@ -0,0 +1,48 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# ==========================================================================
+# Terminator-Qwen3-14B — Server Launcher
+#
+# Starts the vLLM server with the Terminator model.
+# Run setup.sh first to create the model directory.
+#
+# Configuration (set as environment variables before running):
+#
+#   VLLM_GPU_UTIL       GPU memory fraction to use (default: 0.90)
+#
+#   VLLM_MAX_MODEL_LEN  Maximum context length in tokens (default: server picks)
+#
+#   VLLM_PORT           Server port (default: 8000)
+#
+#   VLLM_ENFORCE_EAGER  Set to 1 to disable CUDA graphs (default: 0)
+#                       Use if you encounter CUDA graph compilation errors.
+#                       NOTE: VLLM_ENFORCE_EAGER=0 will result in slower responses
+#
+#   VLLM_API_KEY        Require this API key from clients (default: none)
+#
+# Usage:
+#   ./start_server.sh
+#   or to manually override default environment variables:
+#   VLLM_GPU_UTIL=0.70 VLLM_MAX_MODEL_LEN=8192 ./start_server.sh
+# ==========================================================================
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+MODEL_DIR="${SCRIPT_DIR}/model_dir"
+if [ ! -d "$MODEL_DIR" ]; then
+    echo "ERROR: Model directory not found at: $MODEL_DIR" >&2
+    echo "" >&2
+    echo "Run setup first:" >&2
+    echo "  ./setup.sh" >&2
+    echo "" >&2
+    echo "Or manually:" >&2
+    echo "  python setup_model_dir.py" >&2
+    exit 1
+fi
+export VLLM_MODEL="$MODEL_DIR"
+export REASONING_PARSER="${REASONING_PARSER:-qwen3}"
+export VLLM_SERVED_NAME="${VLLM_SERVED_NAME:-Terminator-Qwen3-14B}"
+exec python "$SCRIPT_DIR/serve.py"

terminator.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:94deb40f18fe3d3e642b0dd1b07d0833b1363d389c0ed01df2946b533b715c97
+size 1321295486

vllm_terminator/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""
+vLLM plugin that registers the Qwen3 Terminator model.
+Usage:
+    import vllm_terminator  # registers the model architecture
+Then set ``"architectures": ["Qwen3TerminatorForCausalLM"]`` in the
+HuggingFace config.json alongside::
+    "terminator_checkpoint_path": "/path/to/layer_-1.pt",
+    "terminator_threshold": 0.7
+"""
+from vllm import ModelRegistry
+ModelRegistry.register_model(
+    "Qwen3TerminatorForCausalLM",
+    "vllm_terminator.model:Qwen3TerminatorForCausalLM",
+)

vllm_terminator/model.py ADDED Viewed

	@@ -0,0 +1,553 @@

+"""
+vLLM-compatible Qwen3 model with Terminator FFN for early reasoning truncation.
+The Terminator predicts when chain-of-thought reasoning has reached the final
+answer. When a majority of recent predictions (sliding window) exceed a
+threshold, the model forces generation of a configurable exit message followed
+by </think> to truncate reasoning early. The exit message helps the model
+transition smoothly from thinking to answering mode.
+Supports optional extra transformer layers between the base model and the FFN
+head. These layers get their own KV cache via vLLM's auto-discovery mechanism.
+Constraints:
+  - layer_idx = -1 (last layer)
+  - sliding_window strategy (majority vote over last window_size predictions)
+  - lag_size = 0
+  - batch_size = 1
+"""
+from collections import deque
+from collections.abc import Iterable
+from itertools import islice
+import torch
+from torch import nn
+from transformers import AutoTokenizer
+from vllm.config import VllmConfig
+from vllm.distributed import get_pp_group
+from vllm.logger import init_logger
+from vllm.model_executor.layers.logits_processor import LogitsProcessor
+from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead
+from vllm.model_executor.models.interfaces import SupportsPP
+from vllm.model_executor.models.qwen2 import Qwen2Model
+from vllm.model_executor.models.qwen3 import Qwen3DecoderLayer
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+from vllm.model_executor.models.utils import (
+    AutoWeightsLoader,
+    PPMissingLayer,
+    maybe_prefix,
+)
+from vllm.sequence import IntermediateTensors
+from .terminator_head import TerminatorFFN
+logger = init_logger(__name__)
+class Qwen3TerminatorModel(Qwen2Model):
+    """Qwen3 backbone that captures pre-norm hidden states for the Terminator FFN."""
+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__(
+            vllm_config=vllm_config,
+            prefix=prefix,
+            decoder_layer_type=Qwen3DecoderLayer,
+        )
+        self._pre_norm_hidden: torch.Tensor | None = None
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: IntermediateTensors | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+    ) -> torch.Tensor | IntermediateTensors:
+        if get_pp_group().is_first_rank:
+            if inputs_embeds is not None:
+                hidden_states = inputs_embeds
+            else:
+                hidden_states = self.embed_input_ids(input_ids)
+            residual = None
+        else:
+            assert intermediate_tensors is not None
+            hidden_states = intermediate_tensors["hidden_states"]
+            residual = intermediate_tensors["residual"]
+        aux_hidden_states = []
+        for idx, layer in enumerate(
+            islice(self.layers, self.start_layer, self.end_layer)
+        ):
+            if idx in self.aux_hidden_state_layers:
+                aux_hidden_states.append(hidden_states + residual)
+            hidden_states, residual = layer(positions, hidden_states, residual)
+        if not get_pp_group().is_last_rank:
+            return IntermediateTensors(
+                {"hidden_states": hidden_states, "residual": residual}
+            )
+        # Capture pre-norm hidden states (matches training's forward hook output).
+        # In the fused residual pattern, pre-norm state = hidden_states + residual.
+        self._pre_norm_hidden = hidden_states + residual
+        hidden_states, _ = self.norm(hidden_states, residual)
+        if len(aux_hidden_states) > 0:
+            return hidden_states, aux_hidden_states
+        return hidden_states
+class Qwen3TerminatorForCausalLM(nn.Module, SupportsPP):
+    """
+    Qwen3 causal LM with an attached Terminator FFN that can force </think>
+    generation when the model predicts the final answer has been reached.
+    """
+    packed_modules_mapping = {
+        "qkv_proj": ["q_proj", "k_proj", "v_proj"],
+        "gate_up_proj": ["gate_proj", "up_proj"],
+    }
+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        config = vllm_config.model_config.hf_config
+        quant_config = vllm_config.quant_config
+        self.config = config
+        self.quant_config = quant_config
+        # --- Base Qwen3 model ---
+        self.model = Qwen3TerminatorModel(
+            vllm_config=vllm_config, prefix=maybe_prefix(prefix, "model")
+        )
+        if get_pp_group().is_last_rank:
+            if config.tie_word_embeddings:
+                self.lm_head = self.model.embed_tokens
+            else:
+                self.lm_head = ParallelLMHead(
+                    config.vocab_size,
+                    config.hidden_size,
+                    quant_config=quant_config,
+                    prefix=maybe_prefix(prefix, "lm_head"),
+                )
+        else:
+            self.lm_head = PPMissingLayer()
+        self.logits_processor = LogitsProcessor(config.vocab_size)
+        self.make_empty_intermediate_tensors = (
+            self.model.make_empty_intermediate_tensors
+        )
+        # --- Terminator FFN ---
+        terminator_checkpoint_path = getattr(
+            config, "terminator_checkpoint_path", None
+        )
+        self._terminator_threshold = getattr(
+            config, "terminator_threshold", 0.7
+        )
+        self._terminator_window_size = getattr(
+            config, "terminator_window_size", 10
+        )
+        if terminator_checkpoint_path:
+            self._terminator_checkpoint_path = terminator_checkpoint_path
+            self._terminator_enabled = True
+            # Load checkpoint metadata to construct FFN with correct architecture
+            checkpoint = torch.load(
+                terminator_checkpoint_path, map_location="cpu", weights_only=False
+            )
+            terminator_config = checkpoint["config"]
+            self._terminator_layer_idx = checkpoint["layer_idx"]
+            self._terminator_state_dict = checkpoint["state_dict"]
+            self.terminator_ffn = TerminatorFFN(
+                hidden_size=terminator_config["hidden_size"],
+                num_hidden_layers=terminator_config.get("ffn_layers", 1),
+                activation=terminator_config.get("ffn_activation", "gelu"),
+                intermediate_size=terminator_config.get(
+                    "ffn_intermediate_size", None
+                ),
+                dropout=0.0,
+                rms_norm_eps=getattr(config, "rms_norm_eps", 1e-6),
+            )
+            logger.info(
+                "Terminator FFN created (layer_idx=%d, threshold=%.2f, "
+                "window_size=%d, params=%d)",
+                self._terminator_layer_idx,
+                self._terminator_threshold,
+                self._terminator_window_size,
+                sum(p.numel() for p in self.terminator_ffn.parameters()),
+            )
+            # --- Extra transformer layers ---
+            self._num_extra_layers = terminator_config.get(
+                "num_extra_layers", 0
+            )
+            self._extra_layers_state_dict = checkpoint.get(
+                "extra_layers_state_dict", None
+            )
+            if self._num_extra_layers > 0 and self._extra_layers_state_dict is not None:
+                cache_config = vllm_config.cache_config
+                # Use indices starting after base layers to avoid
+                # extract_layer_index() collisions in KV cache binding.
+                num_base_layers = config.num_hidden_layers
+                self.terminator_extra_layers = nn.ModuleList([
+                    Qwen3DecoderLayer(
+                        config=config,
+                        cache_config=cache_config,
+                        quant_config=quant_config,
+                        prefix=f"terminator_extra_layers.{num_base_layers + i}",
+                    )
+                    for i in range(self._num_extra_layers)
+                ])
+                logger.info(
+                    "Terminator extra layers created (n=%d, params=%d)",
+                    self._num_extra_layers,
+                    sum(
+                        p.numel()
+                        for p in self.terminator_extra_layers.parameters()
+                    ),
+                )
+            else:
+                self._num_extra_layers = 0
+                self._extra_layers_state_dict = None
+        else:
+            self._terminator_enabled = False
+            self._num_extra_layers = 0
+            self._extra_layers_state_dict = None
+            logger.info(
+                "No terminator_checkpoint_path in config; "
+                "terminator disabled (running as standard Qwen3)"
+            )
+        # --- Think token IDs ---
+        tokenizer = AutoTokenizer.from_pretrained(
+            vllm_config.model_config.tokenizer,
+            trust_remote_code=vllm_config.model_config.trust_remote_code,
+        )
+        self._think_token_id = tokenizer.convert_tokens_to_ids("<think>")
+        self._think_end_token_id = tokenizer.convert_tokens_to_ids("</think>")
+        logger.info(
+            "<think>=%d, </think>=%d",
+            self._think_token_id,
+            self._think_end_token_id,
+        )
+        # --- Exit message ---
+        # Pre-tokenize the exit message + </think> so we can force one token
+        # per step when the terminator fires.
+        _default_exit_msg = (
+            "\nI've run out of thinking tokens."
+            " I need to commit to a final answer."
+        )
+        exit_msg = getattr(config, "terminator_exit_message", _default_exit_msg)
+        if exit_msg:
+            msg_ids = tokenizer.encode(exit_msg, add_special_tokens=False)
+            self._exit_sequence: list[int] = msg_ids + [self._think_end_token_id]
+        else:
+            self._exit_sequence = [self._think_end_token_id]
+        logger.info(
+            "Exit sequence: %d tokens (message=%r)",
+            len(self._exit_sequence),
+            exit_msg if exit_msg else "<none>",
+        )
+        # --- Per-request state (batch_size=1) ---
+        self._is_thinking = False
+        self._pred_buffer: torch.Tensor | None = None  # lazily allocated in forward()
+        self._prev_output_token_id: int | None = None  # argmax from previous compute_logits()
+        self._prediction_history: deque[float] = deque(
+            maxlen=self._terminator_window_size if self._terminator_enabled else 1
+        )
+        self._forcing_exit: bool = False
+        self._forcing_idx: int = 0
+    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
+        return self.model.embed_input_ids(input_ids)
+    # ------------------------------------------------------------------
+    # Thinking state tracking
+    # ------------------------------------------------------------------
+    def _update_thinking_state(self, input_ids: torch.Tensor) -> None:
+        """Set ``_is_thinking`` based on the last think token in *input_ids*."""
+        think_pos = (input_ids == self._think_token_id).nonzero(as_tuple=True)[0]
+        end_pos = (input_ids == self._think_end_token_id).nonzero(as_tuple=True)[0]
+        last_think = think_pos[-1].item() if len(think_pos) > 0 else -1
+        last_end = end_pos[-1].item() if len(end_pos) > 0 else -1
+        if last_think > last_end:
+            self._is_thinking = True
+        elif last_end > last_think:
+            self._is_thinking = False
+    # ------------------------------------------------------------------
+    # Forward / compute_logits
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        positions: torch.Tensor,
+        intermediate_tensors: IntermediateTensors | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+    ) -> torch.Tensor | IntermediateTensors:
+        hidden_states = self.model(
+            input_ids, positions, intermediate_tensors, inputs_embeds
+        )
+        if self._terminator_enabled:
+            # Update thinking state during eager execution only (prefill).
+            # .nonzero() / .item() are not CUDA-graph-capture-safe.
+            # Reset all per-request state so predictions, exit-forcing,
+            # and output token tracking don't leak between requests.
+            if not torch.cuda.is_current_stream_capturing():
+                self._update_thinking_state(input_ids)
+                self._prev_output_token_id = None
+                self._prediction_history.clear()
+                self._forcing_exit = False
+                self._forcing_idx = 0
+            # Run extra layers + FFN unconditionally — all ops (RMSNorm,
+            # Linear, Attention with KV cache, sigmoid) are CUDA-graph-safe.
+            # The prediction is written in-place to a pre-allocated buffer
+            # that persists across graph replays.
+            pre_norm = self.model._pre_norm_hidden
+            if self._num_extra_layers > 0:
+                # Run extra transformer layers on ALL tokens so KV cache
+                # is populated during prefill and updated during decode.
+                extra_hidden = pre_norm
+                extra_residual = None
+                for layer in self.terminator_extra_layers:
+                    extra_hidden, extra_residual = layer(
+                        positions, extra_hidden, extra_residual
+                    )
+                # Reconstruct pre-norm state for the FFN:
+                # hidden + residual gives the un-normed output,
+                # matching what the FFN's own RMSNorm expects.
+                ffn_input = extra_hidden + extra_residual
+            else:
+                ffn_input = pre_norm
+            logit = self.terminator_ffn(ffn_input[-1:])
+            pred = torch.sigmoid(logit)
+            if self._pred_buffer is None:
+                # First eager forward (warmup) — allocate the buffer.
+                # Subsequent calls (including graph capture) reuse it.
+                self._pred_buffer = torch.zeros_like(pred)
+            self._pred_buffer.copy_(pred)
+        return hidden_states
+    def compute_logits(
+        self,
+        hidden_states: torch.Tensor,
+    ) -> torch.Tensor | None:
+        logits = self.logits_processor(self.lm_head, hidden_states)
+        # compute_logits() is called outside the CUDA graph, so .item()
+        # and data-dependent branching are safe here.
+        if self._terminator_enabled:
+            # --- Exit message forcing ---
+            # When the terminator has fired, we walk through the pre-tokenized
+            # exit sequence one token per step, skipping all prediction logic.
+            if self._forcing_exit:
+                token_id = self._exit_sequence[self._forcing_idx]
+                logits.fill_(float("-inf"))
+                logits[:, token_id] = 0.0
+                self._forcing_idx += 1
+                if self._forcing_idx >= len(self._exit_sequence):
+                    # Done forcing — last token was </think>.
+                    self._forcing_exit = False
+                    self._is_thinking = False
+                    logger.debug(
+                        "Exit sequence complete (%d tokens forced)",
+                        len(self._exit_sequence),
+                    )
+                self._prev_output_token_id = token_id
+                return logits
+            # Track thinking state from the previous step's output token.
+            # During CUDA graph replay forward() doesn't execute, so
+            # _update_thinking_state() never sees generated <think> tokens.
+            # Instead we infer the state from the argmax of the previous
+            # step's logits (exact for greedy; heuristic for sampling).
+            if self._prev_output_token_id == self._think_token_id:
+                if not self._is_thinking:
+                    self._prediction_history.clear()
+                self._is_thinking = True
+            elif self._prev_output_token_id == self._think_end_token_id:
+                self._is_thinking = False
+            if self._is_thinking and self._pred_buffer is not None:
+                pred = self._pred_buffer.item()
+                self._prediction_history.append(pred)
+                if len(self._prediction_history) >= self._terminator_window_size:
+                    n_above = sum(
+                        1 for p in self._prediction_history
+                        if p > self._terminator_threshold
+                    )
+                    vote = n_above / self._terminator_window_size
+                    if vote > 0.5:
+                        # Majority of sliding window exceeds threshold —
+                        # enter exit-message forcing mode.
+                        logger.debug(
+                            "Terminator FIRING: pred=%.3f, "
+                            "window=[%s] (%d/%d above %.2f, vote=%.2f)",
+                            pred,
+                            ", ".join(f"{p:.3f}" for p in self._prediction_history),
+                            n_above,
+                            self._terminator_window_size,
+                            self._terminator_threshold,
+                            vote,
+                        )
+                        # Force the first token of the exit sequence now,
+                        # and set up state so subsequent calls continue.
+                        self._forcing_exit = True
+                        self._forcing_idx = 0
+                        token_id = self._exit_sequence[0]
+                        logits.fill_(float("-inf"))
+                        logits[:, token_id] = 0.0
+                        self._forcing_idx = 1
+                        if self._forcing_idx >= len(self._exit_sequence):
+                            self._forcing_exit = False
+                            self._is_thinking = False
+                        self._prev_output_token_id = token_id
+                        return logits
+                    else:
+                        logger.debug(
+                            "Terminator: pred=%.3f, "
+                            "window=[%s] (%d/%d above %.2f, vote=%.2f)",
+                            pred,
+                            ", ".join(f"{p:.3f}" for p in self._prediction_history),
+                            n_above,
+                            self._terminator_window_size,
+                            self._terminator_threshold,
+                            vote,
+                        )
+                else:
+                    logger.debug(
+                        "Terminator: pred=%.3f, filling window (%d/%d)",
+                        pred,
+                        len(self._prediction_history),
+                        self._terminator_window_size,
+                    )
+            # Record argmax for next step's thinking-state tracking.
+            self._prev_output_token_id = logits[0].argmax().item()
+        return logits
+    # ------------------------------------------------------------------
+    # Weight loading
+    # ------------------------------------------------------------------
+    # Mapping from HF checkpoint weight names to vLLM fused names.
+    # Mirrors the stacked_params_mapping in Qwen2Model.load_weights().
+    _extra_layer_stacked_mapping = [
+        # (vllm_param, hf_name, shard_id)
+        ("qkv_proj", "q_proj", "q"),
+        ("qkv_proj", "k_proj", "k"),
+        ("qkv_proj", "v_proj", "v"),
+        ("gate_up_proj", "gate_proj", 0),
+        ("gate_up_proj", "up_proj", 1),
+    ]
+    def _load_extra_layers_weights(self, loaded: set[str]) -> None:
+        """Load extra transformer layer weights from the checkpoint.
+        The checkpoint stores HF-format keys
+        (``layers.0.self_attn.q_proj.weight``) which must be mapped to
+        vLLM's fused names (``terminator_extra_layers.0.self_attn.qkv_proj``).
+        This mirrors the ``stacked_params_mapping`` approach used by
+        ``Qwen2Model.load_weights()``.
+        """
+        if self._extra_layers_state_dict is None:
+            return
+        params_dict = dict(self.named_parameters(remove_duplicate=False))
+        for ckpt_name, tensor in self._extra_layers_state_dict.items():
+            # Remap checkpoint prefix to model module path.
+            name = ckpt_name.replace("layers.", "terminator_extra_layers.", 1)
+            if "rotary_emb.inv_freq" in name:
+                continue
+            # Check stacked (fused) projection mapping.
+            for param_name, weight_name, shard_id in self._extra_layer_stacked_mapping:
+                if weight_name not in name:
+                    continue
+                name = name.replace(weight_name, param_name)
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+                param = params_dict[name]
+                weight_loader = getattr(
+                    param, "weight_loader", default_weight_loader
+                )
+                if weight_loader == default_weight_loader:
+                    weight_loader(param, tensor)
+                else:
+                    weight_loader(param, tensor, shard_id)
+                loaded.add(name)
+                break
+            else:
+                # Direct (non-fused) parameter — norms, o_proj, down_proj.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+                if name not in params_dict:
+                    logger.warning(
+                        "Skipping extra-layer weight %s (no matching param)",
+                        name,
+                    )
+                    continue
+                param = params_dict[name]
+                weight_loader = getattr(
+                    param, "weight_loader", default_weight_loader
+                )
+                weight_loader(param, tensor)
+                loaded.add(name)
+        del self._extra_layers_state_dict
+        self._extra_layers_state_dict = None
+        logger.info(
+            "Terminator extra layers weights loaded from checkpoint",
+        )
+    def load_weights(
+        self, weights: Iterable[tuple[str, torch.Tensor]]
+    ) -> set[str]:
+        skip = ["terminator_ffn.", "terminator_extra_layers."]
+        if self.config.tie_word_embeddings:
+            skip.append("lm_head.")
+        loader = AutoWeightsLoader(self, skip_prefixes=skip)
+        loaded = loader.load_weights(weights)
+        # Load terminator FFN and extra layers from the separate .pt
+        # checkpoint (not from the HF safetensors).
+        if self._terminator_enabled:
+            self.terminator_ffn.load_state_dict(self._terminator_state_dict)
+            del self._terminator_state_dict  # free memory
+            logger.info(
+                "Terminator FFN weights loaded from %s",
+                self._terminator_checkpoint_path,
+            )
+            # Tell vLLM these weights have been handled so it doesn't
+            # complain about uninitialized parameters.
+            for name in self.terminator_ffn.state_dict():
+                loaded.add(f"terminator_ffn.{name}")
+            self._load_extra_layers_weights(loaded)
+        return loaded

vllm_terminator/terminator_head.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""
+Terminator FFN head for vLLM integration.
+Mirrors LayerFFN from terminator_utils.py but uses a standalone RMSNorm
+compatible with checkpoints trained using HuggingFace's Qwen2RMSNorm.
+"""
+from pathlib import Path
+from typing import Any, Dict, Optional, Tuple
+import torch
+import torch.nn as nn
+class SimpleRMSNorm(nn.Module):
+    """RMSNorm with a `weight` parameter matching Qwen2RMSNorm's state_dict."""
+    def __init__(self, hidden_size: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+class TerminatorFFN(nn.Module):
+    """
+    Feed-forward network for per-position binary classification.
+    Architecture mirrors LayerFFN from terminator_utils.py:
+      - Pre-normalization with RMSNorm
+      - Linear projection(s) to scalar logit
+      - No sigmoid (outputs raw logits)
+    State dict keys match training checkpoints exactly:
+      - norm.weight
+      - network.weight, network.bias  (1-layer case)
+      - network.0.weight, network.0.bias, ...  (multi-layer case)
+    """
+    def __init__(
+        self,
+        hidden_size: int,
+        num_hidden_layers: int = 1,
+        activation: str = 'gelu',
+        intermediate_size: Optional[int] = None,
+        dropout: float = 0.0,
+        rms_norm_eps: float = 1e-6,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.norm = SimpleRMSNorm(hidden_size, eps=rms_norm_eps)
+        if num_hidden_layers == 1:
+            self.network = nn.Linear(hidden_size, 1)
+        else:
+            if intermediate_size is None:
+                intermediate_size = hidden_size * 2
+            layers = []
+            layers.append(nn.Linear(hidden_size, intermediate_size))
+            act_fn = {'relu': nn.ReLU, 'gelu': nn.GELU, 'tanh': nn.Tanh}
+            if activation not in act_fn:
+                raise ValueError(f"Unknown activation: {activation}")
+            layers.append(act_fn[activation]())
+            layers.append(nn.Dropout(dropout))
+            for _ in range(num_hidden_layers - 2):
+                layers.append(nn.Linear(intermediate_size, intermediate_size))
+                layers.append(act_fn[activation]())
+                layers.append(nn.Dropout(dropout))
+            layers.append(nn.Linear(intermediate_size, 1))
+            self.network = nn.Sequential(*layers)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            hidden_states: [num_tokens, hidden_size] or [batch, seq_len, hidden_size]
+        Returns:
+            logits: [num_tokens] or [batch, seq_len] raw logits
+        """
+        hidden_states = self.norm(hidden_states)
+        output = self.network(hidden_states)
+        return output.squeeze(-1)
+def load_terminator_checkpoint(
+    checkpoint_path: str,
+    rms_norm_eps: float = 1e-6,
+    device: torch.device = torch.device("cpu"),
+) -> Tuple[TerminatorFFN, Dict[str, Any], int, int, Optional[Dict[str, Any]]]:
+    """
+    Load a trained terminator checkpoint and construct the FFN.
+    Args:
+        checkpoint_path: Path to layer_*.pt checkpoint file
+        rms_norm_eps: Epsilon for RMSNorm (from base model config)
+        device: Device to load onto
+    Returns:
+        (ffn, config, layer_idx, num_extra_layers, extra_layers_state_dict)
+    """
+    checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+    config = checkpoint["config"]
+    layer_idx = checkpoint["layer_idx"]
+    hidden_size = config["hidden_size"]
+    ffn = TerminatorFFN(
+        hidden_size=hidden_size,
+        num_hidden_layers=config.get("ffn_layers", 1),
+        activation=config.get("ffn_activation", "gelu"),
+        intermediate_size=config.get("ffn_intermediate_size", None),
+        dropout=0.0,  # No dropout at inference
+        rms_norm_eps=rms_norm_eps,
+    )
+    ffn.load_state_dict(checkpoint["state_dict"])
+    ffn.to(device)
+    ffn.eval()
+    num_extra_layers = config.get("num_extra_layers", 0)
+    extra_layers_state_dict = checkpoint.get("extra_layers_state_dict", None)
+    return ffn, config, layer_idx, num_extra_layers, extra_layers_state_dict