Initial upload: policy checkpoint, config, model card, source

Browse files

Files changed (7) hide show

README.md +98 -0
config.json +17 -0
policy_best.pt +3 -0
src/__init__.py +0 -0
src/grpo/__init__.py +0 -0
src/grpo/action_space.py +114 -0
src/grpo/policy.py +273 -0

README.md ADDED Viewed

	@@ -0,0 +1,98 @@

+---
+language:
+- en
+license: apache-2.0
+tags:
+- speculative-decoding
+- layer-skipping
+- grpo
+- clasp
+- llama
+- efficiency
+base_model: meta-llama/Meta-Llama-3-8B
+---
+# CASM — CLaSp Adaptive Skip Mask
+CASM is a lightweight GRPO-trained skip policy for self-speculative decoding with [Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B). It learns which transformer layers to bypass during the draft stage of speculative decoding, replacing the CLaSp dynamic-programming optimizer with a small neural policy that runs in microseconds.
+## How it works
+Self-speculative decoding runs the *same* frozen model in two modes:
+1. **Draft** — selected decoder layers are skipped, producing `K` candidate tokens cheaply.
+2. **Verify** — the full model validates the draft block and accepts the longest matching prefix.
+3. **Policy update** — CASM observes the verify hidden states and chooses a new skip mask for the next cycle.
+CASM replaces step 3's DP solver with a 2-layer Transformer encoder (~200 K parameters) that maps per-layer hidden states to a skip-mask distribution. It is trained end-to-end with GRPO against a reward combining token-acceptance rate, decoding speed, and mismatch regularization.
+## Architecture
+| Component | Description |
+|---|---|
+| `HiddenStateProjector` | Projects per-layer hidden states `[L, d_model]` → `[L, 128]` |
+| `ScalarFeatureEmbedder` | Embeds 5 scalar context features (acceptance rate, latency, position, mask age, temperature) |
+| `PolicyEncoder` | 2-layer Transformer encoder over layer positions |
+| `logit_head` | Per-layer skip logits → top-M selection |
+| `AcceptanceRateHead` | Predicts E[τ/K] → optimal draft length K* |
+**Parameters:** ~200 K
+**Base model:** `meta-llama/Meta-Llama-3-8B` (32 layers, hidden_dim=4096)
+**Skip budget:** 8 layers per draft cycle
+## Usage
+```python
+import torch
+from src.grpo.policy import SkipPolicy
+# Load policy
+ckpt = torch.load("policy_best.pt", map_location="cpu")
+policy = SkipPolicy(
+    hidden_dim=4096,
+    n_layers=32,
+    n_skip=8,
+    policy_dim=128,
+    context_tokens=1,
+)
+policy.load_state_dict(ckpt["policy_state_dict"])
+policy.eval()
+# During self-speculative decoding, call after each verify pass:
+# hidden_states: tuple of (L+1) tensors from model output_hidden_states=True
+mask, draft_len = policy.greedy_mask(
+    hidden_states,
+    last_tau=accepted_tokens,
+    draft_len=current_draft_len,
+    position=current_position,
+    max_len=max_new_tokens,
+)
+# mask: list of 0/1 per layer (1 = skip this layer during draft)
+# draft_len: recommended tokens to draft next cycle
+```
+See [grpo-clasp](https://github.com/dayne-2stacks/grpo-clasp) for the full training and evaluation codebase.
+## Training
+Trained with GRPO on SpecBench-style prompts using `meta-llama/Meta-Llama-3-8B` on a single A100 80 GB for 10 000 steps. Imitation warm-start from CLaSp DP masks was used for the first ~1000 steps.
+| Metric | Value |
+|---|---|
+| Training steps | 10 000 |
+| Eval reward | 99.8 |
+| Test reward | 100.4 |
+| GPU | NVIDIA A100 80 GB |
+## Citation
+If you use CASM, please cite the CLaSp paper and this repository:
+```bibtex
+@misc{casm2026,
+  author = {Dayne Guy},
+  title  = {CASM: CLaSp Adaptive Skip Mask},
+  year   = {2026},
+  url    = {https://huggingface.co/dayngerous/CASM}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "model_type": "casm_skip_policy",
+  "hidden_dim": 4096,
+  "n_layers": 32,
+  "n_skip": 8,
+  "policy_dim": 128,
+  "n_heads": 4,
+  "n_encoder_layers": 2,
+  "keep_prefix": 0,
+  "keep_suffix": 0,
+  "context_tokens": 1,
+  "draft_len_choices": [4, 8, 12, 16, 24, 32, 48, 64],
+  "base_model": "meta-llama/Meta-Llama-3-8B",
+  "training_steps": 10000,
+  "eval_reward": 99.773,
+  "test_reward": 100.429
+}

policy_best.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:762abdcb49e258e4bf2c29402429722f8e6cd0f09344e38cf137575946977783
+size 3698949

src/__init__.py ADDED Viewed

File without changes

src/grpo/__init__.py ADDED Viewed

File without changes

src/grpo/action_space.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""Action space definitions for the GRPO skip policy.
+The action is a binary skip mask S ∈ {0,1}^L.
+This module provides samplers and constraint enforcement for different
+action space parameterizations.
+"""
+from typing import List, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class TopMActionSampler(nn.Module):
+    """Fixed-budget action sampler: select exactly M layers to skip.
+    Policy outputs per-layer logits; the top-M scoring eligible layers are skipped.
+    During training, uses a straight-through Gumbel-top-K estimator for gradients.
+    During evaluation / rollouts, uses deterministic top-M argmax.
+    Args:
+        n_layers: total number of transformer layers.
+        n_skip: skip budget M.
+        keep_prefix: number of layers at start that cannot be skipped.
+        keep_suffix: number of layers at end that cannot be skipped.
+    """
+    def __init__(
+        self,
+        n_layers: int,
+        n_skip: int,
+        keep_prefix: int = 2,
+        keep_suffix: int = 2,
+    ):
+        super().__init__()
+        self.n_layers = n_layers
+        self.n_skip = n_skip
+        self.keep_prefix = keep_prefix
+        self.keep_suffix = keep_suffix
+        # Mask for eligible layers
+        eligible = torch.zeros(n_layers, dtype=torch.bool)
+        eligible[keep_prefix : n_layers - keep_suffix] = True
+        self.register_buffer("eligible", eligible)
+    def forward(self, logits: torch.Tensor, temperature: float = 1.0) -> torch.Tensor:
+        """Sample a soft skip mask from policy logits.
+        Args:
+            logits: [n_layers] raw skip logits from policy network.
+            temperature: sampling temperature (1.0 = standard, lower = more peaked).
+        Returns:
+            hard_mask: [n_layers] binary tensor (differentiable via straight-through).
+        """
+        # Zero out ineligible logits
+        masked_logits = logits.clone()
+        masked_logits[~self.eligible] = float("-inf")
+        scale = masked_logits[self.eligible].std().detach().clamp(min=1.0)
+        masked_logits = masked_logits / scale
+        # Gumbel-top-K for differentiable discrete selection
+        gumbel = -torch.log(-torch.log(torch.clamp(torch.rand_like(masked_logits), 1e-9, 1.0)))
+        perturbed = (masked_logits + gumbel) / temperature
+        # Select top n_skip eligible indices
+        topk_vals, topk_idx = torch.topk(perturbed, self.n_skip)
+        hard_mask = torch.zeros(self.n_layers, device=logits.device)
+        hard_mask.scatter_(0, topk_idx, 1.0)
+        # Straight-through estimator: use hard mask in forward, soft mask in backward
+        soft_mask = torch.sigmoid(masked_logits / temperature)
+        return hard_mask + (soft_mask - soft_mask.detach())
+    def greedy_mask(self, logits: torch.Tensor) -> List[int]:
+        """Deterministic top-M mask for inference."""
+        masked_logits = logits.clone()
+        masked_logits[~self.eligible] = float("-inf")
+        _, topk_idx = torch.topk(masked_logits, self.n_skip)
+        mask = torch.zeros(self.n_layers, dtype=torch.long)
+        mask[topk_idx] = 1
+        return mask.tolist()
+    def log_prob(self, logits: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+        """Log-probability of a discrete mask under the Plackett-Luce model.
+        Uses sequential conditioning: each selected layer is drawn from a
+        categorical over the remaining eligible layers, consistent with
+        Gumbel-top-K sampling.
+        """
+        # Apply the same unit-std normalization used in forward() so that log_p_old
+        # (computed at rollout time) and log_p_new (computed during the PPO update)
+        # are on the same scale and the ratio exp(log_p_new - log_p_old) is correct.
+        # Clamp after normalizing for numerical safety in log_softmax.
+        eligible_logits = logits[self.eligible]
+        scale = eligible_logits.std().detach().clamp(min=1.0)
+        eligible_logits = (eligible_logits / scale).clamp(-50.0, 50.0)
+        selected_indices = mask[self.eligible].bool().nonzero(as_tuple=True)[0]
+        log_p = logits.new_zeros(())
+        # Use a bool exclusion mask (no grad) instead of in-place modification of
+        # eligible_logits (which has requires_grad=True during _update_policy).
+        # In-place ops on a grad tensor mid-loop corrupt autograd's version counter
+        # and can silently produce NaN gradients.
+        exclusion = torch.zeros(eligible_logits.shape[0], dtype=torch.bool,
+                                device=logits.device)
+        for idx in selected_indices:
+            masked = eligible_logits.masked_fill(exclusion, float("-inf"))
+            log_p = log_p + F.log_softmax(masked, dim=0)[idx]
+            exclusion = exclusion.clone()
+            exclusion[idx] = True
+        return log_p

src/grpo/policy.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""GRPO skip policy network.
+Architecture:
+  Input: per-layer hidden state projections z_l ∈ R^{d'} for l=0..L-1
+         + scalar context features (last_tau, last_ms, position, age, ...)
+  Encoder: 2-layer Transformer encoder over layer index l (treating l as seq pos)
+  Output: per-layer skip logits u_l ∈ R^L
+          + scalar p̂ ∈ (0,1) predicting E[τ/K] (acceptance rate)
+  Action: top-M selection via TopMActionSampler; K derived from p̂ analytically
+The policy is lightweight by design — it should be orders of magnitude smaller
+than the verify model to keep training cost negligible.
+"""
+from typing import List, Optional, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .action_space import TopMActionSampler
+_DEFAULT_DRAFT_LEN_CHOICES = [4, 8, 12, 16, 24, 32, 48, 64]
+def optimal_draft_len(p_hat: float, choices: List[int]) -> int:
+    """Return the K from choices closest to the natural optimum K* = 1/(1−p̂).
+    Intuition: under a geometric acceptance model, K* ≈ 1/(1−p) maximises
+    expected accepted tokens per verify pass. Beyond this point, extra draft
+    tokens are increasingly likely to be rejected.
+    Args:
+        p_hat: predicted per-token acceptance probability in (0, 1).
+        choices: discrete candidate K values (must be non-empty).
+    """
+    p_hat = max(0.01, min(p_hat, 0.99))
+    k_natural = 1.0 / (1.0 - p_hat)
+    return min(choices, key=lambda k: abs(k - k_natural))
+class HiddenStateProjector(nn.Module):
+    """Project per-layer hidden states from d → d_policy.
+    Input:  tuple of [1, seq_len, d] tensors (one per layer, L+1 total)
+    Output: [L, d_policy] (one projected vector per transformer layer)
+    When context_tokens > 1, the last K token positions are mean-pooled per
+    layer before projection, giving the policy a richer view of recent context.
+    If the sequence is shorter than K, all available tokens are used.
+    """
+    def __init__(
+        self,
+        hidden_dim: int,
+        policy_dim: int,
+        n_layers: int,
+        context_tokens: int = 1,
+    ):
+        super().__init__()
+        self.n_layers = n_layers
+        self.context_tokens = context_tokens
+        self.proj = nn.Linear(hidden_dim, policy_dim, bias=False)
+    def forward(
+        self, hidden_states: Tuple[torch.Tensor, ...]
+    ) -> torch.Tensor:
+        """Extract last K-token mean hidden state from each layer, project, return [L, d_p]."""
+        # hidden_states has L+1 entries: [embed, layer_0_out, ..., layer_{L-1}_out]
+        # We use layers 1..L (i.e., skip the embedding output at index 0)
+        layer_hs = [
+            hs[0, -self.context_tokens:, :].mean(dim=0)   # [d]  (mean over last K tokens)
+            for hs in hidden_states[1:self.n_layers + 1]
+        ]
+        stacked = torch.stack(layer_hs, dim=0)   # [L, d]
+        return self.proj(stacked.float())         # [L, d_policy]
+class PolicyEncoder(nn.Module):
+    """2-layer Transformer encoder over layer indices (treating L layers as seq).
+    Input:  [L, d_policy] + scalar features appended to each position
+    Output: [L, d_policy]
+    """
+    def __init__(self, d_policy: int, n_heads: int = 4, n_encoder_layers: int = 2):
+        super().__init__()
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=d_policy,
+            nhead=n_heads,
+            dim_feedforward=d_policy * 4,
+            dropout=0.0,
+            batch_first=True,
+            norm_first=True,
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_encoder_layers)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """x: [L, d_policy] → [L, d_policy]"""
+        return self.encoder(x.unsqueeze(0)).squeeze(0)   # add/remove batch dim
+class ScalarFeatureEmbedder(nn.Module):
+    """Embeds scalar context features and adds them to each layer position."""
+    FEATURE_NAMES = [
+        "last_tau_norm",       # last_tau / draft_len (acceptance rate)
+        "latency_norm",        # last cycle ms / expected ms (rough normalization)
+        "position_norm",       # current token position / max_len
+        "age_norm",            # mask age / update_interval
+        "temperature",         # generation temperature (fixed per run)
+    ]
+    N_FEATURES = len(FEATURE_NAMES)
+    def __init__(self, d_policy: int):
+        super().__init__()
+        self.embed = nn.Linear(self.N_FEATURES, d_policy, bias=False)
+    def forward(
+        self,
+        last_tau: int,
+        draft_len: int,
+        last_ms: float,
+        position: int,
+        max_len: int,
+        age: int,
+        update_interval: int,
+        temperature: float,
+    ) -> torch.Tensor:
+        """Return [d_policy] scalar feature embedding."""
+        feats = torch.tensor([
+            last_tau / max(draft_len, 1),
+            last_ms / 1000.0,
+            position / max(max_len, 1),
+            age / max(update_interval, 1),
+            temperature,
+        ], dtype=torch.float32, device=self.embed.weight.device)
+        return self.embed(feats)
+class AcceptanceRateHead(nn.Module):
+    """Predicts E[τ/K] from mean-pooled encoder output via scalar regression.
+    Trained with MSE against observed τ/K each rollout — no policy gradient needed.
+    The prediction p̂ is used to derive optimal draft length K* analytically via
+    optimal_draft_len().
+    """
+    def __init__(self, d_policy: int):
+        super().__init__()
+        self.head = nn.Linear(d_policy, 1, bias=True)
+    def forward(self, encoded: torch.Tensor) -> torch.Tensor:
+        """encoded: [L, d_policy] → scalar p̂ ∈ (0, 1)"""
+        pooled = encoded.mean(dim=0)                     # [d_policy]
+        return torch.sigmoid(self.head(pooled)).squeeze(-1)  # scalar
+class SkipPolicy(nn.Module):
+    """Full skip policy: projects hidden states → encodes → outputs skip logits
+    and a predicted acceptance rate p̂.
+    Usage::
+        policy = SkipPolicy(hidden_dim=4096, n_layers=32, n_skip=16, policy_dim=128)
+        skip_logits, p_hat = policy(hidden_states, last_tau=8, ...)
+        mask, draft_len = policy.greedy_mask(hidden_states, ...)
+    """
+    def __init__(
+        self,
+        hidden_dim: int,
+        n_layers: int,
+        n_skip: int,
+        policy_dim: int = 128,
+        n_heads: int = 4,
+        n_encoder_layers: int = 2,
+        keep_prefix: int = 2,
+        keep_suffix: int = 2,
+        draft_len_choices: Optional[List[int]] = None,
+        context_tokens: int = 1,
+    ):
+        super().__init__()
+        self.n_layers = n_layers
+        self.n_skip = n_skip
+        self.draft_len_choices = (
+            draft_len_choices if draft_len_choices is not None
+            else _DEFAULT_DRAFT_LEN_CHOICES
+        )
+        self.projector = HiddenStateProjector(hidden_dim, policy_dim, n_layers, context_tokens)
+        self.scalar_embedder = ScalarFeatureEmbedder(policy_dim)
+        self.encoder = PolicyEncoder(policy_dim, n_heads, n_encoder_layers)
+        self.logit_head = nn.Linear(policy_dim, 1, bias=True)
+        self.sampler = TopMActionSampler(n_layers, n_skip, keep_prefix, keep_suffix)
+        self.acceptance_head = AcceptanceRateHead(policy_dim)
+    def forward(
+        self,
+        hidden_states: Tuple[torch.Tensor, ...],
+        last_tau: int = 0,
+        draft_len: int = 16,
+        last_ms: float = 0.0,
+        position: int = 0,
+        max_len: int = 256,
+        age: int = 0,
+        update_interval: int = 1,
+        temperature: float = 0.0,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute per-layer skip logits and predicted acceptance rate.
+        Returns:
+            skip_logits:  [n_layers]
+            p_hat:        scalar ∈ (0, 1), predicted E[τ/K]
+        """
+        z = self.projector(hidden_states)                     # [L, d_policy]
+        scalar_emb = self.scalar_embedder(
+            last_tau, draft_len, last_ms, position, max_len,
+            age, update_interval, temperature,
+        )                                                      # [d_policy]
+        z = z + scalar_emb.unsqueeze(0)                       # [L, d_policy]
+        encoded = self.encoder(z)                              # [L, d_policy]
+        skip_logits = self.logit_head(encoded).squeeze(-1)    # [L]
+        p_hat = self.acceptance_head(encoded)                 # scalar
+        return skip_logits, p_hat
+    def sample_mask(
+        self,
+        hidden_states: Tuple[torch.Tensor, ...],
+        temperature: float = 1.0,
+        **kwargs,
+    ) -> Tuple[List[int], int, torch.Tensor]:
+        """Sample a skip mask and derive draft length from predicted acceptance rate.
+        draft_len is selected deterministically via optimal_draft_len(p̂) — no RL.
+        log_p covers only the skip mask action.
+        Returns:
+            hard_mask:   List[int] of length n_layers
+            draft_len:   int, derived from p̂
+            log_p:       scalar tensor, log π(mask | h)
+        """
+        skip_logits, p_hat = self.forward(hidden_states, **kwargs)
+        soft_mask = self.sampler(skip_logits, temperature=temperature)
+        hard_mask = (soft_mask.detach() > 0.5).long().tolist()
+        log_p = self.sampler.log_prob(skip_logits, soft_mask.detach())
+        draft_len = optimal_draft_len(p_hat.detach().item(), self.draft_len_choices)
+        return hard_mask, draft_len, log_p
+    def greedy_mask(
+        self,
+        hidden_states: Tuple[torch.Tensor, ...],
+        **kwargs,
+    ) -> Tuple[List[int], int]:
+        """Deterministic greedy mask and draft length for evaluation."""
+        with torch.no_grad():
+            skip_logits, p_hat = self.forward(hidden_states, **kwargs)
+        mask = self.sampler.greedy_mask(skip_logits)
+        draft_len = optimal_draft_len(p_hat.item(), self.draft_len_choices)
+        return mask, draft_len
+    def compile_for_inference(self) -> None:
+        """Replace forward with a torch.compile'd version for faster inference.
+        Call once after policy.eval() and before the generation loop.
+        Use fullgraph=False to tolerate the torch.tensor() call inside
+        ScalarFeatureEmbedder without needing to refactor it.
+        """
+        self.forward = torch.compile(self.forward, mode="max-autotune", fullgraph=False)