Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

README.md +161 -0
config.json +24 -0
generation_config.json +4 -0
inference.py +37 -0
merges.txt +0 -0
model.safetensors +3 -0
models/__init__.py +1 -0
models/h_bitlinear.py +116 -0
models/model_v2_earlyexit.py +413 -0
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_config.json +20 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,161 @@

+---
+license: mit
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- bitnet
+- quantization
+- early-exit
+- layer-skipping
+- efficient-transformers
+datasets:
+- roneneldan/TinyStories
+---
+# bitskip-v2-earlyexit
+BitSkip v2 with 4-bit activation quantization, ternary weights, and Hadamard transform
+## Model Description
+This model implements a 24-layer transformer with early exit loss and quadratic layer dropout for efficient inference. It was trained on the TinyStories dataset with layer-wise auxiliary supervision to enable flexible speed-quality tradeoffs during inference.
+## Architecture Details
+- **Layers**: 24
+- **Hidden dimension**: 2048
+- **Attention heads**: 32 (64-dimensional each)
+- **Key-Value heads**: 8 (Grouped Query Attention with 4:1 ratio)
+- **FFN intermediate size**: 4096
+- **Position embeddings**: Rotary Position Embeddings (RoPE)
+- **Normalization**: RMSNorm
+- **Activation**: SwiGLU (for MLP)
+- **Parameters**: ~1.06B
+### Quantization Scheme
+- **Weights**: Ternary {-1, 0, 1}
+- **Activations**: 4-bit quantization (post-Hadamard)
+- **Hadamard**: Yes (FWHT)
+## Training Details
+### Dataset
+- **Source**: TinyStories (2.1M stories)
+- **Tokenizer**: GPT-2 BPE (vocab size: 50,257)
+- **Sequence length**: 512 tokens
+### Training Techniques
+**Quadratic Layer Dropout:**
+- Progressive dropout: p_l = 0.5 × (l/L)²
+- Normalized so Σp_l = 1.0
+- Never drops final layer
+- Makes earlier layers more accurate
+**Early Exit Loss:**
+- All layers share the same LM head
+- Loss = main_loss + 0.3 × early_exit_loss
+- Layer-proportional weighting: w_i = (i+1)/L
+- Enables flexible early exit at inference
+### Hyperparameters
+- **Optimizer**: AdamW
+- **Learning rate**: 3e-4
+- **Warmup steps**: 4000
+- **Batch size**: 16 (effective: 64)
+- **Training steps**: 50000
+- **Gradient clipping**: 0.5
+## Performance
+### Perplexity (TinyStories validation)
+| Exit Layer | Perplexity | Speed (tok/s) |
+|------------|------------|---------------|
+| All layers | TBD | TBD |
+| Layer 18 | TBD | TBD |
+| Layer 12 | TBD | TBD |
+| Layer 6 | TBD | TBD |
+### Training Stability
+- **Gradient norms**: 50-110
+- **Final loss**: TBD
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Inference
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load model
+model = AutoModelForCausalLM.from_pretrained("your-username/bitskip-v2-earlyexit")
+tokenizer = AutoTokenizer.from_pretrained("your-username/bitskip-v2-earlyexit")
+# Generate text
+inputs = tokenizer("Once upon a time", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=100)
+print(tokenizer.decode(outputs[0]))
+```
+### Early Exit Inference
+```python
+# Exit at layer 12 for faster inference
+model.set_exit_layer(12)
+outputs = model.generate(**inputs, max_length=100)
+# 1.5-2x faster with minimal quality loss
+```
+### Benchmark Different Exit Layers
+```python
+for exit_layer in [6, 12, 18, 24]:
+    model.set_exit_layer(exit_layer)
+    outputs = model.generate(**inputs, max_length=100)
+    print(f"Layer {exit_layer}: {tokenizer.decode(outputs[0])}")
+```
+## Limitations
+- **Inference speed**: Quantized models use fake quantization (QAT) without specialized kernels, resulting in slower inference than full-precision despite lower bit-width
+- **Training instability**: 4-bit models (v2) exhibit gradient explosion (norms 50-110) requiring careful hyperparameter tuning
+- **Dataset scope**: Trained only on TinyStories; may not generalize to other domains without fine-tuning
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{bitnet,
+  title={BitNet: Scaling 1-bit Transformers for Large Language Models},
+  author={Wang, Hongyu and Ma, Shuming and Dong, Li and others},
+  journal={arXiv preprint arXiv:2310.11453},
+  year={2023}
+}
+@article{layerskip,
+  title={LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding},
+  author={Elhoushi, Mostafa and Shrivastava, Akshat and Liskovich, Diana and others},
+  journal={arXiv preprint arXiv:2404.16710},
+  year={2024}
+}
+```
+## License
+MIT License
+## Contact
+For questions or issues, please open an issue on the model repository.

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BitSkipV2ForCausalLMWithEarlyExit"
+  ],
+  "auto_map": {
+    "AutoConfig": "model_v2_earlyexit.BitSkipV2EarlyExitConfig",
+    "AutoModelForCausalLM": "model_v2_earlyexit.BitSkipV2ForCausalLMWithEarlyExit"
+  },
+  "early_exit_loss_weight": 0.3,
+  "hidden_size": 2048,
+  "inference_exit_layer": null,
+  "intermediate_size": 4096,
+  "max_dropout_prob": 0.5,
+  "max_position_embeddings": 2048,
+  "model_type": "bitskip_v2_earlyexit",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 8,
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 10000.0,
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.2",
+  "vocab_size": 50257
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.45.2"
+}

inference.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+Inference script for bitskip-v2-earlyexit
+"""
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+def main():
+    # Load from HuggingFace Hub or local path
+    model_path = "."  # Current directory or specify repo_id
+    print("Loading model...")
+    model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    model.eval()
+    print("Model loaded!")
+    # Example generation
+    prompt = "Once upon a time"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    print(f"\nPrompt: {prompt}\n")
+    # Full model
+    print("Generating with all layers...")
+    outputs = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
+    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+    # Early exit at layer 12
+    print("\nGenerating with early exit at layer 12...")
+    model.set_exit_layer(12)
+    outputs = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
+    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:780f971cbb9a8460636d3de74c7620d7371f7e6895439face3eab2bfd887ebfc
+size 3837873528

models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Model files for bitskip-v2-earlyexit"""

models/h_bitlinear.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+H-BitLinear layer for BitSkip v2 (4-bit activations WITH Hadamard transform)
+OPTIMIZED: Fast Hadamard transform implementation
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+def hadamard_transform(x):
+    """
+    Fast Walsh-Hadamard Transform (FWHT) - OPTIMIZED VERSION.
+    This vectorized implementation is MUCH faster than the loop version.
+    Uses divide-and-conquer butterfly pattern for O(n log n) complexity.
+    """
+    orig_shape = x.shape
+    n = x.shape[-1]
+    # Ensure dimension is power of 2
+    assert n & (n - 1) == 0, f"Dimension must be power of 2, got {n}"
+    # Flatten to 2D for transform
+    x = x.reshape(-1, n)
+    # Fast Hadamard transform using butterfly pattern
+    h = 1
+    while h < n:
+        # Vectorized butterfly operations (MUCH faster than loops!)
+        x = x.reshape(-1, n // (2 * h), 2, h)
+        x_even = x[:, :, 0, :]  # First half
+        x_odd = x[:, :, 1, :]   # Second half
+        # Butterfly: (a, b) -> (a+b, a-b)
+        x[:, :, 0, :] = x_even + x_odd
+        x[:, :, 1, :] = x_even - x_odd
+        x = x.reshape(-1, n)
+        h *= 2
+    # Normalize
+    x = x / math.sqrt(n)
+    # Reshape back
+    return x.reshape(orig_shape)
+class HBitLinear(nn.Module):
+    """
+    H-BitLinear: Hadamard transform + Ternary weights + 4-bit activations.
+    Flow:
+    1. LayerNorm
+    2. Hadamard transform (key preprocessing step!)
+    3. 4-bit quantization
+    4. Linear operation with ternary weights
+    5. Inverse Hadamard transform
+    """
+    def __init__(self, in_features, out_features, bias=False):
+        super().__init__()
+        # Ensure power of 2 for Hadamard
+        assert in_features & (in_features - 1) == 0, \
+            f"in_features must be power of 2 for Hadamard, got {in_features}"
+        assert out_features & (out_features - 1) == 0, \
+            f"out_features must be power of 2 for Hadamard, got {out_features}"
+        self.in_features = in_features
+        self.out_features = out_features
+        # Weight and bias
+        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.02)
+        self.bias = nn.Parameter(torch.zeros(out_features)) if bias else None
+        # LayerNorm before Hadamard
+        self.norm = nn.LayerNorm(in_features)
+    def forward(self, x):
+        """
+        Forward with Hadamard preprocessing + 4-bit quantization.
+        """
+        # 1. LayerNorm
+        x = self.norm(x)
+        # 2. Hadamard transform (KEY STEP for v2!)
+        x_hadamard = hadamard_transform(x)
+        # 3. 4-bit quantization (works better after Hadamard)
+        x_scale = x_hadamard.abs().max(dim=-1, keepdim=True)[0].clamp(min=1e-5)
+        x_quant = (x_hadamard / x_scale * 7).round().clamp(-8, 7)  # 4-bit: -8 to 7
+        x_quant = x_quant / 7 * x_scale
+        # STE for gradients
+        if self.training:
+            x_quant = x_hadamard + (x_quant - x_hadamard).detach()
+        # 4. Ternary weight quantization (same as v1)
+        w_scale = self.weight.abs().mean().clamp(min=1e-5)
+        w_quant = torch.zeros_like(self.weight)
+        w_quant[self.weight > 0.5 * w_scale] = 1.0
+        w_quant[self.weight < -0.5 * w_scale] = -1.0
+        w_quant = w_quant * w_scale
+        if self.training:
+            w_quant = self.weight + (w_quant - self.weight).detach()
+        # 5. Linear operation
+        output = F.linear(x_quant, w_quant, self.bias)
+        # 6. Inverse Hadamard transform
+        output = hadamard_transform(output)
+        return output

models/model_v2_earlyexit.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+BitSkip v2 with Early Exit Loss and Quadratic Dropout
+- H-BitLinear quantization (4-bit + Hadamard)
+- Quadratic layer dropout (normalized sum=1)
+- Early exit loss from all layers
+- HuggingFace compatible
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from transformers import PreTrainedModel, PretrainedConfig, GenerationMixin
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from typing import Optional, Tuple
+from .h_bitlinear import HBitLinear
+class BitSkipV2EarlyExitConfig(PretrainedConfig):
+    model_type = "bitskip_v2_earlyexit"
+    def __init__(
+        self,
+        vocab_size=50257,
+        hidden_size=2048,
+        num_hidden_layers=24,
+        num_attention_heads=32,
+        num_key_value_heads=8,
+        intermediate_size=4096,
+        max_position_embeddings=2048,
+        rms_norm_eps=1e-5,
+        rope_theta=10000.0,
+        early_exit_loss_weight=0.3,
+        max_dropout_prob=0.5,
+        inference_exit_layer=None,
+        **kwargs
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.intermediate_size = intermediate_size
+        self.max_position_embeddings = max_position_embeddings
+        self.rms_norm_eps = rms_norm_eps
+        self.rope_theta = rope_theta
+        self.early_exit_loss_weight = early_exit_loss_weight
+        self.max_dropout_prob = max_dropout_prob
+        self.inference_exit_layer = inference_exit_layer
+        super().__init__(**kwargs)
+class QuadraticLayerDropout(nn.Module):
+    """Quadratic layer dropout normalized to sum=1."""
+    def __init__(self, num_layers, max_dropout_prob=0.5):
+        super().__init__()
+        self.num_layers = num_layers
+        dropout_probs = []
+        for i in range(num_layers):
+            prob = max_dropout_prob * ((i / max(num_layers - 1, 1)) ** 2)
+            dropout_probs.append(prob)
+        total_prob = sum(dropout_probs)
+        if total_prob > 0:
+            dropout_probs = [p / total_prob for p in dropout_probs]
+        self.dropout_probs = dropout_probs
+    def should_drop_layer(self, layer_idx):
+        if not self.training or layer_idx >= self.num_layers - 1:
+            return False
+        return torch.rand(1).item() < self.dropout_probs[layer_idx]
+class RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+class RotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000):
+        super().__init__()
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        freqs = (inv_freq_expanded @ position_ids_expanded).transpose(1, 2)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        return emb.cos().to(x.dtype), emb.sin().to(x.dtype)
+def rotate_half(x):
+    x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin):
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class BitSkipV2Attention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.q_proj = HBitLinear(self.hidden_size, self.num_heads * self.head_dim)
+        self.k_proj = HBitLinear(self.hidden_size, self.num_key_value_heads * self.head_dim)
+        self.v_proj = HBitLinear(self.hidden_size, self.num_key_value_heads * self.head_dim)
+        self.o_proj = HBitLinear(self.hidden_size, self.hidden_size)
+        self.rotary_emb = RotaryEmbedding(self.head_dim, config.max_position_embeddings, config.rope_theta)
+    def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, use_cache=False):
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        cos, sin = self.rotary_emb(value_states, position_ids)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+        past_key_value = (key_states, value_states) if use_cache else None
+        key_states = key_states.repeat_interleave(self.num_key_value_groups, dim=1)
+        value_states = value_states.repeat_interleave(self.num_key_value_groups, dim=1)
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+        if attention_mask is not None:
+            attn_weights = attn_weights + attention_mask
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+        attn_output = torch.matmul(attn_weights, value_states)
+        attn_output = attn_output.transpose(1, 2).contiguous().reshape(bsz, q_len, self.hidden_size)
+        attn_output = self.o_proj(attn_output)
+        return attn_output, None, past_key_value
+class BitSkipV2MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.gate_proj = HBitLinear(config.hidden_size, config.intermediate_size)
+        self.up_proj = HBitLinear(config.hidden_size, config.intermediate_size)
+        self.down_proj = HBitLinear(config.intermediate_size, config.hidden_size)
+    def forward(self, x):
+        return self.down_proj(nn.functional.silu(self.gate_proj(x)) * self.up_proj(x))
+class BitSkipV2DecoderLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.self_attn = BitSkipV2Attention(config)
+        self.mlp = BitSkipV2MLP(config)
+        self.input_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+    def forward(self, hidden_states, attention_mask=None, position_ids=None, past_key_value=None, use_cache=False):
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        hidden_states, _, present_key_value = self.self_attn(
+            hidden_states, attention_mask, position_ids, past_key_value, use_cache
+        )
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        return (hidden_states,) + ((present_key_value,) if use_cache else ())
+class BitSkipV2PreTrainedModel(PreTrainedModel):
+    config_class = BitSkipV2EarlyExitConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        if isinstance(module, (nn.Linear, HBitLinear)):
+            if hasattr(module, 'weight'):
+                module.weight.data.normal_(mean=0.0, std=0.02)
+            if hasattr(module, 'bias') and module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=0.02)
+class BitSkipV2Model(BitSkipV2PreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.layers = nn.ModuleList([BitSkipV2DecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        self.layer_dropout = QuadraticLayerDropout(config.num_hidden_layers, config.max_dropout_prob)
+        self.post_init()
+    def forward(self, input_ids, attention_mask=None, position_ids=None, past_key_values=None, use_cache=False, output_hidden_states=False, return_all_layer_outputs=False):
+        hidden_states = self.embed_tokens(input_ids)
+        if position_ids is None:
+            position_ids = torch.arange(input_ids.shape[1], dtype=torch.long, device=input_ids.device)
+            position_ids = position_ids.unsqueeze(0)
+        next_decoder_cache = () if use_cache else None
+        all_layer_hidden_states = []
+        num_layers_to_run = self.config.inference_exit_layer if self.config.inference_exit_layer else len(self.layers)
+        num_layers_to_run = min(num_layers_to_run, len(self.layers))
+        for idx in range(num_layers_to_run):
+            layer = self.layers[idx]
+            past_key_value = past_key_values[idx] if past_key_values else None
+            if self.training and self.layer_dropout.should_drop_layer(idx):
+                all_layer_hidden_states.append(hidden_states)
+                continue
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    past_key_value,
+                    use_cache,
+                )
+            else:
+                layer_outputs = layer(hidden_states, attention_mask, position_ids, past_key_value, use_cache)
+            hidden_states = layer_outputs[0]
+            all_layer_hidden_states.append(hidden_states)
+            if use_cache:
+                next_decoder_cache += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        all_layer_hidden_states.append(hidden_states)
+        if return_all_layer_outputs:
+            return hidden_states, next_decoder_cache, all_layer_hidden_states
+        else:
+            return hidden_states, next_decoder_cache, None
+class BitSkipV2ForCausalLMWithEarlyExit(BitSkipV2PreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BitSkipV2Model(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def compute_early_exit_loss(self, all_layer_hidden_states, labels):
+        """Compute early exit loss with layer-proportional weighting."""
+        num_layers = len(all_layer_hidden_states)
+        weights = [(i + 1) / num_layers for i in range(num_layers)]
+        weight_sum = sum(weights)
+        weights = [w / weight_sum for w in weights]
+        total_exit_loss = 0.0
+        for i, hidden_states in enumerate(all_layer_hidden_states):
+            logits = self.lm_head(hidden_states)
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss()
+            layer_loss = loss_fct(shift_logits.view(-1, self.vocab_size), shift_labels.view(-1))
+            total_exit_loss += weights[i] * layer_loss
+        return total_exit_loss
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        labels=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        return_all = self.training and labels is not None
+        hidden_states, past_key_values_output, all_layer_hidden_states = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_hidden_states=output_hidden_states,
+            return_all_layer_outputs=return_all,
+        )
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss_fct = nn.CrossEntropyLoss()
+            main_loss = loss_fct(shift_logits.view(-1, self.vocab_size), shift_labels.view(-1))
+            if all_layer_hidden_states is not None and len(all_layer_hidden_states) > 0:
+                early_exit_loss = self.compute_early_exit_loss(all_layer_hidden_states[:-1], labels)
+                loss = main_loss + self.config.early_exit_loss_weight * early_exit_loss
+            else:
+                loss = main_loss
+        if not return_dict:
+            output = (logits,) + (past_key_values_output,)
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=past_key_values_output,
+            hidden_states=None,
+            attentions=None,
+        )
+    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                remove_prefix_length = input_ids.shape[1] - 1
+            input_ids = input_ids[:, remove_prefix_length:]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        model_inputs.update({
+            "position_ids": position_ids,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+        })
+        return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+    def set_exit_layer(self, exit_layer):
+        self.config.inference_exit_layer = exit_layer
+        self.model.config.inference_exit_layer = exit_layer
+BitSkipV2EarlyExitConfig.register_for_auto_class()
+BitSkipV2ForCausalLMWithEarlyExit.register_for_auto_class("AutoModelForCausalLM")

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "pad_token": "<|endoftext|>",
+  "unk_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 1024,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff