Instructions to use smithblack-0/llama3_baseline with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use smithblack-0/llama3_baseline with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="smithblack-0/llama3_baseline", trust_remote_code=True)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("smithblack-0/llama3_baseline", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use smithblack-0/llama3_baseline with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "smithblack-0/llama3_baseline"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smithblack-0/llama3_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/smithblack-0/llama3_baseline

SGLang

How to use smithblack-0/llama3_baseline with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "smithblack-0/llama3_baseline" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smithblack-0/llama3_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "smithblack-0/llama3_baseline" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "smithblack-0/llama3_baseline",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use smithblack-0/llama3_baseline with Docker Model Runner:
```
docker model run hf.co/smithblack-0/llama3_baseline
```

smithblack-0 commited on 21 days ago

Commit

e6fbdc8

verified ·

1 Parent(s): 0b88b09

Update architecture and tokenizer

Browse files

Files changed (4) hide show

README.md +64 -64
config.json +22 -22
huggingface.py +256 -256
tokenizer_config.json +12 -12

README.md CHANGED Viewed

@@ -1,64 +1,64 @@
----
-language:
-- en
-license: mit
-library_name: transformers
-pipeline_tag: text-generation
-tags:
-- pytorch
-- research
-- llama
----
-# advanced-transformers-lib -- Llama 3 Baseline
-A Llama 3-style decoder-only transformer architecture for research. No pretrained
-weights -- pull the architecture from the Hub and instantiate a freshly initialised
-model from config. Override any parameter at instantiation time.
-> **Important:** `trust_remote_code=True` is required. It downloads the architecture
-> source files from the Hub and imports them into your Python process. Review the
-> source at [smithblack-0/llama3_baseline](https://huggingface.co/smithblack-0/llama3_baseline) before use.
-## Usage
-```python
-from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
-# Pull architecture config -- override any parameter at instantiation time
-config = AutoConfig.from_pretrained(
-    "smithblack-0/llama3_baseline",
-    trust_remote_code=True,
-    num_hidden_layers=16,  # example override
-)
-# Instantiate with fresh random weights -- no checkpoint required
-model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
-# Load tokenizer
-tokenizer = AutoTokenizer.from_pretrained("smithblack-0/llama3_baseline")
-# Save and reload after training
-model.save_pretrained("./checkpoint")
-model = AutoModelForCausalLM.from_pretrained("./checkpoint", trust_remote_code=True)
-```
-## Default Configuration
-| Parameter | Default |
-|-----------|---------|
-| `vocab_size` | 50277 |
-| `hidden_size` | 768 |
-| `intermediate_size` | 1568 |
-| `num_hidden_layers` | 24 |
-| `num_attention_heads` | 16 |
-| `num_key_value_heads` | 4 |
-| `head_dim` | 48 |
-| `max_position_embeddings` | 8192 |
-| `rope_theta` | 500000.0 |
-## License
-MIT. Clean-room synthesis: the human author has not read the Llama source code.
-Architectural decisions derive from the published paper. Tokenizer is GPT-NeoX
-(`EleutherAI/gpt-neox-20b`, Apache 2.0).

+---
+language:
+- en
+license: mit
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- pytorch
+- research
+- llama
+---
+# advanced-transformers-lib -- Llama 3 Baseline
+A Llama 3-style decoder-only transformer architecture for research. No pretrained
+weights -- pull the architecture from the Hub and instantiate a freshly initialised
+model from config. Override any parameter at instantiation time.
+> **Important:** `trust_remote_code=True` is required. It downloads the architecture
+> source files from the Hub and imports them into your Python process. Review the
+> source at [smithblack-0/llama3_baseline](https://huggingface.co/smithblack-0/llama3_baseline) before use.
+## Usage
+```python
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+# Pull architecture config -- override any parameter at instantiation time
+config = AutoConfig.from_pretrained(
+    "smithblack-0/llama3_baseline",
+    trust_remote_code=True,
+    num_hidden_layers=16,  # example override
+)
+# Instantiate with fresh random weights -- no checkpoint required
+model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
+# Load tokenizer
+tokenizer = AutoTokenizer.from_pretrained("smithblack-0/llama3_baseline")
+# Save and reload after training
+model.save_pretrained("./checkpoint")
+model = AutoModelForCausalLM.from_pretrained("./checkpoint", trust_remote_code=True)
+```
+## Default Configuration
+| Parameter | Default |
+|-----------|---------|
+| `vocab_size` | 50277 |
+| `hidden_size` | 768 |
+| `intermediate_size` | 1568 |
+| `num_hidden_layers` | 24 |
+| `num_attention_heads` | 16 |
+| `num_key_value_heads` | 4 |
+| `head_dim` | 48 |
+| `max_position_embeddings` | 8192 |
+| `rope_theta` | 500000.0 |
+## License
+MIT. Clean-room synthesis: the human author has not read the Llama source code.
+Architectural decisions derive from the published paper. Tokenizer is GPT-NeoX
+(`EleutherAI/gpt-neox-20b`, Apache 2.0).

config.json CHANGED Viewed

@@ -1,22 +1,22 @@
-{
-  "attention_dropout": 0.0,
-  "auto_map": {
-    "AutoConfig": "configuration.Llama3Config",
-    "AutoModelForCausalLM": "huggingface.Llama3ForCausalLM"
-  },
-  "head_dim": 48,
-  "hidden_size": 768,
-  "intermediate_size": 1568,
-  "max_position_embeddings": 8192,
-  "model_type": "llama3_baseline",
-  "num_attention_heads": 16,
-  "num_hidden_layers": 24,
-  "num_key_value_heads": 4,
-  "rms_norm_eps": 1e-05,
-  "rope_parameters": null,
-  "rope_theta": 500000.0,
-  "tie_word_embeddings": false,
-  "transformers_version": "5.3.0",
-  "use_cache": true,
-  "vocab_size": 50277
-}

+{
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration.Llama3Config",
+    "AutoModelForCausalLM": "huggingface.Llama3ForCausalLM"
+  },
+  "head_dim": 48,
+  "hidden_size": 768,
+  "intermediate_size": 1568,
+  "max_position_embeddings": 8192,
+  "model_type": "llama3_baseline",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 4,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": null,
+  "rope_theta": 500000.0,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true,
+  "vocab_size": 50277
+}

huggingface.py CHANGED Viewed

@@ -1,256 +1,256 @@
-"""HuggingFace wrapper for the Llama 3 baseline.
-Llama3ForCausalLM wraps Llama3Model with everything a researcher needs to
-train, evaluate, and generate from it through the HuggingFace ecosystem:
-token embedding, vocabulary projection, next-token loss, weight tying, and
-the full AutoClass and GenerationMixin contracts.
-The token embedding lives here, not on the backbone. Llama3Model is a pure
-transformer stack that accepts pre-embedded hidden states — it has no knowledge
-of tokens or vocabulary. This is the correct HF convention: the backbone is
-modality-agnostic; the token interface belongs on the task wrapper.
-The LM head projects the backbone's (batch, seq, hidden_size) output to
-(batch, seq, vocab_size) logits. When labels are provided, cross-entropy loss
-is computed with a one-position shift: token i predicts token i+1. The shift
-is applied here rather than expected from the caller — a causal LM always
-trains this way and there is no use case for an unshifted loss.
-Weight tying: when config.tie_word_embeddings is True, lm_head.weight is
-directly assigned to embed_tokens.weight after post_init(). Both matrices are
-shape (vocab_size, hidden_size) — same shape, no transpose needed.
-KV caching uses HuggingFace's Cache protocol. GenerationMixin creates and
-manages the DynamicCache for generate() calls, passing it as past_key_values
-on every forward call. The backbone updates the cache in place and returns the
-same object. _reorder_cache delegates to DynamicCache.reorder_cache() for beam
-search, keeping all beam-reordering logic inside the cache implementation.
-Returns a CausalLMOutputWithPast. ModelOutput subclasses support both attribute
-access (output.logits) and dict-style access (output["logits"]), satisfying
-GenerationMixin's attribute access requirements while keeping existing code unchanged.
-"""
-import torch
-import torch.nn as nn
-from transformers import PreTrainedModel, GenerationMixin
-from transformers.cache_utils import Cache, DynamicCache
-from transformers.modeling_outputs import CausalLMOutputWithPast
-from .configuration import Llama3Config
-from .model import Llama3Model
-class Llama3ForCausalLM(PreTrainedModel, GenerationMixin):
-    """Llama 3 causal language model: token embedding, backbone, LM head, HF contract.
-    Owns the token embedding and LM head. Delegates all transformer computation
-    to Llama3Model. Adds loss computation for training, weight tying between the
-    LM head and the input embedding, and the full HuggingFace AutoClass and
-    GenerationMixin contracts.
-    Args:
-        config: Model configuration. Must be a ``Llama3Config`` instance.
-    """
-    config_class = Llama3Config
-    base_model_prefix = "model"
-    _no_split_modules = ["DecoderLayer"]
-    supports_gradient_checkpointing = True
-    def __init__(self, config: Llama3Config) -> None:
-        super().__init__(config)
-        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
-        self.model = Llama3Model(config)
-        # No bias — consistent with all other projections in this architecture.
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-        self.post_init()
-        # Direct weight tying: both matrices are (vocab_size, hidden_size) — same shape,
-        # no transpose. Explicit here for visibility; post_init() → tie_weights() also
-        # performs this via get_input/output_embeddings(), but that is less readable.
-        if config.tie_word_embeddings:
-            self.lm_head.weight = self.embed_tokens.weight
-    def _init_weights(self, module: nn.Module) -> None:
-        # Suppress HF's default reinitialisation pass. HF's _init_weights overwrites
-        # all Linear and Embedding weights with normal(0, 0.02) after construction,
-        # silently replacing PyTorch's own defaults (kaiming_uniform_ for Linear,
-        # normal(0,1) for Embedding). PyTorch's reset_parameters() already ran at
-        # construction time and those initialisations should stand.
-        pass
-    def get_input_embeddings(self) -> nn.Embedding:
-        """Return the token embedding matrix. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
-        return self.embed_tokens
-    def set_input_embeddings(self, value: nn.Embedding) -> None:
-        """Replace the token embedding matrix. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
-        self.embed_tokens = value
-    def get_output_embeddings(self) -> nn.Linear:
-        """Return the LM head. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
-        return self.lm_head
-    def set_output_embeddings(self, value: nn.Linear) -> None:
-        """Replace the LM head. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
-        self.lm_head = value
-    def _reorder_cache(
-        self, past_key_values: Cache, beam_idx: torch.Tensor
-    ) -> Cache:
-        """Reorder the KV cache to match beam reordering during beam search.
-        GenerationMixin calls this after pruning and reordering beams at each
-        step. beam_idx[i] is the old batch position whose cache should move to
-        position i. DynamicCache.reorder_cache() handles the index-select on
-        every stored tensor's batch dimension, keeping the cache consistent with
-        the reordered beam hypotheses.
-        Args:
-            past_key_values: The active Cache object.
-            beam_idx: 1-D tensor of shape (batch * num_beams,) mapping new batch
-                positions to old ones.
-        Returns:
-            The same Cache object, reordered in place.
-        """
-        past_key_values.reorder_cache(beam_idx)
-        return past_key_values
-    def forward(
-        self,
-        input_ids: torch.Tensor,
-        position_ids: torch.Tensor | None = None,
-        past_key_values: Cache | None = None,
-        use_cache: bool | None = None,
-        output_hidden_states: bool | None = None,
-        labels: torch.Tensor | None = None,
-        cache_position: torch.Tensor | None = None,
-        **kwargs,
-    ) -> CausalLMOutputWithPast:
-        """Run the causal language model.
-        Args:
-            input_ids: Token indices of shape (batch, seq_len).
-            position_ids: Absolute positions of shape (batch, seq_len). Passed
-                through to the backbone. When use_cache=True and this is None,
-                derived from cache_position.
-            past_key_values: A HuggingFace Cache object from a prior step, or
-                None. When use_cache=True and this is None, a fresh DynamicCache
-                is created here before calling the backbone.
-            use_cache: Whether to accumulate and return a KV cache. When True
-                and no cache is provided, a DynamicCache is created. When False,
-                None is passed to the backbone regardless of what was provided.
-                Defaults to config.use_cache when None.
-            output_hidden_states: Whether to return per-layer hidden states.
-                Passed through to the backbone.
-            labels: Target token indices of shape (batch, seq_len) for computing
-                next-token prediction loss. The loss is computed over positions
-                1..seq_len predicting from positions 0..seq_len-1 — the shift
-                is applied internally. Positions with label value -100 are
-                ignored by cross-entropy, following the HuggingFace convention
-                for padding and masked positions.
-            cache_position: 1-D integer tensor of shape (seq_len,) giving the
-                absolute position of each input token in the full sequence.
-                Provided by GenerationMixin during generate(). When use_cache=True
-                and this is None, it is derived from the current cache length.
-            **kwargs: Additional keyword arguments passed by GenerationMixin
-                (e.g. return_dict). Accepted and ignored for forward compatibility.
-                We always return CausalLMOutputWithPast regardless of return_dict.
-        Returns:
-            CausalLMOutputWithPast with fields:
-            - ``logits``: vocabulary scores of shape (batch, seq_len, vocab_size).
-              Always present.
-            - ``loss``: scalar cross-entropy loss, or None if labels not provided.
-            - ``past_key_values``: the updated Cache object, or None.
-            - ``hidden_states``: per-layer hidden states, or None.
-        """
-        if kwargs.get("attention_mask") is not None:
-            raise ValueError(
-                "attention_mask is not supported. This model does not support padding masks. "
-                "For training on variable-length sequences, use right-padding with -100 labels."
-            )
-        # Resolve both flags against config defaults. Config sets the default;
-        # per-call arguments override it. Both fields in Llama3Config remain live.
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-        output_hidden_states = (
-            output_hidden_states
-            if output_hidden_states is not None
-            else self.config.output_hidden_states
-        )
-        # Cache lifecycle is owned here — the backbone only receives a cache or None
-        # and never decides whether to create one.
-        if use_cache:
-            if past_key_values is None:
-                past_key_values = DynamicCache()
-        else:
-            past_key_values = None
-        inputs_embeds = self.embed_tokens(input_ids)
-        batch, seq_len, _ = inputs_embeds.shape
-        # For training (use_cache=False), positions are always 0..seq_len-1.
-        # This is not inference from state — it is a trivial fact about a
-        # non-cached forward pass. The backbone requires explicit position_ids.
-        if not use_cache and position_ids is None:
-            position_ids = torch.arange(seq_len, device=inputs_embeds.device).unsqueeze(0).expand(batch, -1)
-        causal_mask = None
-        if use_cache:
-            # cache_position is GenerationMixin's responsibility. If it is absent,
-            # positions are unknown and any mask or RoPE encoding we produce would be
-            # silently wrong — potentially corrupting a checkpoint. Crash immediately.
-            if cache_position is None:
-                raise ValueError(
-                    "cache_position must be provided when use_cache=True. "
-                    "GenerationMixin supplies this automatically during generate(). "
-                    "If calling forward() directly with use_cache=True, pass cache_position explicitly."
-                )
-            # Derive position_ids for RoPE from cache_position when not provided.
-            # This is a valid computation: cache_position is the authoritative source
-            # of absolute sequence positions, and position_ids is its batch-expanded form.
-            if position_ids is None:
-                position_ids = cache_position.unsqueeze(0).expand(batch, -1)
-            # Build the causal attention mask. For each query at absolute position p,
-            # it may attend to all keys at positions 0..p. k_len is the full sequence
-            # length after this step: one past the last query position.
-            k_len = int(cache_position[-1].item()) + 1
-            k_positions = torch.arange(k_len, device=inputs_embeds.device)
-            # mask[q, k] = True when key position k is within the causal horizon of query q.
-            # Shape: (1, 1, seq_len, k_len) — broadcast over batch and head dimensions.
-            causal_mask = (k_positions[None, :] <= cache_position[:, None]).unsqueeze(0).unsqueeze(0)
-        backbone_out = self.model(
-            inputs_embeds,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            output_hidden_states=output_hidden_states,
-            causal_mask=causal_mask,
-        )
-        logits = self.lm_head(backbone_out["last_hidden_state"])
-        loss = None
-        if labels is not None:
-            # Shift so that each position predicts the next token. The final
-            # logit has no target; the first label has no corresponding input.
-            shift_logits = logits[:, :-1, :].contiguous()
-            shift_labels = labels[:, 1:].contiguous()
-            loss = nn.functional.cross_entropy(
-                shift_logits.view(-1, self.config.vocab_size),
-                shift_labels.view(-1),
-            )
-        return CausalLMOutputWithPast(
-            logits=logits,
-            loss=loss,
-            past_key_values=backbone_out["past_key_values"],
-            hidden_states=backbone_out["hidden_states"],
-        )

+"""HuggingFace wrapper for the Llama 3 baseline.
+Llama3ForCausalLM wraps Llama3Model with everything a researcher needs to
+train, evaluate, and generate from it through the HuggingFace ecosystem:
+token embedding, vocabulary projection, next-token loss, weight tying, and
+the full AutoClass and GenerationMixin contracts.
+The token embedding lives here, not on the backbone. Llama3Model is a pure
+transformer stack that accepts pre-embedded hidden states — it has no knowledge
+of tokens or vocabulary. This is the correct HF convention: the backbone is
+modality-agnostic; the token interface belongs on the task wrapper.
+The LM head projects the backbone's (batch, seq, hidden_size) output to
+(batch, seq, vocab_size) logits. When labels are provided, cross-entropy loss
+is computed with a one-position shift: token i predicts token i+1. The shift
+is applied here rather than expected from the caller — a causal LM always
+trains this way and there is no use case for an unshifted loss.
+Weight tying: when config.tie_word_embeddings is True, lm_head.weight is
+directly assigned to embed_tokens.weight after post_init(). Both matrices are
+shape (vocab_size, hidden_size) — same shape, no transpose needed.
+KV caching uses HuggingFace's Cache protocol. GenerationMixin creates and
+manages the DynamicCache for generate() calls, passing it as past_key_values
+on every forward call. The backbone updates the cache in place and returns the
+same object. _reorder_cache delegates to DynamicCache.reorder_cache() for beam
+search, keeping all beam-reordering logic inside the cache implementation.
+Returns a CausalLMOutputWithPast. ModelOutput subclasses support both attribute
+access (output.logits) and dict-style access (output["logits"]), satisfying
+GenerationMixin's attribute access requirements while keeping existing code unchanged.
+"""
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, GenerationMixin
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from .configuration import Llama3Config
+from .model import Llama3Model
+class Llama3ForCausalLM(PreTrainedModel, GenerationMixin):
+    """Llama 3 causal language model: token embedding, backbone, LM head, HF contract.
+    Owns the token embedding and LM head. Delegates all transformer computation
+    to Llama3Model. Adds loss computation for training, weight tying between the
+    LM head and the input embedding, and the full HuggingFace AutoClass and
+    GenerationMixin contracts.
+    Args:
+        config: Model configuration. Must be a ``Llama3Config`` instance.
+    """
+    config_class = Llama3Config
+    base_model_prefix = "model"
+    _no_split_modules = ["DecoderLayer"]
+    supports_gradient_checkpointing = True
+    def __init__(self, config: Llama3Config) -> None:
+        super().__init__(config)
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.model = Llama3Model(config)
+        # No bias — consistent with all other projections in this architecture.
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.post_init()
+        # Direct weight tying: both matrices are (vocab_size, hidden_size) — same shape,
+        # no transpose. Explicit here for visibility; post_init() → tie_weights() also
+        # performs this via get_input/output_embeddings(), but that is less readable.
+        if config.tie_word_embeddings:
+            self.lm_head.weight = self.embed_tokens.weight
+    def _init_weights(self, module: nn.Module) -> None:
+        # Suppress HF's default reinitialisation pass. HF's _init_weights overwrites
+        # all Linear and Embedding weights with normal(0, 0.02) after construction,
+        # silently replacing PyTorch's own defaults (kaiming_uniform_ for Linear,
+        # normal(0,1) for Embedding). PyTorch's reset_parameters() already ran at
+        # construction time and those initialisations should stand.
+        pass
+    def get_input_embeddings(self) -> nn.Embedding:
+        """Return the token embedding matrix. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
+        return self.embed_tokens
+    def set_input_embeddings(self, value: nn.Embedding) -> None:
+        """Replace the token embedding matrix. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
+        self.embed_tokens = value
+    def get_output_embeddings(self) -> nn.Linear:
+        """Return the LM head. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
+        return self.lm_head
+    def set_output_embeddings(self, value: nn.Linear) -> None:
+        """Replace the LM head. Required by PreTrainedModel for weight tying and resize_token_embeddings."""
+        self.lm_head = value
+    def _reorder_cache(
+        self, past_key_values: Cache, beam_idx: torch.Tensor
+    ) -> Cache:
+        """Reorder the KV cache to match beam reordering during beam search.
+        GenerationMixin calls this after pruning and reordering beams at each
+        step. beam_idx[i] is the old batch position whose cache should move to
+        position i. DynamicCache.reorder_cache() handles the index-select on
+        every stored tensor's batch dimension, keeping the cache consistent with
+        the reordered beam hypotheses.
+        Args:
+            past_key_values: The active Cache object.
+            beam_idx: 1-D tensor of shape (batch * num_beams,) mapping new batch
+                positions to old ones.
+        Returns:
+            The same Cache object, reordered in place.
+        """
+        past_key_values.reorder_cache(beam_idx)
+        return past_key_values
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        position_ids: torch.Tensor | None = None,
+        past_key_values: Cache | None = None,
+        use_cache: bool | None = None,
+        output_hidden_states: bool | None = None,
+        labels: torch.Tensor | None = None,
+        cache_position: torch.Tensor | None = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        """Run the causal language model.
+        Args:
+            input_ids: Token indices of shape (batch, seq_len).
+            position_ids: Absolute positions of shape (batch, seq_len). Passed
+                through to the backbone. When use_cache=True and this is None,
+                derived from cache_position.
+            past_key_values: A HuggingFace Cache object from a prior step, or
+                None. When use_cache=True and this is None, a fresh DynamicCache
+                is created here before calling the backbone.
+            use_cache: Whether to accumulate and return a KV cache. When True
+                and no cache is provided, a DynamicCache is created. When False,
+                None is passed to the backbone regardless of what was provided.
+                Defaults to config.use_cache when None.
+            output_hidden_states: Whether to return per-layer hidden states.
+                Passed through to the backbone.
+            labels: Target token indices of shape (batch, seq_len) for computing
+                next-token prediction loss. The loss is computed over positions
+                1..seq_len predicting from positions 0..seq_len-1 — the shift
+                is applied internally. Positions with label value -100 are
+                ignored by cross-entropy, following the HuggingFace convention
+                for padding and masked positions.
+            cache_position: 1-D integer tensor of shape (seq_len,) giving the
+                absolute position of each input token in the full sequence.
+                Provided by GenerationMixin during generate(). When use_cache=True
+                and this is None, it is derived from the current cache length.
+            **kwargs: Additional keyword arguments passed by GenerationMixin
+                (e.g. return_dict). Accepted and ignored for forward compatibility.
+                We always return CausalLMOutputWithPast regardless of return_dict.
+        Returns:
+            CausalLMOutputWithPast with fields:
+            - ``logits``: vocabulary scores of shape (batch, seq_len, vocab_size).
+              Always present.
+            - ``loss``: scalar cross-entropy loss, or None if labels not provided.
+            - ``past_key_values``: the updated Cache object, or None.
+            - ``hidden_states``: per-layer hidden states, or None.
+        """
+        if kwargs.get("attention_mask") is not None:
+            raise ValueError(
+                "attention_mask is not supported. This model does not support padding masks. "
+                "For training on variable-length sequences, use right-padding with -100 labels."
+            )
+        # Resolve both flags against config defaults. Config sets the default;
+        # per-call arguments override it. Both fields in Llama3Config remain live.
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        output_hidden_states = (
+            output_hidden_states
+            if output_hidden_states is not None
+            else self.config.output_hidden_states
+        )
+        # Cache lifecycle is owned here — the backbone only receives a cache or None
+        # and never decides whether to create one.
+        if use_cache:
+            if past_key_values is None:
+                past_key_values = DynamicCache()
+        else:
+            past_key_values = None
+        inputs_embeds = self.embed_tokens(input_ids)
+        batch, seq_len, _ = inputs_embeds.shape
+        # For training (use_cache=False), positions are always 0..seq_len-1.
+        # This is not inference from state — it is a trivial fact about a
+        # non-cached forward pass. The backbone requires explicit position_ids.
+        if not use_cache and position_ids is None:
+            position_ids = torch.arange(seq_len, device=inputs_embeds.device).unsqueeze(0).expand(batch, -1)
+        causal_mask = None
+        if use_cache:
+            # cache_position is GenerationMixin's responsibility. If it is absent,
+            # positions are unknown and any mask or RoPE encoding we produce would be
+            # silently wrong — potentially corrupting a checkpoint. Crash immediately.
+            if cache_position is None:
+                raise ValueError(
+                    "cache_position must be provided when use_cache=True. "
+                    "GenerationMixin supplies this automatically during generate(). "
+                    "If calling forward() directly with use_cache=True, pass cache_position explicitly."
+                )
+            # Derive position_ids for RoPE from cache_position when not provided.
+            # This is a valid computation: cache_position is the authoritative source
+            # of absolute sequence positions, and position_ids is its batch-expanded form.
+            if position_ids is None:
+                position_ids = cache_position.unsqueeze(0).expand(batch, -1)
+            # Build the causal attention mask. For each query at absolute position p,
+            # it may attend to all keys at positions 0..p. k_len is the full sequence
+            # length after this step: one past the last query position.
+            k_len = int(cache_position[-1].item()) + 1
+            k_positions = torch.arange(k_len, device=inputs_embeds.device)
+            # mask[q, k] = True when key position k is within the causal horizon of query q.
+            # Shape: (1, 1, seq_len, k_len) — broadcast over batch and head dimensions.
+            causal_mask = (k_positions[None, :] <= cache_position[:, None]).unsqueeze(0).unsqueeze(0)
+        backbone_out = self.model(
+            inputs_embeds,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            output_hidden_states=output_hidden_states,
+            causal_mask=causal_mask,
+        )
+        logits = self.lm_head(backbone_out["last_hidden_state"])
+        loss = None
+        if labels is not None:
+            # Shift so that each position predicts the next token. The final
+            # logit has no target; the first label has no corresponding input.
+            shift_logits = logits[:, :-1, :].contiguous()
+            shift_labels = labels[:, 1:].contiguous()
+            loss = nn.functional.cross_entropy(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1),
+            )
+        return CausalLMOutputWithPast(
+            logits=logits,
+            loss=loss,
+            past_key_values=backbone_out["past_key_values"],
+            hidden_states=backbone_out["hidden_states"],
+        )

tokenizer_config.json CHANGED Viewed

@@ -1,13 +1,13 @@
-{
-  "add_prefix_space": false,
-  "backend": "tokenizers",
-  "bos_token": "<|endoftext|>",
-  "eos_token": "<|endoftext|>",
-  "errors": "replace",
-  "is_local": false,
-  "model_max_length": 1000000000000000019884624838656,
-  "pad_token": "<|padding|>",
-  "tokenizer_class": "GPTNeoXTokenizerFast",
-  "trim_offsets": true,
-  "unk_token": "<|endoftext|>"
 }

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|padding|>",
+  "tokenizer_class": "GPTNeoXTokenizerFast",
+  "trim_offsets": true,
+  "unk_token": "<|endoftext|>"
 }