chengyanwu commited on May 20, 2025

Commit

ccda2ec

1 Parent(s): e6dee89

stuff

Files changed (21) hide show

.gitignore +1 -0
README.md +75 -73
__pycache__/configuration_olmoe.cpython-311.pyc +0 -0
__pycache__/modeling_kvlatent.cpython-311.pyc +0 -0
__pycache__/modeling_latent_attention.cpython-311.pyc +0 -0
__pycache__/modeling_olmoe.cpython-311.pyc +0 -0
__pycache__/random.cpython-311.pyc +0 -0
__pycache__/train.cpython-311.pyc +0 -0
config.json +20 -24
generate.py +87 -0
model-00001-of-00003.safetensors +0 -3
model-00002-of-00003.safetensors +0 -3
model-00003-of-00003.safetensors +0 -3
modeling_olmoe.py +822 -0
oldcmds.txt +3 -0
output.txt +0 -0
randommoe.py +1047 -0
requirements.txt +9 -0
shellcommands.txt +3 -0
train.py +130 -0
train_olmoe_adapter.py +404 -0

.gitignore CHANGED Viewed

	@@ -1 +1,2 @@
1	upload.py


1	upload.py
2	+ tester.py

README.md CHANGED Viewed

@@ -12,86 +12,88 @@ datasets:
 library_name: transformers
 ---
-<img alt="OLMoE Logo." src="olmoe-logo.png" width="250px">
 # Model Summary
-> OLMoE-1B-7B is a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024 (0924). It yields state-of-the-art performance among models with a similar cost (1B) and is competitive with much larger models like Llama2-13B. OLMoE is 100% open-source.
-This information and more can also be found on the [**OLMoE GitHub repository**](https://github.com/allenai/OLMoE).
-- **Paper**: https://arxiv.org/abs/2409.02060
-- **Pretraining** [Checkpoints](https://hf.co/allenai/OLMoE-1B-7B-0924), [Code](https://github.com/allenai/OLMo/tree/Muennighoff/MoE), [Data](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) and [Logs](https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3).
-- **SFT (Supervised Fine-Tuning)** [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT), [Code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [Data](https://hf.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-sft-logs.txt).
-- **DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization)**, [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct), [Preference Data](https://hf.co/datasets/allenai/ultrafeedback_binarized_cleaned), [DPO code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [KTO code](https://github.com/Muennighoff/kto/blob/master/kto.py) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt).
-# Use
-Install `transformers` **from source** until a release after [this PR](https://github.com/huggingface/transformers/pull/32406) & `torch` and run:
 ```python
-from transformers import OlmoeForCausalLM, AutoTokenizer
-import torch
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-# Load different ckpts via passing e.g. `revision=step10000-tokens41B`
-model = OlmoeForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924").to(DEVICE)
-tokenizer = AutoTokenizer.from_pretrained("allenai/OLMoE-1B-7B-0924")
-inputs = tokenizer("Bitcoin is", return_tensors="pt")
-inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
-out = model.generate(**inputs, max_length=64)
-print(tokenizer.decode(out[0]))
-# > # Bitcoin is a digital currency that is created and held electronically. No one controls it. Bitcoins aren’t printed, like dollars or euros – they’re produced by people and businesses running computers all around the world, using software that solves mathematical
-```
-You can list all revisions/branches by installing `huggingface-hub` & running:
-```python
-from huggingface_hub import list_repo_refs
-out = list_repo_refs("allenai/OLMoE-1B-7B-0924")
-branches = [b.name for b in out.branches]
 ```
-Important branches:
-- `step1200000-tokens5033B`: Pretraining checkpoint used for annealing. There are a few more checkpoints after this one but we did not use them.
-- `main`: Checkpoint annealed from `step1200000-tokens5033B` for an additional 100B tokens (23,842 steps). We use this checkpoint for our adaptation (https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT & https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct).
-- `fp32`: FP32 version of `main`. The model weights were stored in FP32 during training but we did not observe any performance drop from casting them to BF16 after training so we upload all weights in BF16. If you want the original FP32 checkpoint for `main` you can use this one. You will find that it yields slightly different results but should perform around the same on benchmarks.
-# Evaluation Snapshot
-| Model                       | Active Params | Open Data | MMLU | HellaSwag | ARC-Chall. | ARC-Easy | PIQA | WinoGrande |
-|-----------------------------|---------------|-----------|------|-----------|------------|----------|------|------------|
-| **LMs with ~1B active parameters** |               |           |      |           |            |          |      |            |
-| **OLMoE-1B-7B**              | **1.3B**      | **✅**    | **54.1** | **80.0** | **62.1**   | **84.2** | **79.8** | **70.2**  |
-| DCLM-1B                     | 1.4B          | ✅        | 48.5 | 75.1      | 57.6       | 79.5     | 76.6 | 68.1       |
-| TinyLlama-1B                | 1.1B          | ✅        | 33.6 | 60.8      | 38.1       | 69.5     | 71.7 | 60.1       |
-| OLMo-1B (0724)              | 1.3B          | ✅        | 32.1 | 67.5      | 36.4       | 53.5     | 74.0 | 62.9       |
-| Pythia-1B                   | 1.1B          | ✅        | 31.1 | 48.0      | 31.4       | 63.4     | 68.9 | 52.7       |
-| **LMs with ~2-3B active parameters** |               |           |      |           |            |          |      |            |
-| Qwen1.5-3B-14B              | 2.7B          | ❌        | **62.4** | 80.0      | **77.4**   | **91.6** | **81.0** | 72.3 |
-| Gemma2-3B                   | 2.6B          | ❌        | 53.3 | 74.6      | 67.5       | 84.3     | 78.5 | 71.8       |
-| JetMoE-2B-9B                | 2.2B          | ❌        | 49.1 | **81.7**  | 61.4       | 81.9     | 80.3 | 70.7       |
-| DeepSeek-3B-16B             | 2.9B          | ❌        | 45.5 | 80.4      | 53.4       | 82.7     | 80.1 | **73.2**   |
-| StableLM-2B                 | 1.6B          | ❌        | 40.4 | 70.3      | 50.6       | 75.3     | 75.6 | 65.8       |
-| OpenMoE-3B-9B               | 2.9B          | ✅        | 27.4 | 44.4      | 29.3       | 50.6     | 63.3 | 51.9       |
-| **LMs with ~7-9B active parameters** |               |           |      |           |            |          |      |            |
-| Gemma2-9B                   | 9.2B          | ❌        | **70.6** | **87.3**  | **89.5**   | **95.5** | **86.1** | **78.8** |
-| Llama3.1-8B                 | 8.0B          | ❌        | 66.9 | 81.6      | 79.5       | 91.7     | 81.1 | 76.6       |
-| DCLM-7B                     | 6.9B          | ✅        | 64.4 | 82.3      | 79.8       | 92.3     | 80.1 | 77.3       |
-| Mistral-7B                  | 7.3B          | ❌        | 64.0 | 83.0      | 78.6       | 90.8     | 82.8 | 77.9       |
-| OLMo-7B (0724)              | 6.9B          | ✅        | 54.9 | 80.5      | 68.0       | 85.7     | 79.3 | 73.2       |
-| Llama2-7B                   | 6.7B          | ❌        | 46.2 | 78.9      | 54.2       | 84.0     | 77.5 | 71.7       |
-# Citation
-```bibtex
-@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
-      title={OLMoE: Open Mixture-of-Experts Language Models},
-      author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
-      year={2024},
-      eprint={2409.02060},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2409.02060},
-}
-```

 library_name: transformers
 ---
 # Model Summary
+# OLMoE with Adapters
+This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.
+## Model Architecture
+The `OlmoEWithAdaptersForCausalLM` model extends the original OLMo architecture by:
+1. Adding small adapter layers (bottleneck layers) to each MLP block
+2. Allowing selective freezing of the base model's parameters
+3. Training only the adapter parameters (~0.1-1% of total parameters)
+Key components:
+- `OlmoEWithAdaptersMLP`: MLP layer with additional adapter modules
+- `OlmoEWithAdaptersDecoderLayer`: Decoder layer incorporating adapter MLPs
+- `OlmoEWithAdaptersModel`: Full model with adapter-based decoder layers
+- `OlmoEWithAdaptersForCausalLM`: Causal language model with adapters
+## Training Script
+The `train_olmoe_adapters.py` script provides a complete workflow for fine-tuning the model:
+### Features:
+- Parameter-efficient fine-tuning using adapters
+- Support for various datasets through Hugging Face datasets library
+- Customizable adapter size
+- Option to freeze/unfreeze different components
+- Training with AdamW optimizer and learning rate scheduling
+- Evaluation with perplexity metrics
+- Model checkpointing and saving
+### Usage:
+```bash
+python train.py \
+    --model_name_or_path allenai/OLMo-7B \
+    --adapter_size 64 \
+    --freeze_base_model True \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --output_dir ./olmoe-adapter-finetuned \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 4 \
+    --per_device_eval_batch_size 4 \
+    --learning_rate 5e-5 \
+    --warmup_steps 100 \
+    --logging_steps 100 \
+    --save_steps 1000 \
+    --seed 42
+```
+## Benefits of Adapter-Based Fine-Tuning
+1. **Efficiency**: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
+2. **Storage**: Store only adapter weights rather than full fine-tuned models
+3. **Composability**: Multiple adapters can be trained for different tasks and swapped at inference time
+4. **Reduced Overfitting**: Lower parameter count helps prevent overfitting on small datasets
+## How to Use the Fine-Tuned Model
 ```python
+from transformers import OlmoTokenizer
+from modeling_olmoe import OlmoEWithAdaptersForCausalLM
+# Load the fine-tuned model
+model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
+tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")
+# Generate text
+inputs = tokenizer("Once upon a time", return_tensors="pt")
+outputs = model.generate(**inputs, max_length=50)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+## Adapter Size Recommendations
+The adapter size determines the parameter efficiency vs. performance trade-off:
+- **Small datasets**: 16-32 dimensions
+- **Medium datasets**: 64-128 dimensions
+- **Large datasets**: 128-256 dimensions
+For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.

__pycache__/configuration_olmoe.cpython-311.pyc ADDED Viewed

Binary file (2.35 kB). View file

__pycache__/modeling_kvlatent.cpython-311.pyc ADDED Viewed

Binary file (33.9 kB). View file

__pycache__/modeling_latent_attention.cpython-311.pyc ADDED Viewed

Binary file (9.57 kB). View file

__pycache__/modeling_olmoe.cpython-311.pyc ADDED Viewed

Binary file (43.7 kB). View file

__pycache__/random.cpython-311.pyc ADDED Viewed

Binary file (54.4 kB). View file

__pycache__/train.cpython-311.pyc ADDED Viewed

Binary file (13.6 kB). View file

config.json CHANGED Viewed

@@ -1,31 +1,27 @@
 {
   "architectures": [
-    "OlmoeForCausalLM"
   ],
-  "attention_bias": false,
-  "attention_dropout": 0.0,
-  "clip_qkv": null,
-  "eos_token_id": 50279,
-  "hidden_act": "silu",
   "hidden_size": 2048,
-  "initializer_range": 0.02,
-  "intermediate_size": 1024,
-  "max_position_embeddings": 4096,
-  "model_type": "olmoe",
-  "norm_topk_prob": false,
   "num_attention_heads": 16,
-  "num_experts": 64,
-  "num_experts_per_tok": 8,
-  "num_hidden_layers": 16,
-  "num_key_value_heads": 16,
-  "output_router_logits": false,
   "pad_token_id": 1,
-  "rope_scaling": null,
-  "rope_theta": 10000.0,
-  "router_aux_loss_coef": 0.01,
-  "tie_word_embeddings": false,
-  "torch_dtype": "bfloat16",
-  "transformers_version": "4.43.0.dev0",
   "use_cache": true,
-  "vocab_size": 50304
-}

 {
   "architectures": [
+    "KVLatentForCausalLM"
   ],
+  "model_type": "kvlatent",
   "hidden_size": 2048,
+  "num_hidden_layers": 24,
   "num_attention_heads": 16,
+  "num_key_value_heads": 8,
+  "num_latents": 64,
+  "intermediate_size": 8192,
+  "hidden_act": "gelu",
+  "initializer_range": 0.02,
+  "rms_norm_eps": 1e-5,
+  "vocab_size": 50304,
   "pad_token_id": 1,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "attention_dropout": 0.0,
+  "attention_bias": false,
   "use_cache": true,
+  "tie_word_embeddings": false,
+  "rope_theta": 10000.0,
+  "rope_scaling": null,
+  "max_position_embeddings": 4096,
+  "torch_dtype": "bfloat16"
+}

generate.py ADDED Viewed

	@@ -0,0 +1,87 @@

+#!/usr/bin/env python
+"""
+Example usage script to evaluate a fine-tuned OlmoE adapter model
+and demonstrate generation with adapters.
+"""
+import argparse
+import torch
+from transformers import AutoTokenizer
+from modeling_olmoe import OlmoEWithAdaptersForCausalLM, OlmoConfig
+def generate_text(
+    model_path: str,
+    prompt: str,
+    max_new_tokens: int = 128,
+    temperature: float = 0.7,
+    top_p: float = 0.9,
+    device: str = "auto",
+):
+    """Generate text using a fine-tuned OlmoE adapter model."""
+    # Determine device
+    if device == "auto":
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    # Load tokenizer and model
+    print(f"Loading model from {model_path}")
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    # Load config and update with adapter settings if needed
+    config = OlmoConfig.from_pretrained(model_path)
+    # Load adapter model
+    model = OlmoEWithAdaptersForCausalLM.from_pretrained(
+        model_path,
+        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
+    )
+    model = model.to(device)
+    model.eval()
+    # Tokenize input
+    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
+    # Generate text
+    print("\nGenerating text...\n")
+    with torch.no_grad():
+        outputs = model.generate(
+            input_ids,
+            max_new_tokens=max_new_tokens,
+            do_sample=True,
+            temperature=temperature,
+            top_p=top_p,
+        )
+    # Decode the generated text
+    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    print(f"Prompt: {prompt}")
+    print("\nGenerated text:")
+    print("=" * 40)
+    print(generated_text)
+    print("=" * 40)
+    return generated_text
+def main():
+    parser = argparse.ArgumentParser(description="Generate text with OlmoE adapter model")
+    parser.add_argument("--model_path", type=str, required=True, help="Path to the fine-tuned model")
+    parser.add_argument("--prompt", type=str, default="This is an example of", help="Prompt for text generation")
+    parser.add_argument("--max_new_tokens", type=int, default=128, help="Maximum number of new tokens to generate")
+    parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
+    parser.add_argument("--top_p", type=float, default=0.9, help="Top-p sampling parameter")
+    parser.add_argument("--device", type=str, default="auto", help="Device to use (cuda, cpu, or auto)")
+    args = parser.parse_args()
+    generate_text(
+        model_path=args.model_path,
+        prompt=args.prompt,
+        max_new_tokens=args.max_new_tokens,
+        temperature=args.temperature,
+        top_p=args.top_p,
+        device=args.device,
+    )
+if __name__ == "__main__":
+    main()

model-00001-of-00003.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5e3cff7e367794685c241169072c940d200918617d5e2813f1c387dff52d845e
-size 4997744872

model-00002-of-00003.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:15ef5c730ee3cfed7199498788cd2faf337203fc74b529625e7502cdd759f4a7
-size 4997235176

model-00003-of-00003.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:a9abac4ac1b55c9adabac721a02fa39971f103eea9a65c310972b1246de76e04
-size 3843741912

modeling_olmoe.py ADDED Viewed

	@@ -0,0 +1,822 @@

+# modeling_olmoe.py - Extended version of OLMo for custom training
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Callable, Dict, Optional, Tuple, Union, Any
+# Import necessary components from transformers
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.generation import GenerationMixin
+from transformers.modeling_attn_mask_utils import AttentionMaskConverter
+from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
+# from transformers.modeling_layers import GradientCheckpointingLayer
+from torch.utils.checkpoint import checkpoint
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+# from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import LossKwargs, is_torch_flex_attn_available, logging
+from transformers import OlmoConfig
+# Import flex attention components if available
+if is_torch_flex_attn_available():
+    from torch.nn.attention.flex_attention import BlockMask
+    # from transformers.integrations.flex_attention import make_flex_block_causal_mask
+from functools import partial
+# Define GradientCheckpointingLayer since it's missing
+class GradientCheckpointingLayer(nn.Module):
+    gradient_checkpointing = False
+    def __call__(self, *args, **kwargs):
+        # Use checkpoint on `forward` when enabled
+        if self.gradient_checkpointing and self.training:
+            return checkpoint(self.forward, *args, **kwargs)
+        return super().__call__(*args, **kwargs)
+    def forward(self, *args, **kwargs):
+        # To be implemented by subclasses
+        raise NotImplementedError("Subclasses must implement `forward`")
+import math
+import functools
+# Define our own dynamic_rope_update decorator and ROPE_INIT_FUNCTIONS
+def dynamic_rope_update(func):
+    """
+    Decorator for updating RoPE embeddings when using RoPE scaling strategies.
+    """
+    @functools.wraps(func)
+    def wrapper(self, *args, **kwargs):
+        # Only dynamic scaling needs to modify the positional encodings
+        if self.rope_type == "dynamic" and hasattr(self, "original_max_seq_len"):
+            if self.config.rope_scaling is None:
+                return func(self, *args, **kwargs)
+            # Extract max_position_embeddings from the actual model
+            current_ctx_len = kwargs.get("position_ids", None)
+            if current_ctx_len is not None:
+                # position_ids shape is [batch_size, seq_len]
+                current_ctx_len = current_ctx_len.shape[-1]
+            # If we're inside a context window we've seen before, we don't have to change anything
+            if current_ctx_len is not None and current_ctx_len <= self.max_seq_len_cached:
+                return func(self, *args, **kwargs)
+            current_ctx_len = self.config.max_position_embeddings if current_ctx_len is None else current_ctx_len
+            scaling_factor = self.config.rope_scaling["factor"]
+            self.max_seq_len_cached = min(
+                int(self.original_max_seq_len * scaling_factor),
+                self.config.rope_scaling.get("max_position_embeddings", float("inf"))
+            )
+            # Reset the cached maximum position embeddings to the new value
+            power = 0.0 if scaling_factor <= 1.0 else -0.5
+            self.inv_freq = self.original_inv_freq * (scaling_factor ** power)
+        return func(self, *args, **kwargs)
+    return wrapper
+def get_default_rope_init(config, device=None):
+    """
+    Default initialization for rotary position embeddings.
+    """
+    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+    inv_freq = 1.0 / (config.rope_theta ** (torch.arange(0, head_dim, 2).float().to(device) / head_dim))
+    return inv_freq, None
+def get_linear_rope_init(config, device=None):
+    """
+    Linear initialization for dynamic scaling rotary position embeddings.
+    """
+    base = get_default_rope_init(config, device)[0]
+    scaling_factor = config.rope_scaling["factor"]
+    # Scale the base frequencies
+    return base / scaling_factor, scaling_factor
+def get_dynamic_rope_init(config, device=None):
+    """
+    Dynamic initialization for dynamic scaling rotary position embeddings (NTK approach).
+    """
+    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+    scaling_factor = config.rope_scaling["factor"]
+    # Adjust the base frequencies by a power of the scaling factor
+    power = 0.0 if scaling_factor <= 1.0 else -0.5
+    inv_freq = 1.0 / (config.rope_theta **
+                     (torch.arange(0, head_dim, 2).float().to(device) / head_dim))
+    inv_freq = inv_freq * (scaling_factor ** power)
+    return inv_freq, scaling_factor
+# Define the dictionary of RoPE initialization functions
+ROPE_INIT_FUNCTIONS = {
+    "default": get_default_rope_init,
+    "linear": get_linear_rope_init,
+    "dynamic": get_dynamic_rope_init,
+}
+def can_return_tuple(inputs):
+    # Copied logic from the original source
+    return getattr(inputs, "return_tuple", False) if hasattr(inputs, "return_tuple") else False
+# Start Modeling Code
+logger = logging.get_logger(__name__)
+# Core OLMo components (reused from original implementation)
+class OlmoLayerNorm(nn.Module):
+    """LayerNorm but with no learnable weight or bias."""
+    def __init__(self, hidden_size: int) -> None:
+        super().__init__()
+        self.normalized_shape = (hidden_size,)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        orig_dtype = hidden_states.dtype
+        return F.layer_norm(hidden_states.to(dtype=torch.float32), self.normalized_shape, None, None, eps=1e-5).to(
+            orig_dtype
+        )
+class OlmoMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+# Helper functions for rotary position embeddings
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors."""
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    Repeats key/value states for grouped queries attention.
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    """Default eager implementation of multi-head attention"""
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+class OlmoAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: OlmoConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        if self.config.clip_qkv is not None:
+            query_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+            key_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+            value_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+        query_states = query_states.view(hidden_shape).transpose(1, 2)
+        key_states = key_states.view(hidden_shape).transpose(1, 2)
+        value_states = value_states.view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class OlmoDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: OlmoConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = OlmoAttention(config=config, layer_idx=layer_idx)
+        self.mlp = OlmoMLP(config)
+        self.input_layernorm = OlmoLayerNorm(config.hidden_size)
+        self.post_attention_layernorm = OlmoLayerNorm(config.hidden_size)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+class OlmoRotaryEmbedding(nn.Module):
+    def __init__(self, config: OlmoConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+# Base model classes
+class OlmoEPreTrainedModel(PreTrainedModel):
+    """Base class for OlmoE models with additional extensibility features"""
+    config_class = OlmoConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["OlmoDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+class OlmoEModel(OlmoEPreTrainedModel):
+    """Extended OLMo base model with additional customization points"""
+    def __init__(self, config: OlmoConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [OlmoDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = OlmoLayerNorm(config.hidden_size)
+        self.rotary_emb = OlmoRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    def _update_causal_mask(
+        self,
+        attention_mask: Union[torch.Tensor, "BlockMask"],
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool = False,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and (attention_mask == 0.0).any():
+                return attention_mask
+            return None
+        # if self.config._attn_implementation == "flex_attention":
+        #     if isinstance(attention_mask, torch.Tensor):
+        #         attention_mask = make_flex_block_causal_mask(attention_mask)
+        #     return attention_mask
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_compilable_cache = past_key_values.is_compileable if past_key_values is not None else False
+        if self.config._attn_implementation == "sdpa" and not using_compilable_cache and not output_attentions:
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                is_training=self.training,
+            ):
+                return None
+        dtype = input_tensor.dtype
+        sequence_length = input_tensor.shape[1]
+        if using_compilable_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu", "npu"]
+            and not output_attentions
+        ):
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        **kwargs,
+    ):
+        """Creates a causal 4D mask."""
+        if attention_mask is not None and attention_mask.dim() == 4:
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
+            )
+            if sequence_length != 1:
+                causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
+                    causal_mask.device
+                )
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **flash_attn_kwargs,
+    ) -> BaseModelOutputWithPast:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        if not isinstance(past_key_values, (type(None), Cache)):
+            raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **flash_attn_kwargs,
+            )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
+class OlmoEForCausalLM(OlmoEPreTrainedModel, GenerationMixin):
+    """OLMo Causal Language Model with extensions for custom training"""
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OlmoEModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        # Get model outputs
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+# Example of custom model extensions you can create:
+class OlmoEWithAdaptersMLP(OlmoMLP):
+    """An extended MLP with adapters for parameter-efficient fine-tuning"""
+    def __init__(self, config):
+        super().__init__(config)
+        # Example adapter dimensions (typically much smaller than original dims)
+        adapter_size = getattr(config, "adapter_size", 64)
+        # Add adapter layers
+        self.down_adapter = nn.Sequential(
+            nn.Linear(self.hidden_size, adapter_size, bias=False),
+            nn.ReLU(),
+            nn.Linear(adapter_size, self.hidden_size, bias=False),
+        )
+        # Initialize adapter layers with small weights
+        self.down_adapter[0].weight.data.normal_(mean=0.0, std=0.01)
+        self.down_adapter[2].weight.data.normal_(mean=0.0, std=0.01)
+    def forward(self, x):
+        # Original MLP computation
+        mlp_output = super().forward(x)
+        # Add adapter path with residual connection
+        adapter_output = self.down_adapter(x)
+        return mlp_output + adapter_output
+class OlmoEWithAdaptersDecoderLayer(OlmoDecoderLayer):
+    """OLMo decoder layer with adapters for efficient fine-tuning"""
+    def __init__(self, config, layer_idx):
+        # Replace the standard MLP with an adapter-based MLP
+        super().__init__(config, layer_idx)
+        self.mlp = OlmoEWithAdaptersMLP(config)
+class OlmoEWithAdaptersModel(OlmoEModel):
+    """OLMo model with adapter layers"""
+    def __init__(self, config):
+        super().__init__(config)
+        # Replace all layers with adapter-based layers
+        self.layers = nn.ModuleList(
+            [OlmoEWithAdaptersDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        # Initialize weights
+        self.post_init()
+class OlmoEWithAdaptersForCausalLM(OlmoEForCausalLM):
+    """OLMo for causal language modeling with adapters"""
+    def __init__(self, config, adapters_config: Optional[Dict[str, Any]] = None):
+        super().__init__(config)
+        self.adapters_config = adapters_config
+        # Initialize the model with adapters using the config
+        self.model = OlmoEWithAdaptersModel(config)
+        # Initialize weights
+        self.post_init()
+    def freeze_base_model(self):
+        """Freeze all parameters except adapters for efficient fine-tuning"""
+        for param in self.model.embed_tokens.parameters():
+            param.requires_grad = False
+        for layer in self.model.layers:
+            for name, param in layer.self_attn.named_parameters():
+                param.requires_grad = False
+            for name, param in layer.mlp.named_parameters():
+                if "down_adapter" not in name:
+                    param.requires_grad = False
+            for param in layer.input_layernorm.parameters():
+                param.requires_grad = False
+            for param in layer.post_attention_layernorm.parameters():
+                param.requires_grad = False
+        for param in self.model.norm.parameters():
+            param.requires_grad = False
+        # Uncomment to freeze LM head
+        # for param in self.lm_head.parameters():
+        #     param.requires_grad = False
+    def get_trainable_parameters(self):
+        """Return only trainable parameters for optimizer"""
+        return [p for p in self.parameters() if p.requires_grad]
+    @classmethod
+    def from_config_and_adapters(
+        cls,
+        config,
+        adapters_config: Optional[Dict[str, Any]] = None,
+    ) -> "OlmoEWithAdaptersForCausalLM":
+        """Optional factory method, if you want to keep this pattern."""
+        return cls(config=config, adapters_config=adapters_config)

oldcmds.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+export CUDA_HOME=$(dirname $(dirname $(which nvcc)))
+export PATH=$CUDA_HOME/bin:$PATH
+export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

output.txt ADDED Viewed

File without changes

randommoe.py ADDED Viewed

	@@ -0,0 +1,1047 @@

+from typing import Callable, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ...activations import ACT2FN
+from ...cache_utils import Cache, DynamicCache
+from ...generation import GenerationMixin
+from ...modeling_attn_mask_utils import AttentionMaskConverter
+from ...modeling_flash_attention_utils import FlashAttentionKwargs
+from ...modeling_layers import GradientCheckpointingLayer
+from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from ...processing_utils import Unpack
+from ...utils import (
+    LossKwargs,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    can_return_tuple,
+    is_torch_flex_attn_available,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_olmo import OlmoConfig
+if is_torch_flex_attn_available():
+    from torch.nn.attention.flex_attention import BlockMask
+    from ...integrations.flex_attention import make_flex_block_causal_mask
+logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "OlmoConfig"
+class OlmoLayerNorm(nn.Module):
+    """LayerNorm but with no learnable weight or bias."""
+    def __init__(self, hidden_size: int) -> None:
+        super().__init__()
+        self.normalized_shape = (hidden_size,)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        orig_dtype = hidden_states.dtype
+        return F.layer_norm(hidden_states.to(dtype=torch.float32), self.normalized_shape, None, None, eps=1e-5).to(
+            orig_dtype
+        )
+class OlmoMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: Optional[torch.Tensor],
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+):
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+    return attn_output, attn_weights
+class OlmoAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    def __init__(self, config: OlmoConfig, layer_idx: int):
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.scaling = self.head_dim**-0.5
+        self.attention_dropout = config.attention_dropout
+        self.is_causal = True
+        self.q_proj = nn.Linear(
+            config.hidden_size, config.num_attention_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.k_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.v_proj = nn.Linear(
+            config.hidden_size, config.num_key_value_heads * self.head_dim, bias=config.attention_bias
+        )
+        self.o_proj = nn.Linear(
+            config.num_attention_heads * self.head_dim, config.hidden_size, bias=config.attention_bias
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: Tuple[torch.Tensor, torch.Tensor],
+        attention_mask: Optional[torch.Tensor],
+        past_key_value: Optional[Cache] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        if self.config.clip_qkv is not None:
+            query_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+            key_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+            value_states.clamp_(min=-self.config.clip_qkv, max=self.config.clip_qkv)
+        query_states = query_states.view(hidden_shape).transpose(1, 2)
+        key_states = key_states.view(hidden_shape).transpose(1, 2)
+        value_states = value_states.view(hidden_shape).transpose(1, 2)
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        if past_key_value is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        attention_interface: Callable = eager_attention_forward
+        if self.config._attn_implementation != "eager":
+            if self.config._attn_implementation == "sdpa" and kwargs.get("output_attentions", False):
+                logger.warning_once(
+                    "`torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to "
+                    'eager attention. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
+                )
+            else:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+        attn_output, attn_weights = attention_interface(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            dropout=0.0 if not self.training else self.attention_dropout,
+            scaling=self.scaling,
+            **kwargs,
+        )
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.o_proj(attn_output)
+        return attn_output, attn_weights
+class OlmoDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: OlmoConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = OlmoAttention(config=config, layer_idx=layer_idx)
+        self.mlp = OlmoMLP(config)
+        self.input_layernorm = OlmoLayerNorm(config.hidden_size)
+        self.post_attention_layernorm = OlmoLayerNorm(config.hidden_size)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+class OlmoRotaryEmbedding(nn.Module):
+    def __init__(self, config: OlmoConfig, device=None):
+        super().__init__()
+        # BC: "rope_type" was originally "type"
+        if hasattr(config, "rope_scaling") and config.rope_scaling is not None:
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+        self.config = config
+        self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        self.original_inv_freq = self.inv_freq
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids):
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device)
+        position_ids_expanded = position_ids[:, None, :].float()
+        device_type = x.device.type if isinstance(x.device.type, str) and x.device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):  # Force float32
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos() * self.attention_scaling
+            sin = emb.sin() * self.attention_scaling
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+OLMO_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`OlmoConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+@add_start_docstrings(
+    "The bare Olmo Model outputting raw hidden-states without any specific head on top.",
+    OLMO_START_DOCSTRING,
+)
+class OlmoPreTrainedModel(PreTrainedModel):
+    config_class = OlmoConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["OlmoDecoderLayer"]
+    _skip_keys_device_placement = ["past_key_values"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_flex_attn = True
+    _supports_cache_class = True
+    _supports_quantized_cache = True
+    _supports_static_cache = True
+    _supports_attention_backend = True
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+OLMO_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length) or `BlockMask`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            If the model is configured to use flex_attention, it will attempt to convert the mask Tensor into a BlockMask,
+            but you can also pass a `BlockMask` object directly here.
+            [What are attention masks?](../glossary#attention-mask)
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
+            `past_key_values`).
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`Cache`, *optional*):
+            Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
+            returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
+            It is a [`~cache_utils.Cache`] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).
+            If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
+            have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
+            of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
+            Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
+            this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
+            the complete sequence length.
+"""
+@add_start_docstrings(
+    "The bare Olmo Model outputting raw hidden-states without any specific head on top.",
+    OLMO_START_DOCSTRING,
+)
+class OlmoModel(OlmoPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OlmoDecoderLayer`]
+    olmo's mapping in https://github.com/huggingface/transformers/blob/main/src/transformers/models/auto/modeling_auto.py
+    Args:
+        config: OlmoConfig
+    """
+    def __init__(self, config: OlmoConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        self.layers = nn.ModuleList(
+            [OlmoDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = OlmoLayerNorm(config.hidden_size)
+        self.rotary_emb = OlmoRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embed_tokens
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+    @can_return_tuple
+    @add_start_docstrings_to_model_forward(OLMO_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **flash_attn_kwargs: Unpack[FlashAttentionKwargs],
+    ) -> BaseModelOutputWithPast:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+        # TODO (joao): remove this exception in v4.56 -- it exists for users that try to pass a legacy cache
+        if not isinstance(past_key_values, (type(None), Cache)):
+            raise ValueError("The `past_key_values` should be either a `Cache` object or `None`.")
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        if use_cache and past_key_values is None:
+            past_key_values = DynamicCache()
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
+            )
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+        causal_mask = self._update_causal_mask(
+            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
+        )
+        hidden_states = inputs_embeds
+        # create position embeddings to be shared across the decoder layers
+        position_embeddings = self.rotary_emb(hidden_states, position_ids)
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        for decoder_layer in self.layers[: self.config.num_hidden_layers]:
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            layer_outputs = decoder_layer(
+                hidden_states,
+                attention_mask=causal_mask,
+                position_ids=position_ids,
+                past_key_value=past_key_values,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings,
+                **flash_attn_kwargs,
+            )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values if use_cache else None,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+    def _update_causal_mask(
+        self,
+        attention_mask: Union[torch.Tensor, "BlockMask"],
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool = False,
+    ):
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and (attention_mask == 0.0).any():
+                return attention_mask
+            return None
+        if self.config._attn_implementation == "flex_attention":
+            if isinstance(attention_mask, torch.Tensor):
+                attention_mask = make_flex_block_causal_mask(attention_mask)
+            return attention_mask
+        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
+        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
+        # to infer the attention mask.
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+        using_compilable_cache = past_key_values.is_compileable if past_key_values is not None else False
+        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
+        if self.config._attn_implementation == "sdpa" and not using_compilable_cache and not output_attentions:
+            if AttentionMaskConverter._ignore_causal_mask_sdpa(
+                attention_mask,
+                inputs_embeds=input_tensor,
+                past_key_values_length=past_seen_tokens,
+                is_training=self.training,
+            ):
+                return None
+        dtype = input_tensor.dtype
+        sequence_length = input_tensor.shape[1]
+        if using_compilable_cache:
+            target_length = past_key_values.get_max_cache_shape()
+        else:
+            target_length = (
+                attention_mask.shape[-1]
+                if isinstance(attention_mask, torch.Tensor)
+                else past_seen_tokens + sequence_length + 1
+            )
+        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
+        causal_mask = self._prepare_4d_causal_attention_mask_with_cache_position(
+            attention_mask,
+            sequence_length=sequence_length,
+            target_length=target_length,
+            dtype=dtype,
+            cache_position=cache_position,
+            batch_size=input_tensor.shape[0],
+        )
+        if (
+            self.config._attn_implementation == "sdpa"
+            and attention_mask is not None
+            and attention_mask.device.type in ["cuda", "xpu", "npu"]
+            and not output_attentions
+        ):
+            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
+            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
+            # Details: https://github.com/pytorch/pytorch/issues/110213
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
+        return causal_mask
+    @staticmethod
+    def _prepare_4d_causal_attention_mask_with_cache_position(
+        attention_mask: torch.Tensor,
+        sequence_length: int,
+        target_length: int,
+        dtype: torch.dtype,
+        cache_position: torch.Tensor,
+        batch_size: int,
+        **kwargs,
+    ):
+        """
+        Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
+        `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
+        Args:
+            attention_mask (`torch.Tensor`):
+                A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape
+                `(batch_size, 1, query_length, key_value_length)`.
+            sequence_length (`int`):
+                The sequence length being processed.
+            target_length (`int`):
+                The target length: when generating with static cache, the mask should be as long as the static cache,
+                to account for the 0 padding, the part of the cache that is not filled yet.
+            dtype (`torch.dtype`):
+                The dtype to use for the 4D attention mask.
+            cache_position (`torch.Tensor`):
+                Indices depicting the position of the input sequence tokens in the sequence.
+            batch_size (`torch.Tensor`):
+                Batch size.
+        """
+        if attention_mask is not None and attention_mask.dim() == 4:
+            # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
+            causal_mask = attention_mask
+        else:
+            min_dtype = torch.finfo(dtype).min
+            causal_mask = torch.full(
+                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
+            )
+            if sequence_length != 1:
+                causal_mask = torch.triu(causal_mask, diagonal=1)
+            causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
+            causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
+            if attention_mask is not None:
+                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
+                mask_length = attention_mask.shape[-1]
+                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
+                    causal_mask.device
+                )
+                padding_mask = padding_mask == 0
+                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
+                    padding_mask, min_dtype
+                )
+        return causal_mask
+class KwargsForCausalLM(FlashAttentionKwargs, LossKwargs): ...
+class OlmoForCausalLM(OlmoPreTrainedModel, GenerationMixin):
+    _tied_weights_keys = ["lm_head.weight"]
+    _tp_plan = {"lm_head": "colwise_rep"}
+    _pp_plan = {"lm_head": (["hidden_states"], ["logits"])}
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OlmoModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+    def get_output_embeddings(self):
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder):
+        self.model = decoder
+    def get_decoder(self):
+        return self.model
+    @can_return_tuple
+    @add_start_docstrings_to_model_forward(OLMO_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: Union[int, torch.Tensor] = 0,
+        **kwargs: Unpack[KwargsForCausalLM],
+    ) -> CausalLMOutputWithPast:
+        r"""
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+            logits_to_keep (`int` or `torch.Tensor`, *optional*):
+                If an `int`, compute logits for the last `logits_to_keep` tokens. If `0`, calculate logits for all
+                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
+                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
+                If a `torch.Tensor`, must be 1D corresponding to the indices to keep in the sequence length dimension.
+                This is useful when using packed tensor format (single dimension for batch and sequence length).
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, OlmoForCausalLM
+        >>> model = OlmoForCausalLM.from_pretrained("meta-olmo/Olmo-2-7b-hf")
+        >>> tokenizer = AutoTokenizer.from_pretrained("meta-olmo/Olmo-2-7b-hf")
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs: BaseModelOutputWithPast = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class OlmoMoERouter(nn.Module):
+    """
+    Router module that uses random importance sampling instead of deterministic top-k.
+    This router computes logits for each expert, converts them to probabilities,
+    and then randomly samples experts based on these probabilities.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.num_experts = config.num_experts
+        self.router = nn.Linear(self.hidden_size, self.num_experts, bias=False)
+        self.top_k = config.num_selected_experts
+        self.temperature = config.router_temperature if hasattr(config, "router_temperature") else 1.0
+    def forward(self, hidden_states):
+        """
+        Args:
+            hidden_states: [batch_size, sequence_length, hidden_size]
+        Returns:
+            routing_weights: [batch_size, sequence_length, top_k]
+            routing_indices: [batch_size, sequence_length, top_k]
+        """
+        batch_size, sequence_length, _ = hidden_states.shape
+        # Compute router logits and apply temperature
+        router_logits = self.router(hidden_states) / self.temperature  # [batch_size, sequence_length, num_experts]
+        # Convert to probabilities using softmax
+        router_probs = F.softmax(router_logits, dim=-1)  # [batch_size, sequence_length, num_experts]
+        # For random importance sampling, we'll:
+        # 1. Add Gumbel noise to the log probabilities to induce randomness
+        # 2. Sample top-k experts using the perturbed probabilities
+        # Add Gumbel noise
+        gumbel_noise = -torch.log(-torch.log(torch.rand_like(router_probs) + 1e-10) + 1e-10)
+        perturbed_logits = torch.log(router_probs + 1e-10) + gumbel_noise
+        # Sample top-k experts based on perturbed probabilities
+        routing_weights, routing_indices = torch.topk(perturbed_logits, self.top_k, dim=-1)
+        # Re-normalize the selected probabilities
+        routing_weights = router_probs.gather(-1, routing_indices)
+        routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)
+        return routing_weights, routing_indices
+class OlmoExpertMLP(nn.Module):
+    """
+    Expert MLP module similar to OlmoMLP but used in the MoE architecture.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+        self.act_fn = ACT2FN[config.hidden_act]
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return down_proj
+class OlmoMixtureOfExperts(nn.Module):
+    """
+    Mixture of Experts layer that replaces the standard MLP in OLMo.
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.num_experts = config.num_experts
+        self.num_selected_experts = config.num_selected_experts  # top_k
+        # Create router
+        self.router = OlmoMoERouter(config)
+        # Create experts
+        self.experts = nn.ModuleList([OlmoExpertMLP(config) for _ in range(self.num_experts)])
+        # Expert capacity factor (to avoid load balancing issues)
+        self.capacity_factor = config.expert_capacity_factor if hasattr(config, "expert_capacity_factor") else 1.0
+    def forward(self, hidden_states):
+        """
+        Args:
+            hidden_states: [batch_size, sequence_length, hidden_size]
+        Returns:
+            outputs: [batch_size, sequence_length, hidden_size]
+        """
+        batch_size, sequence_length, hidden_size = hidden_states.shape
+        # Get routing weights and indices
+        routing_weights, routing_indices = self.router(hidden_states)
+        # Reshape tensors for processing
+        flat_hidden_states = hidden_states.reshape(-1, hidden_size)  # [batch_size * sequence_length, hidden_size]
+        # Initialize expert outputs
+        final_output = torch.zeros_like(flat_hidden_states)
+        # For each expert, compute its contribution
+        for expert_idx in range(self.num_experts):
+            # Create a mask to identify which tokens use this expert
+            expert_mask = (routing_indices == expert_idx).any(dim=-1).reshape(-1)
+            if not expert_mask.any():
+                continue  # Skip if no token routes to this expert
+            # Get the hidden states for tokens routed to this expert
+            expert_inputs = flat_hidden_states[expert_mask]
+            # Process these hidden states through the expert
+            expert_outputs = self.experts[expert_idx](expert_inputs)
+            # Find weights for this expert
+            expert_weights = routing_weights[routing_indices == expert_idx].reshape(-1, 1)
+            # Multiply outputs by the routing weights
+            weighted_outputs = expert_outputs * expert_weights
+            # Combine the expert outputs into the final output tensor
+            final_output[expert_mask] += weighted_outputs
+        # Reshape back to original dimensions
+        final_output = final_output.reshape(batch_size, sequence_length, hidden_size)
+        return final_output
+# Modified OlmoDecoderLayer to use MoE instead of standard MLP
+class OlmoMoEDecoderLayer(GradientCheckpointingLayer):
+    def __init__(self, config: OlmoConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = OlmoAttention(config=config, layer_idx=layer_idx)
+        # Use MoE instead of standard MLP
+        self.mlp = OlmoMixtureOfExperts(config)
+        self.input_layernorm = OlmoLayerNorm(config.hidden_size)
+        self.post_attention_layernorm = OlmoLayerNorm(config.hidden_size)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Cache] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+        cache_position: Optional[torch.LongTensor] = None,
+        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states
+        hidden_states = self.input_layernorm(hidden_states)
+        # Self Attention
+        hidden_states, self_attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            position_embeddings=position_embeddings,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+        # MoE instead of Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs
+# Modified OlmoConfig to include MoE-specific parameters
+class OlmoMoEConfig(OlmoConfig):
+    def __init__(
+        self,
+        num_experts=8,
+        num_selected_experts=2,
+        expert_capacity_factor=1.0,
+        router_temperature=0.1,
+        **kwargs
+    ):
+        super().__init__(**kwargs)
+        self.num_experts = num_experts
+        self.num_selected_experts = num_selected_experts
+        self.expert_capacity_factor = expert_capacity_factor
+        self.router_temperature = router_temperature
+# Modified OlmoModel to use MoE decoder layers
+class OlmoMoEModel(OlmoModel):
+    def __init__(self, config: OlmoMoEConfig):
+        OlmoPreTrainedModel.__init__(self, config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        # Use MoE decoder layers
+        self.layers = nn.ModuleList(
+            [OlmoMoEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.norm = OlmoLayerNorm(config.hidden_size)
+        self.rotary_emb = OlmoRotaryEmbedding(config=config)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+# Modified OlmoForCausalLM to use MoE model
+class OlmoMoEForCausalLM(OlmoForCausalLM):
+    def __init__(self, config):
+        OlmoPreTrainedModel.__init__(self, config)
+        self.model = OlmoMoEModel(config)
+        self.vocab_size = config.vocab_size
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+__all__ = ["OlmoForCausalLM", "OlmoModel", "OlmoPreTrainedModel"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+torch>=2.0.0
+transformers>=4.34.0
+accelerate>=0.25.0
+datasets>=2.14.0
+tqdm>=4.66.0
+bitsandbytes>=0.41.0  # For 8-bit training if needed
+sentencepiece>=0.1.99  # For tokenization
+protobuf>=4.23.4  # For datasets loading
+tensorboard>=2.13.0  # For training monitoring

shellcommands.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+conda activate rlmoe
+cd SkipMoE
+python train.py

train.py ADDED Viewed

	@@ -0,0 +1,130 @@

+# train.py
+# runs train_olmoe_adapter.py with parameters when called
+# #!/usr/bin/env python
+"""
+Run script for fine-tuning OlmoE with adapters on specific text domains.
+Handles argument parsing and configuration.
+"""
+import argparse
+import os
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+from transformers import (
+    HfArgumentParser,
+    TrainingArguments,
+)
+@dataclass
+class ScriptArguments:
+    """
+    Arguments for the run script that aren't covered by TrainingArguments.
+    """
+    model_path: str = field(
+        default="allenai/OLMo-7B-Instruct",
+        metadata={"help": "Path to the model to fine-tune"}
+    )
+    output_dir: str = field(
+        default="./output_olmoe_adapter",
+        metadata={"help": "Directory to save the model and logs"}
+    )
+    adapter_size: int = field(
+        default=64,
+        metadata={"help": "Size of the adapter layers"}
+    )
+    dataset_name: str = field(
+        default="mlfoundations/dclm-baseline-1.0",
+        metadata={"help": "Name of the dataset to use"}
+    )
+    max_steps: int = field(
+        default=10000,
+        metadata={"help": "Maximum number of training steps"}
+    )
+    learning_rate: float = field(
+        default=5e-5,
+        metadata={"help": "Learning rate for fine-tuning"}
+    )
+    per_device_batch_size: int = field(
+        default=8,
+        metadata={"help": "Batch size per device"}
+    )
+    gradient_accumulation_steps: int = field(
+        default=1,
+        metadata={"help": "Number of steps to accumulate gradients"}
+    )
+    # use_8bit: bool = field(
+    #     default=False,
+    #     metadata={"help": "Whether to use 8-bit precision"}
+    # )
+    # use_4bit: bool = field(
+    #     default=False,
+    #     metadata={"help": "Whether to use 4-bit precision"}
+    # )
+def main():
+    # Parse command-line arguments
+    parser = HfArgumentParser(ScriptArguments)
+    args = parser.parse_args_into_dataclasses()[0]
+    # Create output directory
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Prepare command for training
+    cmd = [
+        "python",
+        "train_olmoe_adapter.py",
+        # Model arguments
+        f"--model_name_or_path={args.model_path}",
+        f"--adapter_size={args.adapter_size}",
+        "--freeze_base_model=True",  # Always freeze the base model
+        f"--checkpoint_dir={args.output_dir}",
+        # Data arguments
+        f"--dataset_name={args.dataset_name}",
+        "--streaming=True",  # Always stream for large datasets
+        "--streaming_buffer_size=8192",
+        "--max_seq_length=1024",
+        # Training arguments
+        f"--output_dir={args.output_dir}",
+        f"--per_device_train_batch_size={args.per_device_batch_size}",
+        f"--gradient_accumulation_steps={args.gradient_accumulation_steps}",
+        f"--learning_rate={args.learning_rate}",
+        f"--max_steps={args.max_steps}",
+        "--warmup_steps=500",
+        "--logging_steps=10",
+        "--save_steps=1000",
+        "--save_total_limit=2",
+        "--dataloader_num_workers=4",
+        "--seed=42",
+    ]
+    # Add precision flags if needed
+    # if args.use_8bit:
+    #     cmd.append("--load_in_8bit")
+    # if args.use_4bit:
+    #     cmd.append("--load_in_4bit")
+    # Print the command for logging
+    cmd_str = " ".join(cmd)
+    print(f"Running command: {cmd_str}")
+    # Execute the training script
+    os.environ["PYTHONPATH"] = os.getcwd()
+    ret = os.system(cmd_str)
+    if ret != 0:
+        print(f"Training failed with exit code {ret}")
+        sys.exit(ret)
+    print("Training completed successfully!")
+if __name__ == "__main__":
+    main()

train_olmoe_adapter.py ADDED Viewed

	@@ -0,0 +1,404 @@

+#train_olmoe_adapter.py
+#!/usr/bin/env python
+"""
+Training script for OlmoE model with adapters on the mlfoundations/dclm-baseline-1.0 dataset.
+This script demonstrates parameter-efficient fine-tuning using adapters.
+"""
+import os
+import math
+import logging
+import argparse
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Tuple, Any, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.utils.data import DataLoader, IterableDataset
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import LambdaLR
+from datasets import load_dataset
+from transformers import (
+    OlmoConfig,
+    OlmoForCausalLM,
+    AutoTokenizer,
+    DataCollatorForLanguageModeling,
+    HfArgumentParser,
+    TrainingArguments,
+    set_seed,
+    get_scheduler,
+)
+from tqdm import tqdm
+from accelerate import Accelerator, DistributedType
+from accelerate.utils import find_batch_size
+from modeling_olmoe import (
+    OlmoEWithAdaptersForCausalLM,
+    OlmoEForCausalLM,
+)
+# Set up logging
+logger = logging.getLogger(__name__)
+logging.basicConfig(
+    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+    datefmt="%m/%d/%Y %H:%M:%S",
+    level=logging.INFO,
+)
+@dataclass
+class ModelArguments:
+    """Arguments for model configuration."""
+    model_name_or_path: str = field(
+        default="allenai/OLMo-7B-Instruct",
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    adapter_size: int = field(
+        default=64,
+        metadata={"help": "Size of the adapter layers"}
+    )
+    freeze_base_model: bool = field(
+        default=True,
+        metadata={"help": "Whether to freeze all parameters except the adapters"}
+    )
+    checkpoint_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to save model checkpoints"}
+    )
+@dataclass
+class DataArguments:
+    """Arguments for dataset configuration."""
+    dataset_name: str = field(
+        default="mlfoundations/dclm-baseline-1.0",
+        metadata={"help": "Dataset name or path for training"}
+    )
+    streaming: bool = field(
+        default=True,
+        metadata={"help": "Whether to stream the dataset"}
+    )
+    streaming_buffer_size: int = field(
+        default=8192,
+        metadata={"help": "Buffer size for streaming dataset"}
+    )
+    max_seq_length: int = field(
+        default=1024,
+        metadata={"help": "Maximum sequence length for training"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "Number of workers for preprocessing"}
+    )
+    text_column_name: str = field(
+        default="text",
+        metadata={"help": "Column name for text data"}
+    )
+class StreamingTextDataset(IterableDataset):
+    """Dataset for streaming text data."""
+    def __init__(
+        self,
+        dataset_name: str,
+        tokenizer,
+        max_seq_length: int,
+        streaming: bool = True,
+        text_column_name: str = "text",
+        buffer_size: int = 8192,
+        split: str = "train",
+    ):
+        self.tokenizer = tokenizer
+        self.max_seq_length = max_seq_length
+        self.text_column_name = text_column_name
+        # Load dataset in streaming mode
+        self.dataset = load_dataset(
+            dataset_name,
+            split=split,
+            streaming=streaming,
+        )
+        if streaming:
+            # Buffer for streaming
+            self.dataset = self.dataset.shuffle(buffer_size=buffer_size)
+    def __iter__(self):
+        buffer = []
+        current_length = 0
+        for example in self.dataset:
+            text = example[self.text_column_name]
+            if not text or len(text.strip()) == 0:
+                continue
+            tokenized = self.tokenizer(
+                text,
+                truncation=False,
+                return_attention_mask=False,
+                return_token_type_ids=False,
+                add_special_tokens=False,
+            )
+            ids = tokenized["input_ids"]
+            buffer.extend(ids)
+            # Yield complete sequences and update buffer
+            while len(buffer) >= self.max_seq_length:
+                yield {
+                    "input_ids": torch.tensor(buffer[:self.max_seq_length], dtype=torch.long),
+                    "labels": torch.tensor(buffer[:self.max_seq_length], dtype=torch.long),
+                }
+                buffer = buffer[self.max_seq_length:]
+def create_optimizer_and_scheduler(
+    model: nn.Module,
+    args: TrainingArguments,
+    num_training_steps: int
+) -> Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LRScheduler]:
+    """Create optimizer and learning rate scheduler."""
+    # Get only trainable parameters if using adapters with frozen base model
+    if hasattr(model, "get_trainable_parameters"):
+        optimizer_params = model.get_trainable_parameters()
+        logger.info(f"Training with {len(optimizer_params)} trainable parameters")
+    else:
+        # No parameter filtering - get all parameters that require grad
+        optimizer_params = [p for p in model.parameters() if p.requires_grad]
+        logger.info(f"Training with {len(optimizer_params)} parameters")
+    # Create optimizer
+    optimizer = AdamW(
+        optimizer_params,
+        lr=args.learning_rate,
+        betas=(args.adam_beta1, args.adam_beta2),
+        eps=args.adam_epsilon,
+        weight_decay=args.weight_decay,
+    )
+    # Create scheduler
+    scheduler = get_scheduler(
+        name=args.lr_scheduler_type,
+        optimizer=optimizer,
+        num_warmup_steps=args.warmup_steps,
+        num_training_steps=num_training_steps,
+    )
+    return optimizer, scheduler
+def train(
+    model_args: ModelArguments,
+    data_args: DataArguments,
+    training_args: TrainingArguments,
+):
+    """Main training function."""
+    # Set up accelerator
+    accelerator = Accelerator(
+        gradient_accumulation_steps=training_args.gradient_accumulation_steps,
+        mixed_precision=training_args.fp16 and "fp16" or training_args.bf16 and "bf16" or "no",
+    )
+    # Log information about the training setup
+    logger.info(accelerator.state)
+    if accelerator.is_local_main_process:
+        logger.info(f"Model arguments: {model_args}")
+        logger.info(f"Data arguments: {data_args}")
+        logger.info(f"Training arguments: {training_args}")
+    # Set seed for reproducibility
+    set_seed(training_args.seed)
+    # Load tokenizer and model
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+    # Ensure the tokenizer has padding token and EOS token set
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load model config and update with adapter size
+    config = OlmoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+    config.adapter_size = model_args.adapter_size
+    # Load model with adapters
+    logger.info(f"Loading OlmoE model with adapters from {model_args.model_name_or_path}")
+    base_model = OlmoForCausalLM.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+    # Create adapter model from base model weights
+    model = OlmoEWithAdaptersForCausalLM(config)
+    # Copy weights from base model to adapter model
+    # This is needed because we're using a custom model class
+    model.load_state_dict(base_model.state_dict(), strict=False)
+    # Freeze base model parameters if requested
+    if model_args.freeze_base_model:
+        logger.info("Freezing base model parameters")
+        model.freeze_base_model()
+    # Set up streaming dataset
+    logger.info(f"Loading dataset: {data_args.dataset_name}")
+    train_dataset = StreamingTextDataset(
+        dataset_name=data_args.dataset_name,
+        tokenizer=tokenizer,
+        max_seq_length=data_args.max_seq_length,
+        streaming=data_args.streaming,
+        buffer_size=data_args.streaming_buffer_size,
+        text_column_name=data_args.text_column_name,
+    )
+    # Data collator to handle batching
+    data_collator = DataCollatorForLanguageModeling(
+        tokenizer=tokenizer,
+        mlm=False,
+    )
+    # Create data loader
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=training_args.per_device_train_batch_size,
+        collate_fn=data_collator,
+        num_workers=data_args.preprocessing_num_workers or 0,
+    )
+    # Estimate number of update steps
+    # For streaming datasets, we'll use a fixed number of steps
+    num_update_steps_per_epoch = training_args.max_steps
+    num_training_steps = training_args.max_steps
+    # Create optimizer and scheduler
+    optimizer, lr_scheduler = create_optimizer_and_scheduler(
+        model=model,
+        args=training_args,
+        num_training_steps=num_training_steps,
+    )
+    # Prepare for distributed training with accelerator
+    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
+        model, optimizer, train_dataloader, lr_scheduler
+    )
+    # Get total batch size for logging
+    total_batch_size = (
+        training_args.per_device_train_batch_size
+        * accelerator.num_processes
+        * training_args.gradient_accumulation_steps
+    )
+    logger.info(f"Total batch size (with parallel & accumulation): {total_batch_size}")
+    # Log estimated number of steps
+    logger.info(f"Number of training steps: {num_training_steps}")
+    logger.info(f"Number of warmup steps: {training_args.warmup_steps}")
+    # Keep track of training progress
+    progress_bar = tqdm(
+        range(num_training_steps),
+        disable=not accelerator.is_local_main_process,
+        desc="Training",
+    )
+    completed_steps = 0
+    starting_epoch = 0
+    global_step = 0
+    # Training loop
+    logger.info("Starting training...")
+    model.train()
+    for step, batch in enumerate(train_dataloader):
+        # Skip steps for resuming
+        if starting_epoch > 0 and step < starting_epoch * num_update_steps_per_epoch:
+            progress_bar.update(1)
+            continue
+        with accelerator.accumulate(model):
+            # Forward pass
+            outputs = model(**batch)
+            loss = outputs.loss
+            # Backward pass
+            accelerator.backward(loss)
+            # Update weights
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.zero_grad()
+        # Update progress bar
+        progress_bar.update(1)
+        completed_steps += 1
+        global_step += 1
+        # Log metrics
+        if global_step % training_args.logging_steps == 0:
+            # Gather loss from all processes
+            loss_value = accelerator.gather(loss).mean().item()
+            logger.info(f"Step {global_step}: loss = {loss_value:.4f}, lr = {lr_scheduler.get_last_lr()[0]:.8f}")
+            # Log to tensorboard if available
+            if hasattr(accelerator.trackers[0], "store"):
+                accelerator.trackers[0].store({
+                    "loss": loss_value,
+                    "learning_rate": lr_scheduler.get_last_lr()[0],
+                    "step": global_step,
+                })
+        # Save checkpoint
+        if training_args.save_steps > 0 and global_step % training_args.save_steps == 0:
+            if model_args.checkpoint_dir is not None:
+                output_dir = os.path.join(model_args.checkpoint_dir, f"checkpoint-{global_step}")
+                accelerator.save_state(output_dir)
+                logger.info(f"Saved checkpoint to {output_dir}")
+                # Save the model separately
+                if accelerator.is_main_process:
+                    unwrapped_model = accelerator.unwrap_model(model)
+                    unwrapped_model.save_pretrained(
+                        output_dir,
+                        is_main_process=accelerator.is_main_process,
+                        save_function=accelerator.save,
+                    )
+                    tokenizer.save_pretrained(output_dir)
+        # Check if we've reached max steps
+        if completed_steps >= num_training_steps:
+            break
+    # Save final model
+    if model_args.checkpoint_dir is not None:
+        output_dir = os.path.join(model_args.checkpoint_dir, "final-model")
+        accelerator.save_state(output_dir)
+        # Save the model separately
+        if accelerator.is_main_process:
+            unwrapped_model = accelerator.unwrap_model(model)
+            unwrapped_model.save_pretrained(
+                output_dir,
+                is_main_process=accelerator.is_main_process,
+                save_function=accelerator.save,
+            )
+            tokenizer.save_pretrained(output_dir)
+        logger.info(f"Saved final model to {output_dir}")
+    logger.info("Training complete!")
+def main():
+    """Main entry point."""
+    parser = HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Set up output directory
+    if model_args.checkpoint_dir is None:
+        model_args.checkpoint_dir = training_args.output_dir
+    os.makedirs(model_args.checkpoint_dir, exist_ok=True)
+    # Run training
+    train(model_args, data_args, training_args)
+if __name__ == "__main__":
+    main()