+

No Kernels

+

First, we run the model without any custom kernels to get a reference point.

+

Forward

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: no_kernels | 106.70s + | + +Raw +
+
+
+
+1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 +98 +
+
+
# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "accelerate>=1.10.1",
+#     "torch>=2.7.0",
+#     "kernels==0.10.0",
+#     "transformers@https://github.com/huggingface/transformers.git",
+#     "ipdb>=0.13.13",
+#     "matplotlib>=3.7.2",
+#     "numpy>=1.24.3",
+# ]
+# ///
+
+import torch
+from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
+import time
+import torch.nn as nn
+from kernels import register_kernel_mapping, Mode, LayerRepository, replace_kernel_forward_from_hub
+import sys
+import torch.profiler
+import gc
+import logging
+from transformers.models.gpt_oss.modeling_gpt_oss import GptOssRMSNorm
+
+# set to debug logging
+logging.basicConfig(level=logging.INFO)
+
+def reset_peak_memory_stats():
+    """Clear CUDA cache and reset memory allocation counters."""
+    torch.cuda.empty_cache()
+    if torch.cuda.is_available():
+        torch.cuda.reset_peak_memory_stats()
+    gc.collect()
+
+def get_memory_stats():
+    """Get current and peak CUDA memory usage."""
+    if not torch.cuda.is_available():
+        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
+    return {
+        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
+        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
+        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
+    }
+
+def override_kernel_layer_name(cls_name: str, value) -> bool:
+    """Helper to dynamically override the kernel_layer_name in a model class."""
+    for mod in sys.modules.values():
+        if mod is None:
+            continue
+        obj = getattr(mod, cls_name, None)
+        if isinstance(obj, type) and issubclass(obj, nn.Module):
+            setattr(obj, "kernel_layer_name", value)
+            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
+            return True
+    return False
+
+
+# Init the model the normal way
+model_id = "openai/gpt-oss-20b"
+tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
+quantization_config = Mxfp4Config(dequantize=True)
+
+
+
+model = GptOssForCausalLM.from_pretrained(
+    model_id,
+    dtype="bfloat16",
+    device_map="auto",
+    use_kernels=False,
+    quantization_config=quantization_config,
+).eval()
+
+messages = [
+    {"role": "system", "content": "What is Tensor Parallelism?"},
+]
+
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt",
+    return_dict=True,
+    reasoning_effort="low",
+).to("cuda")
+
+max_tokens = 256
+
+with torch.inference_mode():
+    start_time = time.perf_counter()
+    generated = model.generate(
+        **inputs,
+        max_new_tokens=max_tokens,
+        do_sample=False,
+        temperature=None,
+    )
+    end_time = time.perf_counter()
+
+print(tokenizer.decode(generated[0], skip_special_tokens=False))
+print(f"Generation took {end_time - start_time:.2f} seconds")
+
+ +
+
+
+
+
+
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. +Knowledge cutoff: 2024-06 +Current date: 2025-09-24 + +Reasoning: low + +# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions + +What is Tensor Parallelism? + +<|end|><|start|>assistant<|channel|>analysis<|message|>Need to explain tensor parallelism.<|end|><|start|>assistant<|channel|>final<|message|>**Tensor Parallelism** is a technique used to distribute the computation of a single tensor (or a set of tensors) across multiple devices (GPUs, TPUs, or other accelerators) in order to speed up large‑scale deep‑learning training and inference. It is one of several parallelism strategies—alongside data parallelism, pipeline parallelism, and model parallelism—designed to scale models that exceed the memory or compute capacity of a single device. + +--- + +## 1. Why Tensor Parallelism? + +- **Memory constraints**: Modern transformer models (e.g., GPT‑4, PaLM, LLaMA‑70B) can have billions of parameters. A single GPU may not hold the entire model or its intermediate activations. +- **Compute bottlenecks**: Even if memory were sufficient, the sheer number of floating‑point operations can make training prohibitively slow on a single device. +- **Scalability**: Tensor parallelism allows you to add more devices to reduce per‑device memory usage and accelerate matrix operations. + +--- + +## 2. Core Idea + +In tensor parallelism, a *tensor* (typically a weight matrix or an activation tensor +Generation took 25.73 seconds +
+
+
▶ UV Install Logs
+ +
+
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] +Fetching 3 files: 33%|███▎ | 1/3 [00:07<00:14, 7.16s/it] +Fetching 3 files: 67%|██████▋ | 2/3 [00:08<00:03, 3.83s/it] +Fetching 3 files: 100%|██████████| 3/3 [00:08<00:00, 2.89s/it] + +Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] +Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.35s/it] +Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.25s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it]
+
+
+ +

Forward and Backward

+

Next, we'll attempt to run a forward and backward pass without any custom kernels. This will likely run out of memory since the default implementation is not optimized for memory usage.

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: forward_and_backward_no_kernel | 98.96s | FAILED + | + +Raw +
+
+
+
+1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 +98 +99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +
+
+
# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "accelerate>=1.10.1",
+#     "torch>=2.7.0",
+#     "kernels==0.10.0",
+#     "transformers@https://github.com/huggingface/transformers.git",
+#     "ipdb>=0.13.13",
+#     "matplotlib>=3.7.2",
+#     "numpy>=1.24.3",
+# ]
+# ///
+
+import torch
+from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
+import time
+import torch.nn as nn
+from kernels import register_kernel_mapping, Mode, LayerRepository, replace_kernel_forward_from_hub
+import sys
+import torch.profiler
+import gc
+import logging
+from transformers.models.gpt_oss.modeling_gpt_oss import GptOssRMSNorm
+
+# remove liger kernel for testing 
+replace_kernel_forward_from_hub(GptOssRMSNorm, None)
+
+# set to debug logging
+logging.basicConfig(level=logging.INFO)
+
+def reset_peak_memory_stats():
+    """Clear CUDA cache and reset memory allocation counters."""
+    torch.cuda.empty_cache()
+    if torch.cuda.is_available():
+        torch.cuda.reset_peak_memory_stats()
+    gc.collect()
+
+def get_memory_stats():
+    """Get current and peak CUDA memory usage."""
+    if not torch.cuda.is_available():
+        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
+    return {
+        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
+        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
+        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
+    }
+
+def override_kernel_layer_name(cls_name: str, value) -> bool:
+    """Helper to dynamically override the kernel_layer_name in a model class."""
+    for mod in sys.modules.values():
+        if mod is None:
+            continue
+        obj = getattr(mod, cls_name, None)
+        if isinstance(obj, type) and issubclass(obj, nn.Module):
+            setattr(obj, "kernel_layer_name", value)
+            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
+            return True
+    return False
+
+
+# Init the model the normal way
+model_id = "openai/gpt-oss-20b"
+tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
+quantization_config = Mxfp4Config(dequantize=True)
+
+model = GptOssForCausalLM.from_pretrained(
+    model_id,
+    dtype="bfloat16",
+    device_map="auto",
+    use_kernels=False,
+    quantization_config=quantization_config,
+).eval()
+
+messages = [
+    {"role": "system", "content": "What is Tensor Parallelism?"},
+]
+
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt",
+    return_dict=True,
+    reasoning_effort="low",
+).to("cuda")
+
+max_tokens = 128  # Reduced to help with memory usage
+
+# Clear memory before backward pass
+reset_peak_memory_stats()
+print(f"Pre-generation memory: {get_memory_stats()}")
+
+# forward and backward pass
+with torch.autograd.set_grad_enabled(True):
+    start_time = time.perf_counter()
+    generated = model.generate(
+        **inputs,
+        max_new_tokens=max_tokens,
+        do_sample=False,
+        temperature=None,
+    )
+    end_time = time.perf_counter()
+    print(tokenizer.decode(generated[0], skip_special_tokens=False))
+    print(f"Generation took {end_time - start_time:.2f} seconds")
+    print(f"Post-generation memory: {get_memory_stats()}")
+
+    # Use gradient checkpointing to reduce memory usage
+    if hasattr(model, 'gradient_checkpointing_enable'):
+        model.gradient_checkpointing_enable()
+        print("Enabled gradient checkpointing")
+
+    # Reduce sequence length if needed for memory
+    max_seq_len = 512  # Limit sequence length for backward pass
+    if generated.size(1) > max_seq_len:
+        print(f"Truncating sequence from {generated.size(1)} to {max_seq_len} tokens")
+        full_sequence = generated[:, -max_seq_len:]
+    else:
+        full_sequence = generated
+
+    # Get model outputs for the full sequence
+    model.train()  # Enable dropout and other training behaviors
+
+    try:
+        outputs = model(
+            input_ids=full_sequence,
+            labels=full_sequence,  # This will compute loss internally
+            return_dict=True
+        )
+        print(f"Post-forward memory: {get_memory_stats()}")
+
+        # If model doesn't compute loss, compute it manually
+        if outputs.loss is None:
+            shift_logits = outputs.logits[..., :-1, :].contiguous()
+            shift_labels = full_sequence[..., 1:].contiguous()
+
+            # Use CrossEntropyLoss with ignore_index for padding tokens
+            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else -100)
+            loss = loss_fct(
+                shift_logits.view(-1, shift_logits.size(-1)),
+                shift_labels.view(-1)
+            )
+        else:
+            loss = outputs.loss
+
+        print(f"Loss: {loss.item():.4f}")
+
+        # Clear intermediate tensors to save memory
+        del outputs
+        torch.cuda.empty_cache()
+
+        # Perform backward pass with memory management
+        print("Running backward pass...")
+        print(f"Pre-backward memory: {get_memory_stats()}")
+
+        loss.backward()
+        print(f"Post-backward memory: {get_memory_stats()}")
+
+    except torch.cuda.OutOfMemoryError as e:
+        print(f"OOM during forward/backward pass: {e}")
+        print("Try reducing max_tokens or max_seq_len")
+        raise
+
+    # Calculate gradient statistics and print sample gradients
+    total_norm = 0.0
+    param_count = 0
+    grad_samples = {}
+
+    for name, p in model.named_parameters():
+        if p.grad is not None:
+            param_count += 1
+            grad_norm = p.grad.data.norm(2).item()
+            total_norm += grad_norm ** 2
+
+            # Collect gradient statistics for key layers
+            if any(key in name for key in ['embed', 'lm_head', 'mlp.up', 'mlp.down', 'self_attn.q_proj', 'norm']):
+                grad_samples[name] = {
+                    'norm': grad_norm,
+                    'mean': p.grad.data.mean().item(),
+                    'std': p.grad.data.std().item(),
+                    'max': p.grad.data.max().item(),
+                    'min': p.grad.data.min().item(),
+                }
+
+    total_norm = total_norm ** 0.5
+
+    print(f"\nGradient norm: {total_norm:.4f}")
+    print(f"Parameters with gradients: {param_count}")
+
+    # Print sample gradients from important layers
+    print("\nSample gradient statistics:")
+    for i, (name, stats) in enumerate(list(grad_samples.items())[:10]):
+        print(f"  {name[:60]:<60} | norm: {stats['norm']:.4e} | mean: {stats['mean']:.4e} | std: {stats['std']:.4e}")
+
+    # Optional: zero gradients for next iteration
+    model.zero_grad()
+    model.eval()  # Switch back to eval mode
+
+ +
+
+
+
+
+
Pre-generation memory: {'allocated_gb': 9.390148608, 'peak_gb': 9.390148608, 'reserved_gb': 17.177772032} +<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. +Knowledge cutoff: 2024-06 +Current date: 2025-09-24 + +Reasoning: low + +# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions + +What is Tensor Parallelism? + +<|end|><|start|>assistant<|channel|>analysis<|message|>Need to explain tensor parallelism.<|end|><|start|>assistant<|channel|>final<|message|>**Tensor Parallelism** is a technique used to distribute the computation of a single tensor (or a set of tensors) across multiple devices (GPUs, TPUs, or other accelerators) in order to speed up large‑scale deep‑learning training and inference. It is one of several parallelism strategies—alongside data parallelism, pipeline parallelism, and model parallelism—designed to scale models that exceed the memory or compute capacity of a single device. + +--- + +## 1. Why Tensor Parallelism? + +- **Memory constraints**: Modern +Generation took 13.16 seconds +Post-generation memory: {'allocated_gb': 9.398670336, 'peak_gb': 9.514059776, 'reserved_gb': 17.188257792} +Enabled gradient checkpointing +Post-forward memory: {'allocated_gb': 9.487933952, 'peak_gb': 9.514059776, 'reserved_gb': 17.188257792} +Loss: 1.9761 +Running backward pass... +Pre-backward memory: {'allocated_gb': 9.405890048, 'peak_gb': 9.514059776, 'reserved_gb': 17.177772032} +OOM during forward/backward pass: CUDA out of memory. Tried to allocate 508.00 MiB. GPU 2 has a total capacity of 22.30 GiB of which 118.69 MiB is free. Process 68744 has 22.18 GiB memory in use. Of the allocated memory 21.52 GiB is allocated by PyTorch, and 357.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) +Try reducing max_tokens or max_seq_len +
+
+
▶ UV Install Logs
+ +
+
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] +Fetching 3 files: 33%|███▎ | 1/3 [00:06<00:13, 6.93s/it] +Fetching 3 files: 67%|██████▋ | 2/3 [00:08<00:03, 3.58s/it] +Fetching 3 files: 100%|██████████| 3/3 [00:08<00:00, 2.72s/it] + +Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] +Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.35s/it] +Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.25s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it] +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. +Traceback (most recent call last): + File "/repo/moe_benchmarks/megablocks/.uvnote/cells/forward_and_backward_no_kernel.py", line 154, in <module> + loss.backward() + ~~~~~~~~~~~~~^^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/_tensor.py", line 647, in backward + torch.autograd.backward( + ~~~~~~~~~~~~~~~~~~~~~~~^ + self, gradient, retain_graph, create_graph, inputs=inputs + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + ) + ^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/autograd/__init__.py", line 354, in backward + _engine_run_backward( + ~~~~~~~~~~~~~~~~~~~~^ + tensors, + ^^^^^^^^ + ...<5 lines>... + accumulate_grad=True, + ^^^^^^^^^^^^^^^^^^^^^ + ) + ^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + t_outputs, *args, **kwargs + ^^^^^^^^^^^^^^^^^^^^^^^^^^ + ) # Calls into the C++ engine to run the backward pass + ^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/autograd/function.py", line 311, in apply + return user_fn(self, *args) + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/utils/checkpoint.py", line 319, in backward + torch.autograd.backward(outputs_with_grad, args_with_grad) + ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/autograd/__init__.py", line 354, in backward + _engine_run_backward( + ~~~~~~~~~~~~~~~~~~~~^ + tensors, + ^^^^^^^^ + ...<5 lines>... + accumulate_grad=True, + ^^^^^^^^^^^^^^^^^^^^^ + ) + ^ + File "/tmp/uvnote-run-87d0yuj5/home/.cache/uv/environments-v2/forward-and-backward-no-kernel-349948fac2e1b63b/lib/python3.13/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward + return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + t_outputs, *args, **kwargs + ^^^^^^^^^^^^^^^^^^^^^^^^^^ + ) # Calls into the C++ engine to run the backward pass + ^ +torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB. GPU 2 has a total capacity of 22.30 GiB of which 118.69 MiB is free. Process 68744 has 22.18 GiB memory in use. Of the allocated memory 21.52 GiB is allocated by PyTorch, and 357.89 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
+
+
+ +

Kernels

+

Next we can run with Megablocks kernels enabled.

+

Forward

+

First, we run a forward pass with Megablocks kernels.

+

Forward and Backward

+

Next, we run a forward and backward pass with Megablocks kernels enabled. This should be more memory efficient and allow us to complete the backward pass without running out of memory.

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: forward_and_backward | 106.33s + | + +Raw +
+
+
+
+1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 +98 +99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +
+
+
# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+#     "accelerate>=1.10.1",
+#     "torch>=2.7.0",
+#     "kernels==0.10.0",
+#     "transformers@https://github.com/huggingface/transformers.git",
+#     "ipdb>=0.13.13",
+#     "matplotlib>=3.7.2",
+#     "numpy>=1.24.3",
+# ]
+# ///
+
+import torch
+from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
+import time
+import torch.nn as nn
+from kernels import register_kernel_mapping, Mode, LayerRepository, replace_kernel_forward_from_hub
+import sys
+import torch.profiler
+import gc
+import logging
+from transformers.models.gpt_oss.modeling_gpt_oss import GptOssRMSNorm
+
+# remove liger kernel for testing 
+replace_kernel_forward_from_hub(GptOssRMSNorm, None)
+
+# set to debug logging
+logging.basicConfig(level=logging.INFO)
+
+def reset_peak_memory_stats():
+    """Clear CUDA cache and reset memory allocation counters."""
+    torch.cuda.empty_cache()
+    if torch.cuda.is_available():
+        torch.cuda.reset_peak_memory_stats()
+    gc.collect()
+
+def get_memory_stats():
+    """Get current and peak CUDA memory usage."""
+    if not torch.cuda.is_available():
+        return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
+    return {
+        "allocated_gb": torch.cuda.memory_allocated() / 1e9,
+        "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
+        "reserved_gb": torch.cuda.memory_reserved() / 1e9,
+    }
+
+def override_kernel_layer_name(cls_name: str, value) -> bool:
+    """Helper to dynamically override the kernel_layer_name in a model class."""
+    for mod in sys.modules.values():
+        if mod is None:
+            continue
+        obj = getattr(mod, cls_name, None)
+        if isinstance(obj, type) and issubclass(obj, nn.Module):
+            setattr(obj, "kernel_layer_name", value)
+            print(f"Overrode {cls_name}.kernel_layer_name to {value}")
+            return True
+    return False
+
+
+# Init the model the normal way
+model_id = "openai/gpt-oss-20b"
+tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
+quantization_config = Mxfp4Config(dequantize=True)
+
+model = GptOssForCausalLM.from_pretrained(
+    model_id,
+    dtype="bfloat16",
+    device_map="auto",
+    use_kernels=True,
+    quantization_config=quantization_config,
+).eval()
+
+messages = [
+    {"role": "system", "content": "What is Tensor Parallelism?"},
+]
+
+inputs = tokenizer.apply_chat_template(
+    messages,
+    add_generation_prompt=True,
+    return_tensors="pt",
+    return_dict=True,
+    reasoning_effort="low",
+).to("cuda")
+
+max_tokens = 128  # Reduced to help with memory usage
+
+# Clear memory before backward pass
+reset_peak_memory_stats()
+print(f"Pre-generation memory: {get_memory_stats()}")
+
+# forward and backward pass
+with torch.autograd.set_grad_enabled(True):
+    start_time = time.perf_counter()
+    generated = model.generate(
+        **inputs,
+        max_new_tokens=max_tokens,
+        do_sample=False,
+        temperature=None,
+    )
+    end_time = time.perf_counter()
+    print(tokenizer.decode(generated[0], skip_special_tokens=False))
+    print(f"Generation took {end_time - start_time:.2f} seconds")
+    print(f"Post-generation memory: {get_memory_stats()}")
+
+    # Use gradient checkpointing to reduce memory usage
+    if hasattr(model, 'gradient_checkpointing_enable'):
+        model.gradient_checkpointing_enable()
+        print("Enabled gradient checkpointing")
+
+    # Reduce sequence length if needed for memory
+    max_seq_len = 512  # Limit sequence length for backward pass
+    if generated.size(1) > max_seq_len:
+        print(f"Truncating sequence from {generated.size(1)} to {max_seq_len} tokens")
+        full_sequence = generated[:, -max_seq_len:]
+    else:
+        full_sequence = generated
+
+    # Get model outputs for the full sequence
+    model.train()  # Enable dropout and other training behaviors
+
+    try:
+        outputs = model(
+            input_ids=full_sequence,
+            labels=full_sequence,  # This will compute loss internally
+            return_dict=True
+        )
+        print(f"Post-forward memory: {get_memory_stats()}")
+
+        # If model doesn't compute loss, compute it manually
+        if outputs.loss is None:
+            shift_logits = outputs.logits[..., :-1, :].contiguous()
+            shift_labels = full_sequence[..., 1:].contiguous()
+
+            # Use CrossEntropyLoss with ignore_index for padding tokens
+            loss_fct = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else -100)
+            loss = loss_fct(
+                shift_logits.view(-1, shift_logits.size(-1)),
+                shift_labels.view(-1)
+            )
+        else:
+            loss = outputs.loss
+
+        print(f"Loss: {loss.item():.4f}")
+
+        # Clear intermediate tensors to save memory
+        del outputs
+        torch.cuda.empty_cache()
+
+        # Perform backward pass with memory management
+        print("Running backward pass...")
+        print(f"Pre-backward memory: {get_memory_stats()}")
+
+        loss.backward()
+        print(f"Post-backward memory: {get_memory_stats()}")
+
+    except torch.cuda.OutOfMemoryError as e:
+        print(f"OOM during forward/backward pass: {e}")
+        print("Try reducing max_tokens or max_seq_len")
+        raise
+
+    # Calculate gradient statistics and print sample gradients
+    total_norm = 0.0
+    param_count = 0
+    grad_samples = {}
+
+    for name, p in model.named_parameters():
+        if p.grad is not None:
+            param_count += 1
+            grad_norm = p.grad.data.norm(2).item()
+            total_norm += grad_norm ** 2
+
+            # Collect gradient statistics for key layers
+            if any(key in name for key in ['embed', 'lm_head', 'mlp.up', 'mlp.down', 'self_attn.q_proj', 'norm']):
+                grad_samples[name] = {
+                    'norm': grad_norm,
+                    'mean': p.grad.data.mean().item(),
+                    'std': p.grad.data.std().item(),
+                    'max': p.grad.data.max().item(),
+                    'min': p.grad.data.min().item(),
+                }
+
+    total_norm = total_norm ** 0.5
+
+    print(f"\nGradient norm: {total_norm:.4f}")
+    print(f"Parameters with gradients: {param_count}")
+
+    # Print sample gradients from important layers
+    print("\nSample gradient statistics:")
+    for i, (name, stats) in enumerate(list(grad_samples.items())[:10]):
+        print(f"  {name[:60]:<60} | norm: {stats['norm']:.4e} | mean: {stats['mean']:.4e} | std: {stats['std']:.4e}")
+
+    # Optional: zero gradients for next iteration
+    model.zero_grad()
+    model.eval()  # Switch back to eval mode
+
+ +
+
+
+
+
+
Pre-generation memory: {'allocated_gb': 9.390148608, 'peak_gb': 9.390148608, 'reserved_gb': 17.177772032} +<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. +Knowledge cutoff: 2024-06 +Current date: 2025-09-24 + +Reasoning: low + +# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions + +What is Tensor Parallelism? + +<|end|><|start|>assistant<|channel|>analysis<|message|>We need to explain what Tensor Parallelism is. It's a concept in distributed training of large language models. It refers to splitting the weight matrices (tensors) across multiple devices. Provide details: how it works, benefits, challenges, typical frameworks, etc. Also mention difference from data parallelism, pipeline parallelism. Provide example: splitting a weight matrix across GPUs, each GPU holds a slice, compute partial results, then gather. Provide mention of communication overhead, scaling, etc. Also mention that it's used in large models like GPT-3, Megatron-LM, DeepSpeed. Provide references. Also mention that it's +Generation took 17.99 seconds +Post-generation memory: {'allocated_gb': 9.398670336, 'peak_gb': 9.67278848, 'reserved_gb': 17.188257792} +Enabled gradient checkpointing +Post-forward memory: {'allocated_gb': 9.487933952, 'peak_gb': 9.67278848, 'reserved_gb': 17.188257792} +Loss: 2.8572 +Running backward pass... +Pre-backward memory: {'allocated_gb': 9.405890048, 'peak_gb': 9.67278848, 'reserved_gb': 17.179869184} +Post-backward memory: {'allocated_gb': 18.801934336, 'peak_gb': 18.803661312, 'reserved_gb': 19.94391552} + +Gradient norm: 133.4979 +Parameters with gradients: 411 + +Sample gradient statistics: + model.embed_tokens.weight | norm: 3.9844e-01 | mean: 4.5657e-10 | std: 1.6570e-05 + model.layers.0.self_attn.q_proj.weight | norm: 6.1875e+00 | mean: 2.9430e-07 | std: 1.8082e-03 + model.layers.0.self_attn.q_proj.bias | norm: 1.6797e-01 | mean: -2.6584e-05 | std: 2.6245e-03 + model.layers.0.input_layernorm.weight | norm: 6.4941e-02 | mean: 1.1826e-04 | std: 1.2054e-03 + model.layers.0.post_attention_layernorm.weight | norm: 1.1084e-01 | mean: -5.7220e-05 | std: 2.0599e-03 + model.layers.1.self_attn.q_proj.weight | norm: 8.3125e+00 | mean: 1.3784e-06 | std: 2.4109e-03 + model.layers.1.self_attn.q_proj.bias | norm: 2.0215e-01 | mean: 8.4877e-05 | std: 3.1586e-03 + model.layers.1.input_layernorm.weight | norm: 6.6406e-02 | mean: 5.7697e-05 | std: 1.2436e-03 + model.layers.1.post_attention_layernorm.weight | norm: 8.7891e-02 | mean: -4.9770e-06 | std: 1.6403e-03 + model.layers.2.self_attn.q_proj.weight | norm: 4.5312e+00 | mean: 3.9116e-07 | std: 1.3199e-03 +
+
+
▶ UV Install Logs
+ +
+
Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s] +Fetching 3 files: 33%|███▎ | 1/3 [00:07<00:15, 7.67s/it] +Fetching 3 files: 67%|██████▋ | 2/3 [00:09<00:04, 4.14s/it] +Fetching 3 files: 100%|██████████| 3/3 [00:09<00:00, 3.11s/it] +You are using full precision kernels, we will dequantize the model to bf16. To use the quantized model with quantization kernels, please set use_kernels=False + +Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] +Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.35s/it] +Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.26s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it] +Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it] +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` + +Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s] +Fetching 66 files: 2%|▏ | 1/66 [00:00<00:13, 4.77it/s] +Fetching 66 files: 14%|█▎ | 9/66 [00:00<00:01, 29.85it/s] +Fetching 66 files: 20%|█▉ | 13/66 [00:00<00:01, 30.78it/s] +Fetching 66 files: 26%|██▌ | 17/66 [00:01<00:03, 14.47it/s] +Fetching 66 files: 61%|██████ | 40/66 [00:01<00:00, 47.25it/s] +Fetching 66 files: 74%|███████▍ | 49/66 [00:01<00:00, 38.19it/s] +Fetching 66 files: 85%|████████▍ | 56/66 [00:01<00:00, 32.80it/s] +Fetching 66 files: 100%|██████████| 66/66 [00:01<00:00, 41.39it/s] +Fetching 66 files: 100%|██████████| 66/66 [00:01<00:00, 34.85it/s] +/tmp/uvnote-run-hy08fbjx/home/.cache/uv/environments-v2/forward-and-backward-422cb4863433d14c/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: +No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. + warnings.warn( +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +/tmp/uvnote-run-hy08fbjx/home/.cache/uv/environments-v2/forward-and-backward-422cb4863433d14c/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: +No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. + warnings.warn( +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +/tmp/uvnote-run-hy08fbjx/home/.cache/uv/environments-v2/forward-and-backward-422cb4863433d14c/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: +No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. + warnings.warn( +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`. +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +/tmp/uvnote-run-hy08fbjx/home/.cache/uv/environments-v2/forward-and-backward-422cb4863433d14c/lib/python3.13/site-packages/kernels/layer.py:868: UserWarning: +No kernel mapping found for layer `None`. Check if the layer name matches one of the kernels in the mapping or add the kernel you want to use to the mapping. Defaulting to original forward implementation. + warnings.warn( +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP` +INFO:root:Using layer `MegaBlocksMoeMLP` from repo `kernels-community/megablocks` (revision: main) for layer `MegaBlocksMoeMLP`
+
+
+