+
+▼ code
+▼ output
+ ▶ uv-logs
+ |
+Cell: no_kernels | 106.70s
+ |
+
+Raw
+
+
+
+
+
+
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
+61
+62
+63
+64
+65
+66
+67
+68
+69
+70
+71
+72
+73
+74
+75
+76
+77
+78
+79
+80
+81
+82
+83
+84
+85
+86
+87
+88
+89
+90
+91
+92
+93
+94
+95
+96
+97
+98
+
+
+
+
+
+
+# /// script
+# requires-python = ">=3.12"
+# dependencies = [
+# "accelerate>=1.10.1",
+# "torch>=2.7.0",
+# "kernels==0.10.0",
+# "transformers@https://github.com/huggingface/transformers.git",
+# "ipdb>=0.13.13",
+# "matplotlib>=3.7.2",
+# "numpy>=1.24.3",
+# ]
+# ///
+
+import torch
+from transformers import GptOssForCausalLM, PreTrainedTokenizerFast, Mxfp4Config
+import time
+import torch.nn as nn
+from kernels import register_kernel_mapping, Mode, LayerRepository, replace_kernel_forward_from_hub
+import sys
+import torch.profiler
+import gc
+import logging
+from transformers.models.gpt_oss.modeling_gpt_oss import GptOssRMSNorm
+
+# set to debug logging
+logging.basicConfig(level=logging.INFO)
+
+def reset_peak_memory_stats():
+ """Clear CUDA cache and reset memory allocation counters."""
+ torch.cuda.empty_cache()
+ if torch.cuda.is_available():
+ torch.cuda.reset_peak_memory_stats()
+ gc.collect()
+
+def get_memory_stats():
+ """Get current and peak CUDA memory usage."""
+ if not torch.cuda.is_available():
+ return {"allocated_gb": 0, "peak_gb": 0, "reserved_gb": 0}
+ return {
+ "allocated_gb": torch.cuda.memory_allocated() / 1e9,
+ "peak_gb": torch.cuda.max_memory_allocated() / 1e9,
+ "reserved_gb": torch.cuda.memory_reserved() / 1e9,
+ }
+
+def override_kernel_layer_name(cls_name: str, value) -> bool:
+ """Helper to dynamically override the kernel_layer_name in a model class."""
+ for mod in sys.modules.values():
+ if mod is None:
+ continue
+ obj = getattr(mod, cls_name, None)
+ if isinstance(obj, type) and issubclass(obj, nn.Module):
+ setattr(obj, "kernel_layer_name", value)
+ print(f"Overrode {cls_name}.kernel_layer_name to {value}")
+ return True
+ return False
+
+
+# Init the model the normal way
+model_id = "openai/gpt-oss-20b"
+tokenizer = PreTrainedTokenizerFast.from_pretrained(model_id)
+quantization_config = Mxfp4Config(dequantize=True)
+
+
+
+model = GptOssForCausalLM.from_pretrained(
+ model_id,
+ dtype="bfloat16",
+ device_map="auto",
+ use_kernels=False,
+ quantization_config=quantization_config,
+).eval()
+
+messages = [
+ {"role": "system", "content": "What is Tensor Parallelism?"},
+]
+
+inputs = tokenizer.apply_chat_template(
+ messages,
+ add_generation_prompt=True,
+ return_tensors="pt",
+ return_dict=True,
+ reasoning_effort="low",
+).to("cuda")
+
+max_tokens = 256
+
+with torch.inference_mode():
+ start_time = time.perf_counter()
+ generated = model.generate(
+ **inputs,
+ max_new_tokens=max_tokens,
+ do_sample=False,
+ temperature=None,
+ )
+ end_time = time.perf_counter()
+
+print(tokenizer.decode(generated[0], skip_special_tokens=False))
+print(f"Generation took {end_time - start_time:.2f} seconds")
+
+
+<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
+Knowledge cutoff: 2024-06
+Current date: 2025-09-24
+
+Reasoning: low
+
+# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions
+
+What is Tensor Parallelism?
+
+<|end|><|start|>assistant<|channel|>analysis<|message|>Need to explain tensor parallelism.<|end|><|start|>assistant<|channel|>final<|message|>**Tensor Parallelism** is a technique used to distribute the computation of a single tensor (or a set of tensors) across multiple devices (GPUs, TPUs, or other accelerators) in order to speed up large‑scale deep‑learning training and inference. It is one of several parallelism strategies—alongside data parallelism, pipeline parallelism, and model parallelism—designed to scale models that exceed the memory or compute capacity of a single device.
+
+---
+
+## 1. Why Tensor Parallelism?
+
+- **Memory constraints**: Modern transformer models (e.g., GPT‑4, PaLM, LLaMA‑70B) can have billions of parameters. A single GPU may not hold the entire model or its intermediate activations.
+- **Compute bottlenecks**: Even if memory were sufficient, the sheer number of floating‑point operations can make training prohibitively slow on a single device.
+- **Scalability**: Tensor parallelism allows you to add more devices to reduce per‑device memory usage and accelerate matrix operations.
+
+---
+
+## 2. Core Idea
+
+In tensor parallelism, a *tensor* (typically a weight matrix or an activation tensor
+Generation took 25.73 seconds
+
+
+
+▶ UV Install Logs
+
+Fetching 3 files: 0%| | 0/3 [00:00<?, ?it/s]
+Fetching 3 files: 33%|███▎ | 1/3 [00:07<00:14, 7.16s/it]
+Fetching 3 files: 67%|██████▋ | 2/3 [00:08<00:03, 3.83s/it]
+Fetching 3 files: 100%|██████████| 3/3 [00:08<00:00, 2.89s/it]
+
+Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
+Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:04, 2.35s/it]
+Loading checkpoint shards: 67%|██████▋ | 2/3 [00:04<00:02, 2.25s/it]
+Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.80s/it]
+Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00, 1.93s/it]
+