hammh0a
/

command-a-translate-FP8-Dynamic

compressed-tensors

Model card Files Files and versions

hammh0a commited on Sep 13, 2025

Commit

ebd50fa

·

verified ·

1 Parent(s): bb84861

Update README.md

Files changed (1) hide show

README.md +61 -0

README.md CHANGED Viewed

@@ -4,3 +4,64 @@ base_model:
 ---
 FP8 Quantized version of: [CohereLabs/command-a-translate-08-2025](https://huggingface.co/CohereLabs/command-a-translate-08-2025)

 ---
 FP8 Quantized version of: [CohereLabs/command-a-translate-08-2025](https://huggingface.co/CohereLabs/command-a-translate-08-2025)
+Code used to perform quantization using `llmcompressor`.
+```
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+import torch
+import time
+MODEL_ID = "CohereLabs/command-a-translate-08-2025"
+# Check your GPUs
+print(f"Found {torch.cuda.device_count()} GPUs")
+for i in range(torch.cuda.device_count()):
+    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
+    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
+start_time = time.time()
+# Load model across all 4 GPUs
+print("Loading model across 4x A100 GPUs...")
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",  # Automatically distributes across all GPUs
+    low_cpu_mem_usage=True,
+    trust_remote_code=True,
+    max_memory={
+        0: "70GB",  # Leave some headroom on each GPU
+        1: "70GB",
+        2: "70GB",
+        3: "70GB",
+        "cpu": "800GB"  # Use CPU for overflow if needed
+    }
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+print("Model distributed across GPUs!")
+print(model.hf_device_map)  # Shows which layers are on which device
+# Apply FP8 quantization
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head"]
+)
+print("Starting FP8 quantization on multi-GPU setup...")
+oneshot(model=model, recipe=recipe)
+# Save quantized model
+SAVE_DIR = "command-a-translate-FP8-Dynamic"
+print(f"Saving to {SAVE_DIR}...")
+model.save_pretrained(SAVE_DIR, safe_serialization=True)
+tokenizer.save_pretrained(SAVE_DIR)
+elapsed = time.time() - start_time
+print(f"✓ Quantization completed in {elapsed/60:.2f} minutes!")
+```