hammh0a
/

command-a-translate-FP8-Dynamic

compressed-tensors

Model card Files Files and versions

hammh0a commited on Sep 18, 2025

Commit

681a670

·

verified ·

1 Parent(s): ebd50fa

Update README.md

Files changed (1) hide show

README.md +1 -62

README.md CHANGED Viewed

@@ -3,65 +3,4 @@ base_model:
 - CohereLabs/command-a-translate-08-2025
 ---
-FP8 Quantized version of: [CohereLabs/command-a-translate-08-2025](https://huggingface.co/CohereLabs/command-a-translate-08-2025)
-Code used to perform quantization using `llmcompressor`.
-```
-from transformers import AutoTokenizer, AutoModelForCausalLM
-from llmcompressor import oneshot
-from llmcompressor.modifiers.quantization import QuantizationModifier
-import torch
-import time
-MODEL_ID = "CohereLabs/command-a-translate-08-2025"
-# Check your GPUs
-print(f"Found {torch.cuda.device_count()} GPUs")
-for i in range(torch.cuda.device_count()):
-    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
-    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
-start_time = time.time()
-# Load model across all 4 GPUs
-print("Loading model across 4x A100 GPUs...")
-model = AutoModelForCausalLM.from_pretrained(
-    MODEL_ID,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",  # Automatically distributes across all GPUs
-    low_cpu_mem_usage=True,
-    trust_remote_code=True,
-    max_memory={
-        0: "70GB",  # Leave some headroom on each GPU
-        1: "70GB",
-        2: "70GB",
-        3: "70GB",
-        "cpu": "800GB"  # Use CPU for overflow if needed
-    }
-)
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
-print("Model distributed across GPUs!")
-print(model.hf_device_map)  # Shows which layers are on which device
-# Apply FP8 quantization
-recipe = QuantizationModifier(
-    targets="Linear",
-    scheme="FP8_DYNAMIC",
-    ignore=["lm_head"]
-)
-print("Starting FP8 quantization on multi-GPU setup...")
-oneshot(model=model, recipe=recipe)
-# Save quantized model
-SAVE_DIR = "command-a-translate-FP8-Dynamic"
-print(f"Saving to {SAVE_DIR}...")
-model.save_pretrained(SAVE_DIR, safe_serialization=True)
-tokenizer.save_pretrained(SAVE_DIR)
-elapsed = time.time() - start_time
-print(f"✓ Quantization completed in {elapsed/60:.2f} minutes!")
-```

 - CohereLabs/command-a-translate-08-2025
 ---
+FP8 Quantized version of: [CohereLabs/command-a-translate-08-2025](https://huggingface.co/CohereLabs/command-a-translate-08-2025)