klusai
/

tf2-12b-w8a8

Text Generation

compressed-tensors

text-generation-inference

Model card Files Files and versions

andreiPiscoran commited on Aug 20, 2025

Commit

d9988d1

·

verified ·

1 Parent(s): d5a0c41

Create README.md

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+license: mit
+language:
+  - en
+  - ro
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+  - translation
+  - en-ro
+  - w8a8
+  - int8
+  - quantization
+  - vllm
+  - llm-compressor
+  - compressed-tensors
+  - gemma-3
+  - 12b
+base_model: gemma-3-12b-instruct
+precision: int8 (W8A8)
+pretty_name: TF2-12B-W8A8 — EN→RO Fable Translator (LLM Compressor)
+---
+# 🦊 **klusai/tf2-12b-w8a8** — EN→RO Translator (W8A8 via LLM Compressor)
+**Gemma 3 12B (Instruct)**, LoRA-tuned on **15k English→Romanian** fable pairs (synthetic, **GPT-o3**), then **compressed to W8A8 (INT8 weights & activations)** using **LLM Compressor by vLLM** for fast, memory-efficient inference.
+> Focus: faithful, fluent EN→RO translations of short moral fables, preserving tone/structure and proper Romanian diacritics.
+---
+## ✨ What’s in this build
+* **Training:** Same LoRA fine-tune as `[tf2-12b-gguf](https://huggingface.co/klusai/tf2-12b-gguf)` (15k EN→RO fables, GPT-o3).
+* **Compression:** **W8A8** (INT8 weights + INT8 activations) with **LLM Compressor**; channel-wise weight quant + dynamic per-token activation quant. ([VLLM Documentation][1])
+* **Format:** Published for **Transformers** with **compressed tensors** metadata; runs great on **vLLM** server. ([Hugging Face][2], [VLLM Documentation][3])
+* **Hardware notes:** INT8 execution supported on modern NVIDIA GPUs (compute capability ≥ 7.5; e.g., Turing/Ampere/Ada/Hopper). ([BookStack][4])
+---
+## 🚀 Inference
+### vLLM
+```bash
+pip install vllm
+vllm serve tf2-12b-w8a8/ --dtype bfloat16 --tensor-parallel-size 1 --trust-remote-code --port 8000 --host 0.0.0.0
+```
+Then call it with your OpenAI Client (set `base_url` to your server).
+---
+## ✅ Intended Use
+* EN→RO **machine translation** of narrative prose (fables), educational localization, style-preserving story translation.
+## ⚠️ Limitations
+* Domain-specific (fables); not tuned for legal/medical slang or highly technical text.
+* Synthetic training data may encode stylistic biases; human review recommended for production.
+---
+## 🔑 License
+* **This repository:** MIT
+* Use must also comply with the **Gemma** base model license.
+---
+### 📝 Changelog
+* **v1.0 (W8A8):** LoRA-merged, quantized with LLM Compressor to W8A8; uploaded with compressed-tensors metadata for Transformers & vLLM. ([VLLM Documentation][5])