Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- ro
|
| 6 |
+
library_name: transformers
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
tags:
|
| 9 |
+
- translation
|
| 10 |
+
- en-ro
|
| 11 |
+
- w8a8
|
| 12 |
+
- int8
|
| 13 |
+
- quantization
|
| 14 |
+
- vllm
|
| 15 |
+
- llm-compressor
|
| 16 |
+
- compressed-tensors
|
| 17 |
+
- gemma-3
|
| 18 |
+
- 12b
|
| 19 |
+
base_model: gemma-3-12b-instruct
|
| 20 |
+
precision: int8 (W8A8)
|
| 21 |
+
pretty_name: TF2-12B-W8A8 — EN→RO Fable Translator (LLM Compressor)
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
# 🦊 **klusai/tf2-12b-w8a8** — EN→RO Translator (W8A8 via LLM Compressor)
|
| 25 |
+
|
| 26 |
+
**Gemma 3 12B (Instruct)**, LoRA-tuned on **15k English→Romanian** fable pairs (synthetic, **GPT-o3**), then **compressed to W8A8 (INT8 weights & activations)** using **LLM Compressor by vLLM** for fast, memory-efficient inference.
|
| 27 |
+
|
| 28 |
+
> Focus: faithful, fluent EN→RO translations of short moral fables, preserving tone/structure and proper Romanian diacritics.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## ✨ What’s in this build
|
| 33 |
+
|
| 34 |
+
* **Training:** Same LoRA fine-tune as `[tf2-12b-gguf](https://huggingface.co/klusai/tf2-12b-gguf)` (15k EN→RO fables, GPT-o3).
|
| 35 |
+
* **Compression:** **W8A8** (INT8 weights + INT8 activations) with **LLM Compressor**; channel-wise weight quant + dynamic per-token activation quant. ([VLLM Documentation][1])
|
| 36 |
+
* **Format:** Published for **Transformers** with **compressed tensors** metadata; runs great on **vLLM** server. ([Hugging Face][2], [VLLM Documentation][3])
|
| 37 |
+
* **Hardware notes:** INT8 execution supported on modern NVIDIA GPUs (compute capability ≥ 7.5; e.g., Turing/Ampere/Ada/Hopper). ([BookStack][4])
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## 🚀 Inference
|
| 42 |
+
|
| 43 |
+
### vLLM
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
pip install vllm
|
| 47 |
+
|
| 48 |
+
vllm serve tf2-12b-w8a8/ --dtype bfloat16 --tensor-parallel-size 1 --trust-remote-code --port 8000 --host 0.0.0.0
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
Then call it with your OpenAI Client (set `base_url` to your server).
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
|
| 55 |
+
## ✅ Intended Use
|
| 56 |
+
|
| 57 |
+
* EN→RO **machine translation** of narrative prose (fables), educational localization, style-preserving story translation.
|
| 58 |
+
|
| 59 |
+
## ⚠️ Limitations
|
| 60 |
+
|
| 61 |
+
* Domain-specific (fables); not tuned for legal/medical slang or highly technical text.
|
| 62 |
+
* Synthetic training data may encode stylistic biases; human review recommended for production.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## 🔑 License
|
| 67 |
+
|
| 68 |
+
* **This repository:** MIT
|
| 69 |
+
* Use must also comply with the **Gemma** base model license.
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
### 📝 Changelog
|
| 74 |
+
|
| 75 |
+
* **v1.0 (W8A8):** LoRA-merged, quantized with LLM Compressor to W8A8; uploaded with compressed-tensors metadata for Transformers & vLLM. ([VLLM Documentation][5])
|