andreiPiscoran commited on
Commit
d9988d1
·
verified ·
1 Parent(s): d5a0c41

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ro
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - translation
10
+ - en-ro
11
+ - w8a8
12
+ - int8
13
+ - quantization
14
+ - vllm
15
+ - llm-compressor
16
+ - compressed-tensors
17
+ - gemma-3
18
+ - 12b
19
+ base_model: gemma-3-12b-instruct
20
+ precision: int8 (W8A8)
21
+ pretty_name: TF2-12B-W8A8 — EN→RO Fable Translator (LLM Compressor)
22
+ ---
23
+
24
+ # 🦊 **klusai/tf2-12b-w8a8** — EN→RO Translator (W8A8 via LLM Compressor)
25
+
26
+ **Gemma 3 12B (Instruct)**, LoRA-tuned on **15k English→Romanian** fable pairs (synthetic, **GPT-o3**), then **compressed to W8A8 (INT8 weights & activations)** using **LLM Compressor by vLLM** for fast, memory-efficient inference.
27
+
28
+ > Focus: faithful, fluent EN→RO translations of short moral fables, preserving tone/structure and proper Romanian diacritics.
29
+
30
+ ---
31
+
32
+ ## ✨ What’s in this build
33
+
34
+ * **Training:** Same LoRA fine-tune as `[tf2-12b-gguf](https://huggingface.co/klusai/tf2-12b-gguf)` (15k EN→RO fables, GPT-o3).
35
+ * **Compression:** **W8A8** (INT8 weights + INT8 activations) with **LLM Compressor**; channel-wise weight quant + dynamic per-token activation quant. ([VLLM Documentation][1])
36
+ * **Format:** Published for **Transformers** with **compressed tensors** metadata; runs great on **vLLM** server. ([Hugging Face][2], [VLLM Documentation][3])
37
+ * **Hardware notes:** INT8 execution supported on modern NVIDIA GPUs (compute capability ≥ 7.5; e.g., Turing/Ampere/Ada/Hopper). ([BookStack][4])
38
+
39
+ ---
40
+
41
+ ## 🚀 Inference
42
+
43
+ ### vLLM
44
+
45
+ ```bash
46
+ pip install vllm
47
+
48
+ vllm serve tf2-12b-w8a8/ --dtype bfloat16 --tensor-parallel-size 1 --trust-remote-code --port 8000 --host 0.0.0.0
49
+ ```
50
+
51
+ Then call it with your OpenAI Client (set `base_url` to your server).
52
+
53
+ ---
54
+
55
+ ## ✅ Intended Use
56
+
57
+ * EN→RO **machine translation** of narrative prose (fables), educational localization, style-preserving story translation.
58
+
59
+ ## ⚠️ Limitations
60
+
61
+ * Domain-specific (fables); not tuned for legal/medical slang or highly technical text.
62
+ * Synthetic training data may encode stylistic biases; human review recommended for production.
63
+
64
+ ---
65
+
66
+ ## 🔑 License
67
+
68
+ * **This repository:** MIT
69
+ * Use must also comply with the **Gemma** base model license.
70
+
71
+ ---
72
+
73
+ ### 📝 Changelog
74
+
75
+ * **v1.0 (W8A8):** LoRA-merged, quantized with LLM Compressor to W8A8; uploaded with compressed-tensors metadata for Transformers & vLLM. ([VLLM Documentation][5])