Create README.md

d9988d1 verified 8 months ago

2.38 kB

license: mit
language:
  - en
  - ro
library_name: transformers
pipeline_tag: text-generation
tags:
  - translation
  - en-ro
  - w8a8
  - int8
  - quantization
  - vllm
  - llm-compressor
  - compressed-tensors
  - gemma-3
  - 12b
base_model: gemma-3-12b-instruct
precision: int8 (W8A8)
pretty_name: TF2-12B-W8A8 — EN→RO Fable Translator (LLM Compressor)

🦊 klusai/tf2-12b-w8a8 — EN→RO Translator (W8A8 via LLM Compressor)

Gemma 3 12B (Instruct), LoRA-tuned on 15k English→Romanian fable pairs (synthetic, GPT-o3), then compressed to W8A8 (INT8 weights & activations) using LLM Compressor by vLLM for fast, memory-efficient inference.

Focus: faithful, fluent EN→RO translations of short moral fables, preserving tone/structure and proper Romanian diacritics.

✨ What’s in this build

Training: Same LoRA fine-tune as [tf2-12b-gguf](https://huggingface.co/klusai/tf2-12b-gguf) (15k EN→RO fables, GPT-o3).
Compression: W8A8 (INT8 weights + INT8 activations) with LLM Compressor; channel-wise weight quant + dynamic per-token activation quant. ([VLLM Documentation][1])
Format: Published for Transformers with compressed tensors metadata; runs great on vLLM server. ([Hugging Face][2], [VLLM Documentation][3])
Hardware notes: INT8 execution supported on modern NVIDIA GPUs (compute capability ≥ 7.5; e.g., Turing/Ampere/Ada/Hopper). ([BookStack][4])

🚀 Inference

vLLM

pip install vllm

vllm serve tf2-12b-w8a8/ --dtype bfloat16 --tensor-parallel-size 1 --trust-remote-code --port 8000 --host 0.0.0.0

Then call it with your OpenAI Client (set base_url to your server).

✅ Intended Use

EN→RO machine translation of narrative prose (fables), educational localization, style-preserving story translation.

⚠️ Limitations

Domain-specific (fables); not tuned for legal/medical slang or highly technical text.
Synthetic training data may encode stylistic biases; human review recommended for production.

🔑 License

This repository: MIT
Use must also comply with the Gemma base model license.

📝 Changelog

v1.0 (W8A8): LoRA-merged, quantized with LLM Compressor to W8A8; uploaded with compressed-tensors metadata for Transformers & vLLM. ([VLLM Documentation][5])