Text Generation
Transformers
Safetensors
English
Romanian
gemma3
image-text-to-text
translation
en-ro
w8a8
int8
quantization
vllm
llm-compressor
compressed-tensors
gemma-3
12b
conversational
text-generation-inference
Instructions to use klusai/tf2-12b-w8a8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use klusai/tf2-12b-w8a8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="klusai/tf2-12b-w8a8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("klusai/tf2-12b-w8a8") model = AutoModelForImageTextToText.from_pretrained("klusai/tf2-12b-w8a8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use klusai/tf2-12b-w8a8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "klusai/tf2-12b-w8a8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "klusai/tf2-12b-w8a8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/klusai/tf2-12b-w8a8
- SGLang
How to use klusai/tf2-12b-w8a8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "klusai/tf2-12b-w8a8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "klusai/tf2-12b-w8a8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "klusai/tf2-12b-w8a8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "klusai/tf2-12b-w8a8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use klusai/tf2-12b-w8a8 with Docker Model Runner:
docker model run hf.co/klusai/tf2-12b-w8a8
🦊 klusai/tf2-12b-w8a8 — EN→RO Translator (W8A8 via LLM Compressor)
Gemma 3 12B (Instruct), LoRA-tuned on 15k English→Romanian fable pairs (synthetic, GPT-o3), then compressed to W8A8 (INT8 weights & activations) using LLM Compressor by vLLM for fast, memory-efficient inference.
Focus: faithful, fluent EN→RO translations of short moral fables, preserving tone/structure and proper Romanian diacritics.
✨ What’s in this build
- Training: Same LoRA fine-tune as
[tf2-12b-gguf](https://huggingface.co/klusai/tf2-12b-gguf)(15k EN→RO fables, GPT-o3). - Compression: W8A8 (INT8 weights + INT8 activations) with LLM Compressor; channel-wise weight quant + dynamic per-token activation quant. ([VLLM Documentation][1])
- Format: Published for Transformers with compressed tensors metadata; runs great on vLLM server. ([Hugging Face][2], [VLLM Documentation][3])
- Hardware notes: INT8 execution supported on modern NVIDIA GPUs (compute capability ≥ 7.5; e.g., Turing/Ampere/Ada/Hopper). ([BookStack][4])
🚀 Inference
vLLM
pip install vllm
vllm serve tf2-12b-w8a8/ --dtype bfloat16 --tensor-parallel-size 1 --trust-remote-code --port 8000 --host 0.0.0.0
Then call it with your OpenAI Client (set base_url to your server).
✅ Intended Use
- EN→RO machine translation of narrative prose (fables), educational localization, style-preserving story translation.
⚠️ Limitations
- Domain-specific (fables); not tuned for legal/medical slang or highly technical text.
- Synthetic training data may encode stylistic biases; human review recommended for production.
🔑 License
- This repository: MIT
- Use must also comply with the Gemma base model license.
📝 Changelog
- v1.0 (W8A8): LoRA-merged, quantized with LLM Compressor to W8A8; uploaded with compressed-tensors metadata for Transformers & vLLM. ([VLLM Documentation][5])
- Downloads last month
- 3