Broken-Tutu-24B — Quantized (compressed-tensors for vLLM)
This repository provides quantized runtime builds of
ReadyArt/Broken-Tutu-24B (marked “not for all audiences”), repackaged for vLLM using the compressed-tensors format.
TL;DR
- Six branches covering INT4 (W4A16) and INT8 (W8A16) with group sizes 32 / 64 / 128.
- Same calibration recipe as our recent cards: 512 chat samples, 2048 max sequence length, dataset
neuralmagic/LLM_compression_calibration(messages rendered with the model’s chat template).- Weight-only AWQ;
lm_headkept high-precision; exported withsave_compressed=Truefor vLLM.
Revisions & Branches
The
mainbranch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
- main — placeholder / landing page
- W4A16_GS32 — INT4 weights / A16 activations, group size 32 (highest fidelity)
- W4A16_GS64 — INT4 / A16, group size 64 (balanced)
- W4A16_GS128 — INT4 / A16, group size 128 (leanest INT4)
- W8A16_GS32 — INT8 / A16, group size 32 (max fidelity for INT8)
- W8A16_GS64 — INT8 / A16, group size 64 (balanced INT8)
- W8A16_GS128 — INT8 / A16, group size 128 (leanest INT8)
Quick links
- main: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/main
- W4A16_GS32: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W4A16_GS32
- W4A16_GS64: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W4A16_GS64
- W4A16_GS128: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W4A16_GS128
- W8A16_GS32: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W8A16_GS32
- W8A16_GS64: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W8A16_GS64
- W8A16_GS128: https://huggingface.co/TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors/tree/W8A16_GS128
What’s inside (per revision)
- Sharded quantized weights (
*.safetensors) + index (model.safetensors.index.json) config.jsonwith compressed-tensors metadata (weight_format,quantization,quantization_config, etc.)- Tokenizer artifacts (
tokenizer.json,tokenizer.model, merges/vocab as applicable) - Optional:
chat_template.jinja(inherits the finetune’s chat style)
Exact file lists may differ between branches — see Files and versions for each revision.
Quantization & calibration details (same script/recipe family as prior cards)
Method / flow
llmcompressoroneshot pipeline with an AWQModifier (weight-only quantization).
Targets / exclusions
- Quantize Linear layers; ignore
lm_head(kept high-precision).
Weights / grouping
- Branches W4A16_* use INT4 (
num_bits=4,symmetric=True). - Branches W8A16_* use INT8 (
num_bits=8,symmetric=True). - Strategy:
"group"with group size ∈ {32, 64, 128} according to branch. - Activations not quantized (runtime A16: BF16/FP16).
- Export with
save_compressed=Trueso vLLM loads the compressed-tensors layout directly.
Calibration dataset & preprocessing
- Dataset:
neuralmagic/LLM_compression_calibration, splittrain. - NUM_CALIBRATION_SAMPLES = 512 (random subset with fixed seed).
- MAX_SEQUENCE_LENGTH = 2048.
- Each sample’s
messageslist is rendered viatokenizer.apply_chat_template(..., tokenize=False), then tokenized with:max_length=2048,truncation=True,padding=False,add_special_tokens=False.
Compression call
oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)on the preprocessed dataset.
Why group size matters in AWQ (W4A16 / W8A16)
- Definition: Group size controls how many consecutive weights share a single set of quantization scales.
- Trade-offs (accuracy ↔ throughput/VRAM):
- GS32 (smallest groups): Most scale sets → highest fidelity (often best perplexity/task scores), but more scale metadata and slightly lower throughput.
- GS64 (middle ground): Good balance of quality and performance; a solid default if you haven’t profiled yet.
- GS128 (largest groups): Fewest scales → leaner/faster (less bandwidth/metadata), with slightly higher quantization error; good for throughput-critical serving.
- Bit-width differences:
- W4A16 tends to be smaller/faster than INT8 at the same GS, but can be more sensitive to GS and calibration coverage.
- W8A16 offers more headroom (often closer to FP in quality), at the cost of more VRAM/bandwidth.
Context length
- Calibration context: up to 2048 tokens per sample (as above).
- Model context window: inherited from ReadyArt/Broken-Tutu-24B; quantization does not change rope/position encodings—only the numeric representation of the weights.
Quickstart — vLLM (compressed-tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors \
--quantization compressed-tensors \
--tensor-parallel-size 4 \
--max-model-len 2048 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Example Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Broken-Tutu — helpful, precise, and safe."},
{"role":"user","content":"Draft a short, character-driven opening in under 200 words."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensorsis a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible export (e.g., GPTQ/AWQ) or the full-precision finetune.
Prompting / chat template
This package follows the finetuned parent’s chat conventions. If a chat_template.jinja is present, libraries that support apply_chat_template will automatically format messages.
Guidelines:
- Keep the system message concise (behavior, tone, safety constraints).
- Provide clear user instructions; for multi-step tasks, list steps explicitly.
Intended use & safety
This quantization:
- Does not change underlying behavior or content tendencies.
- Only changes weight storage for efficient inference.
Because the base is flagged “not for all audiences,” apply appropriate content filters / policies for your deployment context.
Lineage
- Finetuned parent: https://huggingface.co/ReadyArt/Broken-Tutu-24B
- This repo: Quantized child of the finetune (compressed-tensors for vLLM)
Hardware tips
- 24B models benefit from multi-GPU tensor parallel for throughput.
- Long contexts are KV-cache heavy — tune
--max-model-lenand batch size. - Prefer BF16 on GPUs with native support; otherwise FP16.
- Enable P2P/NVLink when available; consider CUDA Graphs if stable.
Changelog
- v1 (current) — Initial compressed-tensors release; branches
W4A16_GS32 / W4A16_GS64 / W4A16_GS128 / W8A16_GS32 / W8A16_GS64 / W8A16_GS128 with 512-sample / 2048-token AWQ calibration; vLLM-ready packaging.
Model tree for TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors
Base model
ReadyArt/Broken-Tutu-24B