Broken-Tutu-24B — Quantized (compressed-tensors for vLLM)

This repository provides quantized runtime builds of
ReadyArt/Broken-Tutu-24B (marked “not for all audiences”), repackaged for vLLM using the compressed-tensors format.

TL;DR

  • Six branches covering INT4 (W4A16) and INT8 (W8A16) with group sizes 32 / 64 / 128.
  • Same calibration recipe as our recent cards: 512 chat samples, 2048 max sequence length, dataset neuralmagic/LLM_compression_calibration (messages rendered with the model’s chat template).
  • Weight-only AWQ; lm_head kept high-precision; exported with save_compressed=True for vLLM.

Revisions & Branches

The main branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

  • main — placeholder / landing page
  • W4A16_GS32 — INT4 weights / A16 activations, group size 32 (highest fidelity)
  • W4A16_GS64 — INT4 / A16, group size 64 (balanced)
  • W4A16_GS128 — INT4 / A16, group size 128 (leanest INT4)
  • W8A16_GS32 — INT8 / A16, group size 32 (max fidelity for INT8)
  • W8A16_GS64 — INT8 / A16, group size 64 (balanced INT8)
  • W8A16_GS128 — INT8 / A16, group size 128 (leanest INT8)

Quick links


What’s inside (per revision)

  • Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
  • config.json with compressed-tensors metadata (weight_format, quantization, quantization_config, etc.)
  • Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)
  • Optional: chat_template.jinja (inherits the finetune’s chat style)

Exact file lists may differ between branches — see Files and versions for each revision.


Quantization & calibration details (same script/recipe family as prior cards)

Method / flow

  • llmcompressor oneshot pipeline with an AWQModifier (weight-only quantization).

Targets / exclusions

  • Quantize Linear layers; ignore lm_head (kept high-precision).

Weights / grouping

  • Branches W4A16_* use INT4 (num_bits=4, symmetric=True).
  • Branches W8A16_* use INT8 (num_bits=8, symmetric=True).
  • Strategy: "group" with group size ∈ {32, 64, 128} according to branch.
  • Activations not quantized (runtime A16: BF16/FP16).
  • Export with save_compressed=True so vLLM loads the compressed-tensors layout directly.

Calibration dataset & preprocessing

  • Dataset: neuralmagic/LLM_compression_calibration, split train.
  • NUM_CALIBRATION_SAMPLES = 512 (random subset with fixed seed).
  • MAX_SEQUENCE_LENGTH = 2048.
  • Each sample’s messages list is rendered via tokenizer.apply_chat_template(..., tokenize=False), then tokenized with:
    • max_length=2048, truncation=True, padding=False, add_special_tokens=False.

Compression call

  • oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer) on the preprocessed dataset.

Why group size matters in AWQ (W4A16 / W8A16)

  • Definition: Group size controls how many consecutive weights share a single set of quantization scales.
  • Trade-offs (accuracy ↔ throughput/VRAM):
    • GS32 (smallest groups): Most scale sets → highest fidelity (often best perplexity/task scores), but more scale metadata and slightly lower throughput.
    • GS64 (middle ground): Good balance of quality and performance; a solid default if you haven’t profiled yet.
    • GS128 (largest groups): Fewest scales → leaner/faster (less bandwidth/metadata), with slightly higher quantization error; good for throughput-critical serving.
  • Bit-width differences:
    • W4A16 tends to be smaller/faster than INT8 at the same GS, but can be more sensitive to GS and calibration coverage.
    • W8A16 offers more headroom (often closer to FP in quality), at the cost of more VRAM/bandwidth.

Context length

  • Calibration context: up to 2048 tokens per sample (as above).
  • Model context window: inherited from ReadyArt/Broken-Tutu-24B; quantization does not change rope/position encodings—only the numeric representation of the weights.

Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

pip install vllm

Serve (adjust to your hardware):

CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors \
  --quantization compressed-tensors \
  --tensor-parallel-size 4 \
  --max-model-len 2048 \
  --gpu-memory-utilization 0.70 \
  --dtype bfloat16

Example Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Broken-Tutu — helpful, precise, and safe."},
      {"role":"user","content":"Draft a short, character-driven opening in under 200 words."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading directly with vanilla 🤗 Transformers is not supported.
For Transformers, use a compatible export (e.g., GPTQ/AWQ) or the full-precision finetune.


Prompting / chat template

This package follows the finetuned parent’s chat conventions. If a chat_template.jinja is present, libraries that support apply_chat_template will automatically format messages.

Guidelines:

  • Keep the system message concise (behavior, tone, safety constraints).
  • Provide clear user instructions; for multi-step tasks, list steps explicitly.

Intended use & safety

This quantization:

  • Does not change underlying behavior or content tendencies.
  • Only changes weight storage for efficient inference.

Because the base is flagged “not for all audiences,” apply appropriate content filters / policies for your deployment context.


Lineage


Hardware tips

  • 24B models benefit from multi-GPU tensor parallel for throughput.
  • Long contexts are KV-cache heavy — tune --max-model-len and batch size.
  • Prefer BF16 on GPUs with native support; otherwise FP16.
  • Enable P2P/NVLink when available; consider CUDA Graphs if stable.

Changelog

  • v1 (current) — Initial compressed-tensors release; branches
    W4A16_GS32 / W4A16_GS64 / W4A16_GS128 / W8A16_GS32 / W8A16_GS64 / W8A16_GS128 with 512-sample / 2048-token AWQ calibration; vLLM-ready packaging.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/Broken-Tutu-24B_Compressed-Tensors

Quantized
(9)
this model