majentik commited on
Commit
8d75fbc
Β·
verified Β·
1 Parent(s): 821a388

Update KV-cache card with accurate template and fork requirements

Browse files
Files changed (1) hide show
  1. README.md +99 -51
README.md CHANGED
@@ -1,75 +1,123 @@
1
  ---
2
- base_model: mistralai/Voxtral-4B-TTS-2603
3
- library_name: transformers
4
  license: apache-2.0
5
- pipeline_tag: text-to-speech
6
  tags:
 
 
7
  - voxtral
8
- - audio
9
- - speech
10
  - tts
11
- - text-to-speech
12
  - voice-cloning
13
  - zero-shot
14
- - kv-cache
15
- - rotorquant
16
- - quantization
17
  ---
18
 
19
  # Voxtral-4B-TTS-2603-RotorQuant
20
 
21
- RotorQuant KV-cache bundle for [`mistralai/Voxtral-4B-TTS-2603`](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603). Rotational online re-basis of the acoustic KV-cache β€” recommended when switching voices, languages, or styles within a batch.
 
 
22
 
23
- This artifact ships **only the quantized KV-cache** β€” weights load from upstream.
24
 
25
- ## Overview
26
 
27
- - **Base model:** `mistralai/Voxtral-4B-TTS-2603`
28
- - **Capabilities:** TTS, zero-shot voice cloning, 9 languages
29
- - **Quantization target:** attention KV-cache only
30
- - **Method:** RotorQuant β€” orthogonal rotation + low-bit quantization, refreshed per session
 
 
31
 
32
  ## Quickstart
33
 
34
- ```python
35
- from transformers import VoxtralForConditionalGeneration, AutoProcessor
36
- from majentik_quant import RotorQuantCache
37
 
38
- model_id = "mistralai/Voxtral-4B-TTS-2603"
39
- processor = AutoProcessor.from_pretrained(model_id)
40
- model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
41
 
42
- cache = RotorQuantCache.from_pretrained("majentik/Voxtral-4B-TTS-2603-RotorQuant")
 
 
 
 
43
 
44
- for line, voice in utterances:
45
- inputs = processor(text=line, speaker_audio=voice, return_tensors="pt")
46
- audio = model.generate(**inputs, past_key_values=cache, max_new_tokens=2048)
47
- processor.save_audio(audio, f"{line[:10]}.wav")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ```
49
 
50
- ## Model specs
51
 
52
- | Field | Value |
53
- |---|---|
 
 
 
 
54
  | Parameters | 4B |
55
- | Modality | Text-in, audio-out |
56
- | Languages | 9 |
57
- | Voice cloning | Zero-shot |
58
- | Cache quantization | RotorQuant (rotated int4) |
59
- | License | Apache 2.0 |
60
-
61
- ## RotorQuant vs TurboQuant
62
-
63
- | | RotorQuant | TurboQuant |
64
- |---|---|---|
65
- | Strategy | Rotational online re-basis | Per-head static calibration |
66
- | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
67
- | Best for | Multi-voice / multi-language batches | Single-voice, single-language sessions |
68
-
69
- ## See also
70
-
71
- - [`majentik/Voxtral-4B-TTS-2603-TurboQuant`](https://huggingface.co/majentik/Voxtral-4B-TTS-2603-TurboQuant)
72
- - [`majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-8bit)
73
- - [`majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-4bit)
74
- - [`majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-4B-TTS-2603-RotorQuant-MLX-2bit)
75
- - [`mistralai/Voxtral-4B-TTS-2603`](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) β€” upstream base model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
2
  license: apache-2.0
3
+ base_model: mistralai/Voxtral-4B-TTS-2603
4
  tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
  - voxtral
8
+ - mistral
 
9
  - tts
 
10
  - voice-cloning
11
  - zero-shot
12
+ - quantized
13
+ library_name: transformers
14
+ pipeline_tag: text-to-speech
15
  ---
16
 
17
  # Voxtral-4B-TTS-2603-RotorQuant
18
 
19
+ **RotorQuant KV cache compression** for [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603).
20
+
21
+ This is a **documentation repository** that explains how to combine Voxtral-4B-TTS-2603's weights with RotorQuant inference-time KV cache compression. No weights are stored here β€” use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
22
 
23
+ ## What is this?
24
 
25
+ KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β€” so the same base weights can be used with or without compression.
26
 
27
+ | Technique | Where it's applied | Savings |
28
+ |-----------|-------------------|---------|
29
+ | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
30
+ | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
31
+
32
+ Both can be combined for maximum efficiency.
33
 
34
  ## Quickstart
35
 
36
+ ### Option A β€” Python / transformers
 
 
37
 
38
+ Install the `rotorquant` package:
 
 
39
 
40
+ ```bash
41
+ pip install rotorquant
42
+ ```
43
+
44
+ Then use it with the base model:
45
 
46
+ ```python
47
+ import torch
48
+ from transformers import VoxtralTTSForConditionalGeneration, AutoTokenizer
49
+ from rotorquant import IsoQuantCache
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Voxtral-4B-TTS-2603", trust_remote_code=True)
52
+ model = VoxtralTTSForConditionalGeneration.from_pretrained(
53
+ "mistralai/Voxtral-4B-TTS-2603",
54
+ torch_dtype=torch.bfloat16,
55
+ device_map="auto",
56
+ trust_remote_code=True,
57
+ )
58
+
59
+ # Apply RotorQuant to the KV cache
60
+ cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression
61
+
62
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
63
+ outputs = model.generate(
64
+ **inputs,
65
+ max_new_tokens=128,
66
+ past_key_values=cache,
67
+ use_cache=True,
68
+ )
69
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
70
  ```
71
 
 
72
 
73
+ ## Model Specifications
74
+
75
+ | Property | Value |
76
+ |----------|-------|
77
+ | Base Model | [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) |
78
+ | Architecture | Text-to-speech with zero-shot voice cloning |
79
  | Parameters | 4B |
80
+ | Context Length | 32K |
81
+ | BF16 Size | ~8 GB |
82
+ | Modalities | Text β†’ Audio |
83
+ | License | apache-2.0 |
84
+
85
+ ## What is RotorQuant?
86
+
87
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β€” a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
88
+
89
+ **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β€” results vary by model and hardware):
90
+
91
+ - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
92
+ - Decode: 119 tok/s (vs TurboQuant 93 tok/s)
93
+ - Perplexity: 6.91 (vs TurboQuant 7.07)
94
+ - Parameters: 4 per rotor (vs TurboQuant 16,384)
95
+
96
+ > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on Voxtral-4B-TTS-2603 will differ. Please open a discussion if you have independent results.
97
+
98
+ ## Current Ecosystem Support
99
+
100
+ | Runtime | RotorQuant Support | Notes |
101
+ |---------|----------------------|-------|
102
+ | Python transformers + `rotorquant` | βœ… Full | Drop-in cache class |
103
+ | llama.cpp upstream | ❌ Not merged | Use fork below |
104
+ | llama-cpp-turboquant fork | βœ… `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
105
+ | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
106
+ | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
107
+ | vLLM | ❌ Not supported | β€” |
108
+ | koboldcpp | ❌ Not supported | β€” |
109
+
110
+ ## Pre-quantized weight variants
111
+
112
+ If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
113
+
114
+ - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=Voxtral-4B-TTS-2603+MLX)
115
+ - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=Voxtral-4B-TTS-2603+GGUF)
116
+
117
+ ## See Also
118
+
119
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
120
+ - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
121
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
122
+ - [Base model: mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603)
123
+ - [Voxtral-4B-TTS-2603 announcement](https://mistral.ai/news/voxtral)