majentik commited on
Commit
542ee2b
·
verified ·
1 Parent(s): efea9f8

chore(card): add hardware compatibility section

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -2,16 +2,13 @@
2
  license: mit
3
  base_model: deepseek-ai/DeepSeek-V3.2
4
  tags:
5
- - rotorquant
6
- - kv-cache-quantization
7
- - deepseek
8
- - moe
9
- - quantized
10
  library_name: transformers
11
  pipeline_tag: text-generation
12
- language:
13
- - en
14
- inference: false
15
  ---
16
 
17
  # DeepSeek-V3.2-RotorQuant
@@ -20,6 +17,12 @@ inference: false
20
 
21
  This is a **documentation repository** that explains how to combine DeepSeek-V3.2's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
22
 
 
 
 
 
 
 
23
  ## What is this?
24
 
25
  KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.
 
2
  license: mit
3
  base_model: deepseek-ai/DeepSeek-V3.2
4
  tags:
5
+ - rotorquant
6
+ - kv-cache-quantization
7
+ - deepseek
8
+ - moe
9
+ - quantized
10
  library_name: transformers
11
  pipeline_tag: text-generation
 
 
 
12
  ---
13
 
14
  # DeepSeek-V3.2-RotorQuant
 
17
 
18
  This is a **documentation repository** that explains how to combine DeepSeek-V3.2's weights with RotorQuant inference-time KV cache compression. No weights are stored here — use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
19
 
20
+ ## Hardware compatibility
21
+
22
+ | Device | VRAM / RAM | Recommendation |
23
+ | --- | --- | --- |
24
+ | Any host that runs the base model | baseline + runtime savings | RotorQuant/TurboQuant is a KV-cache runtime modifier; pair with any weight variant |
25
+
26
  ## What is this?
27
 
28
  KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime — so the same base weights can be used with or without compression.