|
|
--- |
|
|
license: mit |
|
|
library_name: mlx |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- transformers |
|
|
- mlx |
|
|
base_model: meituan-longcat/LongCat-Flash-Chat |
|
|
--- |
|
|
|
|
|
# TOTORONG/LongCat-Flash-3.5bits |
|
|
|
|
|
This model [TOTORONG/LongCat-Flash-3.5bits](https://huggingface.co/TOTORONG/LongCat-Flash-3.5bits) was |
|
|
converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat) |
|
|
using mlx-lm version **0.27.1**. |
|
|
|
|
|
#Quantized model with 3.516 bits per weight to fit M3 Ultra 256GB |
|
|
|
|
|
#“Selected layers” (the precision bump mask) |
|
|
#A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies: |
|
|
#i < num_layers // 8 or |
|
|
#i >= 7 * num_layers // 8 or |
|
|
#(i - num_layers // 8) % 3 == 2 |
|
|
|
|
|
#These layers receive: |
|
|
#Q/K/V: 3b → 4b |
|
|
#O-proj: 4b → 6b |
|
|
#Experts (.mlps.<idx>.*): 2b → 3b |
|
|
#Switch-MLP remains 3b across all layers. |
|
|
#This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation. |
|
|
|
|
|
|
|
|
## Use with mlx |
|
|
|
|
|
```bash |
|
|
pip install mlx-lm |
|
|
``` |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("TOTORONG/LongCat-Flash-3.5bits") |
|
|
|
|
|
prompt = "hello" |
|
|
|
|
|
if tokenizer.chat_template is not None: |
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
messages, add_generation_prompt=True |
|
|
) |
|
|
|
|
|
response = generate(model, tokenizer, prompt=prompt, verbose=True) |
|
|
``` |
|
|
|