LFM2.5-350M-4bit (MLX)

4-bit quantized MLX conversion of LiquidAI/LFM2.5-350M.

Converted with mlx-lm==0.31.2 using the standard quantization method.

Quantization

Method Standard (weight-only RTN)
Bits 4
Group size 64
Effective bits/weight 4.502

Quality

Perplexity on allenai/tulu-3-sft-mixture (256 samples, seq_len=512) via mlx_lm.perplexity:

Model Perplexity Δ vs bf16
LiquidAI/LFM2.5-350M (bf16) 118.70 ± 1.69
This (4-bit) 180.60 ± 2.66 +52%

Note: Sub-1B models are more sensitive to low-bit quantization. If quality matters more than size, consider the official LiquidAI/LFM2.5-350M-MLX-6bit / -8bit variants, or a future DWQ/AWQ build.

Performance

Benchmarked with mlx_lm.benchmark -p 512 -g 128 on Apple M4 Pro, 48GB:

Metric Value
Prefill 9,470 tok/s
Generation 676 tok/s
Peak memory 465 MB
Size on disk 195 MB

Usage

from mlx_lm import load, generate

model, tokenizer = load("BRlin/LFM2.5-350M-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=100)
print(response)

Or via CLI:

mlx_lm.generate --model BRlin/LFM2.5-350M-4bit --prompt "Hello"

License

Inherits the LFM Open License v1.0 from the base model.

Downloads last month
21
Safetensors
Model size
55.4M params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BRlin/LFM2.5-350M-4bit

Quantized
(19)
this model