Intern-S2-Preview FP8 MLX 4-bit

This repository contains an MLX-compatible 4-bit version of internlm/Intern-S2-Preview.

Local Usage

python -m mlx_lm generate \
  --model <namespace>/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

For a local checkout:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

Local Benchmark

Benchmarks were run locally with mlx_lm generate on Apple Silicon.

Basic Generation

Command:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Write a concise response to your prompt here." \
  --max-tokens 4096

Observed output stats:

Metric Value
Prompt tokens 19
Prompt throughput 306.835 tokens/sec
Generation tokens 702
Generation throughput 123.388 tokens/sec
Peak memory 19.651 GB

Prompted Final-Only Output Test

Command:

python -m mlx_lm generate \
  --model /path/to/Intern-S2-Preview-FP8-MLX-4bit \
  --trust-remote-code \
  --prompt "Do not show reasoning, analysis, thinking process, scratchpad, or <think> text. Output only the final answer. Write a concise response to your prompt here." \
  --max-tokens 4096

Observed output stats:

Metric Value
Prompt tokens 44
Prompt throughput 487.095 tokens/sec
Generation tokens 817
Generation throughput 122.650 tokens/sec
Peak memory 19.695 GB

The model still emitted visible reasoning text in this raw generation mode, so prompt-only suppression was not sufficient.

Notes

  • Format: MLX sharded safetensors
  • Quantization: FP8/4-bit MLX local build
  • Base model: internlm/Intern-S2-Preview
  • The model may emit visible reasoning text in raw generation. For chat applications, use a serving layer or post-processor that strips reasoning if needed.
  • Raw generation throughput was about 123 tokens/sec in the local smoke tests above.
  • Peak memory in these tests was about 19.7 GB.

License

This is a derived MLX build of internlm/Intern-S2-Preview. Refer to the base model repository for upstream license and usage terms.

Downloads last month
54
Safetensors
Model size
35B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chanderbalaji/Intern-S2-Preview-FP8-MLX-4bit

Quantized
(2)
this model