README.md · Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-4bit at main

Soofi-S-Instruct-Preview-EntQuant-4bit / README.md

hassan1595

Add files using upload-large-folder tool

5aec214 verified 4 days ago

preview code

Raw

History Blame Contribute Delete

19.7 kB

	---
	license: other
	library_name: vllm
	pipeline_tag: text-generation
	base_model: Soofi-Project/Soofi-S-Instruct-Preview
	base_model_relation: quantized
	tags:
	- soofi
	- mamba-2
	- moe
	- text-generation
	- sovereign-ai
	- preview
	- quantized
	- entquant
	- fp8
	- 4-bit
	- vllm
	language:
	- en
	- de
	- es
	- fr
	- it
	---

	# Soofi-S-Instruct-Preview-EntQuant-4bit Overview

	> ⚠️ Preview / internal checkpoint. Weights and metadata may still change.
	>
	> Quantized derivative of [`Soofi-Project/Soofi-S-Instruct-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview).
	> See [Quantization details](#quantization-details) for the recipe; the base model card has the underlying model's full description, training data, and evaluation.

	## Description

	EntQuant-compressed serving variant of Soofi-S-Instruct-Preview — the instruction-tuned, non-thinking variant of SOOFI-S, a sovereign, open-source language model developed by a German research consortium. SOOFI (Sovereign Open Source Foundation Models) is designed to provide a secure, European open-source alternative to US and Chinese AI models for industrial use, featuring strong reasoning and AI-agent capabilities.

	This checkpoint compresses to an effective bit size of 4 bits per parameter via [EntQuant](#quantization-details) — a lossless entropy-coding pass on entropy-optimized FP8 codes. Following standard practice, only the Transformer linear weights (attention projections + MoE expert projections) are compressed; the Mamba-2 state-space layers, the embedding table, and the LM head are kept at the base model's precision.

	For explicit chain-of-thought reasoning, use the thinking variants Soofi-S-Isar-Preview and Soofi-S-Rhine-Preview (and their quantized derivatives).

	This model is for research and development only (Preview).

	## License/Terms of Use

	Released under a custom license ("Other"). TODO: add the full license text / link — inherits from the base model.

	## Deployment Geography

	Global (open release on the Hugging Face Hub). Development and training infrastructure are located in Europe (see [Computational Load](#computational-load) on the base model card).

	## Use Case

	Enterprise developers and researchers seeking a sovereign, European open-source LLM for industrial use: general assistant tasks, instruction following, and AI-agent / tool-use workflows. English and German are the primary languages. This quantized variant targets cost-effective inference on a single GPU.

	## Quick start

	This repository is self-contained: it ships the model weights, the EntQuant plugin source, a Dockerfile, and a Compose file. Three lines and you have an OpenAI-compatible server:

	```bash
	hf download Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-4bit --local-dir ./Soofi-S-Instruct-Preview-EntQuant-4bit
	cd Soofi-S-Instruct-Preview-EntQuant-4bit
	docker compose up -d
	```

	(`hf` is HuggingFace's CLI — `pip install huggingface_hub` if you don't have it. Alternative: `git clone` works only if you've also installed `git-lfs` first, otherwise you get tiny pointer files instead of the 13 GB of weights — a common gotcha.)

	The server is then live on `http://localhost:8000/v1`. The model name to send in API requests is `Soofi-S-Instruct-Preview-EntQuant-4bit`.

	Behind a corporate proxy? `export HTTP_PROXY=http://your-proxy:port HTTPS_PROXY=http://your-proxy:port` before `docker compose up` — the build picks them up via `build.args`.

	Pin a specific GPU on a multi-GPU host? In `docker-compose.yml`, replace `count: 1` with `device_ids: ["3"]` (index) or `device_ids: ["GPU-<uuid-from-nvidia-smi-L>"]`.

	Smoke test:

	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "Soofi-S-Instruct-Preview-EntQuant-4bit",
	"messages": [{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}]
	}'
	```

	Override settings via `.env` (see `.env.example` in the repo): port, GPU index, max context length (the model supports up to 262144 / 256K), GPU memory fraction. The defaults serve at 32K context and 90% of GPU memory on a single GPU.

	### Requirements

	- NVIDIA GPU with compute capability ≥ 9.0 (Hopper / Blackwell) for the production fp8 W8A8 path.
	Older GPUs work but fall back to a slower bf16 linear path.
	- ~13 GB of GPU memory for the weights, plus KV cache (highly dependent on max-model-len and concurrency).
	- NVIDIA Container Toolkit; Docker Engine 24+ / Compose v2.

	## Quantization details

	EntQuant ([ICML 2026 paper](https://icml.cc/virtual/2026/poster/66714) · [source](https://github.com/merantix-momentum/entquant)) is a weight-only, scheme-agnostic post-training quantization method that optimizes a scale per output channel via LBFGS to minimize

	```
	L = reconstruction_error(x, q(x)) + λ · L1(q(x))
	```

	The L1 term concentrates the quantized weight distribution toward low Shannon entropy. The weights stay in their target format (FP8 here) but become highly compressible: at λ ≈ 3.9, the FP8 codes entropy-code to an effective bit size of ~4 bits per parameter (substantially more representable values than fixed 4-bit integer quantization would have).

	Scope of compression — following standard practice for hybrid Mamba/Transformer architectures, only the Transformer linear weights are quantized:

	\| Component \| Status \|
	\|---\|---\|
	\| Attention projections (q/k/v/o_proj) \| ✅ EntQuant FP8 → 4-bit-effective entropy-coded \|
	\| MoE expert projections (w1/w2 per expert) \| ✅ EntQuant FP8 → 4-bit-effective entropy-coded \|
	\| Mamba-2 state-space layers (in/out projections, conv1d, A/B/C/D parameters) \| ❌ Kept at base precision \|
	\| Token embedding table \| ❌ Kept at base precision \|
	\| LM head \| ❌ Kept at base precision \|
	\| LayerNorm / RMSNorm weights \| ❌ Kept at base precision \|

	This checkpoint specifically:

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [`Soofi-Project/Soofi-S-Instruct-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview) (bf16) \|
	\| Storage format \| `float-quantized` (compressed-tensors), per-channel FP8 (`e4m3fn`) codes + entropy-coded payload \|
	\| Quant method \| `entquant_coding` (auto-discovered by vLLM via the plugin entry point) \|
	\| Effective bit-size (Transformer linear weights) \| ~4 bits/parameter \|
	\| Resident model size on disk \| ~18 GB \|
	\| Decode \| nvCOMP ANS GPU decompressor on every forward, into a static scratch reused across MoE layers \|
	\| Reference numerics \| W8A16 (weight-only) by default; W8A8 with `ENTQUANT_LINEAR_COMPUTE=fp8` (on by default in this image) \|

	Important: the `4bit` notation refers to the effective compressed bit size (storage cost) of the quantized Transformer linear weights, not 4-bit integer quantization in the conventional sense. The weights themselves are FP8 codes; entropy coding reduces the storage cost to ~4 bits each. At inference time the FP8 codes are decoded back to FP8 (no information loss in the decoding step) and used directly by vLLM's fused W8A8 kernels.

	### What's in this image

	\| Layer \| What \|
	\|---\|---\|
	\| `vllm/vllm-openai:v0.21.0` \| vLLM, CUDA 13, torch 2.11, OpenAI server, tokenizer libs \|
	\| `nvidia-nvcomp-cu12==5.2.0.13` \| GPU ANS / zstd decompressor \|
	\| `entquant-coding` (bundled in this repo) \| EntQuant codec + chunked container + decoder \|
	\| `entquant-vllm` (bundled in this repo) \| vLLM plugin: registers `quant_method: entquant_coding`, FULL-graph capturable decode, fp8 W8A8 linear, selective MoE decode \|

	### Performance

	Measured on B200 (NVIDIA Blackwell), single GPU, full CUDA-graph capture, vLLM 0.21.0:

	\| Batch size \| Tokens/s \|
	\|---\|---\|
	\| B=1 \| TODO — measure on the @4-bit checkpoint specifically \|
	\| B=16 \| TODO \|
	\| B=64 \| TODO \|

	Reference numbers from the closely-related 3-bit-effective checkpoint are in the project's THROUGHPUT_LOG. We will fill these in here after end-to-end benchmarking of the @2-bit weights.

	## Release Date

	Hugging Face Hub — Preview at https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-4bit. TODO: final release date (MM/DD/YYYY).

	## Reference(s)

	- Project: https://soofi.info
	- Base model: [`Soofi-Project/Soofi-S-Instruct-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview)
	- EntQuant paper (ICML 2026): https://icml.cc/virtual/2026/poster/66714
	- EntQuant compressor source: https://github.com/merantix-momentum/entquant
	- Related models: see [Related models](#related-models).

	## Model Architecture

	Inherits the architecture of the base model unchanged.

	- Architecture Type: Transformer-based hybrid Mixture-of-Experts (MoE) with Mamba-2 state-space (SSM) layers and attention layers.
	- Network Architecture: Custom Hybrid Mamba-2/MoE (Nemotron-style), designed from scratch — 23 Mamba-2/MoE layers + 6 attention layers; 128 routing experts + 1 shared expert per MoE layer; 6 experts activated per token.
	- Number of model parameters: 3.0×10^10 total (30B), with ~3.5B active parameters during inference.

	This model was developed from scratch (no parent model); the quantization is applied post-training to the bf16 base.

	## Computational Load

	See the base model card for training compute, energy and emissions. Inference on this quantized variant runs comfortably on a single B200 / H100 (and works on smaller GPUs with reduced KV cache).

	## Input

	- Input Type(s): Text
	- Input Format(s): String
	- Input Parameters: One-Dimensional (1D)
	- Other Properties Related to Input: Chat/ChatML-style messages via the embedded chat template. No system prompt is required (none is injected by default). Context length up to 262144 (256K) — capped at 32768 by default in this image, raise `MAX_MODEL_LEN` in `.env` to go higher.

	## Output

	- Output Type(s): Text
	- Output Format(s): String
	- Output Parameters: One-Dimensional (1D)
	- Other Properties Related to Output: Non-thinking by default (no explicit reasoning trace). Supports the model's native tool-calling format (`<tool_call><function=...><parameter=...>...</parameter></function></tool_call>`) — vLLM's `qwen3_xml` tool-call parser is wired up in the image so OpenAI-style `tool_choice: "auto"` works out of the box.

	## Software Integration

	Runtime Engine(s):

	- vLLM 0.21.0 with the `entquant_coding` plugin (bundled in this repo; auto-discovered).
	- Other engines (HF `transformers`, `llama.cpp`/Ollama) do not load this checkpoint; the on-disk format is vLLM-specific. For HF `transformers` use the [bf16 base model](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview); for `llama.cpp` use the [GGUF variant](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-GGUF).

	Supported Hardware Microarchitecture Compatibility:

	- NVIDIA Hopper (H100, H200) or Blackwell (B200, RTX PRO 6000) — for the production W8A8 fp8 path.
	- Ampere and newer work via the bf16 linear fallback (set `ENTQUANT_LINEAR_COMPUTE=bf16`); some throughput is lost.

	Preferred/Supported Operating System(s): Linux.

	The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment.

	## Model Version(s)

	- This repo: `Soofi-S-Instruct-Preview-EntQuant-4bit` — EntQuant, ~4 effective bits, vLLM-only.
	- Base: [`Soofi-S-Instruct-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview) (bf16, unquantized).
	- Other quantized derivatives: see [Related models](#related-models).

	## Installation & Usage

	### Docker / Compose (recommended)

	See [Quick start](#quick-start) above.

	### Direct vLLM (if you've already installed the plugin)

	```bash
	pip install entquant-coding[gpu] entquant-vllm # once the packages are on PyPI
	# or from this repo:
	pip install ./entquant-coding ./entquant-vllm --no-deps

	ENTQUANT_ANS_GRAPH=1 ENTQUANT_LINEAR_COMPUTE=fp8 ENTQUANT_MOE_SELECTIVE=1 \
	vllm serve ./ \
	--trust-remote-code \
	--max-model-len 32768 \
	--served-model-name Soofi-S-Instruct-Preview-EntQuant-4bit \
	--enable-auto-tool-choice --tool-call-parser qwen3_xml
	```

	### OpenAI client (Python)

	```python
	from openai import OpenAI
	client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

	resp = client.chat.completions.create(
	model="Soofi-S-Instruct-Preview-EntQuant-4bit",
	messages=[{"role": "user", "content": "Briefly explain the concept of AI sovereignty."}],
	)
	print(resp.choices[0].message.content)
	```

	## Training, Testing, and Evaluation Datasets

	See the base model card — quantization is post-training and does not change training data. The quantization process itself is data-free.

	### Dataset Overview (from the base model)

	- Total Size: ~2.5×10^13 tokens (25 trillion).
	- Languages: English, German (primary); French, Italian, Spanish (limited). English acts as the pivot language.
	- Knowledge Cutoff: End of 2025.
	- Training Start: April 2026.

	### Quantization data

	This is a data-free post-training quantization — no calibration set is used. EntQuant optimizes scales purely from the weights, with no forward pass through any data. No personal data is used in the quantization process.

	### Evaluation Dataset

	TODO: add accuracy delta vs. the bf16 base on standard benchmarks (MMLU, etc.) once measured. Expected: within EntQuant's published accuracy band at 2-bit effective rate (≤ a few % loss on most benchmarks; see the EntQuant paper for the reference rates on Llama-2-7B and Llama-3-8B).

	## Inference

	- Acceleration Engine: vLLM 0.21.0 with the bundled `entquant_coding` plugin (FULL CUDA-graph capture; on-the-fly nvCOMP ANS decode; fp8 W8A8 cutlass matmul; selective expert decode at B=1).
	- Specific Test Hardware: Validated on NVIDIA B200 (DGX, 183 GiB HBM3e). Also tested on RTX PRO 6000 Blackwell (97 GiB).

	## Ethical Considerations

	The SOOFI consortium believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

	For more detailed information, see the Model Card++ subcards below. Please report model quality, risk, security vulnerabilities, or concerns to [contact@soofi.info](mailto:contact@soofi.info).

	The quantization step (EntQuant) is a deterministic, weight-only transformation; it does not introduce new training data, new behaviors, or new alignment properties beyond what the base model already exhibits. Any biases, capability ceilings, and safety considerations from the base model apply unchanged here.

	### Bias Subcard

	\| Field \| Response \|
	\|---\|---\|
	\| Participation considerations from adversely impacted groups in model design and testing \| Inherited from the base model — see its card. Quantization does not affect bias. \|
	\| Measures taken to mitigate against unwanted bias \| See the base model card. \|
	\| Bias Metric (if measured) \| See the base model card. \|

	### Explainability Subcard

	\| Field \| Response \|
	\|---\|---\|
	\| Intended Task/Domain \| General assistant, instruction following, AI-agent/tool use \|
	\| Model Type \| Hybrid Mixture-of-Experts (MoE) autoregressive language model, EntQuant-quantized to ~4 effective bits \|
	\| Intended Users \| Enterprise developers and researchers \|
	\| Output \| Text (String) \|
	\| Describe how the model works \| Generates text autoregressively; a router activates 6 of 128 experts per token across hybrid Mamba-2/MoE and attention layers; quantized weights are decoded on the fly via nvCOMP ANS into a fixed GPU scratch and consumed by fused fp8 W8A8 matmuls \|
	\| Technical Limitations \| Preview checkpoint; non-primary languages (FR/IT/ES) are limited; ~2-bit-effective quantization may introduce a small accuracy delta vs. the bf16 reference (TODO: quantify); requires vLLM with the bundled plugin (cannot be loaded by stock HF `transformers`) \|
	\| Verified to have met prescribed quality standards \| TODO \|
	\| Performance Metrics \| TODO — accuracy delta vs. bf16 base on standard benchmarks \|
	\| Potential Known Risks and Mitigation \| May generate incorrect, biased, or unsafe content; apply use-case-specific testing and guardrails before deployment \|
	\| Terms of Use/Licensing \| Other (see [License/Terms of Use](#licenseterms-of-use)) \|

	### Privacy Subcard

	\| Field \| Response \|
	\|---\|---\|
	\| Generatable or reverse engineerable personal data? \| TODO — see base model card \|
	\| Personal data used to create this model? \| Quantization is data-free; for training-data privacy see the base model card \|
	\| Was consent obtained for any personal data used? \| See the base model card \|
	\| How often is dataset reviewed? \| See the base model card \|
	\| Was data from user interactions with the AI model used to train the model? \| No \|
	\| Is there provenance for all datasets used in training? \| See the base model card \|
	\| Applicable Privacy Policy \| TODO \|

	### Safety & Security Subcard

	\| Field \| Response \|
	\|---\|---\|
	\| Model Application Field(s) \| Industrial use; customer service; general-purpose assistant and agent applications \|
	\| Describe the life critical impact (if present) \| None intended. Not for use in life-critical or safety-critical decision-making without independent validation \|
	\| Use Case Restrictions \| Abide by the applicable license agreement (see [License/Terms of Use](#licenseterms-of-use)) \|
	\| Model and dataset restrictions \| TODO \|

	## Related models

	### Base model

	- [`Soofi-Project/Soofi-S-Instruct-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview) — bf16 reference; use with HF `transformers`.

	### Variants of this checkpoint

	- [`Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-2bit) — ~2 effective bits (highest compression, largest accuracy delta from bf16).
	- [`Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-3bit`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-3bit) — ~3 effective bits (mid compression, mid accuracy delta from bf16).
	- [`Soofi-Project/Soofi-S-Instruct-Preview-FP8`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-FP8) — uncompressed FP8 (no EntQuant).
	- [`Soofi-Project/Soofi-S-Instruct-Preview-GGUF`](https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-GGUF) — for llama.cpp / Ollama.

	### Reasoning variants of the base

	- [`Soofi-Project/Soofi-S-Isar-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Isar-Preview) and its EntQuant derivatives.
	- [`Soofi-Project/Soofi-S-Rhine-Preview`](https://huggingface.co/Soofi-Project/Soofi-S-Rhine-Preview) and its EntQuant derivatives.

	## Citation

	If you use this model, please cite both the base model and the EntQuant paper:

	```bibtex
	@misc{soofi_s_instruct_preview,
	title = {Soofi-S-Instruct-Preview},
	author = {SOOFI Consortium},
	year = {2026},
	url = {https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview}
	}

	@misc{soofi_s_instruct_preview_entquant_4bit,
	title = {Soofi-S-Instruct-Preview-EntQuant-4bit (EntQuant-quantized)},
	author = {SOOFI Consortium},
	year = {2026},
	url = {https://huggingface.co/Soofi-Project/Soofi-S-Instruct-Preview-EntQuant-4bit}
	}

	@inproceedings{entquant_icml2026,
	title = {EntQuant: Entropy-Optimized Post-Training Quantization},
	author = {Merantix Momentum},
	booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
	year = {2026},
	url = {https://icml.cc/virtual/2026/poster/66714}
	}
	```