File size: 6,473 Bytes
ec113b9 b1b70ff ec113b9 4a99e73 ec113b9 b1b70ff ec113b9 4a99e73 ec113b9 4a99e73 ec113b9 5aa2b05 ec113b9 4a99e73 ec113b9 6e45b1c e81f6af 6e45b1c ec113b9 6e45b1c e81f6af 6e45b1c ec113b9 6e45b1c ec113b9 4a99e73 ec113b9 4a99e73 ec113b9 4a99e73 ec113b9 4a99e73 ec113b9 4a99e73 5aa2b05 ec113b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
license: mit
base_model:
- inclusionAI/Ling-mini-2.0
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- chatllm.cpp
- quantization
- int4
- int8
- cpu-inference
- ggmm
quantized_by: riverkan
language:
- en
- pt
- es
- fr
- zh
- ja
- de
- it
---
# Ling‑Mini‑2.0 — ChatLLM.cpp Quantizations (Q4_0 and Q8_0)
Author and distribution: [Riverkan](https://riverkan.com)
This repository provides CPU/GPU-friendly quantized builds of Ling‑Mini‑2.0 for [ChatLLM.cpp](https://github.com/foldl/chatllm.cpp). It is not a LLaMA model, is not affiliated with Meta, and does not use the LLaMA license. Files are distributed in ChatLLM.cpp’s GGMM-based format (.bin), ready for local inference.
- Available quantizations: Q4_0 (int4), Q8_0 (int8)
- Tested runtime: ChatLLM.cpp
- Target use: real-time chat/instruction-following on commodity hardware
Notes:
- The model is architecturally distinct from LLaMA-family models.
## ChatLLM.cpp Quantizations of Ling‑Mini‑2.0
Quantized with the ChatLLM.cpp toolchain for GGMM-format inference (.bin). These builds are intended for the ChatLLM.cpp runtime (CPU and optional GPU acceleration as provided by ChatLLM’s GGMM backends). Use ChatLLM.cpp’s convert and run flow described below.
Original (float) model: to be announced by Riverkan.
Run them with ChatLLM.cpp or your preferred ChatLLM-based UI.
## Prompt format
Ling‑Mini‑2.0 does not require a special role-tag chat template. Plain prompts work well. If your tooling prefers an explicit chat structure, you can use this neutral format:
```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
{your question}
[Assistant]
```
Example:
```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
List three tips to speed up CPU inference.
[Assistant]
```
No special tokens are required by the model itself; most UIs can just send user text.
## Download a file (not the whole branch) from below
| Filename | Quant type | File Size | Split | Description |
|---------------------------|------------|-----------|-------|------------------------------------------------------------------|
| Ling‑Mini‑2.0‑Q8_0.bin | Q8_0 | 16 GB | false | Highest quality quant provided here; best for quality, moderate speed. |
| Ling‑Mini‑2.0‑Q4_0.bin | Q4_0 | 8.52 GB | false | Great balance of speed and memory; recommended for CPU‑only setups. |
Notes:
- File sizes depend on the base model size; check the release or hosting page for exact sizes.
- These are GGMM (.bin) files for ChatLLM.cpp, not GGUF.
## How to use with ChatLLM.cpp
1) Clone and build ChatLLM.cpp (follow upstream docs for optional GPU backends):
```bash
git clone --recursive https://github.com/foldl/chatllm.cpp.git
cd chatllm.cpp
cmake -B build
cmake --build build -j --config Release
```
2) Place the quantized model file (e.g., Ling‑Mini‑2.0‑Q4_0.bin) somewhere accessible.
3) Run interactive chat:
```bash
# Linux / macOS
rlwrap ./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin -i
# Windows (PowerShell)
.\build\bin\Release\main.exe -m C:\path\to\Ling‑Mini‑2.0‑Q4_0.bin -i
```
4) Single-shot example:
```bash
./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin --prompt "Explain memory-bound vs compute-bound."
```
Tip: Run `./build/bin/main -h` for all options (context size, threads, GPU offload where applicable, etc.).
## Example usage
Prompt:
```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.
[User]
Give me a 1‑paragraph summary of what quantization does for LLMs.
[Assistant]
```
Running:
```bash
./build/bin/main -m Ling‑Mini‑2.0‑Q8_0.bin -i --prompt "Give me a 1‑paragraph summary of what quantization does for LLMs."
```
In interactive mode (`-i`), simply paste your question and press Enter. The chat history is used as context for subsequent turns.
## Performance (CPU)
```bash
./build/bin/main -m ling-mini-2.0-q4.bin --seed 1
```
- Q4_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~35 tokens/second (output), measured in a typical chat generation scenario.
```bash
./build/bin/main -m ling-mini-2.0-q8.bin --seed 1
```
- Q8_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~20 tokens/second (output), measured in a typical chat generation scenario.
Notes:
- Actual throughput varies with prompt length, context size, threads, OS, and build flags.
## Which file should I choose?
- Want the fastest CPU experience and smallest memory footprint? Choose Q4_0.
- Want maximum response quality on CPU (or if you have headroom)? Choose Q8_0.
- If you’re offloading to GPU via ChatLLM.cpp backends, both will work; Q8_0 usually provides slightly better output fidelity at the cost of more memory.
## Downloading using huggingface-cli
If hosted on Hugging Face, you can fetch specific files with the CLI:
Install:
```bash
pip install -U "huggingface_hub[cli]"
```
Download a specific file:
```bash
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q4_0.bin" --local-dir ./
```
Or the Q8_0 build:
```bash
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q8_0.bin" --local-dir ./
```
Replace the model repo path with the actual hosting path if different.
## Building your own quant (optional)
If you have the float/base weights and want to generate your own GGMM quantized file for ChatLLM.cpp:
1) Install Python deps for ChatLLM.cpp’s conversion pipeline:
```bash
pip install -r requirements.txt
```
2) Convert to Q8_0:
```bash
python convert.py -i /path/to/base/model -t q8_0 -o Ling‑Mini‑2.0‑Q8_0.bin --name "Ling-Mini-2.0"
```
3) Convert to Q4_0:
```bash
python convert.py -i /path/to/base/model -t q4_0 -o Ling‑Mini‑2.0‑Q4_0.bin --name "Ling-Mini-2.0"
```
Notes:
- ChatLLM.cpp uses GGMM-based .bin files (not GGUF).
- See ChatLLM.cpp docs for model-specific flags and supported architectures.
## Credits
- Model and quantized distributions by Riverkan
- Runtime and tooling: ChatLLM.cpp (thanks to the maintainers and the GGMM community)
- Thanks to the InclusionAI team for their foundational work and support!
- Everyone in the open-source LLM community who provided benchmarks, ideas, and tools
For issues, feature requests, or contributions, please open a discussion or pull request in this repo. |