File size: 6,473 Bytes
ec113b9
 
 
 
b1b70ff
ec113b9
 
 
 
 
 
 
4a99e73
ec113b9
 
 
 
 
 
 
 
 
 
 
 
 
 
b1b70ff
ec113b9
4a99e73
ec113b9
 
 
 
 
 
 
 
 
 
4a99e73
ec113b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aa2b05
 
 
 
ec113b9
 
 
4a99e73
ec113b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e45b1c
e81f6af
6e45b1c
ec113b9
6e45b1c
 
e81f6af
6e45b1c
 
 
 
ec113b9
6e45b1c
ec113b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a99e73
ec113b9
 
 
 
4a99e73
ec113b9
 
 
 
 
 
4a99e73
ec113b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a99e73
ec113b9
 
 
 
 
4a99e73
5aa2b05
ec113b9
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
license: mit
base_model:
- inclusionAI/Ling-mini-2.0
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- chatllm.cpp
- quantization
- int4
- int8
- cpu-inference
- ggmm
quantized_by: riverkan
language:
- en
- pt
- es
- fr
- zh
- ja
- de
- it
---

# Ling‑Mini‑2.0 — ChatLLM.cpp Quantizations (Q4_0 and Q8_0)

Author and distribution: [Riverkan](https://riverkan.com)

This repository provides CPU/GPU-friendly quantized builds of Ling‑Mini‑2.0 for [ChatLLM.cpp](https://github.com/foldl/chatllm.cpp). It is not a LLaMA model, is not affiliated with Meta, and does not use the LLaMA license. Files are distributed in ChatLLM.cpp’s GGMM-based format (.bin), ready for local inference.

- Available quantizations: Q4_0 (int4), Q8_0 (int8)
- Tested runtime: ChatLLM.cpp
- Target use: real-time chat/instruction-following on commodity hardware

Notes:
- The model is architecturally distinct from LLaMA-family models.

## ChatLLM.cpp Quantizations of Ling‑Mini‑2.0

Quantized with the ChatLLM.cpp toolchain for GGMM-format inference (.bin). These builds are intended for the ChatLLM.cpp runtime (CPU and optional GPU acceleration as provided by ChatLLM’s GGMM backends). Use ChatLLM.cpp’s convert and run flow described below.

Original (float) model: to be announced by Riverkan.

Run them with ChatLLM.cpp or your preferred ChatLLM-based UI.

## Prompt format

Ling‑Mini‑2.0 does not require a special role-tag chat template. Plain prompts work well. If your tooling prefers an explicit chat structure, you can use this neutral format:

```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.

[User]
{your question}

[Assistant]
```

Example:
```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.

[User]
List three tips to speed up CPU inference.

[Assistant]
```

No special tokens are required by the model itself; most UIs can just send user text.

## Download a file (not the whole branch) from below

| Filename                  | Quant type | File Size | Split | Description                                                      |
|---------------------------|------------|-----------|-------|------------------------------------------------------------------|
| Ling‑Mini‑2.0‑Q8_0.bin    | Q8_0       | 16 GB     | false | Highest quality quant provided here; best for quality, moderate speed. |
| Ling‑Mini‑2.0‑Q4_0.bin    | Q4_0       | 8.52 GB   | false | Great balance of speed and memory; recommended for CPU‑only setups.     |

Notes:
- File sizes depend on the base model size; check the release or hosting page for exact sizes.
- These are GGMM (.bin) files for ChatLLM.cpp, not GGUF.

## How to use with ChatLLM.cpp

1) Clone and build ChatLLM.cpp (follow upstream docs for optional GPU backends):
```bash
git clone --recursive https://github.com/foldl/chatllm.cpp.git
cd chatllm.cpp
cmake -B build
cmake --build build -j --config Release
```

2) Place the quantized model file (e.g., Ling‑Mini‑2.0‑Q4_0.bin) somewhere accessible.

3) Run interactive chat:
```bash
# Linux / macOS
rlwrap ./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin -i

# Windows (PowerShell)
.\build\bin\Release\main.exe -m C:\path\to\Ling‑Mini‑2.0‑Q4_0.bin -i
```

4) Single-shot example:
```bash
./build/bin/main -m /path/to/Ling‑Mini‑2.0‑Q4_0.bin --prompt "Explain memory-bound vs compute-bound."
```

Tip: Run `./build/bin/main -h` for all options (context size, threads, GPU offload where applicable, etc.).

## Example usage

Prompt:
```
[System]
You are Ling‑Mini‑2.0, a helpful, concise assistant.

[User]
Give me a 1‑paragraph summary of what quantization does for LLMs.

[Assistant]
```

Running:
```bash
./build/bin/main -m Ling‑Mini‑2.0‑Q8_0.bin -i --prompt "Give me a 1‑paragraph summary of what quantization does for LLMs."
```

In interactive mode (`-i`), simply paste your question and press Enter. The chat history is used as context for subsequent turns.

## Performance (CPU)

```bash
./build/bin/main -m ling-mini-2.0-q4.bin --seed 1
```
- Q4_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~35 tokens/second (output), measured in a typical chat generation scenario.

```bash
./build/bin/main -m ling-mini-2.0-q8.bin --seed 1
```
- Q8_0 on AMD Ryzen 5 5600G with Radeon Graphics (3.90 GHz): ~20 tokens/second (output), measured in a typical chat generation scenario.

Notes:
- Actual throughput varies with prompt length, context size, threads, OS, and build flags.


## Which file should I choose?

- Want the fastest CPU experience and smallest memory footprint? Choose Q4_0.
- Want maximum response quality on CPU (or if you have headroom)? Choose Q8_0.
- If you’re offloading to GPU via ChatLLM.cpp backends, both will work; Q8_0 usually provides slightly better output fidelity at the cost of more memory.

## Downloading using huggingface-cli

If hosted on Hugging Face, you can fetch specific files with the CLI:

Install:
```bash
pip install -U "huggingface_hub[cli]"
```

Download a specific file:
```bash
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q4_0.bin" --local-dir ./
```

Or the Q8_0 build:
```bash
huggingface-cli download RiverkanIT/Ling-mini-2.0-Quantized --include "Ling‑Mini‑2.0‑Q8_0.bin" --local-dir ./
```

Replace the model repo path with the actual hosting path if different.

## Building your own quant (optional)

If you have the float/base weights and want to generate your own GGMM quantized file for ChatLLM.cpp:

1) Install Python deps for ChatLLM.cpp’s conversion pipeline:
```bash
pip install -r requirements.txt
```

2) Convert to Q8_0:
```bash
python convert.py -i /path/to/base/model -t q8_0 -o Ling‑Mini‑2.0‑Q8_0.bin --name "Ling-Mini-2.0"
```

3) Convert to Q4_0:
```bash
python convert.py -i /path/to/base/model -t q4_0 -o Ling‑Mini‑2.0‑Q4_0.bin --name "Ling-Mini-2.0"
```

Notes:
- ChatLLM.cpp uses GGMM-based .bin files (not GGUF).
- See ChatLLM.cpp docs for model-specific flags and supported architectures.

## Credits

- Model and quantized distributions by Riverkan
- Runtime and tooling: ChatLLM.cpp (thanks to the maintainers and the GGMM community)
- Thanks to the InclusionAI team for their foundational work and support!
- Everyone in the open-source LLM community who provided benchmarks, ideas, and tools

For issues, feature requests, or contributions, please open a discussion or pull request in this repo.