File size: 6,900 Bytes
18e0633
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
<div align="center">

# ⚑ Selential Core

### MoLoRA Inference Engine β€” Runtime-Hot-Swappable LoRA Adapters for Qwen

[![Rust](https://img.shields.io/badge/Rust-1.75%2B-orange)](https://www.rust-lang.org)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

</div>

**Selential Core** is a **Rust-native inference engine** for the [Qwen3.5](https://github.com/QwenLM/Qwen) family of models. It implements **MoLoRA (Mixture of LoRA Experts)** β€” a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.

Instead of one model doing everything, Selential builds an **orchestra of specialists**: a generalist core + coding experts for structural code, flow/error handling, and system I/O.

---

## πŸš€ Quick Start

### Prerequisites

- **Rust** 1.75+ (`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`)
- **4GB+ RAM** (8GB+ recommended)
- **Optional:** NVIDIA GPU with CUDA 12+ for acceleration

### Setup

```bash
# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI

# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh

# 3. Build & run
cargo run --release -- interactive
```

For GPU acceleration:
```bash
./setup.sh --cuda
cargo run --release -- interactive
```

For the full 35B model (24GB+ VRAM):
```bash
./setup.sh --big
cargo run --release -- interactive
```

---

## 🎯 Features

| Feature | Description |
|---|---|
| **MoLoRA Orchestras** | Query automatically routes to the right expert combo |
| **Hot-Swap Adapters** | Switch between coding domains mid-conversation |
| **Hashtag Routing** | `#struct #match #io` β€” or just describe what you need |
| **KB Cache** | Semantic cache for repeated queries (instant response) |
| **Chat History** | Full multi-turn conversation with context |
| **Russian Support** | Detects Russian queries, translates internally, responds naturally |
| **KV-Cache Quantization** | Q4_0 KV-cache saves ~75% VRAM |

### Expert Orchestra Architecture

```
User Query
   β”‚
   β”œβ”€πŸŒ Generalist Core (#70)        ← always active
   β”‚     Syntax, logic, coherence
   β”‚
   β””β”€πŸŽ― Coding Specialists (by topic)
         β”‚
         β”œβ”€πŸ—οΈ  Structural β€” #164, #92
         β”‚     struct, impl, trait, generics
         β”‚
         β”œβ”€πŸ”€ Flow & Error β€” #116, #115
         β”‚     match, Result, Option, concurrency
         β”‚
         β””β”€πŸ“ System & IO β€” #172, #116
               File, HashMap, iterators
```

---

## πŸ’» Usage

### Interactive Mode

```bash
cargo run --release -- interactive
```

Type anything β€” the engine detects what you need and routes to the right expert:

```
> Implement a generic binary search tree in Rust

  🏷️  #algorithms #struct #trait #make

[πŸ—οΈ structural]
// Here's a generic BST implementation...
```

### Commands

| Command | Description |
|---|---|
| `/help` | Show all commands |
| `/orchestra` | Show current expert orchestra |
| `/tags` | List routing hashtags |
| `/hashtags <query>` | Preview hashtag routing |
| `/stats` | Session statistics |
| `/reset` | Clear conversation |
| `/exit` | Quit |

### Single Prompt Mode

```bash
cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural
```

---

## 🧠 How It Works

### Expert Extraction

1. **Probe phase:** Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
2. **Selection:** Pick the most specialized experts per sub-domain (probe β†’ cosine similarity)
3. **SVD Compression:** Compress each expert's weights (3Γ— matrices: gate, up, down) into rank-16 LoRA adapters
4. **GGUF conversion:** Merge selected experts into orchestrated GGUF files for llama.cpp

### Inference Pipeline

```
Query β†’ Hashtag Extraction β†’ Language Detection β†’ KB Cache Lookup
                                                     ↓ (miss)
                                              Router (keyword + hashtag) β†’ Select Expert
                                                                              ↓
                                          ChatML Prompt Builder β†’ llama.cpp (GGUF LoRA)
```

---

## πŸ—οΈ Project Structure

```
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.rs          # CLI entry point
β”‚   β”œβ”€β”€ engine.rs        # llama.cpp inference engine
β”‚   β”œβ”€β”€ inference.rs     # High-level inference pipeline
β”‚   β”œβ”€β”€ pipeline.rs      # Preprocess β†’ Route β†’ Generate flow
β”‚   β”œβ”€β”€ router.rs        # Keyword + hashtag-based routing
β”‚   β”œβ”€β”€ hashtags.rs      # Semantic hashtag extraction
β”‚   β”œβ”€β”€ config.rs        # Configuration + expert definitions
β”‚   β”œβ”€β”€ kb.rs            # Knowledge base semantic cache
β”‚   β”œβ”€β”€ translator.rs    # Russian β†’ English translation
β”œβ”€β”€ adapters/            # GGUF LoRA adapters (orchestra files)
β”œβ”€β”€ tokenizers/          # Qwen tokenizer
β”œβ”€β”€ training/            # Python scripts for expert extraction
β”œβ”€β”€ Cargo.toml
└── setup.sh             # Model download script
```

---

## πŸ”¬ Performance

| Configuration | VRAM | t/s (vs baseline) |
|---|---|---|
| **Baseline** (no LoRA) | 0.91 GB | 9.7 tok/s |
| **1 expert** | +28 MB | -13% |
| **2 experts** | +17 MB | -10% |
| **3 experts** | +22 MB | -10% |

LoRA experts add only **~17-28 MB VRAM** with **~10% speed impact** β€” negligible overhead for specialist capabilities.

---

## πŸ› οΈ Building from Source

### CPU-only (no CUDA)

```bash
# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release
```

### GPU (CUDA)

```bash
# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release
```

### Full 35B Model

```bash
./setup.sh --big
# Edit src/config.rs β†’ update base_model_path to the 35B GGUF
# Edit inference.rs β†’ set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive
```

---

## πŸ“Š Probe Results

From our full probe of all 256 MoE experts in Qwen3.5-35B:

| Category | Count | % |
|---|---|---|
| **Active experts** | 208 | 81.2% |
| **Coding specialists** | 70 | 27.3% |
| **Generalists** | 138 | 53.9% |
| **Low-activity** | 48 | 18.8% |

Qwen's MoE is **well-designed** β€” 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.

---

## πŸ”— Links

- [Qwen3.5 on HuggingFace](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-UD-GGUF)
- [llama.cpp](https://github.com/ggml-ai/llama.cpp)
- [llama-cpp-2 (Rust bindings)](https://crates.io/crates/llama-cpp-2)

---

<div align="center">

**Built with ❀️ using Rust + llama.cpp**

</div>