File size: 6,900 Bytes
18e0633 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | <div align="center">
# β‘ Selential Core
### MoLoRA Inference Engine β Runtime-Hot-Swappable LoRA Adapters for Qwen
[](https://www.rust-lang.org)
[](LICENSE)
</div>
**Selential Core** is a **Rust-native inference engine** for the [Qwen3.5](https://github.com/QwenLM/Qwen) family of models. It implements **MoLoRA (Mixture of LoRA Experts)** β a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.
Instead of one model doing everything, Selential builds an **orchestra of specialists**: a generalist core + coding experts for structural code, flow/error handling, and system I/O.
---
## π Quick Start
### Prerequisites
- **Rust** 1.75+ (`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`)
- **4GB+ RAM** (8GB+ recommended)
- **Optional:** NVIDIA GPU with CUDA 12+ for acceleration
### Setup
```bash
# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI
# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh
# 3. Build & run
cargo run --release -- interactive
```
For GPU acceleration:
```bash
./setup.sh --cuda
cargo run --release -- interactive
```
For the full 35B model (24GB+ VRAM):
```bash
./setup.sh --big
cargo run --release -- interactive
```
---
## π― Features
| Feature | Description |
|---|---|
| **MoLoRA Orchestras** | Query automatically routes to the right expert combo |
| **Hot-Swap Adapters** | Switch between coding domains mid-conversation |
| **Hashtag Routing** | `#struct #match #io` β or just describe what you need |
| **KB Cache** | Semantic cache for repeated queries (instant response) |
| **Chat History** | Full multi-turn conversation with context |
| **Russian Support** | Detects Russian queries, translates internally, responds naturally |
| **KV-Cache Quantization** | Q4_0 KV-cache saves ~75% VRAM |
### Expert Orchestra Architecture
```
User Query
β
ββπ Generalist Core (#70) β always active
β Syntax, logic, coherence
β
ββπ― Coding Specialists (by topic)
β
ββποΈ Structural β #164, #92
β struct, impl, trait, generics
β
ββπ Flow & Error β #116, #115
β match, Result, Option, concurrency
β
ββπ System & IO β #172, #116
File, HashMap, iterators
```
---
## π» Usage
### Interactive Mode
```bash
cargo run --release -- interactive
```
Type anything β the engine detects what you need and routes to the right expert:
```
> Implement a generic binary search tree in Rust
π·οΈ #algorithms #struct #trait #make
[ποΈ structural]
// Here's a generic BST implementation...
```
### Commands
| Command | Description |
|---|---|
| `/help` | Show all commands |
| `/orchestra` | Show current expert orchestra |
| `/tags` | List routing hashtags |
| `/hashtags <query>` | Preview hashtag routing |
| `/stats` | Session statistics |
| `/reset` | Clear conversation |
| `/exit` | Quit |
### Single Prompt Mode
```bash
cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural
```
---
## π§ How It Works
### Expert Extraction
1. **Probe phase:** Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
2. **Selection:** Pick the most specialized experts per sub-domain (probe β cosine similarity)
3. **SVD Compression:** Compress each expert's weights (3Γ matrices: gate, up, down) into rank-16 LoRA adapters
4. **GGUF conversion:** Merge selected experts into orchestrated GGUF files for llama.cpp
### Inference Pipeline
```
Query β Hashtag Extraction β Language Detection β KB Cache Lookup
β (miss)
Router (keyword + hashtag) β Select Expert
β
ChatML Prompt Builder β llama.cpp (GGUF LoRA)
```
---
## ποΈ Project Structure
```
βββ src/
β βββ main.rs # CLI entry point
β βββ engine.rs # llama.cpp inference engine
β βββ inference.rs # High-level inference pipeline
β βββ pipeline.rs # Preprocess β Route β Generate flow
β βββ router.rs # Keyword + hashtag-based routing
β βββ hashtags.rs # Semantic hashtag extraction
β βββ config.rs # Configuration + expert definitions
β βββ kb.rs # Knowledge base semantic cache
β βββ translator.rs # Russian β English translation
βββ adapters/ # GGUF LoRA adapters (orchestra files)
βββ tokenizers/ # Qwen tokenizer
βββ training/ # Python scripts for expert extraction
βββ Cargo.toml
βββ setup.sh # Model download script
```
---
## π¬ Performance
| Configuration | VRAM | t/s (vs baseline) |
|---|---|---|
| **Baseline** (no LoRA) | 0.91 GB | 9.7 tok/s |
| **1 expert** | +28 MB | -13% |
| **2 experts** | +17 MB | -10% |
| **3 experts** | +22 MB | -10% |
LoRA experts add only **~17-28 MB VRAM** with **~10% speed impact** β negligible overhead for specialist capabilities.
---
## π οΈ Building from Source
### CPU-only (no CUDA)
```bash
# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release
```
### GPU (CUDA)
```bash
# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release
```
### Full 35B Model
```bash
./setup.sh --big
# Edit src/config.rs β update base_model_path to the 35B GGUF
# Edit inference.rs β set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive
```
---
## π Probe Results
From our full probe of all 256 MoE experts in Qwen3.5-35B:
| Category | Count | % |
|---|---|---|
| **Active experts** | 208 | 81.2% |
| **Coding specialists** | 70 | 27.3% |
| **Generalists** | 138 | 53.9% |
| **Low-activity** | 48 | 18.8% |
Qwen's MoE is **well-designed** β 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.
---
## π Links
- [Qwen3.5 on HuggingFace](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-UD-GGUF)
- [llama.cpp](https://github.com/ggml-ai/llama.cpp)
- [llama-cpp-2 (Rust bindings)](https://crates.io/crates/llama-cpp-2)
---
<div align="center">
**Built with β€οΈ using Rust + llama.cpp**
</div>
|