| <div align="center"> |
|
|
| # β‘ Selential Core |
|
|
| ### MoLoRA Inference Engine β Runtime-Hot-Swappable LoRA Adapters for Qwen |
|
|
| [](https://www.rust-lang.org) |
| [](LICENSE) |
|
|
| </div> |
|
|
| **Selential Core** is a **Rust-native inference engine** for the [Qwen3.5](https://github.com/QwenLM/Qwen) family of models. It implements **MoLoRA (Mixture of LoRA Experts)** β a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type. |
|
|
| Instead of one model doing everything, Selential builds an **orchestra of specialists**: a generalist core + coding experts for structural code, flow/error handling, and system I/O. |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### Prerequisites |
|
|
| - **Rust** 1.75+ (`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`) |
| - **4GB+ RAM** (8GB+ recommended) |
| - **Optional:** NVIDIA GPU with CUDA 12+ for acceleration |
|
|
| ### Setup |
|
|
| ```bash |
| # 1. Clone |
| git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git |
| cd SelentialCore-New-level-optimization-AI |
| |
| # 2. Download the model + tokenizer |
| chmod +x setup.sh |
| ./setup.sh |
| |
| # 3. Build & run |
| cargo run --release -- interactive |
| ``` |
|
|
| For GPU acceleration: |
| ```bash |
| ./setup.sh --cuda |
| cargo run --release -- interactive |
| ``` |
|
|
| For the full 35B model (24GB+ VRAM): |
| ```bash |
| ./setup.sh --big |
| cargo run --release -- interactive |
| ``` |
|
|
| --- |
|
|
| ## π― Features |
|
|
| | Feature | Description | |
| |---|---| |
| | **MoLoRA Orchestras** | Query automatically routes to the right expert combo | |
| | **Hot-Swap Adapters** | Switch between coding domains mid-conversation | |
| | **Hashtag Routing** | `#struct #match #io` β or just describe what you need | |
| | **KB Cache** | Semantic cache for repeated queries (instant response) | |
| | **Chat History** | Full multi-turn conversation with context | |
| | **Russian Support** | Detects Russian queries, translates internally, responds naturally | |
| | **KV-Cache Quantization** | Q4_0 KV-cache saves ~75% VRAM | |
| |
| ### Expert Orchestra Architecture |
| |
| ``` |
| User Query |
| β |
| ββπ Generalist Core (#70) β always active |
| β Syntax, logic, coherence |
| β |
| ββπ― Coding Specialists (by topic) |
| β |
| ββποΈ Structural β #164, #92 |
| β struct, impl, trait, generics |
| β |
| ββπ Flow & Error β #116, #115 |
| β match, Result, Option, concurrency |
| β |
| ββπ System & IO β #172, #116 |
| File, HashMap, iterators |
| ``` |
| |
| --- |
| |
| ## π» Usage |
| |
| ### Interactive Mode |
| |
| ```bash |
| cargo run --release -- interactive |
| ``` |
| |
| Type anything β the engine detects what you need and routes to the right expert: |
| |
| ``` |
| > Implement a generic binary search tree in Rust |
| |
| π·οΈ #algorithms #struct #trait #make |
| |
| [ποΈ structural] |
| // Here's a generic BST implementation... |
| ``` |
| |
| ### Commands |
| |
| | Command | Description | |
| |---|---| |
| | `/help` | Show all commands | |
| | `/orchestra` | Show current expert orchestra | |
| | `/tags` | List routing hashtags | |
| | `/hashtags <query>` | Preview hashtag routing | |
| | `/stats` | Session statistics | |
| | `/reset` | Clear conversation | |
| | `/exit` | Quit | |
| |
| ### Single Prompt Mode |
| |
| ```bash |
| cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust" |
| cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural |
| ``` |
| |
| --- |
| |
| ## π§ How It Works |
| |
| ### Expert Extraction |
| |
| 1. **Probe phase:** Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries |
| 2. **Selection:** Pick the most specialized experts per sub-domain (probe β cosine similarity) |
| 3. **SVD Compression:** Compress each expert's weights (3Γ matrices: gate, up, down) into rank-16 LoRA adapters |
| 4. **GGUF conversion:** Merge selected experts into orchestrated GGUF files for llama.cpp |
| |
| ### Inference Pipeline |
| |
| ``` |
| Query β Hashtag Extraction β Language Detection β KB Cache Lookup |
| β (miss) |
| Router (keyword + hashtag) β Select Expert |
| β |
| ChatML Prompt Builder β llama.cpp (GGUF LoRA) |
| ``` |
| |
| --- |
| |
| ## ποΈ Project Structure |
| |
| ``` |
| βββ src/ |
| β βββ main.rs # CLI entry point |
| β βββ engine.rs # llama.cpp inference engine |
| β βββ inference.rs # High-level inference pipeline |
| β βββ pipeline.rs # Preprocess β Route β Generate flow |
| β βββ router.rs # Keyword + hashtag-based routing |
| β βββ hashtags.rs # Semantic hashtag extraction |
| β βββ config.rs # Configuration + expert definitions |
| β βββ kb.rs # Knowledge base semantic cache |
| β βββ translator.rs # Russian β English translation |
| βββ adapters/ # GGUF LoRA adapters (orchestra files) |
| βββ tokenizers/ # Qwen tokenizer |
| βββ training/ # Python scripts for expert extraction |
| βββ Cargo.toml |
| βββ setup.sh # Model download script |
| ``` |
| |
| --- |
| |
| ## π¬ Performance |
| |
| | Configuration | VRAM | t/s (vs baseline) | |
| |---|---|---| |
| | **Baseline** (no LoRA) | 0.91 GB | 9.7 tok/s | |
| | **1 expert** | +28 MB | -13% | |
| | **2 experts** | +17 MB | -10% | |
| | **3 experts** | +22 MB | -10% | |
| |
| LoRA experts add only **~17-28 MB VRAM** with **~10% speed impact** β negligible overhead for specialist capabilities. |
| |
| --- |
| |
| ## π οΈ Building from Source |
| |
| ### CPU-only (no CUDA) |
| |
| ```bash |
| # Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps |
| # Then build: |
| cargo build --release |
| ``` |
| |
| ### GPU (CUDA) |
| |
| ```bash |
| # Requirements: CUDA 12+, cuBLAS |
| ./setup.sh --cuda |
| cargo build --release |
| ``` |
| |
| ### Full 35B Model |
| |
| ```bash |
| ./setup.sh --big |
| # Edit src/config.rs β update base_model_path to the 35B GGUF |
| # Edit inference.rs β set n_gpu_layers to 25+ (depends on your VRAM) |
| cargo run --release -- interactive |
| ``` |
| |
| --- |
| |
| ## π Probe Results |
| |
| From our full probe of all 256 MoE experts in Qwen3.5-35B: |
| |
| | Category | Count | % | |
| |---|---|---| |
| | **Active experts** | 208 | 81.2% | |
| | **Coding specialists** | 70 | 27.3% | |
| | **Generalists** | 138 | 53.9% | |
| | **Low-activity** | 48 | 18.8% | |
| |
| Qwen's MoE is **well-designed** β 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture. |
| |
| --- |
| |
| ## π Links |
| |
| - [Qwen3.5 on HuggingFace](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-UD-GGUF) |
| - [llama.cpp](https://github.com/ggml-ai/llama.cpp) |
| - [llama-cpp-2 (Rust bindings)](https://crates.io/crates/llama-cpp-2) |
| |
| --- |
| |
| <div align="center"> |
| |
| **Built with β€οΈ using Rust + llama.cpp** |
| |
| </div> |
| |