# ⚡ Selential Core ### MoLoRA Inference Engine — Runtime-Hot-Swappable LoRA Adapters for Qwen [![Rust](https://img.shields.io/badge/Rust-1.75%2B-orange)](https://www.rust-lang.org) [![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)
**Selential Core** is a **Rust-native inference engine** for the [Qwen3.5](https://github.com/QwenLM/Qwen) family of models. It implements **MoLoRA (Mixture of LoRA Experts)** — a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type. Instead of one model doing everything, Selential builds an **orchestra of specialists**: a generalist core + coding experts for structural code, flow/error handling, and system I/O. --- ## 🚀 Quick Start ### Prerequisites - **Rust** 1.75+ (`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`) - **4GB+ RAM** (8GB+ recommended) - **Optional:** NVIDIA GPU with CUDA 12+ for acceleration ### Setup ```bash # 1. Clone git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git cd SelentialCore-New-level-optimization-AI # 2. Download the model + tokenizer chmod +x setup.sh ./setup.sh # 3. Build & run cargo run --release -- interactive ``` For GPU acceleration: ```bash ./setup.sh --cuda cargo run --release -- interactive ``` For the full 35B model (24GB+ VRAM): ```bash ./setup.sh --big cargo run --release -- interactive ``` --- ## 🎯 Features | Feature | Description | |---|---| | **MoLoRA Orchestras** | Query automatically routes to the right expert combo | | **Hot-Swap Adapters** | Switch between coding domains mid-conversation | | **Hashtag Routing** | `#struct #match #io` — or just describe what you need | | **KB Cache** | Semantic cache for repeated queries (instant response) | | **Chat History** | Full multi-turn conversation with context | | **Russian Support** | Detects Russian queries, translates internally, responds naturally | | **KV-Cache Quantization** | Q4_0 KV-cache saves ~75% VRAM | ### Expert Orchestra Architecture ``` User Query │ ├─🌐 Generalist Core (#70) ← always active │ Syntax, logic, coherence │ └─🎯 Coding Specialists (by topic) │ ├─🏗️ Structural — #164, #92 │ struct, impl, trait, generics │ ├─🔀 Flow & Error — #116, #115 │ match, Result, Option, concurrency │ └─📁 System & IO — #172, #116 File, HashMap, iterators ``` --- ## 💻 Usage ### Interactive Mode ```bash cargo run --release -- interactive ``` Type anything — the engine detects what you need and routes to the right expert: ``` > Implement a generic binary search tree in Rust 🏷️ #algorithms #struct #trait #make [🏗️ structural] // Here's a generic BST implementation... ``` ### Commands | Command | Description | |---|---| | `/help` | Show all commands | | `/orchestra` | Show current expert orchestra | | `/tags` | List routing hashtags | | `/hashtags ` | Preview hashtag routing | | `/stats` | Session statistics | | `/reset` | Clear conversation | | `/exit` | Quit | ### Single Prompt Mode ```bash cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust" cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural ``` --- ## 🧠 How It Works ### Expert Extraction 1. **Probe phase:** Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries 2. **Selection:** Pick the most specialized experts per sub-domain (probe → cosine similarity) 3. **SVD Compression:** Compress each expert's weights (3× matrices: gate, up, down) into rank-16 LoRA adapters 4. **GGUF conversion:** Merge selected experts into orchestrated GGUF files for llama.cpp ### Inference Pipeline ``` Query → Hashtag Extraction → Language Detection → KB Cache Lookup ↓ (miss) Router (keyword + hashtag) → Select Expert ↓ ChatML Prompt Builder → llama.cpp (GGUF LoRA) ``` --- ## 🏗️ Project Structure ``` ├── src/ │ ├── main.rs # CLI entry point │ ├── engine.rs # llama.cpp inference engine │ ├── inference.rs # High-level inference pipeline │ ├── pipeline.rs # Preprocess → Route → Generate flow │ ├── router.rs # Keyword + hashtag-based routing │ ├── hashtags.rs # Semantic hashtag extraction │ ├── config.rs # Configuration + expert definitions │ ├── kb.rs # Knowledge base semantic cache │ ├── translator.rs # Russian → English translation ├── adapters/ # GGUF LoRA adapters (orchestra files) ├── tokenizers/ # Qwen tokenizer ├── training/ # Python scripts for expert extraction ├── Cargo.toml └── setup.sh # Model download script ``` --- ## 🔬 Performance | Configuration | VRAM | t/s (vs baseline) | |---|---|---| | **Baseline** (no LoRA) | 0.91 GB | 9.7 tok/s | | **1 expert** | +28 MB | -13% | | **2 experts** | +17 MB | -10% | | **3 experts** | +22 MB | -10% | LoRA experts add only **~17-28 MB VRAM** with **~10% speed impact** — negligible overhead for specialist capabilities. --- ## 🛠️ Building from Source ### CPU-only (no CUDA) ```bash # Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps # Then build: cargo build --release ``` ### GPU (CUDA) ```bash # Requirements: CUDA 12+, cuBLAS ./setup.sh --cuda cargo build --release ``` ### Full 35B Model ```bash ./setup.sh --big # Edit src/config.rs → update base_model_path to the 35B GGUF # Edit inference.rs → set n_gpu_layers to 25+ (depends on your VRAM) cargo run --release -- interactive ``` --- ## 📊 Probe Results From our full probe of all 256 MoE experts in Qwen3.5-35B: | Category | Count | % | |---|---|---| | **Active experts** | 208 | 81.2% | | **Coding specialists** | 70 | 27.3% | | **Generalists** | 138 | 53.9% | | **Low-activity** | 48 | 18.8% | Qwen's MoE is **well-designed** — 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture. --- ## 🔗 Links - [Qwen3.5 on HuggingFace](https://huggingface.co/Qwen/Qwen3.5-35B-A3B-UD-GGUF) - [llama.cpp](https://github.com/ggml-ai/llama.cpp) - [llama-cpp-2 (Rust bindings)](https://crates.io/crates/llama-cpp-2) ---
**Built with ❤️ using Rust + llama.cpp**