YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Selential Core is a Rust-native inference engine for the Qwen3.5 family of models. It implements MoLoRA (Mixture of LoRA Experts) β a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.
Instead of one model doing everything, Selential builds an orchestra of specialists: a generalist core + coding experts for structural code, flow/error handling, and system I/O.
π Quick Start
Prerequisites
- Rust 1.75+ (
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) - 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA GPU with CUDA 12+ for acceleration
Setup
# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI
# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh
# 3. Build & run
cargo run --release -- interactive
For GPU acceleration:
./setup.sh --cuda
cargo run --release -- interactive
For the full 35B model (24GB+ VRAM):
./setup.sh --big
cargo run --release -- interactive
π― Features
| Feature | Description |
|---|---|
| MoLoRA Orchestras | Query automatically routes to the right expert combo |
| Hot-Swap Adapters | Switch between coding domains mid-conversation |
| Hashtag Routing | #struct #match #io β or just describe what you need |
| KB Cache | Semantic cache for repeated queries (instant response) |
| Chat History | Full multi-turn conversation with context |
| Russian Support | Detects Russian queries, translates internally, responds naturally |
| KV-Cache Quantization | Q4_0 KV-cache saves ~75% VRAM |
Expert Orchestra Architecture
User Query
β
ββπ Generalist Core (#70) β always active
β Syntax, logic, coherence
β
ββπ― Coding Specialists (by topic)
β
ββποΈ Structural β #164, #92
β struct, impl, trait, generics
β
ββπ Flow & Error β #116, #115
β match, Result, Option, concurrency
β
ββπ System & IO β #172, #116
File, HashMap, iterators
π» Usage
Interactive Mode
cargo run --release -- interactive
Type anything β the engine detects what you need and routes to the right expert:
> Implement a generic binary search tree in Rust
π·οΈ #algorithms #struct #trait #make
[ποΈ structural]
// Here's a generic BST implementation...
Commands
| Command | Description |
|---|---|
/help |
Show all commands |
/orchestra |
Show current expert orchestra |
/tags |
List routing hashtags |
/hashtags <query> |
Preview hashtag routing |
/stats |
Session statistics |
/reset |
Clear conversation |
/exit |
Quit |
Single Prompt Mode
cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural
π§ How It Works
Expert Extraction
- Probe phase: Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
- Selection: Pick the most specialized experts per sub-domain (probe β cosine similarity)
- SVD Compression: Compress each expert's weights (3Γ matrices: gate, up, down) into rank-16 LoRA adapters
- GGUF conversion: Merge selected experts into orchestrated GGUF files for llama.cpp
Inference Pipeline
Query β Hashtag Extraction β Language Detection β KB Cache Lookup
β (miss)
Router (keyword + hashtag) β Select Expert
β
ChatML Prompt Builder β llama.cpp (GGUF LoRA)
ποΈ Project Structure
βββ src/
β βββ main.rs # CLI entry point
β βββ engine.rs # llama.cpp inference engine
β βββ inference.rs # High-level inference pipeline
β βββ pipeline.rs # Preprocess β Route β Generate flow
β βββ router.rs # Keyword + hashtag-based routing
β βββ hashtags.rs # Semantic hashtag extraction
β βββ config.rs # Configuration + expert definitions
β βββ kb.rs # Knowledge base semantic cache
β βββ translator.rs # Russian β English translation
βββ adapters/ # GGUF LoRA adapters (orchestra files)
βββ tokenizers/ # Qwen tokenizer
βββ training/ # Python scripts for expert extraction
βββ Cargo.toml
βββ setup.sh # Model download script
π¬ Performance
| Configuration | VRAM | t/s (vs baseline) |
|---|---|---|
| Baseline (no LoRA) | 0.91 GB | 9.7 tok/s |
| 1 expert | +28 MB | -13% |
| 2 experts | +17 MB | -10% |
| 3 experts | +22 MB | -10% |
LoRA experts add only ~17-28 MB VRAM with ~10% speed impact β negligible overhead for specialist capabilities.
π οΈ Building from Source
CPU-only (no CUDA)
# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release
GPU (CUDA)
# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release
Full 35B Model
./setup.sh --big
# Edit src/config.rs β update base_model_path to the 35B GGUF
# Edit inference.rs β set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive
π Probe Results
From our full probe of all 256 MoE experts in Qwen3.5-35B:
| Category | Count | % |
|---|---|---|
| Active experts | 208 | 81.2% |
| Coding specialists | 70 | 27.3% |
| Generalists | 138 | 53.9% |
| Low-activity | 48 | 18.8% |
Qwen's MoE is well-designed β 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.
π Links
Built with β€οΈ using Rust + llama.cpp
- Downloads last month
- 84
We're not able to determine the quantization variants.