YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

⚑ Selential Core

MoLoRA Inference Engine β€” Runtime-Hot-Swappable LoRA Adapters for Qwen

Rust License

Selential Core is a Rust-native inference engine for the Qwen3.5 family of models. It implements MoLoRA (Mixture of LoRA Experts) β€” a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.

Instead of one model doing everything, Selential builds an orchestra of specialists: a generalist core + coding experts for structural code, flow/error handling, and system I/O.


πŸš€ Quick Start

Prerequisites

  • Rust 1.75+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • 4GB+ RAM (8GB+ recommended)
  • Optional: NVIDIA GPU with CUDA 12+ for acceleration

Setup

# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI

# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh

# 3. Build & run
cargo run --release -- interactive

For GPU acceleration:

./setup.sh --cuda
cargo run --release -- interactive

For the full 35B model (24GB+ VRAM):

./setup.sh --big
cargo run --release -- interactive

🎯 Features

Feature Description
MoLoRA Orchestras Query automatically routes to the right expert combo
Hot-Swap Adapters Switch between coding domains mid-conversation
Hashtag Routing #struct #match #io β€” or just describe what you need
KB Cache Semantic cache for repeated queries (instant response)
Chat History Full multi-turn conversation with context
Russian Support Detects Russian queries, translates internally, responds naturally
KV-Cache Quantization Q4_0 KV-cache saves ~75% VRAM

Expert Orchestra Architecture

User Query
   β”‚
   β”œβ”€πŸŒ Generalist Core (#70)        ← always active
   β”‚     Syntax, logic, coherence
   β”‚
   β””β”€πŸŽ― Coding Specialists (by topic)
         β”‚
         β”œβ”€πŸ—οΈ  Structural β€” #164, #92
         β”‚     struct, impl, trait, generics
         β”‚
         β”œβ”€πŸ”€ Flow & Error β€” #116, #115
         β”‚     match, Result, Option, concurrency
         β”‚
         β””β”€πŸ“ System & IO β€” #172, #116
               File, HashMap, iterators

πŸ’» Usage

Interactive Mode

cargo run --release -- interactive

Type anything β€” the engine detects what you need and routes to the right expert:

> Implement a generic binary search tree in Rust

  🏷️  #algorithms #struct #trait #make

[πŸ—οΈ structural]
// Here's a generic BST implementation...

Commands

Command Description
/help Show all commands
/orchestra Show current expert orchestra
/tags List routing hashtags
/hashtags <query> Preview hashtag routing
/stats Session statistics
/reset Clear conversation
/exit Quit

Single Prompt Mode

cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural

🧠 How It Works

Expert Extraction

  1. Probe phase: Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
  2. Selection: Pick the most specialized experts per sub-domain (probe β†’ cosine similarity)
  3. SVD Compression: Compress each expert's weights (3Γ— matrices: gate, up, down) into rank-16 LoRA adapters
  4. GGUF conversion: Merge selected experts into orchestrated GGUF files for llama.cpp

Inference Pipeline

Query β†’ Hashtag Extraction β†’ Language Detection β†’ KB Cache Lookup
                                                     ↓ (miss)
                                              Router (keyword + hashtag) β†’ Select Expert
                                                                              ↓
                                          ChatML Prompt Builder β†’ llama.cpp (GGUF LoRA)

πŸ—οΈ Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.rs          # CLI entry point
β”‚   β”œβ”€β”€ engine.rs        # llama.cpp inference engine
β”‚   β”œβ”€β”€ inference.rs     # High-level inference pipeline
β”‚   β”œβ”€β”€ pipeline.rs      # Preprocess β†’ Route β†’ Generate flow
β”‚   β”œβ”€β”€ router.rs        # Keyword + hashtag-based routing
β”‚   β”œβ”€β”€ hashtags.rs      # Semantic hashtag extraction
β”‚   β”œβ”€β”€ config.rs        # Configuration + expert definitions
β”‚   β”œβ”€β”€ kb.rs            # Knowledge base semantic cache
β”‚   β”œβ”€β”€ translator.rs    # Russian β†’ English translation
β”œβ”€β”€ adapters/            # GGUF LoRA adapters (orchestra files)
β”œβ”€β”€ tokenizers/          # Qwen tokenizer
β”œβ”€β”€ training/            # Python scripts for expert extraction
β”œβ”€β”€ Cargo.toml
└── setup.sh             # Model download script

πŸ”¬ Performance

Configuration VRAM t/s (vs baseline)
Baseline (no LoRA) 0.91 GB 9.7 tok/s
1 expert +28 MB -13%
2 experts +17 MB -10%
3 experts +22 MB -10%

LoRA experts add only ~17-28 MB VRAM with ~10% speed impact β€” negligible overhead for specialist capabilities.


πŸ› οΈ Building from Source

CPU-only (no CUDA)

# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release

GPU (CUDA)

# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release

Full 35B Model

./setup.sh --big
# Edit src/config.rs β†’ update base_model_path to the 35B GGUF
# Edit inference.rs β†’ set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive

πŸ“Š Probe Results

From our full probe of all 256 MoE experts in Qwen3.5-35B:

Category Count %
Active experts 208 81.2%
Coding specialists 70 27.3%
Generalists 138 53.9%
Low-activity 48 18.8%

Qwen's MoE is well-designed β€” 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.


πŸ”— Links


Built with ❀️ using Rust + llama.cpp

Downloads last month
84
GGUF
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support