SelentialCore / README.md
S4ntyC1t's picture
Upload 23 files
18e0633 verified
|
Raw
History Blame Contribute Delete
6.9 kB

⚑ Selential Core

MoLoRA Inference Engine β€” Runtime-Hot-Swappable LoRA Adapters for Qwen

Rust License

Selential Core is a Rust-native inference engine for the Qwen3.5 family of models. It implements MoLoRA (Mixture of LoRA Experts) β€” a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.

Instead of one model doing everything, Selential builds an orchestra of specialists: a generalist core + coding experts for structural code, flow/error handling, and system I/O.


πŸš€ Quick Start

Prerequisites

  • Rust 1.75+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
  • 4GB+ RAM (8GB+ recommended)
  • Optional: NVIDIA GPU with CUDA 12+ for acceleration

Setup

# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI

# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh

# 3. Build & run
cargo run --release -- interactive

For GPU acceleration:

./setup.sh --cuda
cargo run --release -- interactive

For the full 35B model (24GB+ VRAM):

./setup.sh --big
cargo run --release -- interactive

🎯 Features

Feature Description
MoLoRA Orchestras Query automatically routes to the right expert combo
Hot-Swap Adapters Switch between coding domains mid-conversation
Hashtag Routing #struct #match #io β€” or just describe what you need
KB Cache Semantic cache for repeated queries (instant response)
Chat History Full multi-turn conversation with context
Russian Support Detects Russian queries, translates internally, responds naturally
KV-Cache Quantization Q4_0 KV-cache saves ~75% VRAM

Expert Orchestra Architecture

User Query
   β”‚
   β”œβ”€πŸŒ Generalist Core (#70)        ← always active
   β”‚     Syntax, logic, coherence
   β”‚
   β””β”€πŸŽ― Coding Specialists (by topic)
         β”‚
         β”œβ”€πŸ—οΈ  Structural β€” #164, #92
         β”‚     struct, impl, trait, generics
         β”‚
         β”œβ”€πŸ”€ Flow & Error β€” #116, #115
         β”‚     match, Result, Option, concurrency
         β”‚
         β””β”€πŸ“ System & IO β€” #172, #116
               File, HashMap, iterators

πŸ’» Usage

Interactive Mode

cargo run --release -- interactive

Type anything β€” the engine detects what you need and routes to the right expert:

> Implement a generic binary search tree in Rust

  🏷️  #algorithms #struct #trait #make

[πŸ—οΈ structural]
// Here's a generic BST implementation...

Commands

Command Description
/help Show all commands
/orchestra Show current expert orchestra
/tags List routing hashtags
/hashtags <query> Preview hashtag routing
/stats Session statistics
/reset Clear conversation
/exit Quit

Single Prompt Mode

cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural

🧠 How It Works

Expert Extraction

  1. Probe phase: Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
  2. Selection: Pick the most specialized experts per sub-domain (probe β†’ cosine similarity)
  3. SVD Compression: Compress each expert's weights (3Γ— matrices: gate, up, down) into rank-16 LoRA adapters
  4. GGUF conversion: Merge selected experts into orchestrated GGUF files for llama.cpp

Inference Pipeline

Query β†’ Hashtag Extraction β†’ Language Detection β†’ KB Cache Lookup
                                                     ↓ (miss)
                                              Router (keyword + hashtag) β†’ Select Expert
                                                                              ↓
                                          ChatML Prompt Builder β†’ llama.cpp (GGUF LoRA)

πŸ—οΈ Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.rs          # CLI entry point
β”‚   β”œβ”€β”€ engine.rs        # llama.cpp inference engine
β”‚   β”œβ”€β”€ inference.rs     # High-level inference pipeline
β”‚   β”œβ”€β”€ pipeline.rs      # Preprocess β†’ Route β†’ Generate flow
β”‚   β”œβ”€β”€ router.rs        # Keyword + hashtag-based routing
β”‚   β”œβ”€β”€ hashtags.rs      # Semantic hashtag extraction
β”‚   β”œβ”€β”€ config.rs        # Configuration + expert definitions
β”‚   β”œβ”€β”€ kb.rs            # Knowledge base semantic cache
β”‚   β”œβ”€β”€ translator.rs    # Russian β†’ English translation
β”œβ”€β”€ adapters/            # GGUF LoRA adapters (orchestra files)
β”œβ”€β”€ tokenizers/          # Qwen tokenizer
β”œβ”€β”€ training/            # Python scripts for expert extraction
β”œβ”€β”€ Cargo.toml
└── setup.sh             # Model download script

πŸ”¬ Performance

Configuration VRAM t/s (vs baseline)
Baseline (no LoRA) 0.91 GB 9.7 tok/s
1 expert +28 MB -13%
2 experts +17 MB -10%
3 experts +22 MB -10%

LoRA experts add only ~17-28 MB VRAM with ~10% speed impact β€” negligible overhead for specialist capabilities.


πŸ› οΈ Building from Source

CPU-only (no CUDA)

# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release

GPU (CUDA)

# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release

Full 35B Model

./setup.sh --big
# Edit src/config.rs β†’ update base_model_path to the 35B GGUF
# Edit inference.rs β†’ set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive

πŸ“Š Probe Results

From our full probe of all 256 MoE experts in Qwen3.5-35B:

Category Count %
Active experts 208 81.2%
Coding specialists 70 27.3%
Generalists 138 53.9%
Low-activity 48 18.8%

Qwen's MoE is well-designed β€” 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.


πŸ”— Links


Built with ❀️ using Rust + llama.cpp