SelentialCore / README.md

Upload 23 files

18e0633 verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

6.9 kB

⚡ Selential Core

MoLoRA Inference Engine — Runtime-Hot-Swappable LoRA Adapters for Qwen

Selential Core is a Rust-native inference engine for the Qwen3.5 family of models. It implements MoLoRA (Mixture of LoRA Experts) — a technique that extracts individual MoE experts from Qwen's transformer layers, compresses them via SVD into LoRA adapters, and hot-swaps them at runtime based on the query type.

Instead of one model doing everything, Selential builds an orchestra of specialists: a generalist core + coding experts for structural code, flow/error handling, and system I/O.

🚀 Quick Start

Prerequisites

Rust 1.75+ (curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh)
4GB+ RAM (8GB+ recommended)
Optional: NVIDIA GPU with CUDA 12+ for acceleration

Setup

# 1. Clone
git clone https://github.com/S4ntyC1t/SelentialCore-New-level-optimization-AI.git
cd SelentialCore-New-level-optimization-AI

# 2. Download the model + tokenizer
chmod +x setup.sh
./setup.sh

# 3. Build & run
cargo run --release -- interactive

For GPU acceleration:

./setup.sh --cuda
cargo run --release -- interactive

For the full 35B model (24GB+ VRAM):

./setup.sh --big
cargo run --release -- interactive

🎯 Features

Feature	Description
MoLoRA Orchestras	Query automatically routes to the right expert combo
Hot-Swap Adapters	Switch between coding domains mid-conversation
Hashtag Routing	`#struct #match #io` — or just describe what you need
KB Cache	Semantic cache for repeated queries (instant response)
Chat History	Full multi-turn conversation with context
Russian Support	Detects Russian queries, translates internally, responds naturally
KV-Cache Quantization	Q4_0 KV-cache saves ~75% VRAM

Expert Orchestra Architecture

User Query
   │
   ├─🌐 Generalist Core (#70)        ← always active
   │     Syntax, logic, coherence
   │
   └─🎯 Coding Specialists (by topic)
         │
         ├─🏗️  Structural — #164, #92
         │     struct, impl, trait, generics
         │
         ├─🔀 Flow & Error — #116, #115
         │     match, Result, Option, concurrency
         │
         └─📁 System & IO — #172, #116
               File, HashMap, iterators

💻 Usage

Interactive Mode

cargo run --release -- interactive

Type anything — the engine detects what you need and routes to the right expert:

> Implement a generic binary search tree in Rust

  🏷️  #algorithms #struct #trait #make

[🏗️ structural]
// Here's a generic BST implementation...

Commands

Command	Description
`/help`	Show all commands
`/orchestra`	Show current expert orchestra
`/tags`	List routing hashtags
`/hashtags <query>`	Preview hashtag routing
`/stats`	Session statistics
`/reset`	Clear conversation
`/exit`	Quit

Single Prompt Mode

cargo run --release -- prompt "Write a thread-safe HashMap wrapper in Rust"
cargo run --release -- prompt "#struct #io Implement a BufReader line counter" -e structural

🧠 How It Works

Expert Extraction

Probe phase: Analyze Qwen3.5-35B's 256 MoE experts using activation patterns on coding, reasoning, and chat queries
Selection: Pick the most specialized experts per sub-domain (probe → cosine similarity)
SVD Compression: Compress each expert's weights (3× matrices: gate, up, down) into rank-16 LoRA adapters
GGUF conversion: Merge selected experts into orchestrated GGUF files for llama.cpp

Inference Pipeline

Query → Hashtag Extraction → Language Detection → KB Cache Lookup
                                                     ↓ (miss)
                                              Router (keyword + hashtag) → Select Expert
                                                                              ↓
                                          ChatML Prompt Builder → llama.cpp (GGUF LoRA)

🏗️ Project Structure

├── src/
│   ├── main.rs          # CLI entry point
│   ├── engine.rs        # llama.cpp inference engine
│   ├── inference.rs     # High-level inference pipeline
│   ├── pipeline.rs      # Preprocess → Route → Generate flow
│   ├── router.rs        # Keyword + hashtag-based routing
│   ├── hashtags.rs      # Semantic hashtag extraction
│   ├── config.rs        # Configuration + expert definitions
│   ├── kb.rs            # Knowledge base semantic cache
│   ├── translator.rs    # Russian → English translation
├── adapters/            # GGUF LoRA adapters (orchestra files)
├── tokenizers/          # Qwen tokenizer
├── training/            # Python scripts for expert extraction
├── Cargo.toml
└── setup.sh             # Model download script

🔬 Performance

Configuration	VRAM	t/s (vs baseline)
Baseline (no LoRA)	0.91 GB	9.7 tok/s
1 expert	+28 MB	-13%
2 experts	+17 MB	-10%
3 experts	+22 MB	-10%

LoRA experts add only ~17-28 MB VRAM with ~10% speed impact — negligible overhead for specialist capabilities.

🛠️ Building from Source

CPU-only (no CUDA)

# Edit Cargo.toml: remove "cuda" feature from llama-cpp-2 deps
# Then build:
cargo build --release

GPU (CUDA)

# Requirements: CUDA 12+, cuBLAS
./setup.sh --cuda
cargo build --release

Full 35B Model

./setup.sh --big
# Edit src/config.rs → update base_model_path to the 35B GGUF
# Edit inference.rs → set n_gpu_layers to 25+ (depends on your VRAM)
cargo run --release -- interactive

📊 Probe Results

From our full probe of all 256 MoE experts in Qwen3.5-35B:

Category	Count	%
Active experts	208	81.2%
Coding specialists	70	27.3%
Generalists	138	53.9%
Low-activity	48	18.8%

Qwen's MoE is well-designed — 81% of experts actively contribute. The coding-specific experts (70 total) were our focus for the orchestra architecture.

🔗 Links

Built with ❤️ using Rust + llama.cpp