README.md · ruv/ruvltra-small at main

File size: 4,311 Bytes

---
language:
- en
license: apache-2.0
library_name: gguf
tags:
- ruvltra
- sona
- adaptive-learning
- gguf
- quantized
- edge-device
- embedded
- iot
- turboquant
- kv-cache-compression
- flash-attention
- speculative-decoding
- graph-rag
- hybrid-search
- vector-database
- ruvector
- diskann
- mamba-ssm
- colbert
pipeline_tag: text-generation
---

<div align="center">

# RuvLTRA Small

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://huggingface.co/ruv/ruvltra-small)
[![GGUF](https://img.shields.io/badge/Format-GGUF-green)](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)

**📱 Compact Model Optimized for Edge Devices**

[Quick Start](#-quick-start) • [Use Cases](#-use-cases) • [Integration](#-integration)

</div>

---

## Overview

RuvLTRA Small is a compact 0.5B parameter model designed for edge deployment. Perfect for mobile apps, IoT devices, and resource-constrained environments.

## Model Card

| Property | Value |
|----------|-------|
| **Parameters** | 0.5 Billion |
| **Quantization** | Q4_K_M |
| **Context** | 4,096 tokens |
| **Size** | ~398 MB |
| **Min RAM** | 1 GB |

## 🚀 Quick Start

```bash
# Download
wget https://huggingface.co/ruv/ruvltra-small/resolve/main/ruvltra-0.5b-q4_k_m.gguf

# Run with llama.cpp
./llama-cli -m ruvltra-0.5b-q4_k_m.gguf -p "Hello, I am" -n 64
```

## 💡 Use Cases

- **Mobile Apps**: On-device AI assistant
- **IoT**: Smart home device intelligence
- **Edge Computing**: Local inference without cloud
- **Prototyping**: Quick model experimentation

## 🔧 Integration

### Rust (RuvLLM)
```rust
use ruvllm::hub::ModelDownloader;

let path = ModelDownloader::new()
    .download("ruv/ruvltra-small", None)
    .await?;
```

### Python
```python
from huggingface_hub import hf_hub_download

model = hf_hub_download("ruv/ruvltra-small", "ruvltra-0.5b-q4_k_m.gguf")
```

## Hardware Support

- ✅ Apple Silicon (M1/M2/M3)
- ✅ NVIDIA CUDA
- ✅ CPU (x86/ARM)
- ✅ Raspberry Pi 4/5

---

**License**: Apache 2.0 | **GitHub**: [ruvnet/ruvector](https://github.com/ruvnet/ruvector)


---

## ⚡ TurboQuant KV-Cache Compression

RuvLTRA models are fully compatible with **TurboQuant** — 2-4 bit KV-cache quantization that reduces inference memory by 6-8x with <0.5% quality loss.

| Quantization | Compression | Quality Loss | Best For |
|-------------|-------------|--------------|----------|
| 3-bit | 10.7x | <1% | **Recommended** — best balance |
| 4-bit | 8x | <0.5% | High quality, long context |
| 2-bit | 32x | ~2% | Edge devices, max savings |

### Usage with RuvLLM

```bash
cargo add ruvllm    # Rust
npm install @ruvector/ruvllm   # Node.js
```

```rust
use ruvllm::quantize::turbo_quant::{TurboQuantCompressor, TurboQuantConfig, TurboQuantBits};

let config = TurboQuantConfig {
    bits: TurboQuantBits::Bit3_5, // 10.7x compression
    use_qjl: true,
    ..Default::default()
};
let compressor = TurboQuantCompressor::new(config)?;
let compressed = compressor.compress_batch(&kv_vectors)?;
let scores = compressor.inner_product_batch_optimized(&query, &compressed)?;
```

### v2.1.0 Ecosystem

- **Hybrid Search** — Sparse + dense vectors with RRF fusion (20-49% better retrieval)
- **Graph RAG** — Knowledge graph + community detection for multi-hop queries
- **DiskANN** — Billion-scale SSD-backed ANN with <10ms latency
- **FlashAttention-3** — IO-aware tiled attention, O(N) memory
- **MLA** — Multi-Head Latent Attention (~93% KV-cache compression)
- **Mamba SSM** — Linear-time selective state space models
- **Speculative Decoding** — 2-3x generation speedup

[RuVector GitHub](https://github.com/ruvnet/ruvector) | [ruvllm crate](https://crates.io/crates/ruvllm) | [@ruvector/ruvllm npm](https://www.npmjs.com/package/@ruvector/ruvllm)


---

## Benchmarks (L4 GPU, 24GB VRAM)

| Metric | Result |
|--------|--------|
| **Inference Speed** | 75.4 tok/s |
| **Model Load Time** | 1.44s |
| **Parameters** | 0.5B |
| **TurboQuant KV (3-bit)** | 10.7x compression, <1% PPL loss |
| **TurboQuant KV (4-bit)** | 8x compression, <0.5% PPL loss |

*Benchmarked on Google Cloud L4 GPU via `ruvltra-calibration` Cloud Run Job (2026-03-28)*