File size: 4,311 Bytes
6b3e3fd 166784e effde4d 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e 6b3e3fd 166784e effde4d 6a993ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
language:
- en
license: apache-2.0
library_name: gguf
tags:
- ruvltra
- sona
- adaptive-learning
- gguf
- quantized
- edge-device
- embedded
- iot
- turboquant
- kv-cache-compression
- flash-attention
- speculative-decoding
- graph-rag
- hybrid-search
- vector-database
- ruvector
- diskann
- mamba-ssm
- colbert
pipeline_tag: text-generation
---
<div align="center">
# RuvLTRA Small
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/ruv/ruvltra-small)
[](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
**π± Compact Model Optimized for Edge Devices**
[Quick Start](#-quick-start) β’ [Use Cases](#-use-cases) β’ [Integration](#-integration)
</div>
---
## Overview
RuvLTRA Small is a compact 0.5B parameter model designed for edge deployment. Perfect for mobile apps, IoT devices, and resource-constrained environments.
## Model Card
| Property | Value |
|----------|-------|
| **Parameters** | 0.5 Billion |
| **Quantization** | Q4_K_M |
| **Context** | 4,096 tokens |
| **Size** | ~398 MB |
| **Min RAM** | 1 GB |
## π Quick Start
```bash
# Download
wget https://huggingface.co/ruv/ruvltra-small/resolve/main/ruvltra-0.5b-q4_k_m.gguf
# Run with llama.cpp
./llama-cli -m ruvltra-0.5b-q4_k_m.gguf -p "Hello, I am" -n 64
```
## π‘ Use Cases
- **Mobile Apps**: On-device AI assistant
- **IoT**: Smart home device intelligence
- **Edge Computing**: Local inference without cloud
- **Prototyping**: Quick model experimentation
## π§ Integration
### Rust (RuvLLM)
```rust
use ruvllm::hub::ModelDownloader;
let path = ModelDownloader::new()
.download("ruv/ruvltra-small", None)
.await?;
```
### Python
```python
from huggingface_hub import hf_hub_download
model = hf_hub_download("ruv/ruvltra-small", "ruvltra-0.5b-q4_k_m.gguf")
```
## Hardware Support
- β
Apple Silicon (M1/M2/M3)
- β
NVIDIA CUDA
- β
CPU (x86/ARM)
- β
Raspberry Pi 4/5
---
**License**: Apache 2.0 | **GitHub**: [ruvnet/ruvector](https://github.com/ruvnet/ruvector)
---
## β‘ TurboQuant KV-Cache Compression
RuvLTRA models are fully compatible with **TurboQuant** β 2-4 bit KV-cache quantization that reduces inference memory by 6-8x with <0.5% quality loss.
| Quantization | Compression | Quality Loss | Best For |
|-------------|-------------|--------------|----------|
| 3-bit | 10.7x | <1% | **Recommended** β best balance |
| 4-bit | 8x | <0.5% | High quality, long context |
| 2-bit | 32x | ~2% | Edge devices, max savings |
### Usage with RuvLLM
```bash
cargo add ruvllm # Rust
npm install @ruvector/ruvllm # Node.js
```
```rust
use ruvllm::quantize::turbo_quant::{TurboQuantCompressor, TurboQuantConfig, TurboQuantBits};
let config = TurboQuantConfig {
bits: TurboQuantBits::Bit3_5, // 10.7x compression
use_qjl: true,
..Default::default()
};
let compressor = TurboQuantCompressor::new(config)?;
let compressed = compressor.compress_batch(&kv_vectors)?;
let scores = compressor.inner_product_batch_optimized(&query, &compressed)?;
```
### v2.1.0 Ecosystem
- **Hybrid Search** β Sparse + dense vectors with RRF fusion (20-49% better retrieval)
- **Graph RAG** β Knowledge graph + community detection for multi-hop queries
- **DiskANN** β Billion-scale SSD-backed ANN with <10ms latency
- **FlashAttention-3** β IO-aware tiled attention, O(N) memory
- **MLA** β Multi-Head Latent Attention (~93% KV-cache compression)
- **Mamba SSM** β Linear-time selective state space models
- **Speculative Decoding** β 2-3x generation speedup
[RuVector GitHub](https://github.com/ruvnet/ruvector) | [ruvllm crate](https://crates.io/crates/ruvllm) | [@ruvector/ruvllm npm](https://www.npmjs.com/package/@ruvector/ruvllm)
---
## Benchmarks (L4 GPU, 24GB VRAM)
| Metric | Result |
|--------|--------|
| **Inference Speed** | 75.4 tok/s |
| **Model Load Time** | 1.44s |
| **Parameters** | 0.5B |
| **TurboQuant KV (3-bit)** | 10.7x compression, <1% PPL loss |
| **TurboQuant KV (4-bit)** | 8x compression, <0.5% PPL loss |
*Benchmarked on Google Cloud L4 GPU via `ruvltra-calibration` Cloud Run Job (2026-03-28)*
|