---
language:
- en
- code
tags:
- python
- javascript
- go
- java
- php
- ruby
- c++
- embeddings
- code-search
- onnx
- albert
- matryoshka
- extreme-compression
license: mit
pipeline_tag: sentence-similarity
---

# ALRI: Ultra-Efficient Code Embeddings 🚀

ALRI (A Lightweight Retrieval Intelligence) is a family of next-generation embedding models specifically designed for extreme efficiency and high-speed code retrieval. 

By combining modern architectural innovations with aggressive parameter optimization, ALRI achieves near-state-of-the-art retrieval performance at a fraction of the size of standard models like MiniLM.

## 🧬 Key Technologies

- **ALBERT-style Weight Sharing**: Utilizes recursive transformer blocks to maintain deep representations while drastically reducing the unique parameter count.
- **Extreme Hashed Embeddings**: Vocabulary compression that maps 151k virtual tokens into 32k real vectors, eliminating redundancy and reducing memory footprint.
- **Funnel Attention**: Dynamic sequence pooling that accelerates inference by reducing token density in deeper layers.
- **Matryoshka Representation Learning (MRL)**: Flexible output dimensions (32, 64, 128, 384) allowing you to trade off accuracy for even greater speed.
- **Distilled Intelligence**: Knowledge distilled from a 24M parameter teacher into a sub-million parameter "Nano" engine.

## 📊 Models

| Model | Parameters | Size (ONNX) | Acc@1 (Python) | Speed (CPU) |
|---|---|---|---|---|
| **ALRI-Tiny** | 24M | ~90 MB | **96.8%** | ~35 ms |
| **ALRI-Nano** | **0.93M** | **~6 MB** | **94.0%** | **~2 ms** |
| *MiniLM-L6* | *22M* | *~80 MB* | *92.0%* | *~40 ms* |

*Note: ALRI-Nano is ~25x smaller than MiniLM-L6 while maintaining superior accuracy on code retrieval tasks.*

## 🚀 Getting Started (ONNX)

The models are optimized for [ONNX Runtime](https://onnxruntime.ai/). You can run them on any CPU with minimal latency.

```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load Nano model
session = ort.InferenceSession("alri-nano-onnx/model_int8.onnx")
tokenizer = AutoTokenizer.from_pretrained("alri-nano-onnx/tokenizer")

text = "how to read a json file in python"
inputs = tokenizer(text, return_tensors="np")

outputs = session.run(None, {
    "input_ids": inputs["input_ids"].astype(np.int64),
    "attention_mask": inputs["attention_mask"].astype(np.int64)
})
embedding = outputs[0] # (1, 128)
```

## 🎯 Use Cases

- **Real-time IDE Autocomplete**: Lightning-fast context retrieval.
- **Mobile & Edge Search**: High-quality search on low-power devices.
- **Massive Code Indexing**: Extremely low storage costs per embedding.

## 📜 License
MIT