File size: 2,707 Bytes
7de65e6 880a612 7de65e6 880a612 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | ---
language:
- en
- code
tags:
- python
- javascript
- go
- java
- php
- ruby
- c++
- embeddings
- code-search
- onnx
- albert
- matryoshka
- extreme-compression
license: mit
pipeline_tag: sentence-similarity
---
# ALRI: Ultra-Efficient Code Embeddings π
ALRI (A Lightweight Retrieval Intelligence) is a family of next-generation embedding models specifically designed for extreme efficiency and high-speed code retrieval.
By combining modern architectural innovations with aggressive parameter optimization, ALRI achieves near-state-of-the-art retrieval performance at a fraction of the size of standard models like MiniLM.
## 𧬠Key Technologies
- **ALBERT-style Weight Sharing**: Utilizes recursive transformer blocks to maintain deep representations while drastically reducing the unique parameter count.
- **Extreme Hashed Embeddings**: Vocabulary compression that maps 151k virtual tokens into 32k real vectors, eliminating redundancy and reducing memory footprint.
- **Funnel Attention**: Dynamic sequence pooling that accelerates inference by reducing token density in deeper layers.
- **Matryoshka Representation Learning (MRL)**: Flexible output dimensions (32, 64, 128, 384) allowing you to trade off accuracy for even greater speed.
- **Distilled Intelligence**: Knowledge distilled from a 24M parameter teacher into a sub-million parameter "Nano" engine.
## π Models
| Model | Parameters | Size (ONNX) | Acc@1 (Python) | Speed (CPU) |
|---|---|---|---|---|
| **ALRI-Tiny** | 24M | ~90 MB | **96.8%** | ~35 ms |
| **ALRI-Nano** | **0.93M** | **~6 MB** | **94.0%** | **~2 ms** |
| *MiniLM-L6* | *22M* | *~80 MB* | *92.0%* | *~40 ms* |
*Note: ALRI-Nano is ~25x smaller than MiniLM-L6 while maintaining superior accuracy on code retrieval tasks.*
## π Getting Started (ONNX)
The models are optimized for [ONNX Runtime](https://onnxruntime.ai/). You can run them on any CPU with minimal latency.
```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load Nano model
session = ort.InferenceSession("alri-nano-onnx/model_int8.onnx")
tokenizer = AutoTokenizer.from_pretrained("alri-nano-onnx/tokenizer")
text = "how to read a json file in python"
inputs = tokenizer(text, return_tensors="np")
outputs = session.run(None, {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64)
})
embedding = outputs[0] # (1, 128)
```
## π― Use Cases
- **Real-time IDE Autocomplete**: Lightning-fast context retrieval.
- **Mobile & Edge Search**: High-quality search on low-power devices.
- **Massive Code Indexing**: Extremely low storage costs per embedding.
## π License
MIT |