--- language: - en - code tags: - python - javascript - go - java - php - ruby - c++ - embeddings - code-search - onnx - albert - matryoshka - extreme-compression license: mit pipeline_tag: sentence-similarity --- # ALRI: Ultra-Efficient Code Embeddings 🚀 ALRI (A Lightweight Retrieval Intelligence) is a family of next-generation embedding models specifically designed for extreme efficiency and high-speed code retrieval. By combining modern architectural innovations with aggressive parameter optimization, ALRI achieves near-state-of-the-art retrieval performance at a fraction of the size of standard models like MiniLM. ## 🧬 Key Technologies - **ALBERT-style Weight Sharing**: Utilizes recursive transformer blocks to maintain deep representations while drastically reducing the unique parameter count. - **Extreme Hashed Embeddings**: Vocabulary compression that maps 151k virtual tokens into 32k real vectors, eliminating redundancy and reducing memory footprint. - **Funnel Attention**: Dynamic sequence pooling that accelerates inference by reducing token density in deeper layers. - **Matryoshka Representation Learning (MRL)**: Flexible output dimensions (32, 64, 128, 384) allowing you to trade off accuracy for even greater speed. - **Distilled Intelligence**: Knowledge distilled from a 24M parameter teacher into a sub-million parameter "Nano" engine. ## 📊 Models | Model | Parameters | Size (ONNX) | Acc@1 (Python) | Speed (CPU) | |---|---|---|---|---| | **ALRI-Tiny** | 24M | ~90 MB | **96.8%** | ~35 ms | | **ALRI-Nano** | **0.93M** | **~6 MB** | **94.0%** | **~2 ms** | | *MiniLM-L6* | *22M* | *~80 MB* | *92.0%* | *~40 ms* | *Note: ALRI-Nano is ~25x smaller than MiniLM-L6 while maintaining superior accuracy on code retrieval tasks.* ## 🚀 Getting Started (ONNX) The models are optimized for [ONNX Runtime](https://onnxruntime.ai/). You can run them on any CPU with minimal latency. ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np # Load Nano model session = ort.InferenceSession("alri-nano-onnx/model_int8.onnx") tokenizer = AutoTokenizer.from_pretrained("alri-nano-onnx/tokenizer") text = "how to read a json file in python" inputs = tokenizer(text, return_tensors="np") outputs = session.run(None, { "input_ids": inputs["input_ids"].astype(np.int64), "attention_mask": inputs["attention_mask"].astype(np.int64) }) embedding = outputs[0] # (1, 128) ``` ## 🎯 Use Cases - **Real-time IDE Autocomplete**: Lightning-fast context retrieval. - **Mobile & Edge Search**: High-quality search on low-power devices. - **Massive Code Indexing**: Extremely low storage costs per embedding. ## 📜 License MIT