| --- |
| language: |
| - en |
| - code |
| tags: |
| - python |
| - javascript |
| - go |
| - java |
| - php |
| - ruby |
| - c++ |
| - embeddings |
| - code-search |
| - onnx |
| - albert |
| - matryoshka |
| - extreme-compression |
| license: mit |
| pipeline_tag: sentence-similarity |
| --- |
| |
| # ALRI: Ultra-Efficient Code Embeddings π |
|
|
| ALRI (A Lightweight Retrieval Intelligence) is a family of next-generation embedding models specifically designed for extreme efficiency and high-speed code retrieval. |
|
|
| By combining modern architectural innovations with aggressive parameter optimization, ALRI achieves near-state-of-the-art retrieval performance at a fraction of the size of standard models like MiniLM. |
|
|
| ## 𧬠Key Technologies |
|
|
| - **ALBERT-style Weight Sharing**: Utilizes recursive transformer blocks to maintain deep representations while drastically reducing the unique parameter count. |
| - **Extreme Hashed Embeddings**: Vocabulary compression that maps 151k virtual tokens into 32k real vectors, eliminating redundancy and reducing memory footprint. |
| - **Funnel Attention**: Dynamic sequence pooling that accelerates inference by reducing token density in deeper layers. |
| - **Matryoshka Representation Learning (MRL)**: Flexible output dimensions (32, 64, 128, 384) allowing you to trade off accuracy for even greater speed. |
| - **Distilled Intelligence**: Knowledge distilled from a 24M parameter teacher into a sub-million parameter "Nano" engine. |
|
|
| ## π Models |
|
|
| | Model | Parameters | Size (ONNX) | Acc@1 (Python) | Speed (CPU) | |
| |---|---|---|---|---| |
| | **ALRI-Tiny** | 24M | ~90 MB | **96.8%** | ~35 ms | |
| | **ALRI-Nano** | **0.93M** | **~6 MB** | **94.0%** | **~2 ms** | |
| | *MiniLM-L6* | *22M* | *~80 MB* | *92.0%* | *~40 ms* | |
|
|
| *Note: ALRI-Nano is ~25x smaller than MiniLM-L6 while maintaining superior accuracy on code retrieval tasks.* |
|
|
| ## π Getting Started (ONNX) |
|
|
| The models are optimized for [ONNX Runtime](https://onnxruntime.ai/). You can run them on any CPU with minimal latency. |
|
|
| ```python |
| import onnxruntime as ort |
| from transformers import AutoTokenizer |
| import numpy as np |
| |
| # Load Nano model |
| session = ort.InferenceSession("alri-nano-onnx/model_int8.onnx") |
| tokenizer = AutoTokenizer.from_pretrained("alri-nano-onnx/tokenizer") |
| |
| text = "how to read a json file in python" |
| inputs = tokenizer(text, return_tensors="np") |
| |
| outputs = session.run(None, { |
| "input_ids": inputs["input_ids"].astype(np.int64), |
| "attention_mask": inputs["attention_mask"].astype(np.int64) |
| }) |
| embedding = outputs[0] # (1, 128) |
| ``` |
|
|
| ## π― Use Cases |
|
|
| - **Real-time IDE Autocomplete**: Lightning-fast context retrieval. |
| - **Mobile & Edge Search**: High-quality search on low-power devices. |
| - **Massive Code Indexing**: Extremely low storage costs per embedding. |
|
|
| ## π License |
| MIT |