README.md · FGFGFGGDFGDFGSD/ALRI-Embeddings at main

ALRI-Embeddings / README.md

FGFGFGGDFGDFGSD

Update README.md

880a612 verified 3 days ago

preview code

raw

history blame contribute delete

2.71 kB

	---
	language:
	- en
	- code
	tags:
	- python
	- javascript
	- go
	- java
	- php
	- ruby
	- c++
	- embeddings
	- code-search
	- onnx
	- albert
	- matryoshka
	- extreme-compression
	license: mit
	pipeline_tag: sentence-similarity
	---

	# ALRI: Ultra-Efficient Code Embeddings 🚀

	ALRI (A Lightweight Retrieval Intelligence) is a family of next-generation embedding models specifically designed for extreme efficiency and high-speed code retrieval.

	By combining modern architectural innovations with aggressive parameter optimization, ALRI achieves near-state-of-the-art retrieval performance at a fraction of the size of standard models like MiniLM.

	## 🧬 Key Technologies

	- ALBERT-style Weight Sharing: Utilizes recursive transformer blocks to maintain deep representations while drastically reducing the unique parameter count.
	- Extreme Hashed Embeddings: Vocabulary compression that maps 151k virtual tokens into 32k real vectors, eliminating redundancy and reducing memory footprint.
	- Funnel Attention: Dynamic sequence pooling that accelerates inference by reducing token density in deeper layers.
	- Matryoshka Representation Learning (MRL): Flexible output dimensions (32, 64, 128, 384) allowing you to trade off accuracy for even greater speed.
	- Distilled Intelligence: Knowledge distilled from a 24M parameter teacher into a sub-million parameter "Nano" engine.

	## 📊 Models

	\| Model \| Parameters \| Size (ONNX) \| Acc@1 (Python) \| Speed (CPU) \|
	\|---\|---\|---\|---\|---\|
	\| ALRI-Tiny \| 24M \| ~90 MB \| 96.8% \| ~35 ms \|
	\| ALRI-Nano \| 0.93M \| ~6 MB \| 94.0% \| ~2 ms \|
	\| MiniLM-L6 \| 22M \| ~80 MB \| 92.0% \| ~40 ms \|

	Note: ALRI-Nano is ~25x smaller than MiniLM-L6 while maintaining superior accuracy on code retrieval tasks.

	## 🚀 Getting Started (ONNX)

	The models are optimized for [ONNX Runtime](https://onnxruntime.ai/). You can run them on any CPU with minimal latency.

	```python
	import onnxruntime as ort
	from transformers import AutoTokenizer
	import numpy as np

	# Load Nano model
	session = ort.InferenceSession("alri-nano-onnx/model_int8.onnx")
	tokenizer = AutoTokenizer.from_pretrained("alri-nano-onnx/tokenizer")

	text = "how to read a json file in python"
	inputs = tokenizer(text, return_tensors="np")

	outputs = session.run(None, {
	"input_ids": inputs["input_ids"].astype(np.int64),
	"attention_mask": inputs["attention_mask"].astype(np.int64)
	})
	embedding = outputs[0] # (1, 128)
	```

	## 🎯 Use Cases

	- Real-time IDE Autocomplete: Lightning-fast context retrieval.
	- Mobile & Edge Search: High-quality search on low-power devices.
	- Massive Code Indexing: Extremely low storage costs per embedding.

	## 📜 License
	MIT