|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- text-embedding |
|
|
- sentence-similarity |
|
|
- semantic-search |
|
|
- product-matching |
|
|
- transformer |
|
|
- pytorch |
|
|
- from-scratch |
|
|
library_name: pytorch |
|
|
pipeline_tag: sentence-similarity |
|
|
model-index: |
|
|
- name: MiniEmbed-Mini |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# MiniEmbed: Tiny, Powerful Embedding Models from Scratch |
|
|
|
|
|
**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch. |
|
|
|
|
|
**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation) |
|
|
|
|
|
| Spec | Value | |
|
|
|---|---| |
|
|
| Parameters | ~10.8M | |
|
|
| Model Size | ~42 MB | |
|
|
| Embedding Dim | 256 | |
|
|
| Vocab Size | 30,000 | |
|
|
| Max Seq Length | 128 tokens | |
|
|
| Architecture | 4-layer Transformer Encoder | |
|
|
| Pooling | Mean Pooling + L2 Normalization | |
|
|
| Training Loss | MNRL (Multiple Negatives Ranking Loss) | |
|
|
| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) | |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```bash |
|
|
pip install torch numpy scikit-learn huggingface_hub |
|
|
``` |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
# Download model (one-time) |
|
|
model_dir = snapshot_download("surazbhandari/miniembed") |
|
|
|
|
|
# Add src to path |
|
|
import sys |
|
|
sys.path.insert(0, model_dir) |
|
|
|
|
|
from src.inference import EmbeddingInference |
|
|
|
|
|
# Load -- just like sentence-transformers! |
|
|
model = EmbeddingInference.from_pretrained("surazbhandari/miniembed") |
|
|
|
|
|
# 1. Similarity |
|
|
score = model.similarity("Machine learning is great", "AI is wonderful") |
|
|
print(f"Similarity: {score:.4f}") # 0.4287 |
|
|
|
|
|
# 2. Normal Embeddings |
|
|
embeddings = model.encode(["Machine learning is great", "AI is wonderful"]) |
|
|
|
|
|
# 3. Manual Cosine Similarity |
|
|
# Since embeddings are L2-normalized, dot product is cosine similarity |
|
|
import numpy as np |
|
|
score = np.dot(embeddings[0], embeddings[1]) |
|
|
print(f"Similarity: {score:.4f}") |
|
|
|
|
|
# Semantic Search |
|
|
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"] |
|
|
results = model.search("deep learning frameworks", docs, top_k=2) |
|
|
for r in results: |
|
|
print(f" [{r['score']:.3f}] {r['text']}") |
|
|
# [0.498] Neural networks learn patterns |
|
|
# [0.413] Python is great for AI |
|
|
|
|
|
# Clustering |
|
|
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2) |
|
|
# Cluster 1: ['Pizza is food'] |
|
|
# Cluster 2: ['ML is cool', 'AI rocks'] |
|
|
``` |
|
|
|
|
|
## Also Available via GitHub |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/bhandarisuraz/miniembed.git |
|
|
cd miniembed |
|
|
pip install -r requirements.txt |
|
|
|
|
|
python -c " |
|
|
from src.inference import EmbeddingInference |
|
|
model = EmbeddingInference.from_pretrained('models/mini') |
|
|
print(model.similarity('hello world', 'hi there')) |
|
|
" |
|
|
``` |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
- **Semantic Search** -- Find meaning-based matches, not keyword overlap. |
|
|
- **Re-Ranking** -- Sort candidates by true semantic relevance. |
|
|
- **Clustering** -- Group texts into logical categories automatically. |
|
|
- **Product Matching** -- Match items across platforms with messy titles. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
Custom 4-layer Transformer encoder built from first principles: |
|
|
|
|
|
- Token Embedding (30K vocab) + Sinusoidal Positional Encoding |
|
|
- 4x Pre-LayerNorm Transformer Encoder Layers |
|
|
- Multi-Head Self-Attention (4 heads, d_k=64) |
|
|
- Position-wise Feed-Forward (GELU activation, d_ff=1024) |
|
|
- Mean Pooling over non-padded tokens |
|
|
- L2 Normalization (unit hypersphere projection) |
|
|
|
|
|
## Training |
|
|
|
|
|
Trained on ~3.8 million text pairs from public datasets: |
|
|
|
|
|
| Dataset | Type | |
|
|
|---|---| |
|
|
| Natural Questions (NQ) | Q&A / General | |
|
|
| GooAQ | Knowledge Search | |
|
|
| WDC Product Matching | E-commerce | |
|
|
| ECInstruct | E-commerce Tasks | |
|
|
| MS MARCO | Web Search | |
|
|
|
|
|
**Training details:** |
|
|
- Training time: ~49 hours |
|
|
- Final loss: 0.0748 |
|
|
- Optimizer: AdamW |
|
|
- Batch size: 256 |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
surazbhandari/miniembed |
|
|
|-- README.md # This model card |
|
|
|-- config.json # Architecture config |
|
|
|-- model.safetensors # Pre-trained weights (Safe & Fast) |
|
|
|-- model.pt # Pre-trained weights (Legacy PyTorch) |
|
|
|-- tokenizer.json # 30K word-level vocabulary |
|
|
|-- training_info.json # Training metadata |
|
|
|-- src/ |
|
|
|-- __init__.py |
|
|
|-- model.py # Full architecture code |
|
|
|-- tokenizer.py # Tokenizer implementation |
|
|
|-- inference.py # High-level API (supports HF auto-download) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK] |
|
|
- 128 token max sequence length |
|
|
- Trained primarily on English text |
|
|
- Best suited for short-form text (queries, product titles, sentences) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{Bhandari_MiniEmbed_2026, |
|
|
author = {Bhandari, Suraj}, |
|
|
title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}}, |
|
|
url = {https://github.com/bhandarisuraz/miniembed}, |
|
|
version = {1.0.0}, |
|
|
year = {2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|