| | --- |
| | language: en |
| | license: mit |
| | tags: |
| | - text-embedding |
| | - sentence-similarity |
| | - semantic-search |
| | - product-matching |
| | - transformer |
| | - pytorch |
| | - from-scratch |
| | library_name: pytorch |
| | pipeline_tag: sentence-similarity |
| | model-index: |
| | - name: MiniEmbed-Mini |
| | results: [] |
| | --- |
| | |
| | # MiniEmbed: Tiny, Powerful Embedding Models from Scratch |
| |
|
| | **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch. |
| |
|
| | **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation) |
| |
|
| | | Spec | Value | |
| | |---|---| |
| | | Parameters | ~10.8M | |
| | | Model Size | ~42 MB | |
| | | Embedding Dim | 256 | |
| | | Vocab Size | 30,000 | |
| | | Max Seq Length | 128 tokens | |
| | | Architecture | 4-layer Transformer Encoder | |
| | | Pooling | Mean Pooling + L2 Normalization | |
| | | Training Loss | MNRL (Multiple Negatives Ranking Loss) | |
| | | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) | |
| |
|
| | ## Quick Start |
| |
|
| | ```bash |
| | pip install torch numpy scikit-learn huggingface_hub |
| | ``` |
| |
|
| | ```python |
| | from huggingface_hub import snapshot_download |
| | |
| | # Download model (one-time) |
| | model_dir = snapshot_download("surazbhandari/miniembed") |
| | |
| | # Add src to path |
| | import sys |
| | sys.path.insert(0, model_dir) |
| | |
| | from src.inference import EmbeddingInference |
| | |
| | # Load model |
| | model = EmbeddingInference.from_pretrained(model_dir) |
| | |
| | # 1. Similarity |
| | score = model.similarity("Machine learning is great", "AI is wonderful") |
| | print(f"Similarity: {score:.4f}") # 0.4287 |
| | |
| | # 2. Normal Embeddings |
| | embeddings = model.encode(["Machine learning is great", "AI is wonderful"]) |
| | |
| | # 3. Manual Cosine Similarity |
| | # Since embeddings are L2-normalized, dot product is cosine similarity |
| | import numpy as np |
| | score = np.dot(embeddings[0], embeddings[1]) |
| | print(f"Similarity: {score:.4f}") |
| | |
| | # Semantic Search |
| | docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"] |
| | results = model.search("deep learning frameworks", docs, top_k=2) |
| | for r in results: |
| | print(f" [{r['score']:.3f}] {r['text']}") |
| | # [0.498] Neural networks learn patterns |
| | # [0.413] Python is great for AI |
| | |
| | # Clustering |
| | result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2) |
| | for cluster_id, texts in result['texts_by_cluster'].items(): |
| | print(f"Cluster {cluster_id + 1}: {texts}") |
| | # Cluster 1: ['Pizza is food'] |
| | # Cluster 2: ['ML is cool', 'AI rocks'] |
| | ``` |
| |
|
| | ## Also Available via GitHub |
| |
|
| | ```bash |
| | git clone https://github.com/bhandarisuraz/miniembed.git |
| | cd miniembed |
| | pip install -r requirements.txt |
| | |
| | python -c " |
| | from src.inference import EmbeddingInference |
| | model = EmbeddingInference.from_pretrained('models/mini') |
| | print(model.similarity('hello world', 'hi there')) |
| | " |
| | ``` |
| |
|
| | ## Capabilities |
| |
|
| | - **Semantic Search** -- Find meaning-based matches, not keyword overlap. |
| | - **Re-Ranking** -- Sort candidates by true semantic relevance. |
| | - **Clustering** -- Group texts into logical categories automatically. |
| | - **Product Matching** -- Match items across platforms with messy titles. |
| |
|
| | ## Architecture |
| |
|
| | Custom 4-layer Transformer encoder built from first principles: |
| |
|
| | - Token Embedding (30K vocab) + Sinusoidal Positional Encoding |
| | - 4x Pre-LayerNorm Transformer Encoder Layers |
| | - Multi-Head Self-Attention (4 heads, d_k=64) |
| | - Position-wise Feed-Forward (GELU activation, d_ff=1024) |
| | - Mean Pooling over non-padded tokens |
| | - L2 Normalization (unit hypersphere projection) |
| |
|
| | ## Training |
| |
|
| | Trained on ~3.8 million text pairs from public datasets: |
| |
|
| | | Dataset | Type | |
| | |---|---| |
| | | Natural Questions (NQ) | Q&A / General | |
| | | GooAQ | Knowledge Search | |
| | | WDC Product Matching | E-commerce | |
| | | ECInstruct | E-commerce Tasks | |
| | | MS MARCO | Web Search | |
| |
|
| | **Training details:** |
| | - Training time: ~49 hours |
| | - Final loss: 0.0748 |
| | - Optimizer: AdamW |
| | - Batch size: 256 |
| |
|
| | ## Files |
| |
|
| | ``` |
| | surazbhandari/miniembed |
| | |-- README.md # This model card |
| | |-- config.json # Architecture config |
| | |-- model.safetensors # Pre-trained weights (Safe & Fast) |
| | |-- model.pt # Pre-trained weights (Legacy PyTorch) |
| | |-- tokenizer.json # 30K word-level vocabulary |
| | |-- training_info.json # Training metadata |
| | |-- src/ |
| | |-- __init__.py |
| | |-- model.py # Full architecture code |
| | |-- tokenizer.py # Tokenizer implementation |
| | |-- inference.py # High-level API (supports HF auto-download) |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK] |
| | - 128 token max sequence length |
| | - Trained primarily on English text |
| | - Best suited for short-form text (queries, product titles, sentences) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @software{Bhandari_MiniEmbed_2026, |
| | author = {Bhandari, Suraj}, |
| | title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}}, |
| | url = {https://github.com/bhandarisuraz/miniembed}, |
| | version = {1.0.0}, |
| | year = {2026} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|