--- language: en license: mit tags: - text-embedding - sentence-similarity - semantic-search - product-matching - transformer - pytorch - from-scratch library_name: pytorch pipeline_tag: sentence-similarity model-index: - name: MiniEmbed-Mini results: [] --- # MiniEmbed: Tiny, Powerful Embedding Models from Scratch **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch. **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation) | Spec | Value | |---|---| | Parameters | ~10.8M | | Model Size | ~42 MB | | Embedding Dim | 256 | | Vocab Size | 30,000 | | Max Seq Length | 128 tokens | | Architecture | 4-layer Transformer Encoder | | Pooling | Mean Pooling + L2 Normalization | | Training Loss | MNRL (Multiple Negatives Ranking Loss) | | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) | ## Quick Start ```bash pip install torch numpy scikit-learn huggingface_hub ``` ```python from huggingface_hub import snapshot_download # Download model (one-time) model_dir = snapshot_download("surazbhandari/miniembed") # Add src to path import sys sys.path.insert(0, model_dir) from src.inference import EmbeddingInference # Load -- just like sentence-transformers! model = EmbeddingInference.from_pretrained("surazbhandari/miniembed") # 1. Similarity score = model.similarity("Machine learning is great", "AI is wonderful") print(f"Similarity: {score:.4f}") # 0.4287 # 2. Normal Embeddings embeddings = model.encode(["Machine learning is great", "AI is wonderful"]) # 3. Manual Cosine Similarity # Since embeddings are L2-normalized, dot product is cosine similarity import numpy as np score = np.dot(embeddings[0], embeddings[1]) print(f"Similarity: {score:.4f}") # Semantic Search docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"] results = model.search("deep learning frameworks", docs, top_k=2) for r in results: print(f" [{r['score']:.3f}] {r['text']}") # [0.498] Neural networks learn patterns # [0.413] Python is great for AI # Clustering result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2) # Cluster 1: ['Pizza is food'] # Cluster 2: ['ML is cool', 'AI rocks'] ``` ## Also Available via GitHub ```bash git clone https://github.com/bhandarisuraz/miniembed.git cd miniembed pip install -r requirements.txt python -c " from src.inference import EmbeddingInference model = EmbeddingInference.from_pretrained('models/mini') print(model.similarity('hello world', 'hi there')) " ``` ## Capabilities - **Semantic Search** -- Find meaning-based matches, not keyword overlap. - **Re-Ranking** -- Sort candidates by true semantic relevance. - **Clustering** -- Group texts into logical categories automatically. - **Product Matching** -- Match items across platforms with messy titles. ## Architecture Custom 4-layer Transformer encoder built from first principles: - Token Embedding (30K vocab) + Sinusoidal Positional Encoding - 4x Pre-LayerNorm Transformer Encoder Layers - Multi-Head Self-Attention (4 heads, d_k=64) - Position-wise Feed-Forward (GELU activation, d_ff=1024) - Mean Pooling over non-padded tokens - L2 Normalization (unit hypersphere projection) ## Training Trained on ~3.8 million text pairs from public datasets: | Dataset | Type | |---|---| | Natural Questions (NQ) | Q&A / General | | GooAQ | Knowledge Search | | WDC Product Matching | E-commerce | | ECInstruct | E-commerce Tasks | | MS MARCO | Web Search | **Training details:** - Training time: ~49 hours - Final loss: 0.0748 - Optimizer: AdamW - Batch size: 256 ## Files ``` surazbhandari/miniembed |-- README.md # This model card |-- config.json # Architecture config |-- model.safetensors # Pre-trained weights (Safe & Fast) |-- model.pt # Pre-trained weights (Legacy PyTorch) |-- tokenizer.json # 30K word-level vocabulary |-- training_info.json # Training metadata |-- src/ |-- __init__.py |-- model.py # Full architecture code |-- tokenizer.py # Tokenizer implementation |-- inference.py # High-level API (supports HF auto-download) ``` ## Limitations - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK] - 128 token max sequence length - Trained primarily on English text - Best suited for short-form text (queries, product titles, sentences) ## Citation ```bibtex @software{Bhandari_MiniEmbed_2026, author = {Bhandari, Suraj}, title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}}, url = {https://github.com/bhandarisuraz/miniembed}, version = {1.0.0}, year = {2026} } ``` ## License MIT