File size: 4,925 Bytes

3da5f3c
e190deb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3da5f3c
 
e190deb
3da5f3c
e190deb
3da5f3c
e190deb
3da5f3c
e190deb
 
 
 
 
 
 
 
 
 
 
3da5f3c
 
 
 
e190deb
3da5f3c
 
 
e190deb
3da5f3c
e190deb
 
3da5f3c
e190deb
 
 
3da5f3c
 
 
88411a9
 
3da5f3c
e190deb
3da5f3c
 
 
e190deb
 
b9e9219
 
 
e190deb
b9e9219
 
e190deb
b9e9219
3da5f3c
 
 
 
e190deb
 
3da5f3c
b9e9219
e190deb
e2a15cc
 
e190deb
 
3da5f3c
 
e190deb
3da5f3c
 
e190deb
 
 
3da5f3c
e190deb
 
 
 
 
 
3da5f3c
e190deb
3da5f3c
e190deb
 
 
 
3da5f3c
e190deb
3da5f3c
e190deb
3da5f3c
e190deb
 
 
 
 
 
3da5f3c
e190deb
3da5f3c
e190deb
3da5f3c
e190deb
3da5f3c
e190deb
 
 
 
 
3da5f3c
e190deb
 
 
 
 
3da5f3c
e190deb
3da5f3c
e190deb
 
 
 
 
 
 
 
 
 
 
 
 
3da5f3c
 
e190deb
3da5f3c
e190deb
 
 
 
3da5f3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e190deb

---
language: en
license: mit
tags:
  - text-embedding
  - sentence-similarity
  - semantic-search
  - product-matching
  - transformer
  - pytorch
  - from-scratch
library_name: pytorch
pipeline_tag: sentence-similarity
model-index:
  - name: MiniEmbed-Mini
    results: []
---

# MiniEmbed: Tiny, Powerful Embedding Models from Scratch

**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.

**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)

| Spec | Value |
|---|---|
| Parameters | ~10.8M |
| Model Size | ~42 MB |
| Embedding Dim | 256 |
| Vocab Size | 30,000 |
| Max Seq Length | 128 tokens |
| Architecture | 4-layer Transformer Encoder |
| Pooling | Mean Pooling + L2 Normalization |
| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |

## Quick Start

```bash
pip install torch numpy scikit-learn huggingface_hub
```

```python
from huggingface_hub import snapshot_download

# Download model (one-time)
model_dir = snapshot_download("surazbhandari/miniembed")

# Add src to path
import sys
sys.path.insert(0, model_dir)

from src.inference import EmbeddingInference

# Load model
model = EmbeddingInference.from_pretrained(model_dir)

# 1. Similarity
score = model.similarity("Machine learning is great", "AI is wonderful")
print(f"Similarity: {score:.4f}")  # 0.4287

# 2. Normal Embeddings
embeddings = model.encode(["Machine learning is great", "AI is wonderful"])

# 3. Manual Cosine Similarity
# Since embeddings are L2-normalized, dot product is cosine similarity
import numpy as np
score = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {score:.4f}")

# Semantic Search
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
results = model.search("deep learning frameworks", docs, top_k=2)
for r in results:
    print(f"  [{r['score']:.3f}] {r['text']}")
# [0.498] Neural networks learn patterns
# [0.413] Python is great for AI

# Clustering
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
for cluster_id, texts in result['texts_by_cluster'].items():
    print(f"Cluster {cluster_id + 1}: {texts}")
# Cluster 1: ['Pizza is food']
# Cluster 2: ['ML is cool', 'AI rocks']
```

## Also Available via GitHub

```bash
git clone https://github.com/bhandarisuraz/miniembed.git
cd miniembed
pip install -r requirements.txt

python -c "
from src.inference import EmbeddingInference
model = EmbeddingInference.from_pretrained('models/mini')
print(model.similarity('hello world', 'hi there'))
"
```

## Capabilities

- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
- **Re-Ranking** -- Sort candidates by true semantic relevance.
- **Clustering** -- Group texts into logical categories automatically.
- **Product Matching** -- Match items across platforms with messy titles.

## Architecture

Custom 4-layer Transformer encoder built from first principles:

- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
- 4x Pre-LayerNorm Transformer Encoder Layers
- Multi-Head Self-Attention (4 heads, d_k=64)
- Position-wise Feed-Forward (GELU activation, d_ff=1024)
- Mean Pooling over non-padded tokens
- L2 Normalization (unit hypersphere projection)

## Training

Trained on ~3.8 million text pairs from public datasets:

| Dataset | Type |
|---|---|
| Natural Questions (NQ) | Q&A / General |
| GooAQ | Knowledge Search |
| WDC Product Matching | E-commerce |
| ECInstruct | E-commerce Tasks |
| MS MARCO | Web Search |

**Training details:**
- Training time: ~49 hours
- Final loss: 0.0748
- Optimizer: AdamW
- Batch size: 256

## Files

```
surazbhandari/miniembed
|-- README.md           # This model card
|-- config.json         # Architecture config
|-- model.safetensors   # Pre-trained weights (Safe & Fast)
|-- model.pt            # Pre-trained weights (Legacy PyTorch)
|-- tokenizer.json      # 30K word-level vocabulary
|-- training_info.json  # Training metadata
|-- src/
    |-- __init__.py
    |-- model.py        # Full architecture code
    |-- tokenizer.py    # Tokenizer implementation
    |-- inference.py    # High-level API (supports HF auto-download)
```

## Limitations

- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
- 128 token max sequence length
- Trained primarily on English text
- Best suited for short-form text (queries, product titles, sentences)

## Citation

```bibtex
@software{Bhandari_MiniEmbed_2026,
  author  = {Bhandari, Suraj},
  title   = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
  url     = {https://github.com/bhandarisuraz/miniembed},
  version = {1.0.0},
  year    = {2026}
}
```

## License

MIT