Sync from GitHub Actions

e2a15cc verified 5 days ago

4.93 kB

	---
	language: en
	license: mit
	tags:
	- text-embedding
	- sentence-similarity
	- semantic-search
	- product-matching
	- transformer
	- pytorch
	- from-scratch
	library_name: pytorch
	pipeline_tag: sentence-similarity
	model-index:
	- name: MiniEmbed-Mini
	results: []
	---

	# MiniEmbed: Tiny, Powerful Embedding Models from Scratch

	MiniEmbed is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.

	GitHub: [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)

	\| Spec \| Value \|
	\|---\|---\|
	\| Parameters \| ~10.8M \|
	\| Model Size \| ~42 MB \|
	\| Embedding Dim \| 256 \|
	\| Vocab Size \| 30,000 \|
	\| Max Seq Length \| 128 tokens \|
	\| Architecture \| 4-layer Transformer Encoder \|
	\| Pooling \| Mean Pooling + L2 Normalization \|
	\| Training Loss \| MNRL (Multiple Negatives Ranking Loss) \|
	\| Training Data \| ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) \|

	## Quick Start

	```bash
	pip install torch numpy scikit-learn huggingface_hub
	```

	```python
	from huggingface_hub import snapshot_download

	# Download model (one-time)
	model_dir = snapshot_download("surazbhandari/miniembed")

	# Add src to path
	import sys
	sys.path.insert(0, model_dir)

	from src.inference import EmbeddingInference

	# Load model
	model = EmbeddingInference.from_pretrained(model_dir)

	# 1. Similarity
	score = model.similarity("Machine learning is great", "AI is wonderful")
	print(f"Similarity: {score:.4f}") # 0.4287

	# 2. Normal Embeddings
	embeddings = model.encode(["Machine learning is great", "AI is wonderful"])

	# 3. Manual Cosine Similarity
	# Since embeddings are L2-normalized, dot product is cosine similarity
	import numpy as np
	score = np.dot(embeddings[0], embeddings[1])
	print(f"Similarity: {score:.4f}")

	# Semantic Search
	docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
	results = model.search("deep learning frameworks", docs, top_k=2)
	for r in results:
	print(f" [{r['score']:.3f}] {r['text']}")
	# [0.498] Neural networks learn patterns
	# [0.413] Python is great for AI

	# Clustering
	result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
	for cluster_id, texts in result['texts_by_cluster'].items():
	print(f"Cluster {cluster_id + 1}: {texts}")
	# Cluster 1: ['Pizza is food']
	# Cluster 2: ['ML is cool', 'AI rocks']
	```

	## Also Available via GitHub

	```bash
	git clone https://github.com/bhandarisuraz/miniembed.git
	cd miniembed
	pip install -r requirements.txt

	python -c "
	from src.inference import EmbeddingInference
	model = EmbeddingInference.from_pretrained('models/mini')
	print(model.similarity('hello world', 'hi there'))
	"
	```

	## Capabilities

	- Semantic Search -- Find meaning-based matches, not keyword overlap.
	- Re-Ranking -- Sort candidates by true semantic relevance.
	- Clustering -- Group texts into logical categories automatically.
	- Product Matching -- Match items across platforms with messy titles.

	## Architecture

	Custom 4-layer Transformer encoder built from first principles:

	- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
	- 4x Pre-LayerNorm Transformer Encoder Layers
	- Multi-Head Self-Attention (4 heads, d_k=64)
	- Position-wise Feed-Forward (GELU activation, d_ff=1024)
	- Mean Pooling over non-padded tokens
	- L2 Normalization (unit hypersphere projection)

	## Training

	Trained on ~3.8 million text pairs from public datasets:

	\| Dataset \| Type \|
	\|---\|---\|
	\| Natural Questions (NQ) \| Q&A / General \|
	\| GooAQ \| Knowledge Search \|
	\| WDC Product Matching \| E-commerce \|
	\| ECInstruct \| E-commerce Tasks \|
	\| MS MARCO \| Web Search \|

	Training details:
	- Training time: ~49 hours
	- Final loss: 0.0748
	- Optimizer: AdamW
	- Batch size: 256

	## Files

	```
	surazbhandari/miniembed
	\|-- README.md # This model card
	\|-- config.json # Architecture config
	\|-- model.safetensors # Pre-trained weights (Safe & Fast)
	\|-- model.pt # Pre-trained weights (Legacy PyTorch)
	\|-- tokenizer.json # 30K word-level vocabulary
	\|-- training_info.json # Training metadata
	\|-- src/
	\|-- __init__.py
	\|-- model.py # Full architecture code
	\|-- tokenizer.py # Tokenizer implementation
	\|-- inference.py # High-level API (supports HF auto-download)
	```

	## Limitations

	- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
	- 128 token max sequence length
	- Trained primarily on English text
	- Best suited for short-form text (queries, product titles, sentences)

	## Citation

	```bibtex
	@software{Bhandari_MiniEmbed_2026,
	author = {Bhandari, Suraj},
	title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
	url = {https://github.com/bhandarisuraz/miniembed},
	version = {1.0.0},
	year = {2026}
	}
	```

	## License

	MIT