Sync from GitHub Actions

Browse files

Files changed (3) hide show

.gitattributes +2 -0
MODEL_CARD.md +173 -0
README.md +7 -3

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ model.pt filter=lfs diff=lfs merge=lfs -text
2	+ model.safetensors filter=lfs diff=lfs merge=lfs -text

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,173 @@

+---
+language: en
+license: mit
+tags:
+  - text-embedding
+  - sentence-similarity
+  - semantic-search
+  - product-matching
+  - transformer
+  - pytorch
+  - from-scratch
+library_name: pytorch
+pipeline_tag: sentence-similarity
+model-index:
+  - name: MiniEmbed-Mini
+    results: []
+---
+# MiniEmbed: Tiny, Powerful Embedding Models from Scratch
+**MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
+**GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
+| Spec | Value |
+|---|---|
+| Parameters | ~10.8M |
+| Model Size | ~42 MB |
+| Embedding Dim | 256 |
+| Vocab Size | 30,000 |
+| Max Seq Length | 128 tokens |
+| Architecture | 4-layer Transformer Encoder |
+| Pooling | Mean Pooling + L2 Normalization |
+| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
+| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
+## Quick Start
+```bash
+pip install torch numpy scikit-learn huggingface_hub
+```
+```python
+from huggingface_hub import snapshot_download
+# Download model (one-time)
+model_dir = snapshot_download("surazbhandari/miniembed")
+# Add src to path
+import sys
+sys.path.insert(0, model_dir)
+from src.inference import EmbeddingInference
+# Load -- just like sentence-transformers!
+model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
+# 1. Similarity
+score = model.similarity("Machine learning is great", "AI is wonderful")
+print(f"Similarity: {score:.4f}")  # 0.4287
+# 2. Normal Embeddings
+embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
+# 3. Manual Cosine Similarity
+# Since embeddings are L2-normalized, dot product is cosine similarity
+import numpy as np
+score = np.dot(embeddings[0], embeddings[1])
+print(f"Similarity: {score:.4f}")
+# Semantic Search
+docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
+results = model.search("deep learning frameworks", docs, top_k=2)
+for r in results:
+    print(f"  [{r['score']:.3f}] {r['text']}")
+# [0.498] Neural networks learn patterns
+# [0.413] Python is great for AI
+# Clustering
+result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
+# Cluster 1: ['Pizza is food']
+# Cluster 2: ['ML is cool', 'AI rocks']
+```
+## Also Available via GitHub
+```bash
+git clone https://github.com/bhandarisuraz/miniembed.git
+cd miniembed
+pip install -r requirements.txt
+python -c "
+from src.inference import EmbeddingInference
+model = EmbeddingInference.from_pretrained('models/mini')
+print(model.similarity('hello world', 'hi there'))
+"
+```
+## Capabilities
+- **Semantic Search** -- Find meaning-based matches, not keyword overlap.
+- **Re-Ranking** -- Sort candidates by true semantic relevance.
+- **Clustering** -- Group texts into logical categories automatically.
+- **Product Matching** -- Match items across platforms with messy titles.
+## Architecture
+Custom 4-layer Transformer encoder built from first principles:
+- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
+- 4x Pre-LayerNorm Transformer Encoder Layers
+- Multi-Head Self-Attention (4 heads, d_k=64)
+- Position-wise Feed-Forward (GELU activation, d_ff=1024)
+- Mean Pooling over non-padded tokens
+- L2 Normalization (unit hypersphere projection)
+## Training
+Trained on ~3.8 million text pairs from public datasets:
+| Dataset | Type |
+|---|---|
+| Natural Questions (NQ) | Q&A / General |
+| GooAQ | Knowledge Search |
+| WDC Product Matching | E-commerce |
+| ECInstruct | E-commerce Tasks |
+| MS MARCO | Web Search |
+**Training details:**
+- Training time: ~49 hours
+- Final loss: 0.0748
+- Optimizer: AdamW
+- Batch size: 256
+## Files
+```
+surazbhandari/miniembed
+|-- README.md           # This model card
+|-- config.json         # Architecture config
+|-- model.safetensors   # Pre-trained weights (Safe & Fast)
+|-- model.pt            # Pre-trained weights (Legacy PyTorch)
+|-- tokenizer.json      # 30K word-level vocabulary
+|-- training_info.json  # Training metadata
+|-- src/
+    |-- __init__.py
+    |-- model.py        # Full architecture code
+    |-- tokenizer.py    # Tokenizer implementation
+    |-- inference.py    # High-level API (supports HF auto-download)
+```
+## Limitations
+- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
+- 128 token max sequence length
+- Trained primarily on English text
+- Best suited for short-form text (queries, product titles, sentences)
+## Citation
+```bibtex
+@software{Bhandari_MiniEmbed_2026,
+  author  = {Bhandari, Suraj},
+  title   = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
+  url     = {https://github.com/bhandarisuraz/miniembed},
+  version = {1.0.0},
+  year    = {2026}
+}
+```
+## License
+MIT

README.md CHANGED Viewed

@@ -61,10 +61,14 @@ print(f"Similarity: {score:.4f}")  # 0.4287
 # 2. Normal Embeddings
 embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
 import numpy as np
-manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
-# 3. Semantic Search
 docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
 results = model.search("deep learning frameworks", docs, top_k=2)
 for r in results:
@@ -72,7 +76,7 @@ for r in results:
 # [0.498] Neural networks learn patterns
 # [0.413] Python is great for AI
-# 4. Clustering
 result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
 # Cluster 1: ['Pizza is food']
 # Cluster 2: ['ML is cool', 'AI rocks']

 # 2. Normal Embeddings
 embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
+# 3. Manual Cosine Similarity
+# Since embeddings are L2-normalized, dot product is cosine similarity
 import numpy as np
+score = np.dot(embeddings[0], embeddings[1])
+print(f"Similarity: {score:.4f}")
+# Semantic Search
 docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
 results = model.search("deep learning frameworks", docs, top_k=2)
 for r in results:
 # [0.498] Neural networks learn patterns
 # [0.413] Python is great for AI
+# Clustering
 result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
 # Cluster 1: ['Pizza is food']
 # Cluster 2: ['ML is cool', 'AI rocks']