surazbhandari commited on
Commit
3da5f3c
·
0 Parent(s):

Update docs

Browse files
Files changed (2) hide show
  1. MODEL_CARD.md +169 -0
  2. README.md +230 -0
MODEL_CARD.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-embedding
6
+ - sentence-similarity
7
+ - semantic-search
8
+ - product-matching
9
+ - transformer
10
+ - pytorch
11
+ - from-scratch
12
+ library_name: pytorch
13
+ pipeline_tag: sentence-similarity
14
+ model-index:
15
+ - name: MiniEmbed-Mini
16
+ results: []
17
+ ---
18
+
19
+ # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
20
+
21
+ **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
22
+
23
+ **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
24
+
25
+ | Spec | Value |
26
+ |---|---|
27
+ | Parameters | ~10.8M |
28
+ | Model Size | ~42 MB |
29
+ | Embedding Dim | 256 |
30
+ | Vocab Size | 30,000 |
31
+ | Max Seq Length | 128 tokens |
32
+ | Architecture | 4-layer Transformer Encoder |
33
+ | Pooling | Mean Pooling + L2 Normalization |
34
+ | Training Loss | MNRL (Multiple Negatives Ranking Loss) |
35
+ | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
36
+
37
+ ## Quick Start
38
+
39
+ ```bash
40
+ pip install torch numpy scikit-learn huggingface_hub
41
+ ```
42
+
43
+ ```python
44
+ from huggingface_hub import snapshot_download
45
+
46
+ # Download model (one-time)
47
+ model_dir = snapshot_download("surazbhandari/miniembed")
48
+
49
+ # Add src to path
50
+ import sys
51
+ sys.path.insert(0, model_dir)
52
+
53
+ from src.inference import EmbeddingInference
54
+
55
+ # Load -- just like sentence-transformers!
56
+ model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
57
+
58
+ # 1. Similarity
59
+ score = model.similarity("Machine learning is great", "AI is wonderful")
60
+ print(f"Similarity: {score:.4f}") # 0.4287
61
+
62
+ # 2. Normal Embeddings
63
+ embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
64
+ import numpy as np
65
+ manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
66
+
67
+ # 3. Semantic Search
68
+ docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
69
+ results = model.search("deep learning frameworks", docs, top_k=2)
70
+ for r in results:
71
+ print(f" [{r['score']:.3f}] {r['text']}")
72
+ # [0.498] Neural networks learn patterns
73
+ # [0.413] Python is great for AI
74
+
75
+ # 4. Clustering
76
+ result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
77
+ # Cluster 1: ['Pizza is food']
78
+ # Cluster 2: ['ML is cool', 'AI rocks']
79
+ ```
80
+
81
+ ## Also Available via GitHub
82
+
83
+ ```bash
84
+ git clone https://github.com/bhandarisuraz/miniembed.git
85
+ cd miniembed
86
+ pip install -r requirements.txt
87
+
88
+ python -c "
89
+ from src.inference import EmbeddingInference
90
+ model = EmbeddingInference.from_pretrained('models/mini')
91
+ print(model.similarity('hello world', 'hi there'))
92
+ "
93
+ ```
94
+
95
+ ## Capabilities
96
+
97
+ - **Semantic Search** -- Find meaning-based matches, not keyword overlap.
98
+ - **Re-Ranking** -- Sort candidates by true semantic relevance.
99
+ - **Clustering** -- Group texts into logical categories automatically.
100
+ - **Product Matching** -- Match items across platforms with messy titles.
101
+
102
+ ## Architecture
103
+
104
+ Custom 4-layer Transformer encoder built from first principles:
105
+
106
+ - Token Embedding (30K vocab) + Sinusoidal Positional Encoding
107
+ - 4x Pre-LayerNorm Transformer Encoder Layers
108
+ - Multi-Head Self-Attention (4 heads, d_k=64)
109
+ - Position-wise Feed-Forward (GELU activation, d_ff=1024)
110
+ - Mean Pooling over non-padded tokens
111
+ - L2 Normalization (unit hypersphere projection)
112
+
113
+ ## Training
114
+
115
+ Trained on ~3.8 million text pairs from public datasets:
116
+
117
+ | Dataset | Type |
118
+ |---|---|
119
+ | Natural Questions (NQ) | Q&A / General |
120
+ | GooAQ | Knowledge Search |
121
+ | WDC Product Matching | E-commerce |
122
+ | ECInstruct | E-commerce Tasks |
123
+ | MS MARCO | Web Search |
124
+
125
+ **Training details:**
126
+ - Training time: ~49 hours
127
+ - Final loss: 0.0748
128
+ - Optimizer: AdamW
129
+ - Batch size: 256
130
+
131
+ ## Files
132
+
133
+ ```
134
+ surazbhandari/miniembed
135
+ |-- README.md # This model card
136
+ |-- config.json # Architecture config
137
+ |-- model.safetensors # Pre-trained weights (Safe & Fast)
138
+ |-- model.pt # Pre-trained weights (Legacy PyTorch)
139
+ |-- tokenizer.json # 30K word-level vocabulary
140
+ |-- training_info.json # Training metadata
141
+ |-- src/
142
+ |-- __init__.py
143
+ |-- model.py # Full architecture code
144
+ |-- tokenizer.py # Tokenizer implementation
145
+ |-- inference.py # High-level API (supports HF auto-download)
146
+ ```
147
+
148
+ ## Limitations
149
+
150
+ - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
151
+ - 128 token max sequence length
152
+ - Trained primarily on English text
153
+ - Best suited for short-form text (queries, product titles, sentences)
154
+
155
+ ## Citation
156
+
157
+ ```bibtex
158
+ @software{Bhandari_MiniEmbed_2026,
159
+ author = {Bhandari, Suraj},
160
+ title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
161
+ url = {https://github.com/bhandarisuraz/miniembed},
162
+ version = {1.0.0},
163
+ year = {2026}
164
+ }
165
+ ```
166
+
167
+ ## License
168
+
169
+ MIT
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
2
+
3
+ **MiniEmbed** is a research-grade toolkit for training and deploying ultra-compact text embedding models (Bi-Encoders) built entirely from scratch in PyTorch. While the industry chases billion-parameter giants, MiniEmbed proves that a **~42 MB / 10.8M parameter** model can deliver production-grade semantic intelligence for specialized domains.
4
+
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
+ [![Python 3.8+](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://python.org)
7
+ [![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org)
8
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-orange)](https://huggingface.co/surazbhandari/miniembed)
9
+
10
+ ---
11
+
12
+ ## What Can MiniEmbed Do?
13
+
14
+ | Capability | Description |
15
+ |---|---|
16
+ | **Semantic Search** | Find meaning, not just keywords. Understands that *"kitten"* is similar to *"cat"*. |
17
+ | **Re-Ranking** | Sort candidates by true semantic relevance. Eliminates false positives. |
18
+ | **Clustering** | Group thousands of texts into logical categories automatically. |
19
+ | **Product Matching** | Match identical items across stores, even with messy or inconsistent titles. |
20
+ | **Text Encoding** | Convert any text into a dense 256-dimensional vector for downstream tasks. |
21
+
22
+ ---
23
+
24
+ ## Project Structure
25
+
26
+ ```
27
+ miniembed/
28
+ |-- README.md # You are here
29
+ |-- LICENSE # MIT License
30
+ |-- requirements.txt # Python dependencies
31
+ |-- demo.py # Interactive Streamlit demo
32
+ |-- src/ # Core library
33
+ | |-- __init__.py
34
+ | |-- model.py # Transformer architecture (from scratch)
35
+ | |-- tokenizer.py # Custom word-level tokenizer
36
+ | |-- inference.py # High-level API for encoding & search
37
+ |-- models/
38
+ | |-- mini/ # Pre-trained Mini model
39
+ | |-- model.safetensors # Pre-trained weights (Safe & Fast)
40
+ | |-- model.pt # Pre-trained weights (Legacy)
41
+ | |-- config.json # Architecture blueprint
42
+ | |-- tokenizer.json # 30K vocabulary
43
+ | |-- training_info.json # Training metadata
44
+ |-- examples/ # Ready-to-run scripts
45
+ | |-- basic_usage.py # Encoding & similarity
46
+ | |-- semantic_search.py # Document retrieval
47
+ | |-- clustering.py # Text clustering with K-Means
48
+ |-- data/
49
+ |-- sample_data.jsonl # 10-pair demo dataset
50
+ ```
51
+
52
+ > **Note:** Pre-trained weights (`model.safetensors` / `model.pt`, ~42 MB) are included in this repository. Clone and use immediately. `.safetensors` is recommended for security and faster loading.
53
+
54
+ ---
55
+
56
+ ## Quick Start
57
+
58
+ ### 1. Install Dependencies
59
+ ```bash
60
+ git clone https://github.com/bhandarisuraz/miniembed.git
61
+ cd miniembed
62
+ pip install -r requirements.txt
63
+ ```
64
+
65
+ ### 2. Use the Model
66
+
67
+ The pre-trained Mini model is included in `models/mini/`. Alternatively, you can load it directly from Hugging Face:
68
+
69
+ ```python
70
+ from src.inference import EmbeddingInference
71
+
72
+ # Option A: From local files
73
+ model = EmbeddingInference.from_pretrained("models/mini")
74
+
75
+ # Option B: Direct from Hugging Face (auto-downloads)
76
+ model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
77
+ ```
78
+
79
+ ### 3. Try It Instantly
80
+ ```python
81
+ from src.inference import EmbeddingInference
82
+
83
+ model = EmbeddingInference.from_pretrained("models/mini")
84
+
85
+ # 1. Similarity
86
+ score = model.similarity("Machine learning is great", "AI is wonderful")
87
+ print(f"Similarity: {score:.4f}") # 0.4287
88
+
89
+ # 2. Normal Embeddings
90
+ embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
91
+ import numpy as np
92
+ manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
93
+
94
+ # 3. Semantic Search
95
+ docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
96
+ results = model.search("deep learning frameworks", docs, top_k=2)
97
+ for r in results:
98
+ print(f" [{r['score']:.3f}] {r['text']}")
99
+
100
+ # 4. Clustering
101
+ result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
102
+ # Cluster 1: ['Pizza is food']
103
+ # Cluster 2: ['ML is cool', 'AI rocks']
104
+ ```
105
+
106
+ For full Hugging Face integration, ensure you have `huggingface_hub` installed:
107
+ ```bash
108
+ pip install huggingface_hub
109
+ ```
110
+
111
+ ---
112
+
113
+ ## Interactive Demo (`demo.py`)
114
+
115
+ A full-featured Streamlit dashboard for exploring the model's capabilities without writing code:
116
+
117
+ - **Similarity** -- Real-time cosine similarity between any two texts.
118
+ - **Semantic Search** -- Rank a custom document set against your query.
119
+ - **Clustering** -- Automatically categorize items using K-Means.
120
+ - **Text Encoding** -- Inspect raw 256-D vectors and their statistics.
121
+ - **CSV Matcher** -- Match records between two CSV files for deduplication or cross-platform product mapping.
122
+
123
+ ```bash
124
+ streamlit run demo.py
125
+ ```
126
+
127
+ ---
128
+
129
+ ## Architecture
130
+
131
+ MiniEmbed uses a **custom 4-layer Transformer encoder** built from scratch -- no HuggingFace, no pre-trained weights:
132
+
133
+ | Component | Specification |
134
+ |---|---|
135
+ | Embedding Dimension | 256 |
136
+ | Attention Heads | 4 |
137
+ | Transformer Layers | 4 |
138
+ | Feed-Forward Dimension | 1,024 |
139
+ | Vocabulary Size | 30,000 |
140
+ | Max Sequence Length | 128 tokens |
141
+ | Total Parameters | ~10.8M |
142
+ | Model Size on Disk | ~42 MB |
143
+ | Pooling Strategy | Mean Pooling + L2 Normalization |
144
+
145
+ ### Training Objective
146
+
147
+ Training uses **Multiple Negatives Ranking Loss (MNRL)**, the industry-standard contrastive objective for Bi-Encoders:
148
+
149
+ $$\mathcal{L} = -\sum_{i=1}^{n} \log \frac{e^{sim(q_i, p_i) / \tau}}{\sum_{j=1}^{n} e^{sim(q_i, p_j) / \tau}}$$
150
+
151
+ All embeddings are **L2-normalized**, projecting text onto a unit hypersphere where cosine similarity equals dot product -- enabling ultra-fast nearest-neighbor search.
152
+
153
+ ---
154
+
155
+ ## Training Data Sources
156
+
157
+ The pre-trained model was trained on ~3.8 million text pairs from the following open-source datasets:
158
+
159
+ | Dataset | Type | Source |
160
+ |---|---|---|
161
+ | **Natural Questions (NQ)** | Q&A / General | [HuggingFace](https://huggingface.co/datasets/google-research-datasets/natural_questions) |
162
+ | **GooAQ** | Knowledge Search | [HuggingFace](https://huggingface.co/datasets/sentence-transformers/gooaq) |
163
+ | **WDC Product Matching** | E-commerce | [HuggingFace](https://huggingface.co/datasets/wdc/products-2017) |
164
+ | **ECInstruct** | E-commerce Tasks | [HuggingFace](https://huggingface.co/datasets/NingLab/ECInstruct) |
165
+ | **MS MARCO** | Web Search | [HuggingFace](https://huggingface.co/datasets/microsoft/ms_marco) |
166
+
167
+ > **Legal Disclaimer**: These public datasets belong to their respective stakeholders and creators. Any copyright, licensing, or legal usage constraints must be consulted with the original authors individually.
168
+
169
+ ---
170
+
171
+ ## Performance
172
+
173
+ Results from the pre-trained Mini model:
174
+
175
+ | Metric | Value |
176
+ |---|---|
177
+ | **Training Loss** | 0.0748 (final) |
178
+ | **Training Samples** | 3,817,707 pairs |
179
+ | **Throughput** | ~1,000 samples/sec |
180
+ | **Encoding Latency** | ~3-5 ms per text |
181
+ | **Training Epochs** | 10 |
182
+
183
+ ---
184
+
185
+ ## Examples
186
+
187
+ Ready-to-run scripts in the `examples/` folder:
188
+
189
+ ```bash
190
+ cd examples
191
+
192
+ # Basic encoding and similarity
193
+ python basic_usage.py
194
+
195
+ # Document retrieval
196
+ python semantic_search.py
197
+
198
+ # Text clustering with K-Means
199
+ python clustering.py
200
+ ```
201
+
202
+ ---
203
+
204
+ ## Roadmap
205
+
206
+ - **mini-product** -- A further fine-tuned version of the Mini model, specialized for high-accuracy **product matching** is Coming soon...
207
+
208
+ ---
209
+
210
+ ## Citation
211
+
212
+ If you use MiniEmbed in your research, please cite:
213
+
214
+ ```bibtex
215
+ @software{Bhandari_MiniEmbed_2026,
216
+ author = {Bhandari, Suraj},
217
+ title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
218
+ url = {https://github.com/bhandarisuraz/miniembed},
219
+ version = {1.0.0},
220
+ year = {2026}
221
+ }
222
+ ```
223
+
224
+ ---
225
+
226
+ ## License
227
+
228
+ MIT License. See [LICENSE](LICENSE) for details.
229
+
230
+ Explore, learn, and build smaller, smarter AI.