surazbhandari commited on
Commit
e190deb
·
verified ·
1 Parent(s): adc0ea3

Update to Hugging Face standard model format

Browse files
.gitattributes CHANGED
@@ -1,2 +1,4 @@
1
  models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
2
  models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text
 
 
 
1
  models/mini/model.pt filter=lfs diff=lfs merge=lfs -text
2
  models/mini/model.safetensors filter=lfs diff=lfs merge=lfs -text
3
+ model.pt filter=lfs diff=lfs merge=lfs -text
4
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
LICENSE DELETED
@@ -1,21 +0,0 @@
1
- MIT License
2
-
3
- Copyright (c) 2024
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in all
13
- copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
- SOFTWARE.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MODEL_CARD.md DELETED
@@ -1,164 +0,0 @@
1
- ---
2
- language: en
3
- license: mit
4
- tags:
5
- - text-embedding
6
- - sentence-similarity
7
- - semantic-search
8
- - product-matching
9
- - transformer
10
- - pytorch
11
- - from-scratch
12
- library_name: pytorch
13
- pipeline_tag: sentence-similarity
14
- model-index:
15
- - name: MiniEmbed-Mini
16
- results: []
17
- ---
18
-
19
- # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
20
-
21
- **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
22
-
23
- **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
24
-
25
- | Spec | Value |
26
- |---|---|
27
- | Parameters | ~10.8M |
28
- | Model Size | ~42 MB |
29
- | Embedding Dim | 256 |
30
- | Vocab Size | 30,000 |
31
- | Max Seq Length | 128 tokens |
32
- | Architecture | 4-layer Transformer Encoder |
33
- | Pooling | Mean Pooling + L2 Normalization |
34
- | Training Loss | MNRL (Multiple Negatives Ranking Loss) |
35
- | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
36
-
37
- ## Quick Start
38
-
39
- ```bash
40
- pip install torch numpy scikit-learn huggingface_hub
41
- ```
42
-
43
- ```python
44
- from huggingface_hub import snapshot_download
45
-
46
- # Download model (one-time)
47
- model_dir = snapshot_download("surazbhandari/miniembed")
48
-
49
- # Add src to path
50
- import sys
51
- sys.path.insert(0, model_dir)
52
-
53
- from src.inference import EmbeddingInference
54
-
55
- # Load -- just like sentence-transformers!
56
- model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
57
-
58
- # Similarity
59
- score = model.similarity("Machine learning is great", "AI is wonderful")
60
- print(f"Similarity: {score:.4f}") # 0.4287
61
-
62
- # Semantic Search
63
- docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
64
- results = model.search("deep learning frameworks", docs, top_k=2)
65
- for r in results:
66
- print(f" [{r['score']:.3f}] {r['text']}")
67
- # [0.498] Neural networks learn patterns
68
- # [0.413] Python is great for AI
69
-
70
- # Clustering
71
- result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
72
- # Cluster 1: ['Pizza is food']
73
- # Cluster 2: ['ML is cool', 'AI rocks']
74
- ```
75
-
76
- ## Also Available via GitHub
77
-
78
- ```bash
79
- git clone https://github.com/bhandarisuraz/miniembed.git
80
- cd miniembed
81
- pip install -r requirements.txt
82
-
83
- python -c "
84
- from src.inference import EmbeddingInference
85
- model = EmbeddingInference.from_pretrained('models/mini')
86
- print(model.similarity('hello world', 'hi there'))
87
- "
88
- ```
89
-
90
- ## Capabilities
91
-
92
- - **Semantic Search** -- Find meaning-based matches, not keyword overlap.
93
- - **Re-Ranking** -- Sort candidates by true semantic relevance.
94
- - **Clustering** -- Group texts into logical categories automatically.
95
- - **Product Matching** -- Match items across platforms with messy titles.
96
-
97
- ## Architecture
98
-
99
- Custom 4-layer Transformer encoder built from first principles:
100
-
101
- - Token Embedding (30K vocab) + Sinusoidal Positional Encoding
102
- - 4x Pre-LayerNorm Transformer Encoder Layers
103
- - Multi-Head Self-Attention (4 heads, d_k=64)
104
- - Position-wise Feed-Forward (GELU activation, d_ff=1024)
105
- - Mean Pooling over non-padded tokens
106
- - L2 Normalization (unit hypersphere projection)
107
-
108
- ## Training
109
-
110
- Trained on ~3.8 million text pairs from public datasets:
111
-
112
- | Dataset | Type |
113
- |---|---|
114
- | Natural Questions (NQ) | Q&A / General |
115
- | GooAQ | Knowledge Search |
116
- | WDC Product Matching | E-commerce |
117
- | ECInstruct | E-commerce Tasks |
118
- | MS MARCO | Web Search |
119
-
120
- **Training details:**
121
- - Training time: ~49 hours
122
- - Final loss: 0.0748
123
- - Optimizer: AdamW
124
- - Batch size: 256
125
-
126
- ## Files
127
-
128
- ```
129
- surazbhandari/miniembed
130
- |-- README.md # This model card
131
- |-- config.json # Architecture config
132
- |-- model.safetensors # Pre-trained weights (Safe & Fast)
133
- |-- model.pt # Pre-trained weights (Legacy PyTorch)
134
- |-- tokenizer.json # 30K word-level vocabulary
135
- |-- training_info.json # Training metadata
136
- |-- src/
137
- |-- __init__.py
138
- |-- model.py # Full architecture code
139
- |-- tokenizer.py # Tokenizer implementation
140
- |-- inference.py # High-level API (supports HF auto-download)
141
- ```
142
-
143
- ## Limitations
144
-
145
- - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
146
- - 128 token max sequence length
147
- - Trained primarily on English text
148
- - Best suited for short-form text (queries, product titles, sentences)
149
-
150
- ## Citation
151
-
152
- ```bibtex
153
- @software{Bhandari_MiniEmbed_2026,
154
- author = {Bhandari, Suraj},
155
- title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
156
- url = {https://github.com/bhandarisuraz/miniembed},
157
- version = {1.0.0},
158
- year = {2026}
159
- }
160
- ```
161
-
162
- ## License
163
-
164
- MIT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,206 +1,159 @@
1
- # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
2
-
3
- **MiniEmbed** is a research-grade toolkit for training and deploying ultra-compact text embedding models (Bi-Encoders) built entirely from scratch in PyTorch. While the industry chases billion-parameter giants, MiniEmbed proves that a **~42 MB / 10.8M parameter** model can deliver production-grade semantic intelligence for specialized domains.
4
-
5
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
6
- [![Python 3.8+](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://python.org)
7
- [![PyTorch 2.0+](https://img.shields.io/badge/PyTorch-2.0+-ee4c2c.svg)](https://pytorch.org)
8
- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-orange)](https://huggingface.co/surazbhandari/miniembed)
9
-
10
  ---
11
-
12
- ## What Can MiniEmbed Do?
13
-
14
- | Capability | Description |
15
- |---|---|
16
- | **Semantic Search** | Find meaning, not just keywords. Understands that *"kitten"* is similar to *"cat"*. |
17
- | **Re-Ranking** | Sort candidates by true semantic relevance. Eliminates false positives. |
18
- | **Clustering** | Group thousands of texts into logical categories automatically. |
19
- | **Product Matching** | Match identical items across stores, even with messy or inconsistent titles. |
20
- | **Text Encoding** | Convert any text into a dense 256-dimensional vector for downstream tasks. |
21
-
 
 
 
 
22
  ---
23
 
24
- ## Project Structure
25
 
26
- ```
27
- miniembed/
28
- |-- README.md # You are here
29
- |-- LICENSE # MIT License
30
- |-- requirements.txt # Python dependencies
31
- |-- demo.py # Interactive Streamlit demo
32
- |-- src/ # Core library
33
- | |-- __init__.py
34
- | |-- model.py # Transformer architecture (from scratch)
35
- | |-- tokenizer.py # Custom word-level tokenizer
36
- | |-- inference.py # High-level API for encoding & search
37
- |-- models/
38
- | |-- mini/ # Pre-trained Mini model
39
- | |-- model.safetensors # Pre-trained weights (Safe & Fast)
40
- | |-- model.pt # Pre-trained weights (Legacy)
41
- | |-- config.json # Architecture blueprint
42
- | |-- tokenizer.json # 30K vocabulary
43
- | |-- training_info.json # Training metadata
44
- |-- examples/ # Ready-to-run scripts
45
- | |-- basic_usage.py # Encoding & similarity
46
- | |-- semantic_search.py # Document retrieval
47
- | |-- clustering.py # Text clustering with K-Means
48
- |-- data/
49
- |-- sample_data.jsonl # 10-pair demo dataset
50
- ```
51
 
52
- > **Note:** Pre-trained weights (`model.safetensors` / `model.pt`, ~42 MB) are included in this repository. Clone and use immediately. `.safetensors` is recommended for security and faster loading.
53
 
54
- ---
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Quick Start
57
 
58
- ### 1. Install Dependencies
59
  ```bash
60
- git clone https://github.com/bhandarisuraz/miniembed.git
61
- cd miniembed
62
- pip install -r requirements.txt
63
  ```
64
 
65
- ### 2. Use the Model
66
-
67
- The pre-trained Mini model is included in `models/mini/`. Alternatively, you can load it directly from Hugging Face:
68
-
69
  ```python
70
- from src.inference import EmbeddingInference
71
 
72
- # Option A: From local files
73
- model = EmbeddingInference.from_pretrained("models/mini")
74
 
75
- # Option B: Direct from Hugging Face (auto-downloads)
76
- model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
77
- ```
78
 
79
- ### 3. Try It Instantly
80
- ```python
81
  from src.inference import EmbeddingInference
82
 
83
- model = EmbeddingInference.from_pretrained("models/mini")
 
84
 
85
- # Similarity
86
  score = model.similarity("Machine learning is great", "AI is wonderful")
87
  print(f"Similarity: {score:.4f}") # 0.4287
88
 
89
- # Semantic Search
 
 
 
 
 
90
  docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
91
  results = model.search("deep learning frameworks", docs, top_k=2)
92
  for r in results:
93
  print(f" [{r['score']:.3f}] {r['text']}")
94
- ```
 
95
 
96
- For full Hugging Face integration, ensure you have `huggingface_hub` installed:
97
- ```bash
98
- pip install huggingface_hub
 
99
  ```
100
 
101
- ---
102
-
103
- ## Interactive Demo (`demo.py`)
104
-
105
- A full-featured Streamlit dashboard for exploring the model's capabilities without writing code:
106
-
107
- - **Similarity** -- Real-time cosine similarity between any two texts.
108
- - **Semantic Search** -- Rank a custom document set against your query.
109
- - **Clustering** -- Automatically categorize items using K-Means.
110
- - **Text Encoding** -- Inspect raw 256-D vectors and their statistics.
111
- - **CSV Matcher** -- Match records between two CSV files for deduplication or cross-platform product mapping.
112
 
113
  ```bash
114
- streamlit run demo.py
115
- ```
116
-
117
- ---
118
-
119
- ## Architecture
120
-
121
- MiniEmbed uses a **custom 4-layer Transformer encoder** built from scratch -- no HuggingFace, no pre-trained weights:
122
-
123
- | Component | Specification |
124
- |---|---|
125
- | Embedding Dimension | 256 |
126
- | Attention Heads | 4 |
127
- | Transformer Layers | 4 |
128
- | Feed-Forward Dimension | 1,024 |
129
- | Vocabulary Size | 30,000 |
130
- | Max Sequence Length | 128 tokens |
131
- | Total Parameters | ~10.8M |
132
- | Model Size on Disk | ~42 MB |
133
- | Pooling Strategy | Mean Pooling + L2 Normalization |
134
-
135
- ### Training Objective
136
-
137
- Training uses **Multiple Negatives Ranking Loss (MNRL)**, the industry-standard contrastive objective for Bi-Encoders:
138
-
139
- $$\mathcal{L} = -\sum_{i=1}^{n} \log \frac{e^{sim(q_i, p_i) / \tau}}{\sum_{j=1}^{n} e^{sim(q_i, p_j) / \tau}}$$
140
-
141
- All embeddings are **L2-normalized**, projecting text onto a unit hypersphere where cosine similarity equals dot product -- enabling ultra-fast nearest-neighbor search.
142
 
143
- ---
 
 
 
 
 
144
 
145
- ## Training Data Sources
146
 
147
- The pre-trained model was trained on ~3.8 million text pairs from the following open-source datasets:
 
 
 
148
 
149
- | Dataset | Type | Source |
150
- |---|---|---|
151
- | **Natural Questions (NQ)** | Q&A / General | [HuggingFace](https://huggingface.co/datasets/google-research-datasets/natural_questions) |
152
- | **GooAQ** | Knowledge Search | [HuggingFace](https://huggingface.co/datasets/sentence-transformers/gooaq) |
153
- | **WDC Product Matching** | E-commerce | [HuggingFace](https://huggingface.co/datasets/wdc/products-2017) |
154
- | **ECInstruct** | E-commerce Tasks | [HuggingFace](https://huggingface.co/datasets/NingLab/ECInstruct) |
155
- | **MS MARCO** | Web Search | [HuggingFace](https://huggingface.co/datasets/microsoft/ms_marco) |
156
 
157
- > **Legal Disclaimer**: These public datasets belong to their respective stakeholders and creators. Any copyright, licensing, or legal usage constraints must be consulted with the original authors individually.
158
 
159
- ---
 
 
 
 
 
160
 
161
- ## Performance
162
 
163
- Results from the pre-trained Mini model:
164
 
165
- | Metric | Value |
166
  |---|---|
167
- | **Training Loss** | 0.0748 (final) |
168
- | **Training Samples** | 3,817,707 pairs |
169
- | **Throughput** | ~1,000 samples/sec |
170
- | **Encoding Latency** | ~3-5 ms per text |
171
- | **Training Epochs** | 10 |
172
-
173
- ---
174
-
175
- ## Examples
176
 
177
- Ready-to-run scripts in the `examples/` folder:
 
 
 
 
178
 
179
- ```bash
180
- cd examples
181
-
182
- # Basic encoding and similarity
183
- python basic_usage.py
184
 
185
- # Document retrieval
186
- python semantic_search.py
187
-
188
- # Text clustering with K-Means
189
- python clustering.py
 
 
 
 
 
 
 
 
190
  ```
191
 
192
- ---
193
-
194
- ## Roadmap
195
-
196
- - **mini-product** -- A further fine-tuned version of the Mini model, specialized for high-accuracy **product matching** is Coming soon...
197
 
198
- ---
 
 
 
199
 
200
  ## Citation
201
 
202
- If you use MiniEmbed in your research, please cite:
203
-
204
  ```bibtex
205
  @software{Bhandari_MiniEmbed_2026,
206
  author = {Bhandari, Suraj},
@@ -211,10 +164,6 @@ If you use MiniEmbed in your research, please cite:
211
  }
212
  ```
213
 
214
- ---
215
-
216
  ## License
217
 
218
- MIT License. See [LICENSE](LICENSE) for details.
219
-
220
- Explore, learn, and build smaller, smarter AI.
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-embedding
6
+ - sentence-similarity
7
+ - semantic-search
8
+ - product-matching
9
+ - transformer
10
+ - pytorch
11
+ - from-scratch
12
+ library_name: pytorch
13
+ pipeline_tag: sentence-similarity
14
+ model-index:
15
+ - name: MiniEmbed-Mini
16
+ results: []
17
  ---
18
 
19
+ # MiniEmbed: Tiny, Powerful Embedding Models from Scratch
20
 
21
+ **MiniEmbed** is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ **GitHub:** [github.com/bhandarisuraz/miniembed](https://github.com/bhandarisuraz/miniembed) (full repo with examples, tests, interactive demo, and documentation)
24
 
25
+ | Spec | Value |
26
+ |---|---|
27
+ | Parameters | ~10.8M |
28
+ | Model Size | ~42 MB |
29
+ | Embedding Dim | 256 |
30
+ | Vocab Size | 30,000 |
31
+ | Max Seq Length | 128 tokens |
32
+ | Architecture | 4-layer Transformer Encoder |
33
+ | Pooling | Mean Pooling + L2 Normalization |
34
+ | Training Loss | MNRL (Multiple Negatives Ranking Loss) |
35
+ | Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
36
 
37
  ## Quick Start
38
 
 
39
  ```bash
40
+ pip install torch numpy scikit-learn huggingface_hub
 
 
41
  ```
42
 
 
 
 
 
43
  ```python
44
+ from huggingface_hub import snapshot_download
45
 
46
+ # Download model (one-time)
47
+ model_dir = snapshot_download("surazbhandari/miniembed")
48
 
49
+ # Add src to path
50
+ import sys
51
+ sys.path.insert(0, model_dir)
52
 
 
 
53
  from src.inference import EmbeddingInference
54
 
55
+ # Load -- just like sentence-transformers!
56
+ model = EmbeddingInference.from_pretrained("surazbhandari/miniembed")
57
 
58
+ # 1. Similarity
59
  score = model.similarity("Machine learning is great", "AI is wonderful")
60
  print(f"Similarity: {score:.4f}") # 0.4287
61
 
62
+ # 2. Normal Embeddings
63
+ embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
64
+ import numpy as np
65
+ manual_score = np.dot(embeddings[0], embeddings[1]) # Dot product = Cosine Similarity
66
+
67
+ # 3. Semantic Search
68
  docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
69
  results = model.search("deep learning frameworks", docs, top_k=2)
70
  for r in results:
71
  print(f" [{r['score']:.3f}] {r['text']}")
72
+ # [0.498] Neural networks learn patterns
73
+ # [0.413] Python is great for AI
74
 
75
+ # 4. Clustering
76
+ result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
77
+ # Cluster 1: ['Pizza is food']
78
+ # Cluster 2: ['ML is cool', 'AI rocks']
79
  ```
80
 
81
+ ## Also Available via GitHub
 
 
 
 
 
 
 
 
 
 
82
 
83
  ```bash
84
+ git clone https://github.com/bhandarisuraz/miniembed.git
85
+ cd miniembed
86
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
+ python -c "
89
+ from src.inference import EmbeddingInference
90
+ model = EmbeddingInference.from_pretrained('models/mini')
91
+ print(model.similarity('hello world', 'hi there'))
92
+ "
93
+ ```
94
 
95
+ ## Capabilities
96
 
97
+ - **Semantic Search** -- Find meaning-based matches, not keyword overlap.
98
+ - **Re-Ranking** -- Sort candidates by true semantic relevance.
99
+ - **Clustering** -- Group texts into logical categories automatically.
100
+ - **Product Matching** -- Match items across platforms with messy titles.
101
 
102
+ ## Architecture
 
 
 
 
 
 
103
 
104
+ Custom 4-layer Transformer encoder built from first principles:
105
 
106
+ - Token Embedding (30K vocab) + Sinusoidal Positional Encoding
107
+ - 4x Pre-LayerNorm Transformer Encoder Layers
108
+ - Multi-Head Self-Attention (4 heads, d_k=64)
109
+ - Position-wise Feed-Forward (GELU activation, d_ff=1024)
110
+ - Mean Pooling over non-padded tokens
111
+ - L2 Normalization (unit hypersphere projection)
112
 
113
+ ## Training
114
 
115
+ Trained on ~3.8 million text pairs from public datasets:
116
 
117
+ | Dataset | Type |
118
  |---|---|
119
+ | Natural Questions (NQ) | Q&A / General |
120
+ | GooAQ | Knowledge Search |
121
+ | WDC Product Matching | E-commerce |
122
+ | ECInstruct | E-commerce Tasks |
123
+ | MS MARCO | Web Search |
 
 
 
 
124
 
125
+ **Training details:**
126
+ - Training time: ~49 hours
127
+ - Final loss: 0.0748
128
+ - Optimizer: AdamW
129
+ - Batch size: 256
130
 
131
+ ## Files
 
 
 
 
132
 
133
+ ```
134
+ surazbhandari/miniembed
135
+ |-- README.md # This model card
136
+ |-- config.json # Architecture config
137
+ |-- model.safetensors # Pre-trained weights (Safe & Fast)
138
+ |-- model.pt # Pre-trained weights (Legacy PyTorch)
139
+ |-- tokenizer.json # 30K word-level vocabulary
140
+ |-- training_info.json # Training metadata
141
+ |-- src/
142
+ |-- __init__.py
143
+ |-- model.py # Full architecture code
144
+ |-- tokenizer.py # Tokenizer implementation
145
+ |-- inference.py # High-level API (supports HF auto-download)
146
  ```
147
 
148
+ ## Limitations
 
 
 
 
149
 
150
+ - Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
151
+ - 128 token max sequence length
152
+ - Trained primarily on English text
153
+ - Best suited for short-form text (queries, product titles, sentences)
154
 
155
  ## Citation
156
 
 
 
157
  ```bibtex
158
  @software{Bhandari_MiniEmbed_2026,
159
  author = {Bhandari, Suraj},
 
164
  }
165
  ```
166
 
 
 
167
  ## License
168
 
169
+ MIT
 
 
models/mini/config.json → config.json RENAMED
File without changes
data/sample_data.jsonl DELETED
@@ -1,10 +0,0 @@
1
- {"query": "how to train an embedding model", "passage": "Training an embedding model involves using contrastive learning on query-passage pairs.", "source": "sample"}
2
- {"query": "what is a transformer", "passage": "The Transformer is a deep learning model that uses self-attention mechanisms to process sequence data.", "source": "sample"}
3
- {"query": "nike air max 90", "passage": "Men's Nike Air Max 90 Casual Shoes in Black and White.", "source": "sample"}
4
- {"query": "samsung galaxy s21", "passage": "Samsung Galaxy S21 5G 128GB Unlocked Smartphone - Phantom Gray.", "source": "sample"}
5
- {"query": "best winter coats", "passage": "The North Face Gotham Jacket III is one of the warmest winter parkas for heavy snow.", "source": "sample"}
6
- {"query": "python programming for beginners", "passage": "Learn Python with this comprehensive guide covering variables, loops, and functions.", "source": "sample"}
7
- {"query": "benefits of meditation", "passage": "Meditation can reduce stress, improve concentration, and increase happiness.", "source": "sample"}
8
- {"query": "how to bake chocolate cake", "passage": "Whisk eggs and sugar, then fold in flour and melted chocolate for a perfect moist cake.", "source": "sample"}
9
- {"query": "what is machine learning", "passage": "Machine learning is a field of AI that allows systems to learn patterns from data without explicit programming.", "source": "sample"}
10
- {"query": "running shoes for flat feet", "passage": "Brooks Adrenaline GTS 22 provides excellent stability and support for runners with low arches.", "source": "sample"}
 
 
 
 
 
 
 
 
 
 
 
demo.py DELETED
@@ -1,510 +0,0 @@
1
- """
2
- MiniEmbed - Interactive Demo
3
- ================================
4
- Explore the embedding model's capabilities through a Streamlit dashboard.
5
-
6
- Features:
7
- - Pairwise text similarity (cosine distance)
8
- - Semantic document search with ranked results
9
- - Unsupervised text clustering via K-Means
10
- - Raw embedding vector inspection and visualization
11
- - Bulk CSV-to-CSV record matching
12
-
13
- Run: streamlit run demo.py
14
- """
15
-
16
- import streamlit as st
17
- import numpy as np
18
- import pandas as pd
19
- import os
20
- import sys
21
- import io
22
-
23
- # Add src to path
24
- sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
25
-
26
- from src.inference import EmbeddingInference, EmbeddingModelManager
27
-
28
- # ============================================================================
29
- # PAGE CONFIG
30
- # ============================================================================
31
-
32
- st.set_page_config(
33
- page_title="MiniEmbed Demo",
34
- page_icon="M",
35
- layout="wide"
36
- )
37
-
38
- # Custom CSS
39
- st.markdown("""
40
- <style>
41
- .main-header {
42
- font-size: 2.5rem;
43
- font-weight: 700;
44
- background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
45
- -webkit-background-clip: text;
46
- -webkit-text-fill-color: transparent;
47
- text-align: center;
48
- margin-bottom: 1rem;
49
- }
50
- .sub-header {
51
- text-align: center;
52
- color: #888;
53
- margin-bottom: 2rem;
54
- }
55
- .result-box {
56
- background: rgba(100, 100, 100, 0.1);
57
- border-radius: 10px;
58
- padding: 1rem;
59
- margin: 0.5rem 0;
60
- color: inherit;
61
- }
62
- .high-score { border-left: 4px solid #28a745; background: rgba(40, 167, 69, 0.1); }
63
- .medium-score { border-left: 4px solid #ffc107; background: rgba(255, 193, 7, 0.1); }
64
- .low-score { border-left: 4px solid #dc3545; background: rgba(220, 53, 69, 0.1); }
65
- .score-text { font-weight: bold; }
66
- </style>
67
- """, unsafe_allow_html=True)
68
-
69
- # ============================================================================
70
- # LOAD MODEL
71
- # ============================================================================
72
-
73
- @st.cache_resource
74
- def load_model(model_name):
75
- """Load the embedding model from disk."""
76
- model_dir = f"models/{model_name}"
77
- if model_name == "Legacy (model/)":
78
- model_dir = "model"
79
- return EmbeddingInference.from_pretrained(model_dir)
80
-
81
-
82
- # Header
83
- st.markdown('<h1 class="main-header">MiniEmbed Demo</h1>', unsafe_allow_html=True)
84
- st.markdown('<p class="sub-header">Explore semantic similarity, search, clustering, and bulk matching</p>', unsafe_allow_html=True)
85
-
86
- # -----------------------------------------------------------------------------
87
- # Model Selection
88
- # -----------------------------------------------------------------------------
89
- available_models = EmbeddingModelManager.list_models()
90
- if os.path.exists("model/model.pt"):
91
- available_models.append("Legacy (model/)")
92
-
93
- if not available_models:
94
- st.error("No models found. Train a model first or place weights in models/mini/model.pt.")
95
- st.info("Models should be located in the `models/` directory (e.g., `models/mini/`).")
96
- st.stop()
97
-
98
- selected_model_name = st.sidebar.selectbox(
99
- "Select Model",
100
- available_models,
101
- index=0,
102
- help="Select which trained model to load for inference."
103
- )
104
-
105
- model = load_model(selected_model_name)
106
-
107
- if model is None:
108
- st.error("Model not found. Please train the model first.")
109
- st.stop()
110
-
111
- # Model info
112
- with st.expander("Model Info", expanded=False):
113
- st.markdown("""
114
- This panel shows the architecture of the currently loaded model.
115
- - **Embedding Dim**: The size of each output vector (higher = more expressive).
116
- - **Layers**: Number of Transformer encoder layers stacked in the model.
117
- - **Vocab Size**: Total number of unique tokens the model can recognize.
118
- """)
119
- col1, col2, col3 = st.columns(3)
120
- with col1:
121
- st.metric("Embedding Dim", model.model.d_model)
122
- with col2:
123
- st.metric("Layers", len(model.model.layers))
124
- with col3:
125
- st.metric("Vocab Size", len(model.tokenizer.word_to_id))
126
-
127
- # ============================================================================
128
- # TABS
129
- # ============================================================================
130
-
131
- tab1, tab2, tab3, tab4, tab5 = st.tabs([
132
- "Similarity",
133
- "Semantic Search",
134
- "Clustering",
135
- "Encode Text",
136
- "CSV Matcher"
137
- ])
138
-
139
- # ============================================================================
140
- # TAB 1: SIMILARITY
141
- # ============================================================================
142
-
143
- with tab1:
144
- st.markdown("### Pairwise Text Similarity")
145
- st.markdown("""
146
- Enter two texts to compute their **cosine similarity** (range: 0 to 1).
147
- The model encodes each text into a 256-dimensional vector and measures
148
- the angular distance between them. A score close to 1.0 means the texts
149
- are semantically equivalent; a score near 0.0 means they are unrelated.
150
- """)
151
-
152
- col1, col2 = st.columns(2)
153
-
154
- with col1:
155
- text1 = st.text_area(
156
- "Text 1",
157
- "Machine learning is a branch of artificial intelligence",
158
- height=100,
159
- key="sim_text1"
160
- )
161
-
162
- with col2:
163
- text2 = st.text_area(
164
- "Text 2",
165
- "AI systems can learn patterns from data",
166
- height=100,
167
- key="sim_text2"
168
- )
169
-
170
- if st.button("Compute Similarity", type="primary", key="sim_btn"):
171
- if text1 and text2:
172
- with st.spinner("Computing..."):
173
- similarity = model.similarity(text1, text2)
174
-
175
- if similarity > 0.7:
176
- color = "#28a745"
177
- label = "Very Similar"
178
- elif similarity > 0.4:
179
- color = "#ffc107"
180
- label = "Somewhat Similar"
181
- else:
182
- color = "#dc3545"
183
- label = "Not Similar"
184
-
185
- st.markdown(f"""
186
- <div style="text-align: center; padding: 2rem;">
187
- <div style="font-size: 4rem; font-weight: bold; color: {color};">
188
- {similarity:.3f}
189
- </div>
190
- <div style="font-size: 1.2rem; color: {color};">
191
- {label}
192
- </div>
193
- </div>
194
- """, unsafe_allow_html=True)
195
-
196
- # Example pairs
197
- st.markdown("---")
198
- st.markdown("#### Example Pairs")
199
- st.markdown("These pairs demonstrate how the model distinguishes related from unrelated content:")
200
-
201
- examples = [
202
- ("Python is a programming language", "Java is used for software development"),
203
- ("The cat sat on the mat", "A feline rested on the rug"),
204
- ("Machine learning is fascinating", "I love eating pizza"),
205
- ]
206
-
207
- for t1, t2 in examples:
208
- similarity = model.similarity(t1, t2)
209
-
210
- if similarity > 0.5:
211
- css_class = "high-score"
212
- elif similarity > 0.3:
213
- css_class = "medium-score"
214
- else:
215
- css_class = "low-score"
216
-
217
- st.markdown(f"""
218
- <div class="result-box {css_class}">
219
- <strong>{similarity:.3f}</strong> | "{t1}" vs "{t2}"
220
- </div>
221
- """, unsafe_allow_html=True)
222
-
223
- # ============================================================================
224
- # TAB 2: SEMANTIC SEARCH
225
- # ============================================================================
226
-
227
- with tab2:
228
- st.markdown("### Semantic Document Search")
229
- st.markdown("""
230
- Enter a natural-language query. The model encodes your query and all
231
- documents into the same vector space, then ranks documents by cosine
232
- similarity. This finds **meaning-based** matches, not just keyword overlap.
233
- """)
234
-
235
- default_docs = """Python is a high-level programming language
236
- Machine learning algorithms learn patterns from data
237
- The weather today is sunny and warm
238
- Neural networks are inspired by the human brain
239
- JavaScript is used for web development
240
- Deep learning has transformed computer vision
241
- Cats are popular pets around the world
242
- TensorFlow and PyTorch are ML frameworks
243
- The stock market had a volatile day
244
- Natural language processing understands text"""
245
-
246
- query = st.text_input(
247
- "Search Query",
248
- "How do AI systems learn from examples?",
249
- key="search_query"
250
- )
251
-
252
- documents_text = st.text_area(
253
- "Documents (one per line)",
254
- default_docs,
255
- height=200,
256
- key="search_docs"
257
- )
258
-
259
- top_k = st.slider("Number of results", 1, 10, 5, key="search_topk")
260
-
261
- if st.button("Search", type="primary", key="search_btn"):
262
- documents = [d.strip() for d in documents_text.split('\n') if d.strip()]
263
-
264
- if query and documents:
265
- with st.spinner("Searching..."):
266
- results = model.search(query, documents, top_k=top_k)
267
-
268
- st.markdown("### Results")
269
- st.markdown("Documents ranked by semantic relevance to your query:")
270
-
271
- for r in results:
272
- score = r['score']
273
- if score > 0.6:
274
- indicator = "[HIGH]"
275
- css_class = "high-score"
276
- elif score > 0.4:
277
- indicator = "[MED]"
278
- css_class = "medium-score"
279
- else:
280
- indicator = "[LOW]"
281
- css_class = "low-score"
282
-
283
- st.markdown(f"""
284
- <div class="result-box {css_class}">
285
- <strong>{indicator} #{r['rank']}</strong> (score: {score:.4f})<br>
286
- {r['text']}
287
- </div>
288
- """, unsafe_allow_html=True)
289
-
290
- # ============================================================================
291
- # TAB 3: CLUSTERING
292
- # ============================================================================
293
-
294
- with tab3:
295
- st.markdown("### Unsupervised Text Clustering")
296
- st.markdown("""
297
- The model encodes each text into a dense vector. K-Means clustering
298
- then groups these vectors by proximity in the embedding space.
299
- Texts that are semantically similar end up in the same cluster,
300
- even if they share no common words.
301
- """)
302
-
303
- default_cluster_texts = """Python programming language
304
- Machine learning algorithms
305
- Deep learning neural networks
306
- JavaScript web development
307
- Cats and dogs as pets
308
- Pizza and pasta Italian food
309
- Sunny weather today
310
- Rainy day forecast
311
- Stock market trends
312
- Financial news update"""
313
-
314
- cluster_texts = st.text_area(
315
- "Texts to cluster (one per line)",
316
- default_cluster_texts,
317
- height=200,
318
- key="cluster_texts"
319
- )
320
-
321
- n_clusters = st.slider("Number of clusters", 2, 10, 3, key="n_clusters")
322
-
323
- if st.button("Run Clustering", type="primary", key="cluster_btn"):
324
- texts = [t.strip() for t in cluster_texts.split('\n') if t.strip()]
325
-
326
- if len(texts) >= n_clusters:
327
- with st.spinner("Clustering..."):
328
- result = model.cluster_texts(texts, n_clusters=n_clusters)
329
-
330
- st.markdown("### Cluster Assignments")
331
- st.markdown("Each group contains texts that the model considers semantically related:")
332
-
333
- colors = ["#667eea", "#28a745", "#ffc107", "#dc3545", "#17a2b8",
334
- "#6f42c1", "#fd7e14", "#20c997", "#e83e8c", "#6c757d"]
335
-
336
- for cluster_id in sorted(result['texts_by_cluster'].keys()):
337
- cluster_texts_list = result['texts_by_cluster'][cluster_id]
338
- color = colors[cluster_id % len(colors)]
339
-
340
- st.markdown(f"""
341
- <div style="background: {color}15; border-left: 4px solid {color};
342
- padding: 1rem; border-radius: 5px; margin: 0.5rem 0;">
343
- <strong style="color: {color};">Cluster {cluster_id + 1}</strong>
344
- ({len(cluster_texts_list)} texts)
345
- </div>
346
- """, unsafe_allow_html=True)
347
-
348
- for text in cluster_texts_list:
349
- st.markdown(f" - {text}")
350
- else:
351
- st.warning(f"Need at least {n_clusters} texts to create {n_clusters} clusters.")
352
-
353
- # ============================================================================
354
- # TAB 4: ENCODE TEXT
355
- # ============================================================================
356
-
357
- with tab4:
358
- st.markdown("### Raw Embedding Inspector")
359
- st.markdown("""
360
- Convert any text into its dense vector representation. The output is a
361
- 256-dimensional float vector that is **L2-normalized** (unit length = 1.0).
362
- This is the same representation used internally for similarity and search.
363
- """)
364
-
365
- encode_text = st.text_area(
366
- "Text to encode",
367
- "Machine learning is a fascinating field of study.",
368
- height=100,
369
- key="encode_text"
370
- )
371
-
372
- if st.button("Encode", type="primary", key="encode_btn"):
373
- if encode_text:
374
- with st.spinner("Encoding..."):
375
- embedding = model.encode(encode_text)
376
-
377
- st.markdown("### Embedding Vector")
378
-
379
- col1, col2, col3 = st.columns(3)
380
- with col1:
381
- st.metric("Dimensions", embedding.shape[1])
382
- with col2:
383
- st.metric("L2 Norm", f"{np.linalg.norm(embedding[0]):.4f}")
384
- with col3:
385
- st.metric("Mean Value", f"{embedding[0].mean():.4f}")
386
-
387
- st.markdown("#### First 20 values:")
388
- st.code(str(embedding[0][:20].round(4).tolist()))
389
-
390
- st.markdown("#### Value Distribution")
391
- st.markdown("A well-trained model produces a roughly Gaussian distribution centered near zero:")
392
- import plotly.express as px
393
- fig = px.histogram(
394
- x=embedding[0],
395
- nbins=50,
396
- title="Embedding Value Distribution",
397
- labels={'x': 'Value', 'y': 'Count'}
398
- )
399
- fig.update_layout(showlegend=False)
400
- st.plotly_chart(fig, width="stretch")
401
-
402
- # ============================================================================
403
- # TAB 5: CSV MATCHER
404
- # ============================================================================
405
-
406
- with tab5:
407
- st.markdown("### Bulk CSV Record Matcher")
408
- st.markdown("""
409
- Upload two CSV files and match rows across them using semantic similarity.
410
- This is useful for:
411
- - **Product deduplication** across e-commerce platforms
412
- - **Record linkage** between databases with inconsistent naming
413
- - **Cross-platform mapping** (e.g., matching supplier catalogs to your inventory)
414
-
415
- The model encodes the selected text column from each CSV, then ranks
416
- every row in CSV 2 against each row in CSV 1 by cosine similarity.
417
- """)
418
-
419
- col1, col2 = st.columns(2)
420
-
421
- with col1:
422
- st.markdown("#### Upload CSV 1 (Queries)")
423
- file1 = st.file_uploader("Upload primary CSV", type=['csv'], key="csv_file_1")
424
-
425
- with col2:
426
- st.markdown("#### Upload CSV 2 (Knowledge Base)")
427
- file2 = st.file_uploader("Upload secondary CSV", type=['csv'], key="csv_file_2")
428
-
429
- if file1 and file2:
430
- df1 = pd.read_csv(file1)
431
- df2 = pd.read_csv(file2)
432
-
433
- st.markdown("---")
434
- col_m1, col_m2 = st.columns(2)
435
-
436
- with col_m1:
437
- col1_name = st.selectbox("Select column to match from CSV 1", df1.columns, key="col1_sel")
438
-
439
- with col_m2:
440
- col2_name = st.selectbox("Select column to search in CSV 2", df2.columns, key="col2_sel")
441
-
442
- col_p1, col_p2 = st.columns(2)
443
- with col_p1:
444
- top_n_candidates = st.slider("Step 1: Top candidates to fetch", 1, 50, 10, help="Initial semantic search depth")
445
- with col_p2:
446
- top_m_final = st.slider("Step 2: Top matches to keep", 1, 10, 3, help="Final number of matches per row")
447
-
448
- if st.button("Start Bulk Matching", type="primary"):
449
- progress_bar = st.progress(0)
450
- status_text = st.empty()
451
-
452
- queries = df1[col1_name].fillna("").astype(str).tolist()
453
- corpus = df2[col2_name].fillna("").astype(str).tolist()
454
-
455
- status_text.text("Encoding search corpus (CSV 2)...")
456
- corpus_embs = model.encode(corpus, batch_size=128)
457
- progress_bar.progress(20)
458
-
459
- status_text.text("Encoding queries (CSV 1)...")
460
- query_embs = model.encode(queries, batch_size=128)
461
- progress_bar.progress(50)
462
-
463
- status_text.text("Computing similarities and mapping...")
464
- similarities = np.dot(query_embs, corpus_embs.T)
465
- progress_bar.progress(80)
466
-
467
- all_results = []
468
- for i in range(len(queries)):
469
- row_scores = similarities[i]
470
- top_indices = np.argsort(row_scores)[::-1][:top_m_final]
471
-
472
- res_row = df1.iloc[i].to_dict()
473
- for rank, idx in enumerate(top_indices, 1):
474
- res_row[f'Match_{rank}_{col2_name}'] = corpus[idx]
475
- res_row[f'Match_{rank}_Score'] = round(float(row_scores[idx]), 4)
476
- all_results.append(res_row)
477
-
478
- res_df = pd.DataFrame(all_results)
479
-
480
- progress_bar.progress(100)
481
- status_text.text("Matching complete.")
482
-
483
- st.markdown("### Results Preview")
484
- st.dataframe(res_df.head(50), width="stretch")
485
-
486
- output = io.StringIO()
487
- res_df.to_csv(output, index=False)
488
- csv_string = output.getvalue()
489
-
490
- st.download_button(
491
- label="Download Full Results CSV",
492
- data=csv_string,
493
- file_name="semantic_matching_results.csv",
494
- mime="text/csv",
495
- )
496
- else:
497
- st.info("Upload both CSV files to begin matching.")
498
-
499
-
500
- # ============================================================================
501
- # FOOTER
502
- # ============================================================================
503
-
504
- st.markdown("---")
505
- st.markdown("""
506
- <div style="text-align: center; color: #666; padding: 1rem;">
507
- <strong>MiniEmbed</strong> | Lightweight Text Embeddings |
508
- <a href="https://github.com/bhandarisuraz/miniembed">GitHub</a>
509
- </div>
510
- """, unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
examples/basic_usage.py DELETED
@@ -1,85 +0,0 @@
1
- """
2
- Basic Usage Example
3
- ===================
4
- Demonstrates encoding texts and computing similarity using MiniEmbed.
5
-
6
- This script shows the three core operations:
7
- 1. Encoding raw text into dense vectors
8
- 2. Computing pairwise similarity between two texts
9
- 3. Building a full similarity matrix across sets of texts
10
- """
11
-
12
- import sys
13
- sys.path.insert(0, '..')
14
-
15
- from src.inference import EmbeddingInference
16
-
17
-
18
- def main():
19
- print("=" * 60)
20
- print("MiniEmbed - Basic Usage Example")
21
- print("=" * 60)
22
-
23
- # Load the model
24
- print("\nLoading model...")
25
- model = EmbeddingInference.from_pretrained("../models/mini")
26
- print("Model loaded.\n")
27
-
28
- # -------------------------------------------------------------------------
29
- # Example 1: Encode texts
30
- # -------------------------------------------------------------------------
31
- print("-" * 40)
32
- print("Example 1: Encoding Texts")
33
- print("-" * 40)
34
-
35
- texts = [
36
- "Machine learning is a branch of artificial intelligence",
37
- "Deep learning uses neural networks with many layers",
38
- "I love eating pizza on weekends"
39
- ]
40
-
41
- embeddings = model.encode(texts)
42
- print(f"Input: {len(texts)} texts")
43
- print(f"Output: {embeddings.shape}") # (3, 256)
44
-
45
- # -------------------------------------------------------------------------
46
- # Example 2: Compute similarity
47
- # -------------------------------------------------------------------------
48
- print("\n" + "-" * 40)
49
- print("Example 2: Computing Similarity")
50
- print("-" * 40)
51
-
52
- pairs = [
53
- ("Machine learning is great", "AI is wonderful"),
54
- ("Machine learning is great", "I love pizza"),
55
- ("The cat sat on the mat", "A feline rested on the rug"),
56
- ]
57
-
58
- for text1, text2 in pairs:
59
- similarity = model.similarity(text1, text2)
60
- tag = "MATCH" if similarity > 0.5 else " LOW"
61
- print(f" [{tag}] {similarity:.4f} | '{text1}' vs '{text2}'")
62
-
63
- # -------------------------------------------------------------------------
64
- # Example 3: Pairwise similarity matrix
65
- # -------------------------------------------------------------------------
66
- print("\n" + "-" * 40)
67
- print("Example 3: Pairwise Similarity Matrix")
68
- print("-" * 40)
69
-
70
- texts_a = ["Machine learning", "Deep learning", "Natural language"]
71
- texts_b = ["AI models", "Neural networks", "Text processing"]
72
-
73
- similarity_matrix = model.pairwise_similarity(texts_a, texts_b)
74
-
75
- print("\nSimilarity Matrix:")
76
- print(" ", " ".join(f"{t[:10]:>10}" for t in texts_b))
77
- for i, text in enumerate(texts_a):
78
- row = " ".join(f"{similarity_matrix[i, j]:>10.4f}" for j in range(len(texts_b)))
79
- print(f"{text[:12]:>12}: {row}")
80
-
81
- print("\nDone.")
82
-
83
-
84
- if __name__ == "__main__":
85
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
examples/clustering.py DELETED
@@ -1,109 +0,0 @@
1
- """
2
- Text Clustering Example
3
- =======================
4
- Demonstrates how to cluster texts by semantic similarity using MiniEmbed.
5
-
6
- The model encodes each text into a dense vector. K-Means clustering then
7
- groups these vectors by proximity in the embedding space, even if the texts
8
- share no common words.
9
- """
10
-
11
- import sys
12
- sys.path.insert(0, '..')
13
-
14
- from src.inference import EmbeddingInference
15
-
16
-
17
- def main():
18
- print("=" * 60)
19
- print("MiniEmbed - Text Clustering Example")
20
- print("=" * 60)
21
-
22
- # Load the model
23
- print("\nLoading model...")
24
- model = EmbeddingInference.from_pretrained("../models/mini")
25
- print("Model loaded.\n")
26
-
27
- # -------------------------------------------------------------------------
28
- # Text collection (mixed topics)
29
- # -------------------------------------------------------------------------
30
- texts = [
31
- # Technology
32
- "Python is a versatile programming language",
33
- "Machine learning models learn from data",
34
- "JavaScript is used for web development",
35
- "Neural networks process information like the brain",
36
- "Software engineering involves designing systems",
37
-
38
- # Food
39
- "Pizza is my favorite Italian dish",
40
- "Sushi is a traditional Japanese cuisine",
41
- "Tacos are delicious Mexican street food",
42
- "Pasta with marinara sauce is comforting",
43
- "Ramen noodles are popular in Japan",
44
-
45
- # Sports
46
- "Football is the most popular sport worldwide",
47
- "Basketball requires teamwork and skill",
48
- "Tennis is an exciting individual sport",
49
- "Swimming is great for cardiovascular health",
50
- "Soccer World Cup attracts billions of viewers",
51
-
52
- # Nature
53
- "Mountains offer breathtaking scenic views",
54
- "Oceans cover most of the Earth's surface",
55
- "Forests are home to diverse wildlife",
56
- "Rivers provide fresh water to ecosystems",
57
- "Deserts have extreme temperature variations",
58
- ]
59
-
60
- print(f"Text Collection: {len(texts)} texts (4 topics)")
61
-
62
- # -------------------------------------------------------------------------
63
- # Cluster texts
64
- # -------------------------------------------------------------------------
65
- print("\nClustering texts into 4 groups...")
66
-
67
- result = model.cluster_texts(texts, n_clusters=4)
68
-
69
- # -------------------------------------------------------------------------
70
- # Display results
71
- # -------------------------------------------------------------------------
72
- print("\n" + "=" * 60)
73
- print("Clustering Results")
74
- print("=" * 60)
75
-
76
- for cluster_id in sorted(result['texts_by_cluster'].keys()):
77
- cluster_texts = result['texts_by_cluster'][cluster_id]
78
-
79
- print(f"\n Cluster {cluster_id + 1} ({len(cluster_texts)} texts)")
80
- print("-" * 40)
81
-
82
- for text in cluster_texts:
83
- print(f" - {text}")
84
-
85
- # -------------------------------------------------------------------------
86
- # Evaluate clustering (simple check)
87
- # -------------------------------------------------------------------------
88
- print("\n" + "=" * 60)
89
- print("Clustering Analysis")
90
- print("=" * 60)
91
-
92
- # Expected groupings (approximate)
93
- expected = {
94
- "Technology": texts[0:5],
95
- "Food": texts[5:10],
96
- "Sports": texts[10:15],
97
- "Nature": texts[15:20],
98
- }
99
-
100
- print("\nLabels assigned to each text:")
101
- for i, (text, label) in enumerate(zip(texts, result['labels'])):
102
- topic = list(expected.keys())[i // 5]
103
- print(f" [{label}] ({topic}) {text[:50]}...")
104
-
105
- print("\nDone.")
106
-
107
-
108
- if __name__ == "__main__":
109
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
examples/semantic_search.py DELETED
@@ -1,108 +0,0 @@
1
- """
2
- Semantic Search Example
3
- =======================
4
- Demonstrates how to use MiniEmbed for document retrieval.
5
-
6
- The model encodes a query and a corpus of documents into the same vector space,
7
- then ranks documents by cosine similarity to the query. This finds results based
8
- on meaning, not keyword overlap.
9
- """
10
-
11
- import sys
12
- sys.path.insert(0, '..')
13
-
14
- from src.inference import EmbeddingInference
15
-
16
-
17
- def main():
18
- print("=" * 60)
19
- print("MiniEmbed - Semantic Search Example")
20
- print("=" * 60)
21
-
22
- # Load the model
23
- print("\nLoading model...")
24
- model = EmbeddingInference.from_pretrained("../models/mini")
25
- print("Model loaded.\n")
26
-
27
- # -------------------------------------------------------------------------
28
- # Document collection
29
- # -------------------------------------------------------------------------
30
- documents = [
31
- "Python is a high-level programming language known for its simplicity",
32
- "Machine learning algorithms can learn patterns from data",
33
- "The weather today is sunny with a high of 75 degrees",
34
- "Neural networks are computational models inspired by the brain",
35
- "JavaScript is widely used for web development",
36
- "Deep learning has revolutionized computer vision and NLP",
37
- "Cats are popular pets known for their independence",
38
- "TensorFlow and PyTorch are popular deep learning frameworks",
39
- "The stock market showed strong gains today",
40
- "Natural language processing helps computers understand text"
41
- ]
42
-
43
- print(f"Document Collection: {len(documents)} documents")
44
- for i, doc in enumerate(documents, 1):
45
- print(f" {i}. {doc[:60]}...")
46
-
47
- # -------------------------------------------------------------------------
48
- # Search queries
49
- # -------------------------------------------------------------------------
50
- queries = [
51
- "How do AI systems learn from examples?",
52
- "What programming language is good for beginners?",
53
- "Tell me about artificial neural networks",
54
- ]
55
-
56
- print("\n" + "=" * 60)
57
- print("Search Results")
58
- print("=" * 60)
59
-
60
- for query in queries:
61
- print(f"\n Query: \"{query}\"")
62
- print("-" * 50)
63
-
64
- results = model.search(query, documents, top_k=3)
65
-
66
- for r in results:
67
- score = r['score']
68
- if score > 0.6:
69
- tag = "[HIGH]"
70
- elif score > 0.4:
71
- tag = "[ MED]"
72
- else:
73
- tag = "[ LOW]"
74
-
75
- print(f" {tag} #{r['rank']} (score: {score:.4f})")
76
- print(f" {r['text']}")
77
-
78
- # -------------------------------------------------------------------------
79
- # Interactive search (optional)
80
- # -------------------------------------------------------------------------
81
- print("\n" + "=" * 60)
82
- print("Interactive Search")
83
- print("=" * 60)
84
- print("Enter your own queries (type 'quit' to exit):\n")
85
-
86
- while True:
87
- try:
88
- query = input(" Query: ").strip()
89
- if query.lower() in ['quit', 'exit', 'q']:
90
- break
91
- if not query:
92
- continue
93
-
94
- results = model.search(query, documents, top_k=3)
95
-
96
- print("\n Results:")
97
- for r in results:
98
- print(f" - [{r['score']:.3f}] {r['text'][:70]}...")
99
- print()
100
-
101
- except (KeyboardInterrupt, EOFError):
102
- break
103
-
104
- print("\nDone.")
105
-
106
-
107
- if __name__ == "__main__":
108
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/mini/model.pt → model.pt RENAMED
File without changes
models/mini/model.safetensors → model.safetensors RENAMED
File without changes
models/large/README.md DELETED
@@ -1,5 +0,0 @@
1
- # MiniEmbed - Large
2
-
3
- Full-scale variant for maximum accuracy on complex semantic tasks.
4
-
5
- Coming soon...
 
 
 
 
 
 
models/medium/README.md DELETED
@@ -1,5 +0,0 @@
1
- # MiniEmbed - Medium
2
-
3
- Balanced variant offering higher accuracy with moderate compute requirements.
4
-
5
- Coming soon...
 
 
 
 
 
 
models/product/README.md DELETED
@@ -1,5 +0,0 @@
1
- # MiniEmbed - Product
2
-
3
- Fine-tuned variant of Mini, specialized for high-accuracy product matching.
4
-
5
- Coming soon...
 
 
 
 
 
 
models/small/README.md DELETED
@@ -1,5 +0,0 @@
1
- # MiniEmbed - Small
2
-
3
- A larger variant with increased capacity for general-purpose embeddings.
4
-
5
- Coming soon...
 
 
 
 
 
 
requirements.txt DELETED
@@ -1,14 +0,0 @@
1
- # Core
2
- torch>=2.0.0
3
- numpy>=1.21.0
4
- tqdm>=4.64.0
5
-
6
- # Demo UI
7
- streamlit>=1.30.0
8
- plotly>=5.0.0
9
-
10
- # Optional (for clustering, CSV processing, & Benchmarking)
11
- scikit-learn>=1.0.0
12
- pandas>=2.0.0
13
- psutil>=5.9.0
14
- sentence-transformers>=2.2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (500 Bytes). View file
 
src/__pycache__/inference.cpython-313.pyc ADDED
Binary file (14.7 kB). View file
 
src/__pycache__/model.cpython-313.pyc ADDED
Binary file (15 kB). View file
 
src/__pycache__/tokenizer.cpython-313.pyc ADDED
Binary file (7.06 kB). View file
 
models/mini/tokenizer.json → tokenizer.json RENAMED
File without changes
models/mini/training_info.json → training_info.json RENAMED
File without changes